How to apply TFIDF in structured dataset in Python?

Question

I know that TFIDF is an NLP method for feature extraction.

and I know that there are libraries that calculate TFIDF directly from the text.

This is not what I want though

In my case, my text dataset has been converted into Bag of words

The original dataset that I "DO NOT" have access to, looks like this

RepID     RepText
------------------
1         Doctor sys patient has diabetes and needs rest for ...
2         Patients history: broken arm, and ...
3         A dose of Metformin 2 times a day ...
4         Xray needed for the chest...
5         Covid-19 expectation and patient should have a rest ...

But my dataset looks like this

RepID   Word         BOW
-------------------------
1       Doctor       3
1       diabetes     4
1       patient      1
.       .            .
.       .            .
2       patient      2
2       arm          7
.       .            .
.       .            .
5684    cough        9
5684    Xray         3
5684    Covid        5
.       .            .
.       .            .

What I want is to find TFIDF for each word in my dataset.

I was thinking of converting my dataset into a unstructured format

so it would look like this

RepID     RepText
------------------
1         Doctor Doctor Doctor diabetes diabetes diabetes diabetes ...
2         Patients patients arm arm arm arm arm arm arm ...
.
.
5684      cough cough cough cough cough cough cough cough cough Xray Xray

so each word repeated the same number of BOW

but I do not think this is the best way to do as I convert a structured dataset into an unstructured one..

How to find the TFIDF from the structured dataset? is there a library or algorithm for that?

Note :

Dataset stored in MS SQL Server, and I am using Python code.

score 3 · Answer 1 · 2021-05-22T20:12:05.167

You could use pandas pivot_table() to transform your data frame into a count matrix, and then apply sklearn TfidfTransformer() to the count matrix in order to obtain the tf-idfs.

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
input data
df = pd.DataFrame({
    'RepID': [1, 1, 1, 2, 2, 5684, 5684, 5684],
    'Word': ['Doctor', 'diabetes', 'patient', 'patient', 'arm', 'cough', 'Xray', 'Covid'],
    'BOW': [3, 4, 1, 2, 7, 9, 3, 5]
})
count matrix
df = pd.pivot_table(df, index='RepID', columns='Word', values='BOW', aggfunc='sum')
df = df.fillna(value=0)
print(df)
Word   Covid  Doctor  Xray  arm  cough  diabetes  patient
RepID
1        0.0     3.0   0.0  0.0    0.0       4.0      1.0
2        0.0     0.0   0.0  7.0    0.0       0.0      2.0
5684     5.0     0.0   3.0  0.0    9.0       0.0      0.0
tf-idf transform
X = TfidfTransformer().fit(df.values)
print(X.idf_)
[1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.28768207]

How to apply TFIDF in structured dataset in Python?

1 Answers1

input data

count matrix

Word Covid Doctor Xray arm cough diabetes patient

RepID

1 0.0 3.0 0.0 0.0 0.0 4.0 1.0

2 0.0 0.0 0.0 7.0 0.0 0.0 2.0

5684 5.0 0.0 3.0 0.0 9.0 0.0 0.0

tf-idf transform

[1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.28768207]