3

I know that TFIDF is an NLP method for feature extraction.

and I know that there are libraries that calculate TFIDF directly from the text.

This is not what I want though

In my case, my text dataset has been converted into Bag of words

The original dataset that I "DO NOT" have access to, looks like this

RepID     RepText
------------------
1         Doctor sys patient has diabetes and needs rest for ...
2         Patients history: broken arm, and ...
3         A dose of Metformin 2 times a day ...
4         Xray needed for the chest...
5         Covid-19 expectation and patient should have a rest ...

But my dataset looks like this

RepID   Word         BOW
-------------------------
1       Doctor       3
1       diabetes     4
1       patient      1
.       .            .
.       .            .
2       patient      2
2       arm          7
.       .            .
.       .            .
5684    cough        9
5684    Xray         3
5684    Covid        5
.       .            .
.       .            .

What I want is to find TFIDF for each word in my dataset.

I was thinking of converting my dataset into a unstructured format

so it would look like this

RepID     RepText
------------------
1         Doctor Doctor Doctor diabetes diabetes diabetes diabetes ...
2         Patients patients arm arm arm arm arm arm arm ...
.
.
5684      cough cough cough cough cough cough cough cough cough Xray Xray

so each word repeated the same number of BOW

but I do not think this is the best way to do as I convert a structured dataset into an unstructured one..

How to find the TFIDF from the structured dataset? is there a library or algorithm for that?

Note :

Dataset stored in MS SQL Server, and I am using Python code.

asmgx
  • 549
  • 2
  • 18

1 Answers1

3

You could use pandas pivot_table() to transform your data frame into a count matrix, and then apply sklearn TfidfTransformer() to the count matrix in order to obtain the tf-idfs.

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer

input data

df = pd.DataFrame({ 'RepID': [1, 1, 1, 2, 2, 5684, 5684, 5684], 'Word': ['Doctor', 'diabetes', 'patient', 'patient', 'arm', 'cough', 'Xray', 'Covid'], 'BOW': [3, 4, 1, 2, 7, 9, 3, 5] })

count matrix

df = pd.pivot_table(df, index='RepID', columns='Word', values='BOW', aggfunc='sum') df = df.fillna(value=0) print(df)

Word Covid Doctor Xray arm cough diabetes patient

RepID

1 0.0 3.0 0.0 0.0 0.0 4.0 1.0

2 0.0 0.0 0.0 7.0 0.0 0.0 2.0

5684 5.0 0.0 3.0 0.0 9.0 0.0 0.0

tf-idf transform

X = TfidfTransformer().fit(df.values) print(X.idf_)

[1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.28768207]