2

I am trying to compare different clustering algorithms for my text data. I first calculated the tf-idf matrix and used it for the cosine distance matrix (cosine similarity). Then I used this distance matrix for K-means and Hierarchical clustering (ward and dendrogram). I want to use the distance matrix for mean-shift, DBSCAN, and optics.

Below is the part of the code showing the distance matrix.

from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(Strategies) #fit the vectorizer to synopses


terms = tfidf_vectorizer.get_feature_names()

from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)
print(dist)

I am new to both python and clustering. I found the code for K-means and hierarchical clustering and tried to understand it but I cannot apply it for other clusterings algorithms. It would be very helpful if I can get some simple explanation of each clustering algorithm and how this distance matrix can be used to implement (if possible) in different clustering.

Thanks in advance!

Piyush Ghasiya
  • 165
  • 2
  • 6

1 Answers1

3

Several scikit-learn clustering algorithms can be fit using cosine distances:

from collections      import defaultdict
from sklearn.datasets import load_iris
from sklearn.cluster  import DBSCAN, OPTICS

Define sample data

iris = load_iris() X = iris.data

List clustering algorithms

algorithms = [DBSCAN, OPTICS] # MeanShift does not use a metric

Fit each clustering algorithm and store results

results = defaultdict(int) for algorithm in algorithms: results[algorithm] = algorithm(metric='cosine').fit(X)

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113