How to use Cosine Distance matrix for Clustering algorithms like mean-shift, DBSCAN, and optics?

Question

I am trying to compare different clustering algorithms for my text data. I first calculated the tf-idf matrix and used it for the cosine distance matrix (cosine similarity). Then I used this distance matrix for K-means and Hierarchical clustering (ward and dendrogram). I want to use the distance matrix for mean-shift, DBSCAN, and optics.

Below is the part of the code showing the distance matrix.

from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(Strategies) #fit the vectorizer to synopses


terms = tfidf_vectorizer.get_feature_names()

from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)
print(dist)

I am new to both python and clustering. I found the code for K-means and hierarchical clustering and tried to understand it but I cannot apply it for other clusterings algorithms. It would be very helpful if I can get some simple explanation of each clustering algorithm and how this distance matrix can be used to implement (if possible) in different clustering.

Thanks in advance!

Brian Spiering · Answer 1 · 2022-01-19T22:56:59.233

Several scikit-learn clustering algorithms can be fit using cosine distances:

from collections      import defaultdict
from sklearn.datasets import load_iris
from sklearn.cluster  import DBSCAN, OPTICS
Define sample data
iris = load_iris()
X = iris.data
List clustering algorithms
algorithms = [DBSCAN, OPTICS] # MeanShift does not use a metric
Fit each clustering algorithm and store results
results = defaultdict(int)
for algorithm in algorithms:
    results[algorithm] = algorithm(metric='cosine').fit(X)

How to use Cosine Distance matrix for Clustering algorithms like mean-shift, DBSCAN, and optics?

1 Answers1

Define sample data

List clustering algorithms

Fit each clustering algorithm and store results