Questions tagged [clustering]

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval etc.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and hierarchical clustering.

Related topics: , pattern-recognition, knowledge discovery, taxonomy. Not to be confused with cluster computing.

1381 questions
202
votes
13 answers

K-Means clustering for mixed numeric and categorical data

My data set contains a number of numeric attributes and one categorical. Say, NumericAttr1, NumericAttr2, ..., NumericAttrN, CategoricalAttr, where CategoricalAttr takes one of three possible values: CategoricalAttrValue1, CategoricalAttrValue2 or…
IgorS
  • 5,474
  • 11
  • 34
  • 43
67
votes
9 answers

Clustering geo location coordinates (lat,long pairs)

What is the right approach and clustering algorithm for geolocation clustering? I'm using the following code to cluster geolocation coordinates: import numpy as np import matplotlib.pyplot as plt from scipy.cluster.vq import kmeans2,…
rokpoto.com
  • 813
  • 1
  • 7
  • 6
48
votes
6 answers

Calculating KL Divergence in Python

I am rather new to this and can't say I have a complete understanding of the theoretical concepts behind this. I am trying to calculate the KL Divergence between several lists of points in Python. I am using this to try and do this. The problem that…
Nanda
  • 793
  • 1
  • 7
  • 8
41
votes
5 answers

Is it necessary to standardize your data before clustering?

Is it necessary to standardize your data before cluster? In the example from scikit learn about DBSCAN, here they do this in the line: X = StandardScaler().fit_transform(X) But I do not understand why it is necessary. After all, clustering does…
makansij
  • 869
  • 2
  • 12
  • 17
38
votes
4 answers

When to use cosine simlarity over Euclidean similarity

In NLP, people tend to use cosine similarity to measure document/text distances. I want to hear what do people think of the following two scenarios, which to pick, cosine similarity or Euclidean? Overview of the task set: The task is to compute…
Logan
  • 503
  • 1
  • 4
  • 8
35
votes
8 answers

Best practical algorithm for sentence similarity

I have two sentences, S1 and S2, both which have a word count (usually) below 15. What are the most practically useful and successful (machine learning) algorithms, which are possibly easy to implement (neural network is ok, unless the architecture…
DaveTheAl
  • 533
  • 1
  • 5
  • 12
35
votes
1 answer

What is the best Keras model for multi-class classification?

I am working on research, where need to classify one of three event WINNER=(win, draw, lose) WINNER LEAGUE HOME AWAY MATCH_HOME MATCH_DRAW MATCH_AWAY MATCH_U2_50 MATCH_O2_50 3 13 550 571 1.86 3.34 …
SpanishBoy
  • 557
  • 1
  • 5
  • 11
29
votes
1 answer

Word2Vec vs. Sentence2Vec vs. Doc2Vec

I recently came across the terms Word2Vec, Sentence2Vec and Doc2Vec and kind of confused as I am new to vector semantics. Can someone please elaborate the differences in these methods in simple words. What are the most suitable tasks for each…
27
votes
2 answers

How to deal with time series which change in seasonality or other patterns?

Background I'm working on a time series data set of energy meter readings. The length of the series varies by meter - for some I have several years, others only a few months, etc. Many display significant seasonality, and often multiple layers -…
Jo Douglass
  • 401
  • 1
  • 5
  • 10
25
votes
3 answers

K-means incoherent behaviour choosing K with Elbow method, BIC, variance explained and silhouette

I'm trying to cluster some vectors with 90 features with K-means. Since this algorithm asks me the number of clusters, I want to validate my choice with some nice math. I expect to have from 8 to 10 clusters. The features are Z-score scaled. Elbow…
marcodena
  • 1,667
  • 4
  • 14
  • 17
25
votes
5 answers

Clustering based on similarity scores

Assume that we have a set of elements E and a similarity (not distance) function sim(ei, ej) between two elements ei,ej ∈ E. How could we (efficiently) cluster the elements of E, using sim? k-means, for example, requires a given k, Canopy…
vefthym
  • 503
  • 1
  • 6
  • 13
20
votes
4 answers

K-means: What are some good ways to choose an efficient set of initial centroids?

When a random initialization of centroids is used, different runs of K-means produce different total SSEs. And it is crucial in the performance of the algorithm. What are some effective approaches toward solving this problem? Recent approaches are…
ngub05
  • 333
  • 1
  • 2
  • 8
19
votes
2 answers

K-means vs. online K-means

K-means is a well known algorithm for clustering, but there is also an online variation of such algorithm (online K-means). What are the pros and cons of these approaches, and when should each be preferred?
Rubens
  • 4,117
  • 5
  • 25
  • 42
17
votes
1 answer

Algorithms for text clustering

I have a problem of clustering huge amount of sentences into groups by their meanings. This is similar to a problem when you have lots of sentences and want to group them by their meanings. What algorithms are suggested to do this? I don't know…
Andrey Rubliov
  • 303
  • 1
  • 2
  • 7
15
votes
2 answers

Clustering unique visitors by useragent, ip, session_id

Given website access data in the form session_id, ip, user_agent, and optionally timestamp, following the conditions below, how would you best cluster the sessions into unique visitors? session_id: is an id given to every new visitor. It does not…
AdrianBR
  • 367
  • 2
  • 10
1
2 3
91 92