Clustering is the problem of finding groups of data points (often modelled as nodes in a graph) that are closer to each other than to other points.
Questions tagged [clustering]
152 questions
30
votes
4 answers
What exactly is the difference between supervised and unsupervised learning?
I am trying to understand clustering methods.
What I I think I understood:
In supervised learning, the categories/labels data is assigned to are known before computation. So, the labels, classes or categories are being used in order to "learn" the…
Prot
- 403
- 1
- 4
- 5
8
votes
1 answer
Under what conditions is K-means clustering transformation-invariant?
Given a set of data points $X = \{x_1, x_2, \ldots, x_m\}$ where $x_i \in \mathbb{R}^d$ we run K-means on $X$ and obtain the clusters $c_1, c_2, \ldots, c_k$.
Now, if we create a new dataset $Y = \{y_1, y_2, \ldots, y_m\}$ where $y_i = Ax_i + b$ and…
Ana Echavarria
- 107
- 6
6
votes
1 answer
How to compare/cluster millions of strings?
I have around 1,000,000 of strings of variable length (from 200 to 50000) that can contain 5 characters (A, B, C, D, E).
What I actually want is to cluster them together if they are similar enough. By similar enough I mean they have an edit distance…
Ivan
- 273
- 2
- 7
5
votes
1 answer
How is the (local) clustering coefficient defined for vertices with degree 1
We want to compute the clustering coefficient $C$ for an undirected graph $G = (V, E)$.
The clustering coefficient $C$ for a graph $G$ is the average over all local clustering coefficients $C_i$, whereby $C_i$ is the local clustering coefficient of…
confusedstudent
- 51
- 2
5
votes
1 answer
k-means clustered data: how to label newly incoming data
I have a data set with labels that were produced by a $k$-means clustering
algorithm. Now there is some data (with the same data structure) from another
source and I wonder what is the most sensible way to label this new, yet unseen
data? I was…
Uli Niklas
- 51
- 1
4
votes
0 answers
Finding the "most modular" subset of graph vertices, i.e. that minimize disagreement inside and outside
Let $G = (V, E)$ be a graph. I want to find the subset of vertices of $G$ that minimizes a certain modularity cost. In our setting, the modularity cost of a subset $X$ is defined as the number of non-edges within $X$ plus the number of edges from…
Manuel Lafond
- 530
- 2
- 12
4
votes
2 answers
Creating Best Clusters of Objects Based on Distance Between Them
I have an array of images. And, there is a function that calculates the distance between two images.
I wish to cluster the images based on this distance. So the clusters contain images that are all at short distance to each other.
So only the…
meaning-matters
- 141
- 4
4
votes
0 answers
Persistent Homology vs Clustering Methods
How do persistent homology and clustering methods for data point clouds differ? I'm specifically interested in the application to gene expression data of cancer patients, but any example works.
I understand that a hierarchical clustering method…
Emil_Longshore
- 41
- 2
4
votes
2 answers
Reduce k-means to Integer Programming
The k-means algorithm reduces to computing the objective function:
$
\underset{\textbf{S}}{\operatorname{argmax}} \sum_{i=1}^k \sum_{\textbf{x}_j\in\textbf{S}_i} \lVert \textbf{x}_j - \mathbf{\mu}_i \rVert
$
for some observations…
user13675
- 1,684
- 12
- 19
4
votes
0 answers
Find a dynamic programming solution that minimize the sum of the diameters of two clusters?
I asked a question at this link, where I suggested a greedy algorithm for this problem:
Suppose given $2n$ points in the plane and we want partition points into two clusters $C_1$ , $C_2$ such that each cluster contains exactly $n$ points and we…
All
- 83
- 6
4
votes
0 answers
K-means, but normalized and with max
Given points $x_1, \ldots, x_n$ in the Euclidean space and $K \in \mathbb N$, I'm interested in the following objective.
Partition the points into $K$ clusters $C_1, \ldots, C_K$ so that:
$$\max_{i \in [K]} \frac{1}{|C_i|}\sum_{j \in C_i} \|x_j -…
Dmitry
- 347
- 1
- 4
- 12
4
votes
1 answer
How to group intervals which overlap by some amount?
I have an algorithm that generates a list of intervals. The algorithm is run m times. Lets mark the intervals as tuples (s1, e1), (s2, e2), .., (sn, en). It is possible to add the run ID to the tuple (though I don't think it helps).
The goal is to…
mibm
- 149
- 3
4
votes
1 answer
How to calculate the minimum number of groups, by grouping groups with capacity together?
I need to group cars (and their passengers) with other cars, and I don't know how to approach this problem.
If I have, for example, 3 cars. Car A with 7 seats and 2 passengers (3/7 because of the driver). Car B, 2/2. Car C, 1/3.
The most wasteful…
Ricardo Jesus
- 55
- 4
3
votes
1 answer
(DROP) Data Reduction Algorithm - How it works?
I am studing a PHD framework which the propose is to reduce the dataset with the most representative samples for training a classifier. Maybe I am loosing something, but I could not undestand a specific part.
Basic this is the algoritm 1…
rej
- 31
- 3
3
votes
0 answers
What is the definition of a "Clustering Feature" in BIRCH algorithm?
The paper for BIRCH (a clustering algorithm) contains definitions of a Clustering Feature (CF) where the notation is unclear (cf. PDF page 3 / section 4).
A cluster contains N d-dimensional entries $ \{ \vec{X}_1, \vec{X}_2, \dots, \vec{X}_N \} $…
c11o
- 31
- 2