Questions tagged [clustering]

Clustering is the problem of finding groups of data points (often modelled as nodes in a graph) that are closer to each other than to other points.

152 questions
30
votes
4 answers

What exactly is the difference between supervised and unsupervised learning?

I am trying to understand clustering methods. What I I think I understood: In supervised learning, the categories/labels data is assigned to are known before computation. So, the labels, classes or categories are being used in order to "learn" the…
Prot
  • 403
  • 1
  • 4
  • 5
8
votes
1 answer

Under what conditions is K-means clustering transformation-invariant?

Given a set of data points $X = \{x_1, x_2, \ldots, x_m\}$ where $x_i \in \mathbb{R}^d$ we run K-means on $X$ and obtain the clusters $c_1, c_2, \ldots, c_k$. Now, if we create a new dataset $Y = \{y_1, y_2, \ldots, y_m\}$ where $y_i = Ax_i + b$ and…
6
votes
1 answer

How to compare/cluster millions of strings?

I have around 1,000,000 of strings of variable length (from 200 to 50000) that can contain 5 characters (A, B, C, D, E). What I actually want is to cluster them together if they are similar enough. By similar enough I mean they have an edit distance…
Ivan
  • 273
  • 2
  • 7
5
votes
1 answer

How is the (local) clustering coefficient defined for vertices with degree 1

We want to compute the clustering coefficient $C$ for an undirected graph $G = (V, E)$. The clustering coefficient $C$ for a graph $G$ is the average over all local clustering coefficients $C_i$, whereby $C_i$ is the local clustering coefficient of…
5
votes
1 answer

k-means clustered data: how to label newly incoming data

I have a data set with labels that were produced by a $k$-means clustering algorithm. Now there is some data (with the same data structure) from another source and I wonder what is the most sensible way to label this new, yet unseen data? I was…
Uli Niklas
  • 51
  • 1
4
votes
0 answers

Finding the "most modular" subset of graph vertices, i.e. that minimize disagreement inside and outside

Let $G = (V, E)$ be a graph. I want to find the subset of vertices of $G$ that minimizes a certain modularity cost. In our setting, the modularity cost of a subset $X$ is defined as the number of non-edges within $X$ plus the number of edges from…
Manuel Lafond
  • 530
  • 2
  • 12
4
votes
2 answers

Creating Best Clusters of Objects Based on Distance Between Them

I have an array of images. And, there is a function that calculates the distance between two images. I wish to cluster the images based on this distance. So the clusters contain images that are all at short distance to each other. So only the…
4
votes
0 answers

Persistent Homology vs Clustering Methods

How do persistent homology and clustering methods for data point clouds differ? I'm specifically interested in the application to gene expression data of cancer patients, but any example works. I understand that a hierarchical clustering method…
4
votes
2 answers

Reduce k-means to Integer Programming

The k-means algorithm reduces to computing the objective function: $ \underset{\textbf{S}}{\operatorname{argmax}} \sum_{i=1}^k \sum_{\textbf{x}_j\in\textbf{S}_i} \lVert \textbf{x}_j - \mathbf{\mu}_i \rVert $ for some observations…
4
votes
0 answers

Find a dynamic programming solution that minimize the sum of the diameters of two clusters?

I asked a question at this link, where I suggested a greedy algorithm for this problem: Suppose given $2n$ points in the plane and we want partition points into two clusters $C_1$ , $C_2$ such that each cluster contains exactly $n$ points and we…
4
votes
0 answers

K-means, but normalized and with max

Given points $x_1, \ldots, x_n$ in the Euclidean space and $K \in \mathbb N$, I'm interested in the following objective. Partition the points into $K$ clusters $C_1, \ldots, C_K$ so that: $$\max_{i \in [K]} \frac{1}{|C_i|}\sum_{j \in C_i} \|x_j -…
Dmitry
  • 347
  • 1
  • 4
  • 12
4
votes
1 answer

How to group intervals which overlap by some amount?

I have an algorithm that generates a list of intervals. The algorithm is run m times. Lets mark the intervals as tuples (s1, e1), (s2, e2), .., (sn, en). It is possible to add the run ID to the tuple (though I don't think it helps). The goal is to…
mibm
  • 149
  • 3
4
votes
1 answer

How to calculate the minimum number of groups, by grouping groups with capacity together?

I need to group cars (and their passengers) with other cars, and I don't know how to approach this problem. If I have, for example, 3 cars. Car A with 7 seats and 2 passengers (3/7 because of the driver). Car B, 2/2. Car C, 1/3. The most wasteful…
3
votes
1 answer

(DROP) Data Reduction Algorithm - How it works?

I am studing a PHD framework which the propose is to reduce the dataset with the most representative samples for training a classifier. Maybe I am loosing something, but I could not undestand a specific part. Basic this is the algoritm 1…
3
votes
0 answers

What is the definition of a "Clustering Feature" in BIRCH algorithm?

The paper for BIRCH (a clustering algorithm) contains definitions of a Clustering Feature (CF) where the notation is unclear (cf. PDF page 3 / section 4). A cluster contains N d-dimensional entries $ \{ \vec{X}_1, \vec{X}_2, \dots, \vec{X}_N \} $…
c11o
  • 31
  • 2
1
2 3
10 11