Questions tagged [clustering]

Clustering is grouping (partitioning) a set of objects so that items in the same group are more similar to each other than to items in different groups, where the notion of similarity may be variously defined.

Clustering is a task of grouping (partitioning) a set of objects so that items in the same group are more similar (closer) to each other than to items in different groups. Often the notion of similarity is expressed as a distance measure, with greater distance conveying less similarity. The study of clustering algorithms (cluster analysis) originated in the social sciences but has become important in statistical data analysis (data mining) and in machine learning.

Examples of such algorithms are $K$-means and self-organizing map.

322 questions
18
votes
1 answer

Size of connected regions on a randomly-colored infinite chessboard

Consider an infinite chessboard where each square is colored white with probability $p$ and black with probability $1-p$. Suppose without loss of generality that the square at $(0,0)$ is white. We can consider the entire connected region $W$ of…
17
votes
1 answer

Theoretical link between the graph diffusion/heat kernel and spectral clustering

The graph diffusion kernel of a graph is the exponential of its Laplacian $\exp(-\beta L)$ (or a similar expression depending on how you define the kernel). If you have labels on some vertices, you can get labels on the rest of the vertices by a…
17
votes
2 answers

Measure of "how much diagonal" a matrix is

I have a (biological) computational system that outputs square matrices. Sometimes, these matrices are diagonal-like, with higher values at and around the diagonal. I would like to have some summary measure on how "much diagonal" a matrix is, so…
lourencoj
  • 273
11
votes
0 answers

Balanced linear partitioning of a set of points in $R^d$

Suppose we have a set of points in $R^d$ and for a given constant $\epsilon>0$ we want to find a hyperplane such that it divides the dataset into two balanced partitions, and that the number of points that are $\epsilon$-close the hyperplane is…
9
votes
1 answer

Why do we use the Laplacian matrix in Spectral Clustering?

When we perform spectral clustering, given a similarity matrix $S$, we define the Laplacian matrix $L$ (normalized or unnormalized). Then, we do eigenvalue decomposition on $L$ and get its eigenvector matrix. Why do we do eigenvalue decomposition on…
8
votes
0 answers

Optimization / personalization within clusters

I have the following optimization problem: I have a (random and very noisy) objective function $f(A, P)$, where $A$ is a vector of "observable" parameters of the input and $P$ is the parameters that I can control. I'd like to find $P(A)$ for every…
6
votes
2 answers

Clustering algorithm to cluster objects based on their relation weight

I have $n$ words and their relatedness weight that gives me an $n\times n$ matrix. I'm going to use this for a search algorithm but the problem is I need to cluster the entered keywords based on their pairwise relation. So let's say if the keywords…
Tohid
  • 163
6
votes
2 answers

What is the difference between an array and a vector?

Okay so I'm doing a little bit of vector calculus at university (mainly with neural networks and the k-means clustering for cluster analysis in a 3 dimensional field or hyperplane) And from what I understand (Forgive me I'm not sure how to format…
5
votes
1 answer

Measure of the clusters quality in a graph

Suppose we have a graph $G=(V,E)$ with $n$ non-overlapping subgraphs, the clusters $C_1, C_2, \dots, C_n$ which covers the graph $C_1 \cup \dots \cup C_n = G$. I'm looking for a good metric to measure the quality of these clusters. Let's call it…
5
votes
2 answers

How to see that K-means objective is non-convex?

I'm trying to proof that the objective of the K-means clustering algorithm is non-convex. The objective is given as $J(U,Z) = \|X-UZ\|_F^2$, with $X \in\mathbb{R}^{m\times n}, U\in \mathbb{R}^{m\times k}, \mathbb \{0,1\}^{k\times n}$. $Z$ represents…
5
votes
1 answer

Mutual Information for clustering

I'm working on a document clustering application and decided to use Normalized Mutual Information as one of the measures of effectivenes. But I don't really understand how to implement this in that situation. In…
5
votes
1 answer

What are the use cases related to cluster analysis of different distance metrics?

I'm trying to use different distance metrics like Euclidean, Manhattan, cosine, chebyshev among other distance metrics in my k-means algorithm to calculate distances between the data points and the centers. In what situation would one distance…
5
votes
1 answer

What is the definition of "convex relaxation" in clustering?

I have following text from a paper i am trying to understand: I don't understand what does below sentence refers to as being convex/non-convex The problem is that even though the objectives (1) and (2) are convex the constraint that K is valid…
4
votes
0 answers

Looking for an algo to "sorta" diagonalize a similarity matrix.

I've got a big fat similarity matrix. The rows and columns represent people, and the values represent some positive measure of their closeness (0 meaning no connection at all). The n-th row and n-th colum corresponds to the same person - thus the…
4
votes
2 answers

Sampling with an "oversampling" factor, in K-Means||

I'm trying to understand K-Means||, a scalable version of K-Means++, which itself is an "improved" version of the clustering algorithm K-Means. Please find here the link to K-Means||…
1
2 3
21 22