With multiple identical datapoints, should I use UMAP min_dist = 0?

Question

Most (if not all) implementations/ examples of UMAP dimensionality reduction I have seen use a min_dist value of slightly above zero in order to avoid too tight clustering of points.

It makes sense, but I noticed that I have a significant number of identical datapoints (indicated by zeros in my distance matrix (using Gower's distance), so I'm wondering whether this means that I should set min_dist = 0 in order to make sure that the identity of those points is preserved in the new space.

I have tried it and, from what I can tell, it works fine in most cases. Under some circumstances, however, I get extremely tight clusters so that I have to use n_neighbors = 120 or so in order to "pull them apart". I tend to interpret this as "positive" in the sense that I have clearly defined clusters which are strong enough to "resist" a large n_neighbors value, which means that distances between clusters remain (somewhat) meaningful (as opposed to when I can only see clusters with a low n_neighbors value, which, I understand, focuses on local structures at the cost of rendering global structure (e.g. distances between clusters) meaningless.

Is my interpretation correct? What are the downsides of using min_dist = 0?

For reference, in case it matters: my dataset has between 2000 and 3000 rows, depending on which cases I exclude prior to UMAP.

(If some of the above doesn't make sense, please bear with me. I have started working my way into cluster analysis only a few weeks ago...)

With multiple identical datapoints, should I use UMAP min_dist = 0?

0 Answers0