Measure the significance of differences in internal distances in two clusters

Question

I am a linguist studying grammatical variation in 169 female and 169 male Norwegian authors (using a treebank), based on 8 inflectional and syntactic properties. (Each property involves a choice between two alternative realizations, and the recorded value for a given author is a percentage of one of the choices.) A striking and stable difference between the two groups is that the men spread out more than the women in the eight-dimensional space of possibilities, the women clustering more densely, showing less variation in their combinations of choices. I calculate density by taking the average of each group member’s average distance to the other authors of the same gender. For one of the pairs of properties this gives a mean distance of 37 for the men (standard deviation 13.0) and 27 for the women (standard deviation 11.8) A t-test yields significance with extremely high confidence, but – not being a statistician – I have my doubts about the appropriateness of a t-test. The values for the group members are interdependent, so we are testing properties of the groups as wholes rather than comparable properties of each member across the two groups. Hence my question is: is there an appropriate significance test for this kind of data – or am I wrong in assuming that a t-test is not appropriate (which would be nice)?

score 0 · Answer 1 · answered Aug 28 '24 at 12:10

I am also not a statistician but I thought I would offer my two cents:

First, I think when you talk about clusters you are actually referring to groups?. A cluster is typically a group that one finds by applying some kind of algorithm. In this case it seems the two clusters are simply male and female authors (which in my opinion should be called simply groups)

You talk about density of groups as something that you calculate using all eight attributes but then it seems you calculate a distance for a single set of properties.

In both cases I would think the t-test is not appropriate because the t-test assumes among other things that the two groups have the same distribution with equal variances which it seems is not the case here (See here for instance).

If t-test is indeed not applicable this website offers some suggestions on what to do. I would also suggest that you post the question on cross-validated which is the stackoverflow community directly focused on statistics questions.

score 0 · Answer 2 · answered Aug 28 '24 at 22:16

Thank you, René! Indeed I mean groups (which is the term I use in the text).

The eight properties are eight binary choices between two alternative ways to express a category, called ‘conservative’ and ‘radical’. What is recorded for an author for a given property is the percentage of the conservative choice. Hence it will be a value between 0 and 100. The eight properties thus define an eight-dimensional space in which the authors are distributed. The (Euclidean) distance between any two authors can then be calculated as usual, taking the square root of the sum of squared differences in the coordinates of the points. The example I give concern just two of the eight properties – two dimensions (which probably is unnecessarily comfusing).

It did not occur to me that the (small?) difference in standard deviations would preclude the use of a t-test – I need to look into that further. The problem I saw was that since the values for each author in the calculation of density or tightness in the group are mutual distances, they are not independent of each other.

Thank you for the reference to a useful website.

Measure the significance of differences in internal distances in two clusters

2 Answers2