Questions tagged [data-mining]

This tag is for questions about data mining, which is an interdisciplinary subfield of computer science. It is the computational process of discovering patterns in large data sets.

Finding patterns within massive amounts of unexplored data requires the use of sophisticated linear algebra and presents a unique challenge.

Some relevant topics used in data mining are linear discriminant analysis, principal component analysis, and support vector machine.

139 questions
9
votes
1 answer

Normalization of data in decision tree

After reading through a few references, I have come to know that for machine learning in general, it is necessary to normalize features so that no features are arbitrarily large ($centering$) and all features are on the same scale…
Raaj
  • 711
8
votes
2 answers

How do I find the formula (or rules) that created a list of numbers with seemingly no pattern?

Newbie here, and I apologize if this is the wrong forum for this type of question... I have a group of 200 or so alphanumeric codes from an unknown source. Here's an example piece of the data…
6
votes
1 answer

What is the exact function or algorithm Windows 10 is using to calculate taskbar underline color from Registry values, based on data I have collected?

In earlier versions of Windows, it was easy for users to set the exact colors of interface elements. For example, in Windows 7, you could go to [Control Panel -> Personalization -> Window Color], which would bring up a dialog box letting you select…
6
votes
2 answers

Distance between unequal-dimension vectors (and data)?

It is easy to find simple distance measures for equal-dimension vectors, such as Euclidean Distance or Correlation. What about unequal-dimension vectors, such as, for instance, $(a,b,c)$ and $(d,e)$? Are there any known approaches in math for…
iLie
  • 69
5
votes
1 answer

Mutual Information for clustering

I'm working on a document clustering application and decided to use Normalized Mutual Information as one of the measures of effectivenes. But I don't really understand how to implement this in that situation. In…
5
votes
2 answers

Courses to take during pure math masters to keep data science and applied work as a possibility

I was wondering what courses you can take in a pure math masters to preserve the opportunity to go into data science, economics, policy research or other applied work while preserving the opportunity to do pure math PhD later (keeping in mind that I…
5
votes
1 answer

Does Gini index considers only a binary split for each attribute or can it have multi way spliting?

Does Gini-index based classification split values for any attribute always as a binary split or can it split into more than $2$ branches (multi-way split)? For more clarification, if a split on $A$ partitions $D$ into $D_1$ and $D_2$ , the…
4
votes
1 answer

How can one prove that mahalanobis distance is a metric?

How can one prove that mahalanobis distance is a metric? How can one show that these four properties of a metric are valid for mahalanobis distance? 1) d(x, y) ≥ 0 (non-negativity, or separation axiom) 2) d(x, y) = 0 if and only if x = y …
dod
  • 41
4
votes
1 answer

How to find almost periodic lattices in a set of high-dimensional points?

sorry for lame question, but I just have no maturity in this direction. Let's say I have very large set ( millions ) of high-dimensional vectors ( typical dimensionality is 64). These vectors typically come from embedding of social network into a…
3
votes
0 answers

Gini coefficient vs Gini impurity - decision trees

The problem refers to decision trees building. According to Wikipedia 'Gini coefficient' should not be confused with 'Gini impurity'. However both measures can be used when building a decision tree - these can support our choices when splitting the…
brunner
  • 31
3
votes
2 answers

Are A and B conditionally independent

Are A and B conditionally independent given the class label? I calculated that $$P(A=1) = \frac{1}{2}$$ $$P(B=1) = \frac{2}{5}$$ $$P(A=1,B=1)=\frac{1}{5}$$ My answer is yes. I do it by anding $(A\text{ and }B)$ which shows that when $A = 1$ and $B…
Mike John
  • 174
3
votes
1 answer

Calculation of Intrinsic dimension of datasets

Currently I am following a Machine learning course and we are looking at the intrinsic dimension of datasets. The professor gave a few examples of the intrinsic dimension of some objects (ej. the lungs have an intrinsic dimension of 2.89). However…
Oliver
  • 171
3
votes
0 answers

Calculate PageRank for small web

Calculate PageRank for: A links to B, B links to C and C links to B and C where the damping factor $\beta=0.8$ I have: $M=\begin{bmatrix} 0&0&\frac{1}{2} \\ 1&0&\frac{1}{2} \\ 0&1&0 \end{bmatrix}$ and $v=\begin{bmatrix} \frac{1}{3} \\ \frac{1}{3}…
3
votes
0 answers

Number of rules (Data Mining)

This is a paraphrasing of a statement from Data Mining by Witten et al., 4th edition, on section 1.6. Consider the set $S = \{1, \dots, 288\}$. A rule is defined as a one-element singleton set whose element is from $S$, e.g., $\{1\}$, $\{2\}$,…
Clarinetist
  • 20,278
  • 10
  • 72
  • 137
3
votes
0 answers

Does there exist closed-form solutions for calculations in the EM-algorithm for Gaussian Mixtures?

Our data can be represented as $X_{n\times d}$ matrix, where we have $n$ data points lying in $\mathbb{R}^d$. We assume that there are $k$ underlying Gaussian models, $\mathcal{N}_d(\mu_j, \Sigma_j)$, from which we could have drawn these points. In…
1
2 3
9 10