My apologies for cross-posting to stackoverflow and cross validated. Not really sure which one is the most relevant place.
Please shed some light on me with this task.
Description
Assuming the training data looks like
0 | X773579,Y2640,Y2072,Z4,Z15 1 | X374616379,X773579,X344420902,Y1940,Y1705,Z4,Z15,Z26 ...
One would like to predict 0/1 from those comma separated values, without further explanation about what exactly the X, Y, and Z with numbers mean. Some rare cases even have un-numbered X or Y (possibly just noises though). The length of comma separated values varies, but each row contains no repeated values. Also, the length of numbers seem correlated to X, Y, and Z.
The data size is 1 million rows.
Feature
In LibSVM format
Label | Value_ID:1...
Example
0 1:1 2:1 3:1 4:1 5:1 1 1:1 4:1 5:1 14:1 36:1 37:1 38:1 39:1
I've tried treating those values categorically and applying several different models of Spark MLlib, namely linear regression, logistic regression, decision tree, random forests, gradient-boosted trees, and SVM, along with LibSVM and Conditional Random Fields. Their 5-fold cross-validations got only 75%~79% accuracy.
I really want to know if there's any formal way to discover the signal from the data, build up models, extract and even select features accordingly. Admittedly I haven't done ANOVA yet.
Thank you!
Background (self-disclosure)
This is a private interview question and the employer didn't share any details (except "no offer" response :p). Although I specifically asked twice that I would like to learn from it, no reply so far.
Book studying/studied
- Pattern classification
- Pattern recognition
- Foundations of Statistical Natural Language Processing
- Speech and Language Processing
What I may try next
Based on a discussion with @NeilSlater :
I have done some quick frequent pattern analysis and found certain n-grams. The problem was that they seemed associated with both labels 0 and 1. I will run some feature templates of bi-gram and tri-gram, and then get back to you. One thing that is also intriguing to me is, in NLP, sometimes labels also got dependencies, namely some 1 may be followed by certain patterns of 0 and vice versa.
Updates
In terms of linear/logistic regressions and conditional random fields:
- n-gram is slightly harmful: tried adjacent bigram/trigram & pair/triple combinations, all got worse about 1%
- occurrences of X/Y/Z & mean/max/min/sum/avg of X/Y/Z can only reach 69%.
- X or Y or Z solely as categorical feature got 70%~75%. Y is the most useful category.
Other models listed in the original description all got far worse results.