Binary classification with unexplained data

Question

My apologies for cross-posting to stackoverflow and cross validated. Not really sure which one is the most relevant place.

Please shed some light on me with this task.

Description

Assuming the training data looks like

0 | X773579,Y2640,Y2072,Z4,Z15
1 | X374616379,X773579,X344420902,Y1940,Y1705,Z4,Z15,Z26
...

One would like to predict 0/1 from those comma separated values, without further explanation about what exactly the X, Y, and Z with numbers mean. Some rare cases even have un-numbered X or Y (possibly just noises though). The length of comma separated values varies, but each row contains no repeated values. Also, the length of numbers seem correlated to X, Y, and Z.

The data size is 1 million rows.

Feature

In LibSVM format

Label | Value_ID:1...

Example

0 1:1 2:1 3:1 4:1 5:1
1 1:1 4:1 5:1 14:1 36:1 37:1 38:1 39:1

I've tried treating those values categorically and applying several different models of Spark MLlib, namely linear regression, logistic regression, decision tree, random forests, gradient-boosted trees, and SVM, along with LibSVM and Conditional Random Fields. Their 5-fold cross-validations got only 75%~79% accuracy.

I really want to know if there's any formal way to discover the signal from the data, build up models, extract and even select features accordingly. Admittedly I haven't done ANOVA yet.

Thank you!

Background (self-disclosure)

This is a private interview question and the employer didn't share any details (except "no offer" response :p). Although I specifically asked twice that I would like to learn from it, no reply so far.

Book studying/studied

Pattern classification
Pattern recognition
Foundations of Statistical Natural Language Processing
Speech and Language Processing

What I may try next

Based on a discussion with @NeilSlater :

I have done some quick frequent pattern analysis and found certain n-grams. The problem was that they seemed associated with both labels 0 and 1. I will run some feature templates of bi-gram and tri-gram, and then get back to you. One thing that is also intriguing to me is, in NLP, sometimes labels also got dependencies, namely some 1 may be followed by certain patterns of 0 and vice versa.

Updates

In terms of linear/logistic regressions and conditional random fields:

n-gram is slightly harmful: tried adjacent bigram/trigram & pair/triple combinations, all got worse about 1%
occurrences of X/Y/Z & mean/max/min/sum/avg of X/Y/Z can only reach 69%.
X or Y or Z solely as categorical feature got 70%~75%. Y is the most useful category.

Other models listed in the original description all got far worse results.