I saw here: http://www.cs.cmu.edu/~ninamf/ML11/lect0906.pdf
Intuitively, if “n” is large but most features are irrelevant (i.e. target is sparse but examples are dense), then Winnow is better because adding irrelevant features increases L2(X) but not L∞(X). On the other hand, if the target is dense and examples are sparse, then Perceptron is better.
Why adding irrelevant features increases L2(X) but not L∞(X)?