3

Are there online resources with an ASCII character frequency table for HTML?

There's tons of them for english, but I can't seem to find one for page source. I know that this is something that's likely to vary from site to site, but surely there's rule-of-thumb good-enough-usually tables where < is ranked higher than z?

cygnusv
  • 5,072
  • 1
  • 23
  • 48
Craig Gidney
  • 496
  • 3
  • 7

3 Answers3

3

Here's a simple PHP script that calculates bytes frequencies from 10 random Wikipedia articles:

<?php

$frequencies = array();
    $total = 0;
for ($i=0; $i<256; $i++) $frequencies[$i] = 0;
    for ($i=0; $i<10; $i++) {
  $src = file_get_contents('https://en.wikipedia.org/wiki/Special:Random');
      foreach (str_split($src) as $char) {
        $frequencies[ord($char)]++;
        $total++;
      }
    }
    header('Content-Type: text/plain');
    for ($i=0; $i<256; $i++) {
  printf('%3d:%7.4f%s',$i,$frequencies[$i]*100/$total,(($i&7)==7)?"\n":'  ');
}

You could easily extend this to fetch more pages from a wider range of sources.

Here are the results I got:

  0: 0.0000    1: 0.0000    2: 0.0000    3: 0.0000    4: 0.0000    5: 0.0000    6: 0.0000    7: 0.0000
  8: 0.0000    9: 1.9908   10: 1.0258   11: 0.0000   12: 0.0000   13: 0.0000   14: 0.0000   15: 0.0000
 16: 0.0000   17: 0.0000   18: 0.0000   19: 0.0000   20: 0.0000   21: 0.0000   22: 0.0000   23: 0.0000
 24: 0.0000   25: 0.0000   26: 0.0000   27: 0.0000   28: 0.0000   29: 0.0000   30: 0.0000   31: 0.0000
 32: 4.8680   33: 0.0277   34: 4.5121   35: 0.0942   36: 0.0042   37: 0.2563   38: 0.2139   39: 0.0867
 40: 0.1498   41: 0.1498   42: 0.0275   43: 0.1042   44: 0.4450   45: 1.1506   46: 1.1177   47: 3.1013
 48: 0.6391   49: 0.4991   50: 0.5280   51: 0.3287   52: 0.2134   53: 0.2927   54: 0.2862   55: 0.1731
 56: 0.2161   57: 0.2103   58: 0.6376   59: 0.3994   60: 2.8618   61: 2.1683   62: 2.8618   63: 0.0682
 64: 0.0000   65: 0.3892   66: 0.1531   67: 0.3726   68: 0.1941   69: 0.1862   70: 0.2055   71: 0.0938
 72: 0.0913   73: 0.1354   74: 0.0572   75: 0.1103   76: 0.1240   77: 0.1656   78: 0.1242   79: 0.0963
 80: 0.2883   81: 0.0468   82: 0.1858   83: 0.3152   84: 0.2528   85: 0.0832   86: 0.0666   87: 0.1076
 88: 0.0060   89: 0.0354   90: 0.0237   91: 0.0807   92: 0.0522   93: 0.0807   94: 0.0079   95: 0.8702
 96: 0.0000   97: 6.3074   98: 0.9143   99: 2.2221  100: 2.8104  101: 6.4384  102: 1.2492  103: 1.6232
104: 1.9022  105: 6.5891  106: 0.0815  107: 1.2405  108: 3.9219  109: 1.5149  110: 3.4597  111: 3.1206
112: 2.1826  113: 0.0285  114: 3.6109  115: 3.5510  116: 5.1196  117: 1.3010  118: 0.7861  119: 1.3322
120: 0.5001  121: 0.7843  122: 0.1313  123: 0.0483  124: 0.0000  125: 0.0483  126: 0.0000  127: 0.0000
128: 0.0431  129: 0.0044  130: 0.0079  131: 0.0152  132: 0.0017  133: 0.0029  134: 0.0002  135: 0.0033
136: 0.0035  137: 0.0015  138: 0.0023  139: 0.0017  140: 0.0031  141: 0.0012  142: 0.0006  143: 0.0006
144: 0.0033  145: 0.0012  146: 0.0017  147: 0.0285  148: 0.0012  149: 0.0021  150: 0.0029  151: 0.0037
152: 0.0017  153: 0.0046  154: 0.0006  155: 0.0012  156: 0.0023  157: 0.0004  158: 0.0023  159: 0.0173
160: 0.0025  161: 0.0017  162: 0.0027  163: 0.0021  164: 0.0069  165: 0.0094  166: 0.0021  167: 0.0035
168: 0.0037  169: 0.0025  170: 0.0035  171: 0.0025  172: 0.0025  173: 0.0037  174: 0.0027  175: 0.0004
176: 0.0123  177: 0.0050  178: 0.0069  179: 0.0042  180: 0.0046  181: 0.0031  182: 0.0021  183: 0.0025
184: 0.0056  185: 0.0044  186: 0.0048  187: 0.0046  188: 0.0025  189: 0.0021  190: 0.0027  191: 0.0037
192: 0.0000  193: 0.0000  194: 0.0092  195: 0.0094  196: 0.0012  197: 0.0006  198: 0.0000  199: 0.0000
200: 0.0000  201: 0.0002  202: 0.0002  203: 0.0000  204: 0.0000  205: 0.0000  206: 0.0054  207: 0.0002
208: 0.0202  209: 0.0104  210: 0.0004  211: 0.0000  212: 0.0004  213: 0.0031  214: 0.0004  215: 0.0000
216: 0.0089  217: 0.0073  218: 0.0002  219: 0.0025  220: 0.0000  221: 0.0000  222: 0.0000  223: 0.0000
224: 0.0125  225: 0.0023  226: 0.0433  227: 0.0015  228: 0.0019  229: 0.0037  230: 0.0031  231: 0.0017
232: 0.0017  233: 0.0010  234: 0.0008  235: 0.0012  236: 0.0017  237: 0.0008  238: 0.0000  239: 0.0029
240: 0.0171  241: 0.0000  242: 0.0000  243: 0.0000  244: 0.0000  245: 0.0000  246: 0.0000  247: 0.0000
248: 0.0000  249: 0.0000  250: 0.0000  251: 0.0000  252: 0.0000  253: 0.0000  254: 0.0000  255: 0.0000

As you would expect, the frequencies for < (60) and > (62) are identical.

r3mainer
  • 2,073
  • 15
  • 17
1

I don't know of any reference to such information, but it's something you can easily estimate with a very simple program.

Just for illustration, I have made a very quick analysis of the HTML code of this very page using an frequency analysis tool (http://www.dcode.fr/frequency-analysis). These are the results (only the most frequent characters are shown):

enter image description here

Of course, this is just a simplified example of analysis, and for real-world purposes you would need a much bigger sample, but the process itself is very simple.

cygnusv
  • 5,072
  • 1
  • 23
  • 48
0

There is another angle that can add some accuracy to what was suggested above which is to count in the frequency a page (or domain) is visited on the web when you do your frequency calculation. For example, a tag that google.com uses should have a high frequency value from the perspective of a proxy or firewall serving average users. Even though that tag might never show up anywhere else on the web (not google.com.)

camgas
  • 31
  • 1