Tuesday 28 February 2012

Working log

Working Log
1. Verify the compression ratio of a web page

The compression ratio is majorly used to detect the repeated words in a web page, here define compression ratio of a web page is the uncompressed page divided by the compressed page, therefore, the higher the compression ratio is, it is more likely to be a spam page.

2. Verify the n-gram liklyhoods

Many spam page may generate content by drawn words randomly from a dictionary.

3. Anchor fraction
Some search engine take the anchor text as a keyword to the link in the anchor, so some spam page are created to major for creating the keyword for other sites.
So the anchor fraction could indicate the spamicity of a cite.