For a little inspiration Shanghai Longfeng aggregated page Shingle algorithm

Jaccard This is the

for the length of the L document, every N Chinese characters cut a Shingle, so a total cut to L-N+1 shingle, A L-N+1=21-2+1=20 Shingle into the document title, the title of the document into a B L-N+1=20-2+1=19 Shingle.

A, B two assumed the title of the document, the A document’s title is: tomorrow telephone book train tickets can take the tickets through the national time delay 12 hours; B document title: telephone booking train tickets of a national online pre-sale period is extended from.


How to know whether the search engine

before I do.

A, B =7/ factor two (20+19-7) =0.21875

A, B two document title common Shingle map 7 bold: telephone, word order, and train tickets, the country, the country pass, pass the.

A, B two document title a total of 20+19-7=32 Shingle.

document titleTwo

Shingle [n g l] in the English said overlapping tiles. Through an example to illustrate the Shingle algorithm:

estimation algorithm of Shingle, if the Jaccard coefficient is less than a certain number, will not repeat, give each document set split into a number of Shingle, calculate the Jaccard coefficient of 22, if less than a certain number of pages can be generated.

from the title of the document, can be extended to the two page document, and then extended to the N page, the Jaccard coefficient is needed to achieve a similar standard to judge whether the page and page similarity.

, B two document title common Shingle, A, B divided by two, the title of the document a total of Shingle, is the two title of the document Jaccard can be used to determine the similarity coefficient, A, B two document title.

Shingle algorithm, the intersection of the two sets by the union of two sets, Jaccard coefficient, Jaccard coefficient by judging whether more than a number, to determine whether to repeat two sets.

two document title is repeated? We can take 2 Chinese characters cut into a Shingle method:

However, A

Shingle algorithm is a search engine to remove the same or similar pages of one of the basic algorithm, when Shanghai Longfeng aggregated page how to make repeated page? How to deal with the problem of repeat Shingle algorithm can be calculated? Get some inspiration.

