# w-shingling

In natural language processing a * w-shingling* is a set of

*unique*

*shingles*(therefore

*n-grams*) each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity between documents. The symbol

*w*denotes the quantity of tokens in each shingle selected, or solved for.

The document, "a rose is a rose is a rose" can therefore be maximally tokenized as follows:

- (a,rose,is,a,rose,is,a,rose)

The set of all contiguous *sequences of 4 tokens* (Thus 4=*n*, thus 4-*grams*) is

- { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } Which can then be reduced, or maximally shingled in this particular instance to { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }.

## Resemblance[edit]

For a given shingle size, the degree to which two documents *A* and *B* resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or

where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.

## See also[edit]

- Concept mining (alternative method for document similarity calculation with more computational complexity, but where the measure more closely models a human's perception of document similarity)
- N-gram
- k-mer
- MinHash
- Rolling hash
- Rabin fingerprint
- Vector space model
- Bag-of-words model

## References[edit]

- (Manber 1993)
*Finding Similar Files in a Large File System*. Does not yet use the term "shingling". - (Broder, Glassman, Manasse, and Zweig 1997)
*Syntactic Clustering of the Web*. SRC Technical Note #1997-015.

## External links[edit]

- Manning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich (7 July 2008). "w-shingling".
*Introduction to Information Retrieval*. Cambridge University Press. ISBN 978-1-139-47210-4.