Kneser–Ney smoothing

Kneser–Ney smoothing is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories.^[1] It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams. The method is due to Reinhard Kneser and Hermann Ney [de].^[2]

A common example that illustrates the concept behind this method is the frequency of the bigram "San Francisco". If it appears several times in a training corpus, the frequency of the unigram "Francisco" will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results;^[3] however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

Method[edit]

Let $c(w,w')$ be the number of occurrences of the word $w$ followed by the word $w'$ in the corpus.

The equation for bigram probabilities is as follows:

$p_{KN}(w_{i}|w_{i-1})={\frac {\max(c(w_{i-1},w_{i})-\delta ,0)}{\sum _{w'}c(w_{i-1},w')}}+\lambda _{w_{i-1}}p_{KN}(w_{i})$ ^[4]

Where the unigram probability $p_{KN}(w_{i})$ depends on how likely it is to see the word $w_{i}$ in an unfamiliar context, which is estimated as the number of times it appears after any other word divided by the number of distinct pairs of consecutive words in the corpus:

$p_{KN}(w_{i})={\frac {|\{w':0<c(w',w_{i})\}|}{|\{(w',w''):0<c(w',w'')\}|}}$

Please note that $p_{KN}$ is a proper distribution, as the values defined in above way are non-negative and sum to one.

The parameter $\delta$ is a constant which denotes the discount value subtracted from the count of each n-gram, usually between 0 and 1.

The value of the normalizing constant $\lambda _{w_{i-1}}$ is calculated to make the sum of conditional probabilities $p_{KN}(w_{i}|w_{i-1})$ over all $w_{i}$ equal to one. Observe that (provided $\delta <1$ ) for each $w_{i}$ which occurs at least once in the context of $w_{i-1}$ in the corpus we discount the probability by exactly the same constant amount ${\delta }/\left(\sum _{w'}c(w_{i-1},w')\right)$ , so the total discount depends linearly on the number of unique words $w_{i}$ that can occur after $w_{i-1}$ . This total discount is a budget we can spread over all $p_{KN}(w_{i}|w_{i-1})$ proportionally to $p_{KN}(w_{i})$ . As the values of $p_{KN}(w_{i})$ sum to one, we can simply define $\lambda _{w_{i-1}}$ to be equal to this total discount:

$\lambda _{w_{i-1}}={\frac {\delta }{\sum _{w'}c(w_{i-1},w')}}|\{w':0<c(w_{i-1},w')\}|$

This equation can be extended to n-grams. Let $w_{i-n+1}^{i-1}$ be the $n-1$ words before $w_{i}$ :

$p_{KN}(w_{i}|w_{i-n+1}^{i-1})={\frac {\max(c(w_{i-n+1}^{i-1},w_{i})-\delta ,0)}{\sum _{w'}c(w_{i-n+1}^{i-1},w')}}+\delta {\frac {|\{w':0<c(w_{i-n+1}^{i-1},w')\}|}{\sum _{w_{i}}c(w_{i-n+1}^{i})}}p_{KN}(w_{i}|w_{i-n+2}^{i-1})$ ^[5]

This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero.^[6] Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.

References[edit]

^ 'A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06'
^ N -gram Counts and Language Models from the Common Crawl
^ 'Brown University: Introduction to Computational Linguistics '
^ 'Kneser Ney Smoothing Explained'
^ 'NLP Tutorial: Smoothing'
^ 'An empirical study of smoothing techniques for language modeling'

[1] 'A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06'

[2] N -gram Counts and Language Models from the Common Crawl

[3] 'Brown University: Introduction to Computational Linguistics '

[4] 'Kneser Ney Smoothing Explained'

[5] 'NLP Tutorial: Smoothing'

[6] 'An empirical study of smoothing techniques for language modeling'

[1]

[2]

[3]

[4]

[5]

[6]

Kneser–Ney smoothing

Method[edit]

References[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Interaction

Tools

Print/export

Languages