# Pointwise mutual information

Pointwise mutual information (PMI),[1] or point mutual information, is a measure of association used in information theory and statistics. In contrast to mutual information (MI) which builds upon PMI, it refers to single events, whereas MI refers to the average of all possible events.

## Definition

The PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically:

${\displaystyle \operatorname {pmi} (x;y)\equiv \log {\frac {p(x,y)}{p(x)p(y)}}=\log {\frac {p(x|y)}{p(x)}}=\log {\frac {p(y|x)}{p(y)}}.}$

The mutual information (MI) of the random variables X and Y is the expected value of the PMI over all possible outcomes (with respect to the joint distribution ${\displaystyle p(x,y)}$).

The measure is symmetric (${\displaystyle \operatorname {pmi} (x;y)=\operatorname {pmi} (y;x)}$). It can take positive or negative values, but is zero if X and Y are independent. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is positive. PMI maximizes when X and Y are perfectly associated (i.e. ${\displaystyle p(x|y)}$ or ${\displaystyle p(y|x)=1}$), yielding the following bounds:

${\displaystyle -\infty \leq \operatorname {pmi} (x;y)\leq \min \left[-\log p(x),-\log p(y)\right].}$

Finally, ${\displaystyle \operatorname {pmi} (x;y)}$ will increase if ${\displaystyle p(x|y)}$ is fixed but ${\displaystyle p(x)}$ decreases.

Here is an example to illustrate:

x y p(xy)
0 0 0.1
0 1 0.7
1 0 0.15
1 1 0.05

Using this table we can marginalize to get the following additional table for the individual distributions:

p(x) p(y)
0 0.8 0.25
1 0.2 0.75

With this example, we can compute four values for ${\displaystyle pmi(x;y)}$. Using base-2 logarithms:

 pmi(x=0;y=0) = −1 pmi(x=0;y=1) = 0.222392 pmi(x=1;y=0) = 1.584963 pmi(x=1;y=1) = -1.584963

(For reference, the mutual information ${\displaystyle \operatorname {I} (X;Y)}$ would then be 0.2141709)

## Similarities to mutual information

Pointwise Mutual Information has many of the same relationships as the mutual information. In particular,

{\displaystyle {\begin{aligned}\operatorname {pmi} (x;y)&=&h(x)+h(y)-h(x,y)\\&=&h(x)-h(x|y)\\&=&h(y)-h(y|x)\end{aligned}}}

Where ${\displaystyle h(x)}$ is the self-information, or ${\displaystyle -\log _{2}p(X=x)}$.

## Normalized pointwise mutual information (npmi)

Pointwise mutual information can be normalized between [-1,+1] resulting in -1 (in the limit) for never occurring together, 0 for independence, and +1 for complete co-occurrence.[2]

${\displaystyle \operatorname {npmi} (x;y)={\frac {\operatorname {pmi} (x;y)}{h(x,y)}}}$

## PMI variants

In addition to the above-mentioned npmi, PMI has many other interesting variants. A comparative study of these variants can be found in [3]

## Chain-rule for pmi

Like mutual information,[4] point mutual information follows the chain rule, that is,

${\displaystyle \operatorname {pmi} (x;yz)=\operatorname {pmi} (x;y)+\operatorname {pmi} (x;z|y)}$

This is easily proven by:

{\displaystyle {\begin{aligned}\operatorname {pmi} (x;y)+\operatorname {pmi} (x;z|y)&{}=\log {\frac {p(x,y)}{p(x)p(y)}}+\log {\frac {p(x,z|y)}{p(x|y)p(z|y)}}\\&{}=\log \left[{\frac {p(x,y)}{p(x)p(y)}}{\frac {p(x,z|y)}{p(x|y)p(z|y)}}\right]\\&{}=\log {\frac {p(x|y)p(y)p(x,z|y)}{p(x)p(y)p(x|y)p(z|y)}}\\&{}=\log {\frac {p(x,yz)}{p(x)p(yz)}}\\&{}=\operatorname {pmi} (x;yz)\end{aligned}}}

## Applications

In computational linguistics, PMI has been used for finding collocations and associations between words. For instance, countings of occurrences and co-occurrences of words in a text corpus can be used to approximate the probabilities ${\displaystyle p(x)}$ and ${\displaystyle p(x,y)}$ respectively. The following table shows counts of pairs of words getting the most and the least PMI scores in the first 50 millions of words in Wikipedia (dump of October 2015) filtering by 1,000 or more co-occurrences. The frequency of each count can be obtained by dividing its value by 50,000,952. (Note: natural log is used to calculate the PMI values in this example, instead of log base 2)

word 1 word 2 count word 1 count word 2 count of co-occurrences PMI
puerto rico 1938 1311 1159 10.0349081703
hong kong 2438 2694 2205 9.72831972408
los angeles 3501 2808 2791 9.56067615065
carbon dioxide 4265 1353 1032 9.09852946116
prize laureate 5131 1676 1210 8.85870710982
san francisco 5237 2477 1779 8.83305176711
nobel prize 4098 5131 2498 8.68948811416
ice hockey 5607 3002 1933 8.6555759741
star trek 8264 1594 1489 8.63974676575
car driver 5578 2749 1384 8.41470768304
it the 283891 3293296 3347 -1.72037278119
are of 234458 1761436 1019 -2.09254205335
this the 199882 3293296 1211 -2.38612756961
is of 565679 1761436 1562 -2.54614706831
and of 1375396 1761436 2949 -2.79911817902
a and 984442 1375396 1457 -2.92239510038
in and 1187652 1375396 1537 -3.05660070757
to and 1025659 1375396 1286 -3.08825363041
to in 1025659 1187652 1066 -3.12911348956
of and 1761436 1375396 1190 -3.70663100173

Good collocation pairs have high PMI because the probability of co-occurrence is only slightly lower than the probabilities of occurrence of each word. Conversely, a pair of words whose probabilities of occurrence are considerably higher than their probability of co-occurrence gets a small PMI score.

## References

1. ^ Kenneth Ward Church and Patrick Hanks (March 1990). "Word association norms, mutual information, and lexicography". Comput. Linguist. 16 (1): 22–29.
2. ^ Bouma, Gerlof (2009). "Normalized (Pointwise) Mutual Information in Collocation Extraction" (PDF). Proceedings of the Biennial GSCL Conference.
3. ^ Francois Role, Moahmed Nadif. Handling the Impact of Low frequency Events on Co-occurrence-based Measures of Word Similarity:A Case Study of Pointwise Mutual Information. Proceedings of KDIR 2011 : KDIR- International Conference on Knowledge Discovery and Information Retrieval, Paris, October 26-29 2011
4. ^ Paul L. Williams. INFORMATION DYNAMICS: ITS THEORY AND APPLICATION TO EMBODIED COGNITIVE SYSTEMS (PDF).