# Information content

(Redirected from Self-information)

In information theory, information content, self-information, or surprisal of a random variable or signal is the amount of information gained when it is sampled. Formally, information content is a random variable defined for any event in probability theory regardless of whether a random variable is being measured or not.

Information content is expressed in a unit of information, as explained below. The expected value of self-information is information theoretic entropy, the average amount of information an observer would expect to gain about a system when sampling the random variable.[1]

## Definition

Given a random variable ${\displaystyle X}$ with probability mass function ${\displaystyle p_{X}{\left(x\right)}}$, the self-information of measuring ${\displaystyle X}$ as outcome ${\displaystyle x}$ is defined as ${\displaystyle \operatorname {I} _{X}(x):=-\log {\left[p_{X}{\left(x\right)}\right]}=\log {\left({\frac {1}{p_{X}{\left(x\right)}}}\right)}.}$[2]

Broadly given an event ${\displaystyle E}$ with probability ${\displaystyle P}$, information content is defined analogously:

${\displaystyle \operatorname {I} (E):=-\log {\left[\Pr {\left(E\right)}\right]}=-\log {\left(P\right)}.}$

In general, the base of the logarithmic chosen does not matter for most information-theoretic properties; however, different units of information are assigned based on popular choices of base.

If the logarithmic base is 2, the unit is named the Shannon but "bit" is also used. If the base of the logarithm is the natural logarithm (logarithm to base Euler's number e ≈ 2.7182818284), the unit is called the nat, short for "natural". If the logarithm is to base 10, the units are called hartleys or decimal digits.

The Shannon entropy of the random variable ${\displaystyle X}$ above is defined as

{\displaystyle {\begin{alignedat}{2}\mathrm {H} (X)&=\sum _{x}{-p_{X}{\left(x\right)}\log {p_{X}{\left(x\right)}}}\\&=\sum _{x}{p_{X}{\left(x\right)}\operatorname {I} _{X}(x)}\\&{\overset {\underset {\mathrm {def} }{}}{=}}\ \mathbb {E} {\left[\operatorname {I} _{X}(x)\right]},\end{alignedat}}}

by definition equal to the expected information content of measurement of ${\displaystyle X}$.[3]:11[4]:19-20

## Properties

### Antitonicity for probability

For a given probability space, measurement of rarer events will yield more information content than more common values. Thus, self-information is antitonic in probability for events under observation.

• Intuitively, more information is gained from observing an unexpected event—it is "surprising".
• For example, if there is a one-in-a-million chance of Alice winning the lottery, her friend Bob will gain significantly more information from learning that she won than that she lost on a given day. (See also: Lottery mathematics.)
• This establishes an implicit relationship between the self-information of a random variable and its variance.

The information content of two independent events is the sum of each event's information content. This property is known as additivity in mathematics, and sigma additivity in particular in measure and probability theory. Consider two independent random variables ${\textstyle X,\,Y}$ with probability mass functions ${\displaystyle p_{X}(x)}$ and ${\displaystyle p_{Y}(y)}$ respectively. The joint probability mass function is

${\displaystyle p_{X,Y}\!\left(x,y\right)=\Pr(X=x,\,Y=y)=p_{X}\!(x)\,p_{Y}\!(y)}$

because ${\textstyle X}$ and ${\textstyle Y}$ are independent. The information content of the outcome ${\displaystyle (X,Y)=(x,y)}$ is

{\displaystyle {\begin{aligned}\operatorname {I} _{X,Y}(x,y)&=-\log _{2}\left[p_{X,Y}(x,y)\right]=-\log _{2}\left[p_{X}\!(x)p_{Y}\!(y)\right]\\&=-\log _{2}\left[p_{X}{(x)}\right]-\log _{2}\left[p_{Y}{(y)}\right]\\&=\operatorname {I} _{X}(x)+\operatorname {I} _{Y}(y)\end{aligned}}}
See § Two independent, identically distributed dice below for an example.

## Notes

This measure has also been called surprisal, as it represents the "surprise" of seeing the outcome (a highly improbable outcome is very surprising). This term (as a log-probability measure) was coined by Myron Tribus in his 1961 book Thermostatics and Thermodynamics.[5][6]

When the event is a random realization (of a variable) the self-information of the variable is defined as the expected value of the self-information of the realization.

Self-information is an example of a proper scoring rule.[clarification needed]

## Examples

### Fair coin toss

Consider the Bernoulli trial of tossing a fair coin ${\displaystyle X}$. The probabilities of the events of the coin landing as heads ${\displaystyle H}$ and tails ${\displaystyle T}$ (see fair coin and obverse and reverse) are one half each, ${\textstyle p_{X}{(H)}=p_{X}{(T)}={\tfrac {1}{2}}=0.5}$. Upon measuring the variable as heads, the associated information gain is

${\displaystyle \operatorname {I} _{X}(H)=-\log _{2}{p_{X}{(H)}}=-\log _{2}\!{\tfrac {1}{2}}=1,}$
so the information gain of a fair coin landing as heads is 1 shannon.[2] Likewise, the information gain of measuring ${\displaystyle T}$ tails is
${\displaystyle \operatorname {I} _{X}(T)=-\log _{2}{p_{X}{(T)}}=-\log _{2}\!{\tfrac {1}{2}}=1{\text{ shannon}}.}$

### Fair dice roll

Suppose we have a fair six-sided dice. The value of a dice roll is a discrete uniform random variable ${\displaystyle X\sim \mathrm {DU} [1,6]}$ with probability mass function

${\displaystyle p_{X}(k)={\begin{cases}{\frac {1}{6}},&k\in \{1,2,3,4,5,6\}\\0,&{\text{otherwise}}\end{cases}}}$
The probability of rolling a 4 is ${\textstyle p_{X}(4)={\frac {1}{6}}}$, as for any other valid roll. The information content of rolling a 4 is thus
${\displaystyle \operatorname {I} _{X}(4)=-\log _{2}{p_{X}{(4)}}=-\log _{2}{\tfrac {1}{6}}\approx 2.585\;{\text{shannons}}}$
of information.

### Two independent, identically distributed dice

Suppose we have two independent, identically distributed random variables ${\textstyle X,\,Y\sim \mathrm {DU} [1,6]}$ each corresponding to an independent fair 6-sided dice roll. The joint distribution of ${\displaystyle X}$ and ${\displaystyle Y}$ is

{\displaystyle {\begin{aligned}p_{X,Y}\!\left(x,y\right)&{}=\Pr(X=x,\,Y=y)=p_{X}\!(x)\,p_{Y}\!(y)\\&{}={\begin{cases}\displaystyle {1 \over 36},\ &x,y\in [1,6]\cap \mathbb {N} \\0&{\text{otherwise.}}\end{cases}}\end{aligned}}}

The information content of the random variate ${\displaystyle (X,Y)=(2,\,4)}$ is

{\displaystyle {\begin{aligned}\operatorname {I} _{X,Y}{(2,4)}&=-\log _{2}\!{\left[p_{X,Y}{(2,4)}\right]}=\log _{2}\!{36}=2\log _{2}\!{6}\\&\approx 5.169925{\text{ shannons}},\end{aligned}}}
just as

{\displaystyle {\begin{aligned}\operatorname {I} _{X,Y}{(2,4)}&=-\log _{2}\!{\left[p_{X,Y}{(2,4)}\right]}=-\log _{2}\!{\left[p_{X}(2)\right]}-\log _{2}\!{\left[p_{Y}(4)\right]}\\&=2\log _{2}\!{6}\\&\approx 5.169925{\text{ shannons}},\end{aligned}}}
as explained in § Additivity of independent events.

#### Information from frequency of rolls

If we receive information about the value of the dice without knowledge of which die had which value, we can formalize the approach with so-called counting variables

${\displaystyle C_{k}:=\delta _{k}(X)+\delta _{k}(Y)={\begin{cases}0,&\neg \,(X=k\vee Y=k)\\1,&\quad X=k\,\veebar \,Y=k\\2,&\quad X=k\,\wedge \,Y=k\end{cases}}}$

for ${\displaystyle k\in \{1,2,3,4,5,6\}}$, then ${\textstyle \sum _{k=1}^{6}{C_{k}}=2}$ and the counts have the multinomial distribution

{\displaystyle {\begin{aligned}f(c_{1},\ldots ,c_{6})&{}=\Pr(C_{1}=c_{1}{\text{ and }}\dots {\text{ and }}C_{6}=c_{6})\\&{}={\begin{cases}{\displaystyle {1 \over {18}}{1 \over c_{1}!\cdots c_{k}!}},\ &{\text{when }}\sum _{i=1}^{6}c_{i}=2\\0&{\text{otherwise,}}\end{cases}}\\&{}={\begin{cases}{1 \over 18},\ &{\text{when 2 }}c_{k}{\text{ are }}1\\{1 \over 36},\ &{\text{when exactly one }}c_{k}=2\\0,\ &{\text{otherwise.}}\end{cases}}\end{aligned}}}

To verify this, the 6 outcomes ${\textstyle (X,Y)\in \left\{(k,k)\right\}_{k=1}^{6}=\left\{(1,1),(2,2),(3,3),(4,4),(5,5),(6,6)\right\}}$ correspond to the event ${\displaystyle C_{k}=2}$ and a total probability of 1/6. These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same. Without knowledge to distinguish the dice rolling the other numbers, the other ${\textstyle {\binom {6}{2}}=15}$ combinations correspond to one die rolling one number and the other die rolling a different number, each having probability 1/18. Indeed, ${\textstyle 6\cdot {\tfrac {1}{36}}+15\cdot {\tfrac {1}{18}}=1}$, as required.

Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one dice was one number and the other was a different number. Take for examples the events ${\displaystyle A_{k}=\{(X,Y)=(k,k)\}}$ and ${\displaystyle B_{j,k}=\{c_{j}=1\}\cap \{c_{k}=1\}}$for ${\displaystyle j\neq k,1\leq j,k\leq 6}$. For example, ${\displaystyle A_{2}=\{X=2{\text{ and }}Y=2\}}$and ${\displaystyle B_{3,4}=\{(3,4),(4,3)\}}$.

The information contents are

${\displaystyle \operatorname {I} (A_{2})=-\log _{2}\!{\tfrac {1}{36}}=5.169925{\text{ shannons}}}$
${\displaystyle \operatorname {I} \left(B_{3,4}\right)=-\log _{2}\!{\tfrac {1}{18}}=4.169925{\text{ shannons}}}$
Let ${\textstyle Same=\bigcup _{i=1}^{6}{A_{i}}}$ be the event that both dice rolled the same value and ${\displaystyle Diff={\overline {Same}}}$ be the event that the dice differed. Then ${\textstyle \Pr(Same)={\tfrac {1}{6}}}$ and ${\textstyle \Pr(Diff)={\tfrac {5}{6}}}$. The information contents of the events are

${\displaystyle \operatorname {I} (Same)=-\log _{2}\!{\tfrac {1}{6}}=2.5849625{\text{ shannons}}}$
${\displaystyle \operatorname {I} (Diff)=-\log _{2}\!{\tfrac {5}{6}}=0.2630344{\text{ shannons}}.}$

#### Information from sum of die

The probability mass or density function (collectively probability measure) of the sum of two independent random variables is the convolution of each probability measure. In the case of independent fair 6-sided dice rolls, the random variable ${\displaystyle Z=X+Y}$ has probability mass function ${\textstyle p_{Z}(z)=p_{X}(x)*p_{Y}(y)={6-|z-7| \over 36}}$, where ${\displaystyle *}$ represents the discrete convolution. The outcome ${\displaystyle Z=5}$ has probability ${\textstyle p_{Z}(5)={\frac {4}{36}}={1 \over 9}}$. Therefore, the information asserted is

${\displaystyle \operatorname {I} _{Z}(5)=-\log _{2}{\tfrac {1}{9}}=\log _{2}{9}\approx 3.169925{\text{ shannons.}}}$

### General discrete uniform distribution

Generalizing the § Fair dice roll example above, consider a general discrete uniform random variable (DURV) ${\displaystyle X\sim \mathrm {DU} [a,b];\quad a,b\in \mathbb {Z} ,\ b\geq a.}$ For convenience, define ${\textstyle N:=b-a+1}$. The p.m.f. is

${\displaystyle p_{X}(k)={\begin{cases}{\frac {1}{N}},&k\in [a,b]\cap \mathbb {Z} \\0,&{\text{otherwise}}\end{cases}}.}$
In general, the values of the DURV need not be integers, or for the purposes of information theory even uniformly spaced; they need only be equiprobable.[2] The information gain of any observation ${\displaystyle X=k}$is
${\displaystyle \operatorname {I} _{X}(k)=-\log _{2}{\frac {1}{N}}=\log _{2}{N}{\text{ shannons}}.}$

#### Special case: constant random variable

If ${\displaystyle b=a}$ above, ${\displaystyle X}$ degenerates to a constant random variable with probability distribution deterministically given by ${\displaystyle X=b}$ and probability measure the Dirac measure ${\textstyle p_{X}(k)=\delta _{b}(k)}$. The only value ${\displaystyle X}$ can take is deterministically ${\displaystyle b}$, so the information content of any measurement of ${\displaystyle X}$ is

${\displaystyle \operatorname {I} _{X}(b)=-\log _{2}{1}=0.}$
In general, there is no information gained from measuring a known value.[2]

### Categorical distribution

Generalizing all of the above cases, consider a categorical discrete random variable with support ${\textstyle {\mathcal {S}}={\bigl \{}s_{i}{\bigr \}}_{i=1}^{N}}$ and p.m.f. given by

${\displaystyle p_{X}(k)={\begin{cases}p_{i},&k=s_{i}\in {\mathcal {S}}\\0,&{\text{otherwise}}\end{cases}}.}$

For the purposes of information theory, the values ${\displaystyle s\in {\mathcal {S}}}$ do not even have to be numbers at all; they can just be mutually exclusive events on a measure space of finite measure that has been normalized to a probability measure ${\displaystyle p}$. Without loss of generality, we can assume the categorical distribution is supported on the set ${\textstyle [N]=\left\{1,2,...,N\right\}}$; the mathematical structure is isomorphic in terms of probability theory and therefore information theory as well.

The information of the outcome ${\displaystyle X=x}$ is given

${\displaystyle \operatorname {I} _{X}(x)=-\log _{2}{p_{X}(x)}.}$

From these examples, it is possible to calculate the information of any set of independent DRVs with known distributions by additivity.

## Relationship to entropy

The entropy is the expected value of the information content of the discrete random variable, with expectation taken over the discrete values it takes. Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies ${\displaystyle \mathrm {H} (X)=\operatorname {I} (X;X)}$, where ${\displaystyle \operatorname {I} (X;X)}$ is the mutual information of ${\displaystyle X}$ with itself.[7]

## Derivation

By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information a priori. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero.

For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin, “Weather forecast for tonight: dark. Continued dark overnight, with widely scattered light by morning.” Assuming one does not reside near the Earth's poles or polar circles, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.

When the content of a message is known a priori with certainty, with probability of 1, there is no actual information conveyed in the message. Only when the advance knowledge of the content of the message by the receiver is less than 100% certain does the message actually convey information.

Accordingly, the amount of self-information contained in a message conveying content informing an occurrence of event, ${\displaystyle \omega _{n}}$, depends only on the probability of that event.

${\displaystyle \operatorname {I} (\omega _{n})=f(\operatorname {P} (\omega _{n}))}$

for some function ${\displaystyle f(\cdot )}$ to be determined below. If ${\displaystyle \operatorname {P} (\omega _{n})=1}$, then ${\displaystyle \operatorname {I} (\omega _{n})=0}$. If ${\displaystyle \operatorname {P} (\omega _{n})<1}$, then ${\displaystyle \operatorname {I} (\omega _{n})>0}$.

Further, by definition, the measure of self-information is nonnegative and additive. If a message informing of event ${\displaystyle C}$ is the intersection of two independent events ${\displaystyle A}$ and ${\displaystyle B}$, then the information of event ${\displaystyle C}$ occurring is that of the compound message of both independent events ${\displaystyle A}$ and ${\displaystyle B}$ occurring. The quantity of information of compound message ${\displaystyle C}$ would be expected to equal the sum of the amounts of information of the individual component messages ${\displaystyle A}$ and ${\displaystyle B}$ respectively:

${\displaystyle \operatorname {I} (C)=\operatorname {I} (A\cap B)=\operatorname {I} (A)+\operatorname {I} (B)}$.

Because of the independence of events ${\displaystyle A}$ and ${\displaystyle B}$, the probability of event ${\displaystyle C}$ is

${\displaystyle \operatorname {P} (C)=\operatorname {P} (A\cap B)=\operatorname {P} (A)\cdot \operatorname {P} (B)}$.

However, applying function ${\displaystyle f(\cdot )}$ results in

{\displaystyle {\begin{aligned}\operatorname {I} (C)&=\operatorname {I} (A)+\operatorname {I} (B)\\f(\operatorname {P} (C))&=f(\operatorname {P} (A))+f(\operatorname {P} (B))\\&=f{\big (}\operatorname {P} (A)\cdot \operatorname {P} (B){\big )}\\\end{aligned}}}

The class of function ${\displaystyle f(\cdot )}$ having the property such that

${\displaystyle f(x\cdot y)=f(x)+f(y)}$

is the logarithm function of any base. The only operational difference between logarithms of different bases is that of different scaling constants.

${\displaystyle f(x)=K\log(x)}$

Since the probabilities of events are always between 0 and 1 and the information associated with these events must be nonnegative, that requires that ${\displaystyle K<0}$.

Taking into account these properties, the self-information ${\displaystyle \operatorname {I} (\omega _{n})}$ associated with outcome ${\displaystyle \omega _{n}}$ with probability ${\displaystyle \operatorname {P} (\omega _{n})}$ is defined as:

${\displaystyle \operatorname {I} (\omega _{n})=-\log(\operatorname {P} (\omega _{n}))=\log \left({\frac {1}{\operatorname {P} (\omega _{n})}}\right)}$

The smaller the probability of event ${\displaystyle \omega _{n}}$, the larger the quantity of self-information associated with the message that the event indeed occurred. If the above logarithm is base 2, the unit of ${\displaystyle \displaystyle I(\omega _{n})}$ is bits. This is the most common practice. When using the natural logarithm of base ${\displaystyle \displaystyle e}$, the unit will be the nat. For the base 10 logarithm, the unit of information is the hartley.

As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 bits (probability 1/16), and the information content associated with getting a result other than the one specified would be ~0.09 bits (probability 15/16). See below for detailed examples.