Blog on Applied Mathematics

Game Of Math: The MaxEnt Algorithm and Game of Thrones.

Game of Thrones.

Surely you're wondering how does Mathematics fit in Game of Thrones? Well I confess, I could use any other TV series to tell you what follows in this post, but honestly ... what is best than Game of Thrones (GoT)?
The connection to Mathematics is not so much considered in the series but in the type of data used. In particular one kind of data that has been used is the set of the comments regarding GoT posted on the IMDb site reviews to produce this infographic, on which I worked with my brother.




Firstly, very briefly, let us see what types of information were extracted from various comments (671).

The map on the left shows the major lineages; each one has been placed in its city of reference, indicating:

  • The number of reviews in which the single lineage is appointed;
  • The number of reviews that speak positively of the family;
  • The number of reviews that speak negatively of the family;
  • The member of the most cited lineage;
  • The not mentioned member of the family.

Instead, in the chart on the right hand side appear:

  • the performance of the nominated characters from each house in five seasons of the series;
  • the most cited characters that are not attributable to lineages;
  • the graphs of co-occurrences between characters and between lineages.

The part on the infographics which, however, I would like to focus attention is the one related to the word-cloud in the bottom right side of the picture, where the adjectives that have been mainly used in the reviews of a lineage are reported.
It needs to be explained how to determine whether a word in a speech has to be regarded as an adjective rather than as a preposition or article. And it is here that Mathematics comes to help.

To perform this type of analysis a POS (Part of Speech) Tagger has been used, namely a tool able to make grammatical analysis of a text. The POS Tagger taken into account is based on OpenNLP library, which is essentially based on the Maximum Entropy model that we will analyze in detail.

Information Theory and Entropy.

Before examining the MaxEnt algorithm, I would define the concept of entropy used here.
Most commonly we talk about entropy in the following areas:

  • thermodynamics: the entropy of a gas is function of its temperature, volume, mass and nature; it is a reversibility index: if there is no change, the process is reversible, otherwise it is irreversible (just think of the exchange of heat between two bodies, one hot and one cold);
  • statistical mechanics: the entropy is a measure of the increase of disorder (inability to predict the position and velocity of the particles); the greater is the knowledge, the lower the entropy, and viceversa.

To these interpretations of entropy, one can be added and it plays a very important role in information theory (especially in the field of signal processing) and will be the way in which we understand it in our case.

Let's consider a source of messages X. The amount of information transmitted from the message increases with the increase of the uncertainty of the product message. The greater our knowledge about the message produced by the source, the lower the uncertainty, the entropy and the transmitted information.
We formulate the concept of entropy, as introduced by Claude Shannon in 1948.

Definition 1

Let X be a source and x a signal emitted by X. The information given by x is called autoinformation and is defined as

I(x) = -\log_{2}{p(x)}

where p(x) is the probability that the event x happens.

Definition 2

The entropy of a X source is the expected value of the autoinformation, i.e. the average information contained in any signal from X, in particular

H(X) = E[I(X)] = \sum_{i=1}^N P(x_i) \log_{2}{p(x_i)}
if X is a discrete variable

H(X) = - \int p(x) \log_{2}{p(x)}\, dx
if X is a continuous variable

Proposition 1

Let X be a discrete source, then

0\leq H(X) \leq \log_{2}{N}

In particular the maximum of the entropy is \log_{2}{N} when all the N events are equally probable.


Let's see a simple example.

Example 1
Suppose to have a source X that emits 1 with probability p and 0 with probability 1-p.


H(X) = -(p\log_{2}{p} + (1-p)\log_{2}{(1-p))}

and the entropy is equal to 1 if p=1/2 (unpredictable sources), while it is equal to 0 if p=1 (predictable sources, i.e. it always outputs 1 or always 0).

The MaxEnt Algorithm.


The classifier Maximum Entropy is a discriminative classifier widely used in the areas of Natural Language Processing, Speech Recognition and Information Retrieval. In particular, it is used in order to solve problems of classification of text such as language detection, topic and sentiment analysis.
The Maximum Entropy algorithm is based on the principle of Maximum Entropy and selects the model that has the maximum entropy (as enunciated by Shannon) on the training set of all tested models.

Recalling Bayes' theorem (see here), the Max Entropy classifier is used when you do not have any information about the prior distribution of the data and it is not correct to make assumptions about.
Unlike the Naive Bayes classifier, the Maximum Entropy has not the hypothesis that the variables are independent of one another, which reflects the nature of the natural text where the variables into account are the words, that of course are not independent of one another since the grammatical rules of the language; moreover, it requires more time in training while providing more reliable results.

Example 2

Before going into the theory behind the MaxEnt, consider an example which clarifies from the outset what will be said in a formal way in the following.

Suppose we want to determine the grammatical form of the word "set."

The word "set" can take the following forms:

  • Adjective: "He has a set smile."
  • Noun: "Give me your chess set."
  • Verb: "They set a fast pace."

We collect a large number of examples from which to extract information to determine the decision-making model p. The model we're going to build will assign the word "set" a chance to take a particular grammatical meaning.

As we don't have other information from the data, you can impose for our p model:

p(adjective) + p(noun) + p(verb) = 1

There are several p models that hold previous identity, including:

  • p(adjective) = 1, p(noun) = p(verb) = 0
  • p(noun) = 1, p(adjective) = p(verb) = 0
  • p(verb) = 1, p(adjective) = p(noun) = 0
  • p(adjective) = p(noun) = p(verb) = \frac{1}{3}

By analyzing the data set further, let's suppose to get other informations, such as every time the word "set" is preceded by a pronoun is a verb. This, added to the normalization condition, changes the possible chances, reducing the possible models.

The goal of the MaxEnt algorithm is to determine the model p uniform as possible (maximizing the entropy), according to the information derived from the data, without making any additional assumptions.


The algorithm.

We pass now to the explanation of the algorithm.

Consider a text-based document d and have \left\{w_1, w_2, \dots, w_m \right\} m words to each of which corresponds to a particular tag \left\{y_1, y_2, \dots, y_m \right\} (i.e. a grammatical part of the document: noun, adjective, pronoun, article, etc.). We introduce the concept of "history" of the word w_i as the possible informations arising from the context in which w_i is located and we indicate it with x_{i}.

We make a small example to explain how you can understand the "story" of a word.

Example 3

Consider the sentence "Today is a beautiful day". The set of words is {Today, is, just, abeautiful, day} and we call "history" of a word the grammatical information of the previous and next word.
For example, for the word "beautiful"

x_{ = {feminine singular indefinite article - "a", feminine singular noun - "day"}


What we have to define is a stochastic model that estimates the conditional probability of getting a y tag, given a particular "story" x, namely is p(y | x).

Then we follow the usual classification scheme, i.e. we build our model starting from couples (x_i, y_i) of the training set, where x_i is the "story" of the word w_i and y_i is the class assigned to it (the grammatical part of speech) .
Consider the probability distribution based on the sample

\tilde p(x, y) = \frac{1}{N}|(x, y)|

where N is the size of the training set, while |(x, y)| is the number of occurrences of pair (x, y) in the training set.
We introduce the indicator function

f_j(x, y) := \left\{\begin{array}{ll}1 \ & y = y_i \ e \ x \ holds\ w_k \\ 0 \ & otherwise\end{array}\right.

and consider the f_j features as variables for the construction of our model.
The average value of f_j variable compared to the probability derived from the sample is

\tilde p(f_j) = \sum\limits_{x, y} \tilde p(x, y) f_j(x, y)

where clearly \tilde p(x, y) = \frac{1}{N} whether each pair of the training set has occurrence 1.

While the average value of f_j variable with respect to probability model p(y | x) is equal to

p(f_j) = \sum\limits_{x, y} \tilde p(x) p(y | x) f_j(x, y)

We impose the condition that the average value of the model is limited to f_j on the training set, i.e.

\tilde p(f_j) = \sum\limits_{x, y} \tilde p(x, y) f_j(x, y) = \sum\limits_{x, y} \tilde p(x) p(y | x) f_j(x, y) = p(f_j)

We now have so many conditions as the previous ones for each f_j, which can be met by several models. To choose the best conditions, we use the principle of Maximum Entropy, by selecting the closest possible model to standard form (maximization of information) .

In particular, it shows that exists and a well established model p^* that maximizes the entropy of a system with C constraints.

In our case the problem is the following:

Determine p^* such that

p^* = \underset{p \in C}{\arg\max} \ H(p) = \underset{p \in C}{\arg\max} \ ( - \sum\limits_{x, y} \tilde p(y | x) log p( y | x))


  • p(y|x) \geq 0 \ \forall x, \ y.
  • \sum\limits_{y} p(y|x) =1, \ \forall x.
  • \sum\limits_{x, y} \tilde p(x, y) f_j(x, y) = \sum\limits_{x, y} \tilde p(x) p(y | x) f_j(x, y).

We use Lagrange multipliers to solve it :

\begin{equation}\begin{split}\Xi(p, \Lambda, \gamma) & = - \sum\limits_{x, y} \tilde p(y | x) log p( y | x) \ + \\& \sum\limits_{j} \lambda_{j} (\sum\limits_{x, y} \tilde p(x, y) f_j(x, y) - \tilde p(x) p(y | x) f_j(x, y)) \ + \\& \gamma \sum\limits_{x} \tilde p(y | x) - 1\end{split}\end{equation}

obtaining the solution

p^*(y|x) = \gamma^* \ exp(\sum\limits_j \lambda_j f_j(x,y)) = \frac{1}{\sum\limits_{y} exp(\sum\limits_{j} \lambda_j f_j(x,y))} \ exp(\sum\limits_j \lambda_j f_j(x,y))

At this point we insert in the Lagrangian the values ​​of p^* and \gamma^* and get \Lambda^*, maximizing the function that follows for which it is stated that (without proving it here):

  • is the log-likelihood of the exponential model  p (y | x) ;
  • the maximum problem cannot be resolved analytically but by numerical methods (conjugate gradient, GIS - Generalized Iterative Scaling), taking advantage of the regularity and the convexity of the function for each \lambda_j.

So, as made the POS Tagger training on a given set (training set), we proceed to make a classification on new words to test (test set). In our case the POS Tagger, already trained on a large enough data set, is used for the classification of words present in IMDb's review; discarded all the words that are not classified as adjectives, then we will assign adjectives to families based on the frequency in the review of a specific lineage.


Now, if you're not a fan of Game of Thrones, I repeat that it can also be applied to other TV series (House of Cards, The Walking Dead, etc.) or in completely different areas, like ... what we will see in my next post.
Until now there is a unique conclusion ... 24.04 the sixth season begins.
Winter is coming ...


No Comments Yet

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow us on Twitter