Musings on Entropy

6 min readMay 9, 2021

Entropy — what is it? A closer look at entropy for both the thermodynamic and information cases

The classic definition of entropy is given as

S =Sum -p(i)*ln[ p(i)]

This definition is equally valid for a thermodynamic system, that has states with given energies — in which case p(i) represents an occupation probability — as for an information source — which which case p(i) represents the probability of the message.

Entropy as a scalar

Regardless of the system, if one just looks at the math, entropy is a dimensionless length eg a scalar. In fact, entropy base 2 is the sum of the heights of perfectly balanced binary trees weighted by the probability of occurrence of each event.

Simplify

Take the above definition of entropy but drop the sum — consider calculating the contribution to the overall entropy from just one energy level or just one message.

S =-p(1)*log[ p(1)]

Let’s look at each piece in turn. First, the probability term, p(1)

p(1) = probability that the 1st msg is emitted or that the 1st energy level is occupied

That’s straightforward. Now, examine the log term.

First, note that the probability, p(1), is the ratio of number of events of type ‘1’ divided by total number of events. When dealing with an information source, the denominator represents the total dictionary of possible messages. When dealing with the thermodynamic case, the denominator represents the total number of energy states (e.g. Z factor).

Thermodynamic case

p(1) = total events of energy ‘1’/ total number of energy events

Information case

p(1) = total msgs of type ‘1’/ total dictionary of msgs

Now, note that the entropy has a negative sign in front of all terms. Pull in the negative sign, which inverts the terms in the log

-log[ p(1) ]=log[ 1/p(1) ] = log[ Total_Events] — log[ Events_of_type_1]

Pulling in the negative sign has inverted the log terms and clarified what is going on.

The log term is just the difference between the log of the number of total events minus the log of the number of events of type 1.

The log of the number of total events will always be greater than the second term. If the probability of p(1) is low, then the log[ Total_Events] will dominate and the value will be large. Conversely, if the probability of p(1) is high, then the difference will be small, potentially vanishingly small for very high probability events.

Entropy is high for low probability events and low for high probability events — which will be demonstrated below.

Second, note that if one uses log base 2, the log calculations represent the height associated with a perfectly balanced binary tree.

-log2[ p(1)]=log2[ 1/p(1) ]
-log2[ p(1)]=log2[ Total_Events] — log2[ Events_of_type_1]
-log2[ p(1) = (height of binary tree) — (height of binary tree)

The log base 2 terms are the difference in the height, or search depth, of two binary trees. One binary tree represents the dictionary of all msgs while the other binary tree represents the count of events of type ‘1’.

Example — dictionary with 15 characters

Take the system to consist of a dictionary of 15 characters. Further assume that the probability of occurrence of each character is equiprobable.

What is the entropy of a message that contains 1 character from the dictionary?

S=-1/15* log2[ 1/15]

Decomposing the log term, we can interpret it as follows

log2[1/15]=log2[15] — log2[1]
log2[15] — 0

Here is a balanced binary tree with 15 nodes

Hence, the log term represents the height of this perfectly balanced tree

h1=log2[ 15 ] ~log2[ 2^(3.9)] =3.9

Similarly, the log2[1] represents the height of a balanced tree with just a root node:

h2=log2[ 1] =log2[ 2⁰ ]=0

And the entropy for the emitted message is interpreted as the difference in the heights between these two binary trees — the dictionary tree, with 15 nodes, and the event tree with 1 node, with the binary tree of the dictionary dominating the calculation.

S=-1/15* log2[1/15]=1/15* 3.9

Hence, entropy is a ‘scalar’ that captures the difference in the search depths between two binary trees — one representing the all msgs in the dictionary, and one representing the events of interest.

The 16th node is the 1st node in the 4th level. The 15th node is the last node in the 3rd level, before the 4th level.

Visually, what does this look like

An entropy calculation can be viewed as the sum of the difference in heights of the main binary tree — containing all states or messages of the system — and the binary tree of the states or messages of interest, weighted by the probability factor for that event to occur, relative to other trees.

The entropy is basically calculating a search depth factor for the tree of all states or all messages, minus the search depth for the tree holding the states or messages of interest, and adding up a weighted sum of these factors across all states.

What dominates — height or probability of state?

One immediate question is what part of the entropy dominates — is it the log part or the probability part?

Taking the derivative of the probability yields

dS=dP(i)*log2(P(i)+ P(i) * 1/P(i)* dP(i)

Setting that equal to zero yields

dS=dP(i)[log2(P(i)+ 1]=0
log2( P(i) ) = -1
P(i) = 1/2

The entropy at P(i)=1/2, for a two state system, is then

S(p=1/2) = 2* [-1/2*log2(1/2)]= 2*[1/2*[log2(2) — log2(1)]]
S(p=1/2) = 1*[1–0]=1

It’s interesting to examine the entropy at three points of interest

Min probability : p(1) = 1/100
Max probability: p(99)=99/100
Max Entropy: p(.5)

It is surprising to note that the Entropy for the minimum probability point is higher than the entropy for the maximum probability point — lower probability events have higher entropy than higher probability events.

Of course, the maximum entropy occurs when the probability is equidistant from the maximum and minimum probability.

Entropy — how useful is it?

Entropy is a scalar. It can be interpreted as the difference in the the height of two binary trees — the primary binary tree holds all states of the system or all messages in the dictionary — the secondary binary tree holds only the events or messages of interest. The height, or search depth, of the primary binary tree is offset by the height, or search depth, of the secondary binary tree.

From the perspective of thermodynamics, entropy peaks at thermal equilibrium eg at thermal equilibrium, the system constituents occupy states that have the highest probability of being occupied.

From an information theory perspective, entropy represents how many bits of information are added to or removed from a system or that travel from point A to point B.

Entropy’s shortcomings -where to next?

Entropy does not distinguish between microscopic states of the system which do not impact macroscopic variables. This is basic degeneracy.

Entropy does not distinguish between microscopic states of the system which do impact macroscopic variables.

Is there value in being able to distinguish between degenerate states of the same system if they produce, in aggregate, the same result with respect to some macroscopic variable?

What is the right measure to capture the information content of a system?

What is the right measure to capture how deviations from its current state lead to varying degrees of difference in other macroscopic variables like temperature, pressure or the probability outputs of a neural net?

What are some interesting entropy configurations? Check out Musings on Entropy — part II