Kolgomorov Complexity

Written May 12, 2025

My theory of computation class did not cover this, but it is a topic I wanted to learn about. Here is what I learned from reading about it.

The concept of quantifying information (randomness) has been debated for quite a while. Typically, what comes to mind is one of two definitions: combinatorial, or probabilistic. For combinatorial, assume we want to specify a random integer between zero and three, each equally likely to be chosen. There are a minimum of four possible combinations needed to express these four unique integers using binary: $00$ , $01$ , $10$ , and $11$ . If a message can be one of four options and every option has the same probability of being chosen, we need the size of the message to be at least 2 bits long. Let $N$ be the total number of combinations, and let $I$ be the total number of bits (information size) required to specify which combination we choose.

\begin{align} I = \log_2(N) \text{ bits} \end{align}

This formula, introduced by Claude Shannon in 1948, measures the expected amount of information required to specify the state of the integer assuming each combination is equally likely to be chosen. Let's create a scenario to better understand this concept. Assume you lost the password to your childhood Steam account you grew up on, but you want to try to guess it so you can experience the nostalgia of it once more. You remember as a kid you chose a password that is perfectly random and 16 characters long (such good practice). The password contains lowercase letters, uppercase letters, and the numbers 0-9. For each character, we have $26+26+10 = 62$ total possibilities. For the whole password, the number of possible guesses is $62^{16}$ since all 16 positions can be any of the 62 possible characters. We can calculate the number of bits required to identify the correct password as follows

\begin{align*} I & = \log_2(62^{16})\\ & = 16 \cdot \log_2(62)\\ & \approx 16 \cdot 5.95\\ & \approx 95.28 \text{ bits} \end{align*}

With these 95.28 bits, we can express $2^{95.28} = 4.81 \cdot 10^{28}$ passwords, which also corresponds to the number of possible guesses. Goodluck finding that password.

Note: In traditional computers, the number of bits required to make a guess would be equal to the ceiling of

H(X)

since we can't store a fraction of a bit. We would need to use 96 bits of storage. The point of

H(X)

from a theoretical standpoint is to express the number of guesses we would need, not the physical storage required. As stated above, there are

2^{95.28}

possible guesses for this password.

So far, I have been talking about quantifying information when we assume every possibility has an equal probability of being chosen, but what happens if one outcome occurs far more often than another? Do we need more or less information to specify which possibility will be chosen? For quantifying information probabilistically, we need to consider the probability of every outcome occurring. Fortunately, we have a formula to calculate this for us. Let $\Eta$ be the expected amount of information needed to describe the variable $X$ that is in the set $\mathcal{X}$ , and let $p: \mathcal{X} \to [0, 1]$ define the probability of a certain variable in $\mathcal{X}$ occurring:

\begin{align} \Eta(X) = - \sum_{x \in \mathcal{X}} p(x) \log_2 p(x) \end{align}

This formula, often referred to as Shannon Entropy, considers the probability of $X$ when determining how many bits are required to specify it. Let's revisit our problem earlier with guessing the password to my lost Steam account. Let's say as a child, my password security was not as robust, and I used common words in the English dictionary rather than randomly generating all 16 characters (who would do such a thing). We know that there is a far greater likelihood that the letter 'e' will more frequently occur than the letter 'z', since 'e' is more commonly used than the letter 'z' in the English language. This means we may require less information than our previous example to specify our password. While still having the same total number of combinations, we can with near certainty rule out the password "ZyGoSpondYLiNe99". While this is still a word in the Oxford English Dictionary, and it is a valid password, what are the odds that my 10 year old self decided to choose that word for my password? Probably pretty low.

Since we can now rule out some words, the distribution of possible passwords is now way less uniform. There is a higher probability that I choose the word "apple" over "zygospondyline" in my password, and we can now greatly reduce those $4.81 \cdot 10^{28}$ guesses previously down to something smaller, like $10^5$ guesses. This means the entropy of my password has decreased. The closer to a uniform distribution we are, entropy and randomness increase. By quantifying information based on this probabilistic approach, you are determining how surprising the next guessed letter will be. This can be seen heavily in modern day machine learning and artificial intelligence.

How can we determine if something is truly random though? Clearly from our previous example where the password contained words from the English dictionary, the password was not random at all. But for the first example where we randomly generated passwords, it seems like we may have pure randomness. Consider this new definition which tries to define randomness in an alternative light:

Definition: The relative complexity of an object

y

with a given

x

refers to the minimal length

l(p)

of the program

p

for obtaining

y

from

x

Andrey Kolmogorov, Three Approaches to the Quantitative Definition of Information (1968)

This was Andrey Kolmogorov's definition of randomness, and what he referrred to as the complexity of an object. An object's Kolmogorov complexity is the minimum amount of information required to specify the string. In contrast to what we saw earlier, there is no probabilistic distribution that we need to consider. Instead, for any given string $x$ and a description of the string $d(x)$ , $x$ is Kolmogorov-random if its description is greater than or equal to the size of $x - c$ for some constant $c$ independent of $x$ , and if $d(x)$ can not be shortened any further.

We can write this formally out as follows

Theorem:

\begin{align} x \in \textsf{K-RANDOM} \iff \forall d(x): \text{ if } d(x) \text{ is a description of } x, |d(x)| \geq |x| - c \end{align}

\begin{align*} x \in &\textsf{K-RANDOM} \iff \forall d(x) \text{ if } d(x) \\ & \text{ is a description of } x, |d(x)| \geq |x| - c \end{align*}

Consider the 32 character string x = "dU47a1YONwHsGCppSITySNprlIYtli7r". We know that $|x|$ is 256 bits (32 characters $\times$ 8 bits/char) here. and x appears to be random, since no pattern can be extracted from it. Don't worry there are no tricks here, this string is random and I generated it using random.org. Let's consider x as incompressible. The shortest description of x must include x within it due to x's incompressibility. This means that x is in $\textsf{K-RANDOM}$ .

On the contrary, consider the 32 character string y = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA". $|y|$ is 256 bits, but there is a clear pattern in y: the string is the character 'A' 32 times. Let's suppose there exists a description d(y) = "'A' 32 times". $|d(y)|$ is only 96 bits (12 characters $\times$ 8 bits/char), and it does not contain y, so y must be compressible. Since $|d(y)| < |y|$ , y is not in $\textsf{K-RANDOM}$ .

Let's consider the following image I generated of the Mandelbrot set. Let's refer to this as x, similar to our examples from earlier since we can assume the encoded .png is just a very very very long binary string.

This image, which can be viewed in full resolution here, is 200 iterations of the Mandelbrot set. In its full resolution, the image is 14.7 megabytes, or 14,201,996 bytes large. While this image is rather large, and it took my laptop 5 minutes to generate, the python script to generate this is less than 30 lines long.

import numpy as np
import matplotlib.pyplot as plt

def mandelbrot(height, width, x_center=0, y_center=0, zoom=1, max_iterations=200):
    x_width = 1.5
    y_height = 1.5 * height / width
    x_from, x_to = x_center - x_width / zoom, x_center + x_width / zoom
    y_from, y_to = y_center - y_height / zoom, y_center + y_height / zoom
    x = np.linspace(x_from, x_to, width).reshape((1, width))
    y = np.linspace(y_from, y_to, height).reshape((height, 1))
    c = x + 1j * y
    z = np.zeros(c.shape, dtype=complex)

    div_time = np.zeros(z.shape, dtype=int)
    m = np.full(c.shape, True, dtype=bool)
    for i in range(max_iterations):
        print(i)
        z[m] = z[m]**2 + c[m]
        diverged = np.greater(np.abs(z), 2, out=np.full(c.shape, False), where=m)
        div_time[diverged] = i
        m[np.abs(z) > 2] = False
    return div_time

if __name__ == '__main__':
    image = mandelbrot(10000, 15000)
    plt.imsave('mandelbrot.png', image, cmap='inferno')
    plt.axis('off')
    plt.show()

This script which we will call d(x) is 1016 bytes, but it is able to algorithmically produce x which is 14 MB. Despite this image's large size, its Kolmogorov complexity is rather small. However, there are some discrepancies that I must address: the size of the libraries (NumPy and Matplotlib) as well as the python interpreter must be considered. For all intents and purposes, let's assume x was generated on a Universal Turing Machine (UTM) where Python 3.12.2, NumPy, and Matplotlib came preinstalled (I do not have the computational power to generate an image that is significantly larger than the installation size of these two libraries, so bare with me). We do know with certainty that using this UTM we defined earlier, the Kolmogorov complexity of x is upperbounded by d(x):

\begin{align*} K_{UTM} \text{(mandelbrot.png)} \leq 1016 \text{ bytes} \end{align*}

This script can be further optimized for sure: symbols can be stripped, indentations can be reduced, and the script itself can be compressed using gzip or a similar library, but that is what makes Kolgomorov complexity undecidable. In order to prove that the Kolgomorov complexity of x is d(x), we need to show that for every other description of x, that description must be greater than or equal to |d(x)|. We know that this is impossible due to undecidability of the halting problem.

Note: The following is a mapping reduction of the Halting problem to a language that relies on Kolmogorov Complexity. The purpose of this is to show the semi-computable nature of Kolgomorov complexity and its undecidability. This section is a little more technical than the previous sections. Please feel free to reach out to me if my reduction has flaws in it.

Let $\textsf{HALT} = \{\langle M, w \rangle : M \text{ is a TM that halts on } w\}$ .

Let $\textsf{K-COMPRESSIBILITY} = \{(x, c) : K(x) > c \}$ . In other words, this accepts when $x$ 's Kolmogorov Complexity is less than $c$ .

Proof: To show that

\textsf{K-COMPRESSIBILITY}

is undecidable, I will show

\textsf{HALT} \leq_m \textsf{K-COMPRESSIBILITY}

.

Assume

\textsf{HALT}

is decidable.

Let

D_K

be a decider for

\textsf{K-COMPRESSIBILITY}

. We can construct a decider

D_H

for

\textsf{HALT}

as follows.

D_H =

"On input

\langle M, w \rangle

Construct the following TM $M'$ :

\hspace{2mm} M' =

x

Simulate $M$ on $w$ .
If $M(w)$ halts, then return $1^{100}$ , otherwise return a perfectly random incompressible 100-bit string."

Let $k = 99$ .
Simulate $M'(\epsilon)$ and let $\omega$ be the output.
Run $D_K(\omega, k)$ .
1. If $D_K$ accepts $(\omega, k)$ , then accept.
2. Otherwise reject."

M(w)

halts, then

\omega = 1^{100}

, and

K(\omega) < 99

D_K

accepts

(\omega, 99)

.

If

M(w)

does not halt, then

\omega

is a randomly generated 100-bit string, so

D_K

rejects

(\omega, 99)

(M, w) \in \textsf{HALT} \iff (\omega, 99) \in \textsf{K-COMPRESSIBILITY}

Since

\textsf{K-COMPRESSIBILITY}

requires solving

\textsf{HALT}

and we know

\textsf{HALT}

is undecidable, then

\textsf{K-COMPRESSIBILITY}

must too be undecidable by contraposition of the inital claim.

\square

Cole Harvey

Kolgomorov Complexity