/ algorithm

Integer Compression

It never occurred to me that there would be a way to compress an integer...

I recently learned about three methods of encoding integer values:

  1. Unary
  2. γ (Elias gamma coding)
  3. δ (Elias delta coding)

This sort of blew my mind because I never thought about trying to take a basic, primitive type (such as an integer) and store it using less disk space.

In a simple example, let's assume a basic signed integer in C which takes 16 bits. If we were to save the number 3, we would be storing the following bit sequence:

00000000 00000011

BUT, we can apply one of the methods above and actually use less than 16 bits. Let's take a look at each method.

(It's important to note that in all scenarios in this post we will be using log base 2)

Unary encoding

Unary encoding is by far the simplest. For an integer x where x1, the unary encoding is x-1 1s, followed by a 0.

For example, the number 3 will be two 1s, followed by a 0:

110

And the number 9 will be eight 1s, followed by a 0:

111111110

For small numbers this is great - we've saved 13 bits and 7 bits respectively in these examples. However, you'll likely notice that as the number grows larger, we end up taking more space than we would have by just storing the integer itself.

γ-encoding

For γ-encoding we need to start doing some math.

We start by finding the unary code for 1+⌊log x⌋. So, for the number 3 we end up with:

1+⌊log 3⌋ = 1+1 = 2, which in unary code becomes:

10

Next, we find the uniform (binary) code for x-2⌊log x⌋, but represented in ⌊log x⌋ bits:

3-2⌊log 3⌋ = 3-21 = 1, which when represented in ⌊log 3⌋ = 1 bit is:

1

Piecing the two parts together (unary part first, then the uniform), we get:

101

Definitely a savings compared to storing the raw value of 3 in 16 bits, but there is no gain over the unary encoding. What about the number 9?

1+⌊log 9⌋ = 1+3 = 4, which in unary code becomes:

1110

9-2⌊log 9⌋ = 9-23 = 1, represented in ⌊log 9⌋ = 3 bits is:

001

Piecing the two parts together (unary part first, then the uniform), we get:

1110001

A 2-bit savings over the unary encoding! As the numbers grow, so do the savings.

For example, 15 in unary-encoding:

111111111111110

...and in γ-encoding:

1110111

δ-encoding

The δ encoding is more complicated. It is nearly identical to γ-encoding, however instead of starting with the unary encoding of 1+⌊log x⌋, we start with the γ-encoding of 1+⌊log x⌋. This can get tricky to keep track of.

Let's use the number 3 again for our first example. We start with the γ-encoding of 1+⌊log x⌋:

1+⌊log 3⌋ = 1+1 = 2, so we find the γ-encoding of the number 2. Following the steps in the section above, you will see that the γ-encoding is:

100

The rest is identical to the γ-encoding. We find the uniform (binary) code for x-2⌊log x⌋, but represented in ⌊log x⌋ bits:

3-2⌊log 3⌋ = 3-2 = 1, represented in ⌊log 3⌋ = 1 bit is simply:

1

Piecing the two parts together (γ-encoding first, then the uniform):

1001

Yikes! This took 4 bits instead of the 3 bits used by the γ-encoding! What about a larger number like 125?

We start with the γ-encoding of 1+⌊log x⌋:

1+⌊log 125⌋ = 1+6 = 7, represented in γ-encoding is:

11011

Next we find the uniform (binary) code for x-2⌊log x⌋, but represented in ⌊log x⌋ bits:

125-2⌊log 125⌋ = 125 - 26 = 61, represented in binary in ⌊log 125⌋ = 6 bits is:

111101

Piecing the two parts together (γ-encoding first, then the uniform):

11011111101

On the other hand, the γ-encoding for 125 is:

1111110111101

A 2-bit savings!

Summary

Long story short, I though these were some pretty cool methods for encoding integers. I won't write about it here, but you can just as simply decode these bit strings. Actually, I won't lie, it's not quite as simple.

Here is a link to a GitHub repo where I've put together a couple Python scripts to perform both the compression and decompression techniques:

https://github.com/avojak/integer-compression