LLM Text Compression

Loading...

Bytes: 0

Enable Streaming? Streaming will allow you to watch the characters get decompressed in real time

Quick Note

The AI

The next character predictor AI was trained on a collection of featured Wikipedia articles, which are free for the public to use under the creative commons license

The model was trained on Google Colab, and is trained to predict the next character in a sequence of characters. It only has a vocabulary of 63 characters. a-z, A-Z, 0-9, and a space

The model's weights are about 47.2 MB and can be found in the Github repo for this project. Feel free to run the model and code on your own computer, it runs moderately well on my Intel i7-9700F

Practically

Will this have widespread use? Probably not?

Is it better than huffman coding? In 95% of the cases I tested,,, no.

Is it at all fast to compress or decompress? Not really, especially for longer text

For best preformance you'd need text in the latin alphabet, maybe favors English text since it was trained on it. The less punctuation and special characters the better.

But...

Was it fun to make? Yes of course!! I had lots of fun trying to optimize how I should store the data, and how to make the ai's predictions more accurate. All of that was really fun to do, and if I could, I would do it all over again

How does it work?

Compression

The compression algorithm is really quite simple. We have a list of either characters or indices, and it is initially filled with the first character of the string to compress

We now iterate through each character in the string starting from the 2nd one (index 1) and run the LLM on the previous states and what the previous character was. We generate the top 12 most probable next characters

We now check if the correct character is in that list of possible next characters, if it is we add it's index to the list. If it is not in the list then we add the character

Now, after we do that for every character in the input string, we iterate over the list to write the bits

we reserve 3 bits at the beginning of our bit buffer for the header

If the item in the list is an integer then we get the hamming code for it, write a 1 bit and write those bits into our bit buffer

If the item in the list is a character then we will write a 0 bit and then all the bytes for the UTF-8 encoded string

We now calculate how many bits we need to add to align it to a byte

We now write how many bits we added into the header as 3 bits and we are done!

Decompression

The compressed binary has 2 parts. A 3 bit header, and the rest is the payload

The 3 bit header just tells you how many bits at the end of the payload is for padding, which is used to align the data to 8 bits

The payload is make up of "packets." Each of which has a type, and a value. The 2 types of packets are Character packets and AI packets

The Character packet starts with a "0" and is a UTF-8 character for it's value. It can be 1, 2, 3, or 4 bytes in size depending on the first byte of the utf-8 character, as described in the UTF-8 standard

The AI packet starts with a "1" and is followed by a binary number of a varied size. That binary number is a hamming code number, the table of which is below.

The hamming code resolves to an index between 0 and 11 (inclusive). That index is the packet's value

Index	Hamming Code
0	0b0
2	0b110
1	0b101
3	0b1000
4	0b11111
5	0b10011
7	0b10010
11	0b11101
6	0b111101
8	0b111100
10	0b111001
9	0b111000

To decompress the data we need a few things. We need the hamming code table to decode indices, and the LLM model to predict the next letter

After we read the 3 bit header we then get enter a loop until the remaining bits to read is less than or equal to the padding amount described in the header

We will store the decoded string in a variable, initially empty, as we do our loop

Each pass of our loop we will first have our LLM generate the 12 most likely next characters based on the last character in the decoded string.

Then we will determine which packet type we are reading. If it's a Character packet then we just write the value to the output. If it's an AI Packet then the character we write to the output is the (packet's value)th top most likely character.

     So if the packet's value is `0` then the next character is the first item in the top most likely characters.

We must always feed the AI the previous character even if our packet is a Character packet since we must preserve the state

After we handle every packet we should now have our decoded string!

For example, the data you have compressed above has these packets:

Packet Type	Value