The next character predictor AI was trained on a collection of featured Wikipedia articles, which are free for the public to use under the creative commons license
The model was trained on Google Colab, and is trained to predict the next character in a sequence of characters. It only has a vocabulary of 63 characters. a-z, A-Z, 0-9, and a space
The model's weights are about 47.2 MB and can be found in the Github repo for this project. Feel free to run the model and code on your own computer, it runs moderately well on my Intel i7-9700F
Will this have widespread use? Probably not?
Is it better than huffman coding? In 95% of the cases I tested,,, no.
Is it at all fast to compress or decompress? Not really, especially for longer text
For best preformance you'd need text in the latin alphabet, maybe favors English text since it was trained on it. The less punctuation and special characters the better.
Was it fun to make? Yes of course!! I had lots of fun trying to optimize how I should store the data, and how to make the ai's predictions more accurate. All of that was really fun to do, and if I could, I would do it all over again
The compression algorithm is really quite simple. We have a list of either characters or indices, and it is initially filled with the first character of the string to compress
We now iterate through each character in the string starting from the 2nd one (index 1) and run the LLM on the previous states and what the previous character was. We generate the top 12 most probable next characters
We now check if the correct character is in that list of possible next characters, if it is we add it's index to the list. If it is not in the list then we add the character
Now, after we do that for every character in the input string, we iterate over the list to write the bits
we reserve 3 bits at the beginning of our bit buffer for the header
If the item in the list is an integer then we get the hamming code for it, write a 1 bit and write those bits into our bit buffer
If the item in the list is a character then we will write a 0 bit and then all the bytes for the UTF-8 encoded string
We now calculate how many bits we need to add to align it to a byte
We now write how many bits we added into the header as 3 bits and we are done!
The compressed binary has 2 parts. A 3 bit header, and the rest is the payload
The 3 bit header just tells you how many bits at the end of the payload is for padding, which is used to align the data to 8 bits
The payload is make up of "packets." Each of which has a type, and a value. The 2 types of packets are Character packets and AI packets
The Character packet starts with a "0" and is a UTF-8 character for it's value. It can be 1, 2, 3, or 4 bytes in size depending on the first byte of the utf-8 character, as described in the UTF-8 standard
The AI packet starts with a "1" and is followed by a binary number of a varied size. That binary number is a hamming code number, the table of which is below.
The hamming code resolves to an index between 0 and 11 (inclusive). That index is the packet's value
| Index | Hamming Code |
|---|---|
| 0 | 0b0 |
| 2 | 0b110 |
| 1 | 0b101 |
| 3 | 0b1000 |
| 4 | 0b11111 |
| 5 | 0b10011 |
| 7 | 0b10010 |
| 11 | 0b11101 |
| 6 | 0b111101 |
| 8 | 0b111100 |
| 10 | 0b111001 |
| 9 | 0b111000 |
To decompress the data we need a few things. We need the hamming code table to decode indices, and the LLM model to predict the next letter
After we read the 3 bit header we then get enter a loop until the remaining bits to read is less than or equal to the padding amount described in the header
We will store the decoded string in a variable, initially empty, as we do our loop
Each pass of our loop we will first have our LLM generate the 12 most likely next characters based on the last character in the decoded string.
Then we will determine which packet type we are reading. If it's a Character packet then we just write the value to the output. If it's an AI Packet then the character we write to the output is the (packet's value)th top most likely character.
So if the packet's value is `0` then the next character is the first item in the top most likely characters.
We must always feed the AI the previous character even if our packet is a Character packet since we must preserve the state
After we handle every packet we should now have our decoded string!
For example, the data you have compressed above has these packets:
| Packet Type | Value |
|---|