Variable length Huffman coding
I'm going to use Huffman's code to compress text, but with variable length characters (strings). For example (using an underscore as a space):
huffman-code | symbol
------------------------------------
00 | _
01 | E
100 | THE
101 | A
1100 | UP
1101 | DOWN
11100 | .
11101 |
1111...
(etc...)
How do I create a frequency table? It is obvious that there is some overlapping issues, the sequence _TH
will seem nearly as often as THE
, but it is useless in the table (both _
and THE
have short huffman code).
Is there such an algorithm? Does he have a special name? What would be the tricks for generating a frequency table? Do I need to tokenize the input? I didn't find anything in the litterature / web. (All of this makes me think of base trees as well).
I was thinking about using an iterative process:
- Generate huffman tree for all characters from 1 to N
- Remove from the tree all symbols with N> 1 and below a certain counting threshold
- Reconstruct a second huffman tree, but this time tokenize the input with the previous one (possibly using a radix tree to search)
- Repeat until 1 until we converge (or multiple times)
But I cannot figure out how I can prevent overlapping ( _TH
vs THE
) issue with this.
source to share
If you've signed the text correctly, you don't need to worry about the overlap issue. You can define each token as a word (long continuous stream of characters), punctuation character, or whitespace character ('', '\ t', \ n '). So, by definition, markers / characters do not overlap.
But using Huffman encoding directly is not ideal for compressing text as it cannot exploit dependencies between characters. Ex. Probably "q" followed by "u", "qu" followed by a vowel, "thank you", most likely "you", etc. You might want to look into a high-order encoder such as "LZ" that can exploit this redundancy by converting the data into a sequence of lookup addresses, copy lengths, and character rejection. Here's an example of how LZ works. Then you can apply Huffman encoding to each of the three streams to further compress the data. The DEFLATE algorithm works this way.
source to share
This is not a complete solution.
Since you must store both a sequence and a lookup table, you can perhaps be greedy for characters that minimize storage cost.
Step 1: store all characters of length at most k in the try and keep track of their number
Step 2: For each likely character, calculate the saved space (or compression ratio).
Encode_length (symbol) = log (N) - log (count (symbol))
Space_saved (character) = length (character) * count (character) - Encode_length (character) * count (character) - (length (character) + Encode_length (character))
N is the total frequency of all symbols (which we don't know yet, perhaps approximate?).
Step 3: choose the optimal symbol and subtract the frequency of other symbols that overlap with it.
Step 4: If the whole sequence is not encoded, but choose the next optimal character (for example, go to step 2)
NOTE. This is just a circuit and is neither complete nor computationally efficient. If you are looking for a practical quick fix, you should use krjampani's solution. This answer is purely academic.
source to share