Last byte in Huffman compression

I am wondering what is the best way to handle the last byte in Huffman Copression. I have some good C ++ code that compresses text files very well, but currently I have to write to my encoded file also the number of characters encoded (well, this is equal to the size of the input file), for lack of an idea about how to handle the last byte better.

For example, the last char to compress is 'a' which code is 011 and I am just starting to write a new byte, so the last byte will look like this: 011 + some 5 bits of garbage, I make them zeros, for example at the end. And when I encode this encoded file, it can happen that the code 00000 (or with fewer zeros) is the code for some char, so at the end of my encoded file I will have some bucket of char.

As I wrote in the first paragraph, I avoid this by keeping the number of characters of the input file in the encoded file, and during encoding, I read the encoded file to reach that number (and not EndOfFile, so as not to get 5 zeros for these examples). It is not very efficient, the size of the encoded file increases for a long number.

How can I handle this better?

PS. Sorry for my genuine english, I hope it can be understood :-)

+3


source to share


2 answers


Your approach (write the number of bytes encoded to a file) is a perfectly reasonable approach. If you want to try another way, you might consider creating a new "pseudo-EOF" symbol that marks the end of the input (I'll call it & square;). Whenever you want to compress s, you compress s & square; instead. This means that you must include one copy of & square; when creating the coding tree. so that you have a unique encoding for & square ;. Then, when you write out the string to a file, you must write the character bits of the string as usual, and then write out the bit pattern for & square ;. If there are leftover bits, you can just leave them arbitrary.

The advantage of this approach is that when decoding the file, if at any time you find & square; character, you can immediately stop the decoding bit because you know you are at the end of the file. It doesn't require you to store the number of bytes that were written out anywhere - the encoding implicitly marks its own endpoint.

The disadvantage of this setting is that it can increase the length of the pattern bits used by some characters, since you will need to assign the pattern bit to & square; in addition to all other characters.



I teach introductory programming and we use Huffman encoding as one of our assignments. We have students using the above approach as it is a little easier than writing the number of bits or bytes in front of the file content. For more details, you can watch this handout or these lecture slides from the course.

Hope this helps!

+5


source


I know this is an old question, but still there is an alternative, so it might help someone.

When you write a compressed file for output, you probably have an integer keeping track of where you are in the current byte (for bit offset).



char c, p;
p = '\0';
int curr = 7;
while (infile.get(c))
{
    std::string trav = GetTraversal(c);
    for (int i = 0; i < trav.size(); i++)
    {
        if (trav[i] == '1')
            p += (1 << curr);
        if (--curr < 0)
        {
            outfile.put(p);
            p = '\0';
            curr = 7;
        }
    }
}
if (curr < 7)
    outfile.put(p);

      

At the end of this block (curr+1)%8

is equal to the number of garbage bits in the last data byte. Then you can store it at the end as one extra byte and just remember when you decompress.

+2


source







All Articles