Thursday, May 26, 2011

Huffman and the STL

A while back I wrote a Huffman encoder using the C++ STL (Standard Template Library). It worked, but then a co-worker of mine took it upon himself to see how many lines of code he could squeeze out of my implementation. So, in the intervening years I always had this in the back of my head to revisit this and create a compact Huffman encoder with C++ and STL.

So, given a spare moment I wrote carved out a little chunk of code--probably not the absolute smallest in number of lines, but compact none the less. I'm sure a version in Perl would end up being far more compact, but then it would be written in Perl.

The code ends up looking simple and neat, which is one of things I love about using the STL.

See for yourself. The encoding calculation is below, with what is really just a scan for the frequency of occurrence of an ascii character.

vector ct_coll(256);
for (int i = 0; i < input.length(); ++i) {
  ct_coll[input[i]]._c = input[i];
  ++ct_coll[input[i]];
}
//now sort and build tree
multiset freq_coll;
copy(ct_coll.begin(),ct_coll.end(),inserter(freq_coll,freq_coll.begin()));
while (freq_coll.size() > 1) {
  freq_coll.insert(Data(new Data(*freq_coll.begin()),new Data(*++freq_coll.begin())));
  freq_coll.erase(freq_coll.begin());freq_coll.erase(freq_coll.begin());
}
//assign value
Data d = *freq_coll.begin();
assign_code(d);



And that's pretty much it. The chunk above does the following:

* count the frequency of occurrence in a vector container
* sorts by occurrence
* builds the binary tree from the bottom up
* recurses from the top down and builds the huffman code

Finito!

These two lines deal with the tree data structure:

  freq_coll.insert(Data(new Data(*freq_coll.begin()),new Data(*++freq_coll.begin())));
  freq_coll.erase(freq_coll.begin());freq_coll.erase(freq_coll.begin());     

What is happending here is a new Data object is made up of the first two elements of the freq_coll collection. This collection is sorted from least frequent to most frequent occurrance. Therefore, the two least frequent elements are combined to create a new binary node, which is then inserted back into the freq_coll collection. When the new Data object is inserted back into the tree this object's key is the sum of the two Data elements frequency value.

These two nodes are then removed since they will still remain at the top of the freq_coll tree (since the new Data object is the sum of the two frequencies of occurrence).

I also wanted to remove the assignment of the character from the frequency loop (first line in the loop below), since this is only needed to be done once, and ideally it should be done (or hidden) when initialized, but I couldn't come up with a way to define this in compact form.

for (int i = 0; i < input.length(); ++i) {
  ct_coll[input[i]]._c = input[i];
  ++ct_coll[input[i]];
}
Boost didn't appear to have anything either--sure I could initialize in another loop, or explicitly set each 255 characters. But that doesn't reduce the overall line count. The full source is below:
#include <stdlib.h>
#include <stdio.h>
#include <string>
#include <algorithm>
#include <vector>
#include <set>
#include <iostream>

using namespace std;

class Data
{
public:
  Data() : _c(0),_f(0),_code(0),_r(NULL),_l(NULL) {}
  Data(char c) : _c(c),_f(0),_code(0),_r(NULL),_l(NULL) {}
  Data(char c,int f) : _c(c),_f(f),_code(0),_r(NULL),_l(NULL) {}
  Data(Data *d1,Data *d2) : _c(0),_f((d1->_f)+(d2->_f)),_code(0),_r(d1),_l(d2) {}
  Data& operator++() {_f++;}
  friend ostream& operator<<(ostream& output, const Data& p);

  char _c;
  int _f;
  string _readable_code;
  int _code;
  Data *_r;
  Data *_l;
};

ostream& operator<<(ostream& output, const Data& d) 
{
  char buf[256];
  sprintf(buf,"Data: '%c' %d [%s],%X\n",d._c,d._f,d._readable_code.c_str(),d._code);
  output << buf;
  return output;  // for multiple << operators.
}

class freq_cmp
{
public:
  bool operator()(const Data i1, const Data i2) const
  {
    //reverse sort, less likely to most likely
    return (i1._f < i2._f);
  }
};

void
assign_code(Data &d)
{
  if (d._r) {
    d._r->_readable_code = d._readable_code + string("1");
    d._r->_code = (d._code) << 1;
    d._r->_code |= 1;
    assign_code(*d._r);
  }
  if (d._l) { 
    d._l->_readable_code = d._readable_code + string("0");
    d._l->_code = (d._code) << 1;
    assign_code(*d._l);
  }
}

void
print_data(Data &d) 
{
  if (d._c != 0) {
    cout << d;
  }
  if (d._r) {
    print_data(*d._r);
  }
  if (d._l) {
    print_data(*d._l);
  } 
}

void
usage()
{
  cout << "huff -f" << endl;
  cout << "  -f [file] configuration file" << endl;
  cout << "  -h        this help" << endl;
  cout << endl;
}


int main(int argc, char* argv[])
{
  string input;
  int ch;
  char *file = NULL;
  while ((ch = getopt(argc, argv, "f:h")) != -1) {
    switch (ch) {
    case 'f':
      file = optarg;
      break;
    case 'h':
      usage();
      exit(0);
    }
  }

  if (file == NULL) {
    exit(1);
  }

  FILE *fp = fopen(file,"r");
  if (fp) {
    char str[1025];
    while (fgets(str, 1024, fp)) {
      input += string(str);
    }
    fclose(fp);
  }
  if (input.empty()) {
    exit(1);
  }

  //Encoding starts here
  vector<Data> ct_coll(256);
  for (int i = 0; i < input.length(); ++i) {
    ct_coll[input[i]]._c = input[i];
    ++ct_coll[input[i]];
  }
  //now sort and build tree
  multiset<Data,freq_cmp> freq_coll;
  copy(ct_coll.begin(),ct_coll.end(),inserter(freq_coll,freq_coll.begin()));
  while (freq_coll.size() > 1) {
    freq_coll.insert(Data(new Data(*freq_coll.begin()),new Data(*++freq_coll.begin())));
    freq_coll.erase(freq_coll.begin());freq_coll.erase(freq_coll.begin());
  }
  //assign value
  Data d = *freq_coll.begin();
  assign_code(d);
  //Encoding done

  //TODO: now need to convert stream...

  //and print...
  print_data(d);
}

No comments:

Post a Comment