The assigment is overdue now. I will up the price I am willing to pay to have it done. It must be completed by the latest on the 8th. However I would greatly preffer for it to be finished on the 7th. I will add $20 to the price if you can complete it by then.
I need the files to run in Eclipse. Please upload them as such. See my file as an example.
They must be coded in java.
This is for a 200 level data structures class so please match that level accordingly.
I need java Test cases and comments please. (see my example file)
I have included part of the previous assigment so you can see what mine looks like and it also may have things you may need like map etc. Though I cannot guarantee all of it is working properly. (named example)
Okay, we are finally going to finish off the Huffman application. Last week, we got to the point where we could build the tree and do some simple encoding and decoding. The problem with our technique is that we were converting text into Strings of 0s and 1s. Hopefully, the vast majority realized that this is the opposite of actual compression. Our goal was to reduce the number of bits required to encode a character. Instead, we blew it up, replacing one character with a whole String of them! Ouch.
So, we are going to do some bit level work, and learn how to write binary data out to files and read it back in again. The second issue that we have is that once we have an encoded file, we don’t have a good way to decode it without first having the original source file, which rather defeats the purpose of compressing the file in the first place. Thus, we need to store a copy of the tree in the coded file. Of course, we want to store the tree in a format we can read back out again, and also doesn’t take up too much room…
Background: Canonical form
We would like our Huffman codes to be canonical. I gave you an algorithm for picking which character to add next to your Huffman tree that comes close to producing canonical codes, but, sadly, I have to admit it has a flaw. There is an edge case that it can never compensate for. So, we are going to have to do some more work to get it into the canonical form.
The canonical form adds two rules to the way Huffman codes work. First, all codes of a given bit length have lexicographically consecutive values (i.e., not just in order, but immediately consecutive), in the same order as the symbols they represent, and second, shorter codes lexicographically precede longer ones. An example will make this a little clearer…
Suppose we have some piece of text in which the characters A,B,C,and D appear, with the following frequencies [A:1, B:5, C:6, D:2]. We could put these into a Huffman tree and come out with the following codes:
This is not in canonical form. The first rule is broken because A comes before D, but 000 should precede 001. The second rule is broken because C had a shorter code than B, but 01 precedes 1 lexicographically. Here is an alternate assignment of codes that does satisfy the rule:
0 precedes 10, which precedes 11x, and 110 and 111 are lexicographically consecutive.
The algorithm for deriving canonical codes is fairly straightforward, provided we start by knowing the lengths of the codes. The trick is to think of the codes as binary numbers. The first code, which is given to the lexicographically smallest symbol with the shortest code, is 0, padded out with 0s to be as long as the symbol’s code. In our example, C has the shortest code (length 1), so its code is 0 (if C’s code has been length three, the code would have been 000). We then increment the code (counting in binary), which will produce the lexicographically consecutive code. If the next symbol has a longer code, we pad the code with 0s to the right to match the code length (in other words, we perform a left shift or a multiply by 2). Thus, the next code would be 1, but B has a longer code length, so it gets the code 10. To move to A, we increment (11), and shift again to get 110. D has the same code length as A, so it gets 111.
Why is this interesting? Well, it means that we can recover the entire tree starting only from knowledge about the length of the codes. So, when we want to store our tree in a file with our data, we just have to record the character followed by the length of its code. Cool, huh?
Part one: Generate canonical codes [25 pts]
We can keep all of the work that we did in the last assignment, but the tree that we produce is no longer the tree that we want to keep. Its sole purpose is to provide us with the initial code lengths for generating the canonical codes. Instead of traversing the tree and loading the reverse lookup into a map, you are going to traverse the tree and collect symbols and their code lengths, dumping them into a new priority queue as we go. I suggest tweaking your old wrapper class that held symbols and frequencies to be a more generic pairing of counts and symbols, so you can reuse it for symbols and code lengths (this is probably just some renaming to make it more general purpose). You will want to write a new Comparator that will return these items to you in order first by code length (shortest first) and then in lexicographic order (a to z).
You should then write a new private method called
buildTreeFromMap which takes in a PriorityQueue of your wrapper objects as an argument. In this method you will use the PriorityQueue to implement the above algorithm to build the Huffman tree (which the algorithm I provided just specifies the codes, you will want to build up into a tree that represents the given code words). The tree that results from this process is the one to save as an instance variable (for decode). Now is when you will build the Map that you need for encode (you should be able to reuse the code you had for building it before). The old
buildTreeFromFileshould obviously call
buildTreeFromMap once it has built the code length map.
Part two: File handling [40 pts]
We can finally get to the actual application! The first step will be to encode the file. Start by making one additional tweak to
buildFromFile that saves the
File object in an instance variable. Write a new function called
saveCompressedFile, which will take a single
File object as an argument. This file will be the destination for the compressed file. This function will first write the tree representation to the destination file. Start by writing an integer that says how many character, code lengths pairs to expect (this tells the reader how many bytes of the file are consumed by the tree). Then write out each character followed by the length of its code. Following the tree, you can start writing the compressed data from the file. This works just like encode, walking character by character, except that you will want to write bits, rather than Strings holding 0s and 1s (we’ll talk about this in class).
You will then want to write the compliment to this. The difference here, however, is that we don’t have the original source data to build the tree again. So I want you to write a second static factory method called
newTreeFromFile, this will create a new HuffmanTree object and then call a private instance method called
buildFromCompressedFile, to which it passes the compressed file. This should read the tree description out of the file, dumping the symbols and code lengths into a priority queue as it reads. It should then call
buildTreeFromMap to initialize the tree and the reverse lookup map.
Finally, write a method called
saveExpandedFile, which takes in a destination
File and processes the original file, using the tree to decode the data.
If you encounter errors in either compressing or decompressing with respect to unknown codes, throw a CharConversionException.
A little help
Java Tests needed
The assignment info from the previous assignment is attatched. It explains how other things such as map are supposed to be built.
Huffman tree skeleton is also attatched.