In the author's implementation of the Trie data structure he uses a somewhat unusual design idea. This idea is required to deal with the fact that the Trie data structure is a heterogeneous data structure that contains two different types of nodes. These notes will introduce you to the basic concepts in C++ needed to replace the author's approach with a cleaner alternative.
In a heterogeneous tree structure, we use one type for the internal nodes and a different type to represent the leaf nodes. This structure creates a problem when we need to set up child pointers in the internal nodes. Some of those pointers will point to other internal nodes, while others will point to children. How can we tell whether or not a given pointer points to a leaf or another internal node?
Here is how the author solves this problem:
class TrieNonLeafNode {
public:
TrieNonLeafNode();
TrieNonLeafNode(char);
private:
bool leaf, endOfWord;
char *letters;
TrieNonLeafNode **ptrs;
friend class Trie;
};
class TrieLeafNode {
public:
TrieLeafNode();
TrieLeafNode(char*);
private:
bool leaf;
char *word;
friend class Trie;
};
Note that the TrieNonLeafNode class contains an array, ptr, of pointers to TrieNonLeafNodes. In practice, some of those pointers will point to TrieNonLeafNodes, while others will point to TrieLeafNodes. To allow us to distinguish between the two cases, the author has placed a boolean flag variable, leaf, as the first member variable in both classes. The constructor for the TrieNonLeafNode class sets this flag to false for all TrieNonLeafNodes, while the constructor for the TrieLeafNode class sets this flag to true for all TrieLeafNodes. Thus, if we are looking at a particular pointer in the ptr array, we just have to follow that pointer to the object it points to and examine that object's leaf flag. If it says true, we know that the pointer actually points to a TrieLeafNode. As soon as we realize this, we can type cast the pointer to be a pointer to TrieLeafNode instead of TrieNonLeafNode, and then proceed to work with the proper pointer.
This trick relies on an essential piece of information about the way that C++ compilers lay out the memory structure of classes. The rule is that the member variables of a given class will be laid out in the object in precisely the order in which they are declared in the class declaration. Although it may be possible to make a more efficient arrangement of the data members by placing them in a different order, the C++ compiler is prevented from doing so. Instead, it is up to the programmer to understand which ordering of member variables makes the most efficient use of memory and then declare the member variables in the appropriate order.
Given this strict rule that governs how member variables are laid out in objects, the author can place the same bool member variable at the front of the list of member variables in both classes and count on both classes having the same variable in the same location. Thus, if we have a pointer p which was initially declared as being a pointer to a TrieNonLeafNode, and that pointer actually points to a TrieLeafNode, we can safely say
if(p->leaf) {
TrieLeafNode* leafP = (TrieLeafNode*) p;
// now do stuff with leafP
}
To make use of this trick, the programmer has to have a very clear understanding of how the compiler structures variables in an object. Further, the programmer has to be aware that this highly specialized trick is being used and be very careful not to do anything that would make the trick blow up. For example, if another programmer came along and decided that for convenience they would like to have a member variable numKeys in the TrieNonLeafNode class,
class TrieNonLeafNode {
public:
TrieNonLeafNode();
TrieNonLeafNode(char);
private:
int numKeys;
bool leaf, endOfWord;
char *letters;
TrieNonLeafNode **ptrs;
friend class Trie;
};
the modified program would die a sudden and violent death when run, because the leaf member variables are no longer in the same relative position in both classes.
Another problem with this approach is that it relies on the fact that both C and C++ follow the ordering rule for member functions strictly. Other languages, such as Java, may not. Thus, if one were to take this code and attempt to port it to another language, one would run the risk of the code no longer working.
A further criticism we can make of this approach is that the leaf member variables do absolutely nothing useful beyond helping us to distinguish TrieNonLeafNodes from TrieLeafNodes.
In the notes that follow, I will explore two alternative solutions to the problem of dealing with a heterogeneous tree structure.
In Java we saw that a class can extend another class. This situation is also refered to as inheritance: we say that a sub-class extends or inherits from a base class.
C++ also allows one class to extend another class. Here is a simple example. The author's BST class was loaded up with a lot of different methods for doing a variety of things. Frequently, when you use a BST class you won't need all of these capabilities, and the extra methods needed to implement all of these capabilities just get in the way.
For example, the BST class contains methods for doing various kinds of traversals. In many applications of binary search trees you won't need or want this capability. This suggests that we should start by putting the core capabilities of the BST class in a base class and then create a sub-class that inherits those capabilities from the base class and adds new features. That arrangement would look something like this:
template<class T>
class BST {
public:
BST();
~BST();
void clear();
bool isEmpty() const;
void insert(const T&);
T* search(const T& el) const;
void delete(const T&);
protected:
BSTNode<T>* root;
void clear(BSTNode<T>*);
private:
void deleteByCopying(BSTNode<T>*&);
void findAndDeleteByCopying(const T&);
};
template<class T>
class TraversableBST : public BST<T> {
public:
TraversableBST();
~TraversableBST();
void preorder();
void inorder();
void postorder();
void iterativePreorder();
void iterativeInorder();
void iterativePostorder();
void MorrisPreorder();
void MorrisInorder();
void MorrisPostorder();
protected:
void preorder(BSTNode<T>*);
void inorder(BSTNode<T>*);
void postorder(BSTNode<T>*);
void visit(BSTNode<T>* p);
};
The base class, BST, contains the core functionality needed for a binary search tree. The sub-class, TraversableBST, extends the base class by adding new functionality not found in the base class. We also say that the sub-class inherits all of the member variables and methods of the base class. Thus, it is perfectly legal to do the following:
TraversableBST<int> x; x.insert(3); x.insert(2); x.insert(7); x.inorder();
In this example we call the insert method inherited from the BST class as well as the inorder() method that is only available in a TraversableBST.
We are all familiar with the usage of the private and public keywords to control access to member variables and methods. Variables and methods declared in the private section of a class declaration can only be accessed by methods in the same class, while things declared public can be free accessed from outside the class.
C++ has a very strict interpretation of the private keyword. Anything declared private can only be accessed by methods of the same class. For the purposes of this rule, classes that inherit from the original class are technically considered different classes and hence are not allowed to access any member variables or methods declared private in the base class.
This interpretation of private is quite strict, so the language designers came up with a third access specifier, protected, as a compromise. Data or methods declared protected can not be accessed by code outside the class, but they are allowed to be accessed by code in methods belonging to classes that inherit from the original class. This is what you see in the two classes above. The root pointer declared in the base class is declared protected. This means that any of the methods declared in the subclass are allowed to access that root pointer. This is essential, because all of the traversal methods defined in TraversableBST will of course need to access the root pointer to do their job. The root pointer is still shielded from access by code outside these two classes, and that is exactly what we want.
Another application of inheritance is to make it possible for you to redefine all or part of the behavior of an existing class without having to create a completely new class. For example, the BST class shown here contains a delete method that uses the delete by copying algorithm:
template <class T>
void BST<T>::delete(const T &el) {
findAndDeleteByCopying(el);
}
Suppose we wanted to change this behavior to use the delete by merging method instead. Here is how we could use inheritance to do this.
template<class T>
class MergingBST : public BST<T>{
public:
void delete(const T&);
private:
void deleteByMerging(BSTNode<T>*&);
void findAndDeleteByMerging(const T&);
};
template <class T>
void MergingBST<T>::delete(const T &el) {
findAndDeleteByMerging(el);
}
The MergingBST class again inherits most of what it needs from the BST class. The MergingBST class adds new methods to implement deletion by merging. In addition, the MergingBST class overrides the original delete() method of BST with its own version of the delete method which calls findAndDeleteByMerging instead of findAndDeleteByMerging.
Another aspect of inheritance has to do with how inherited classes work with pointers. The type rules of C++ specify that a pointer to a base class can be used to point to an object of a class that inherits from the base class. Thus, for example, the following code is perfectly legal.
BST<int> *p = new MergingBST<int>(); p->insert(3); p->insert(2); p->insert(7); ((MergingBST<int>*) p)->delete(2);
Note that since the pointer p is declared to point to a BST<int>, we can only use that pointer to call methods defined in that base class. If we want to call a method that is not defined in that base class or is overridden with a new version in the derived class, we have to type cast the pointer to a type that does define the correct version of that method.
If we were to do
p->delete(3);
we would end up calling BST<T>::delete instead, which implements deletion by copying.
The last example is somewhat disturbing. If we have a pointer to a BST<int> that actually points to a MergingBST<int>, you would think that any time that we call delete() through that pointer we would actually get the MergingBST<int> version of that method. In fact, this is the behavior one typically gets in Java.
It turns out that this behavior is possible in C++, although we have to tell the compiler explicitly that we want it. The mechanism for doing this is to declare the method in question virtual the first time it appears in the class heirarchy.
template<class T>
class BST {
public:
BST();
~BST();
void clear();
bool isEmpty() const;
void insert(const T&);
T* search(const T& el) const;
virtual void delete(const T&); // calls findAndDeleteByCopying
protected:
BSTNode<T>* root;
void clear(BSTNode<T>*);
private:
void deleteByCopying(BSTNode<T>*&);
void findAndDeleteByCopying(const T&);
};
template<class T>
class MergingBST : public BST<T>{
public:
virtual void delete(const T&); // calls findAndDeleteByMerging
private:
void deleteByMerging(BSTNode<T>*&);
void findAndDeleteByMerging(const T&);
};
As soon as you declare a method virtual, the version of the method you get when you call that method is no longer tied to the type of the pointer, but is now bound to the type of the object that the pointer points to.
BST<int> *p = new MergingBST<int>(); p->insert(3); p->insert(2); p->insert(7); p->delete(2); // Does deletion by merging
Now that we have seen some basic background information on inheritance and virtual functions, we can examine two alternative solutions to the heterogeneous structure problem.
The first approach uses a common base class containing a bool member variable that distinguishes leaves from internal nodes. The two node classes then inherit from the common base class.
class TrieNode {
public:
TrieNode() { leaf = false; }
bool isLeaf() { return leaf; }
protected:
bool leaf;
};
class TrieNonLeafNode : public TrieNode {
public:
TrieNonLeafNode();
TrieNonLeafNode(char);
private:
bool endOfWord;
char *letters;
TrieNode **ptrs;
friend class Trie;
};
class TrieLeafNode : public TrieNode {
public:
TrieLeafNode();
TrieLeafNode(char*);
private:
char *word;
friend class Trie;
};
The constructor for the base class sets this flag to false. The constructor for the TrieLeafNode class would have to set leaf to true instead.
Note what has happened to the array of pointers in the TrieNonLeafNode class. Those pointers are now declared to be pointers to TrieNode. This means that we now have to write code like the following:
if(ptrs[n]->isLeaf()) {
TrieLeafNode* leafP = (TrieLeafNode*) ptrs[n];
// do something with this leaf node pointer
} else {
TrieNonLeafNode* internalP = (TrieNonLeafNode*) ptrs[n];
// do something with this internal node pointer
}
This approach has two advantages over the original. The first is that it helps make it clearer to anyone who is looking at the code what role the leaf member variable plays. The second is that it is now safe to modify any of these classes in any way by adding more member variables as needed. We no longer have to worry about breaking the ordering relation between the leaf member variables in two distinct classes, because those two separate member variables have been consolidated into one member variable in the base class.
This approach still has the problem of the leaf member variable being somewhat useless for any purpose other than distinguishing an internal node from a leaf node.
The second approach is to eliminate the leaf member variable from the base class and make the isLeaf() member function virtual. That approach looks like this.
class TrieNode {
public:
virtual bool isLeaf() = 0;
};
class TrieNonLeafNode : public TrieNode {
public:
TrieNonLeafNode();
TrieNonLeafNode(char);
virtual bool isLeaf() { return false; }
private:
bool endOfWord;
char *letters;
TrieNode **ptrs;
friend class Trie;
};
class TrieLeafNode : public TrieNode {
public:
TrieLeafNode();
TrieLeafNode(char*);
virtual bool isLeaf() { return true; }
private:
char *word;
friend class Trie;
};
This is the cleanest of the three approaches. It replaces the almost useless leaf flag with a virtual function. The virtual function is abstract, because the = 0 syntax indicates that the TrieNode class does not define a body for the function. Instead, both subclasses have to provide overrides for the virtual function that do the right thing. (A further advantage of making a class abstract by placing an abstract virtual function in it is that you can never create an object whose class type is abstract. This avoids a mistake that beginning programmers sometimes make, which is to try creating objects of the base type.)
The one disadvantage of this approach is that uses slightly more space in the nodes. We saved space by eliminating the leaf member variables, but as soon as you introduce a virtual function, each object then has to include a virtual function table pointer that is used as part of the virtual function mechanism.
The good thing about virtual functions is that once they are introduced for one purpose, we can introduce additional virtual functions at no extra cost in storage space in the objects. Once an object contains a virtual function table pointer, we can make that virtual function table as large as we want to. Since the same table gets used by all objects of a given class, adding extra entries to the table comes at almost no cost.
Rewrite the author's code for the spell checker to use the virtual function approach outlined above.
For extra credit, see if you can find other places in the Trie class where virtual functions in the TrieNode class would produce a cleaner alternative to the existing code. (Hint: one such place to look is the code for printing the Trie.)