Spell Checking Version Two

Index Builder Project Spell Checker Project

An alternative strategy to handle the dictionary

In the first version of the spell checking program we loaded the dictionary of words into memory by reading all of the strings in the dictionary file. We constructed an array of pointers to strings to store all of those words, and then implemented our binary search in that array.

In this version of the spell checking program we are going to pursue a slightly different strategy. This time around we are going to keep the dictionary words in the original file. To make it easier to locate the words in the file we are going to construct an index that can tell us the location of each word in the text file. Only the index will be loaded into memory when the spell checker starts, and we will use the index to locate words we want to look up in the text file and read just one word at a time from the text file when we do our searches.

Creating the index

The first step in setting up the second version of the spell checker is to construct the index. The index will be an array of integers, and each entry in the index will store the location in the dictionary file of one of the dictionary words.

The first piece of code we are going to look at today is a program index.c that constructs the index and stores the index in a file. This will save us work later, since the spell checking program will be able to load the index from the file at start and won't have to rebuild the index for each run of the spell checking program.

Here is the code for index.c:

#include <stdio.h>
#include <stdlib.h>

int main() {
    FILE* f1 = fopen("words.txt","r");
    int wordCount = 1;
    char ch;
    while(!feof(f1)) {
        ch = fgetc(f1);
        if(ch == '\n') wordCount++;
    }

    FILE* f2 = fopen("index.idx","wb");
    fwrite(&wordCount,sizeof(int),1,f2);

    int* positions = (int*) malloc((wordCount+1)*sizeof(int));

    rewind(f1);

    wordCount = 0;
    positions[wordCount] = (int) ftell(f1);
    while(!feof(f1)) {
        ch = fgetc(f1);
        if(ch == '\n') {
            wordCount++;
            positions[wordCount] = (int) ftell(f1);
        }
    }
    wordCount++;
    positions[wordCount] = (int) ftell(f1) + 1;
    fclose(f1);

    fwrite(positions,sizeof(int),wordCount+1,f2);
    fclose(f2);
    return 0;
}

The first loop in main() counts how many words are in the words.txt file. This will tell us how large the index array should be.

The index array will contain the location in the file of the first character for each of the words in the words.txt file. We will also add one additional entry at the end that tells us where the next word would start if there were one more word in the file.

To store the index we are going to set up a file named index.idx. This file is going to be a binary file, which will allow us to write the index array directly to the file in binary form. This will eventually save time in the spell checking program since we won't have to convert the index list from text into an array of integers: we can simply read the integers directly from the binary index.idx file.

To work with a binary file we make one small change when opening the file for writing. Here is the command we use to open the file:

FILE* f2 = fopen("index.idx","wb");

The second parameter in the fopen() call is the file mode: in this case we use 'w' to indicate that we are opening the file for writing and 'b' to indicate that we are going to be using this file in binary mode.

To write data into the binary file we are going to be using the fwrite() function. The first thing we will write into the file is the size of the index:

fwrite(&wordCount,sizeof(int),1,f2);

The fwrite() function takes four parameters. The first parameter is a pointer to the memory location where the data is stored. In this case we are writing a single integer stored in the variable wordCount, so we provide the address of that integer variable as the first parameter. The second parameter indicates the size in bytes of the items we are going to be writing, and the third parameter indicates how many of those items we want to write into the file. The fourth parameter is the file pointer.

The rest of the code in index.c constructs the index array. To do this we will scan through the words.txt file looking for newline characters. When we see a newline character we know that we have reached the end of a word and the next character that will come from the file is the start of a new word. At that instant we can use the ftell() function to get the position of that next character in the file. We store this location information in the index array.

Once we have constructed the index array we make one final call to fwrite() to write the entire array of location information out to the file.

The new spell checker

In version two of the spell checking program we are going to load only the index array into memory when the spell checker starts up. The dictionary words themselves will stay in the words.txt file, and we will read those words only when we need to.

Given this slightly different arrangement, we will need to make some small modifications to the dictionary struct and the code for initializing it:

typedef struct {
    int N; // Number of words in the the dictionary
    int* index;
    FILE* words;
} dictionary;

dictionary* loadDictionary(const char* indexName,const char* dictName) {
    FILE* f = fopen(indexName,"rb");
    dictionary* d = (dictionary*) malloc(sizeof(dictionary));
    fread(&(d->N),sizeof(int),1,f);
    d->index = (int*) malloc((d->N + 1) * sizeof(int));
    fread(d->index,sizeof(int),d->N+1,f);
    fclose(f);
    d->words = fopen(dictName,"r");
    return d;
}

The dictionary struct will now contain a pointer to the index array, which is an array of ints of size N+1, and a file pointer to the words.txt file.

The loadDictionary() function will open the index file for reading in binary mode and then read the number of words from the file, followed by the entire index array. To do the reading we will use the fread() function. Like fwrite(), this function takes four parameters. The first parameter is a pointer to a memory location where we would like to have our data stored. The second and third parameters specify the size in bytes of each unit we want to read and how many units we want to read. The fourth parameter is the file pointer.

Here now is the modified searchDictionary() function. The big difference this time around is that the words in the dictionary are not stored in memory. To do a binary search in the dictionary list we will have to read words from the file one at a time just when the binary search code needs them.

int searchDictionary(dictionary* d,const char* word) {
    int start = 0;
    int end = d->N-1;
    char fileWord[64];
    while(start <= end) {
        int mid = (start+end)/2;
        fseek(d->words,d->index[mid],SEEK_SET);
        size_t bytesRead = fread(fileWord,sizeof(char),d->index[mid+1]-d->index[mid]-1,d->words);
        fileWord[bytesRead] = '\0';
        int comp = strcmp(word,fileWord);
        if(comp == 0) return mid;
        else if(comp < 0) end = mid -1;
        else start = mid + 1;
    }
    return -1;
}

To read just a single word from the dictionary file we first use the fseek() function to set the file position in the file. The fseek() function allows us to specify how many bytes from the start of the file we want to set the file position. In this code we want to read the word with an index of mid from the file, so we use d->index[mid] to look up the position of the start of that word in the index.

We then issue the fread() command to read that word into a character array that we set up to receive the word. The third parameter to fread() needs to be set to the number of bytes we want to read. We can compute that byte count by just computing the difference of location of this word and the location of the word that comes after it. fread() will read the word without a terminating '\0' character. We have to insert that character ourselves to complete the word.

This version of the spell checking program is going to be doing a lot more file interactions, since each word we search the dictionary for is going to generate several reads from the file. All of this reading does slow the spell checker down, but the performance hit from all of these reads is not that bad. The second version of the spell checker can still do a spell check on a moderately long file in a reasonable amount of time.

Programming assignment

In this assignment we are going to construct a third version of the spell checking program. In this version of the program we are going to load both the index file and the full list of words into memory when the spell checker starts up.

In this assignment you are also going to get some practice working with an alternative set of functions for reading data from a file. These functions are described in the textbook in chapter two. Start by reading about the open(), read(), and close() functions in chapter two. You will using these functions in place of fopen(), fread(), and fclose() for this assignment.

Start by modifying the structure declaration for the dictionary struct:

typedef struct {
    int N; // Number of words in the the dictionary
    int* index;
    char* words;
} dictionary;

The struct is now going to contain pointers to two large arrays. The index pointer will point to the index array, and the words pointer is going to point to an array that contains all of the text from the dictionary file in a single, gigantic array.

Modify the loadDictionary() function to read both the index array and the words array from the index file and the words file. When you read the words file, start by using malloc() to construct an array of chars of size one larger than the size of the words file. To determine the size in bytes of the dictionary file you should use the stat() function: here is some sample code that demonstrates how to do this.

After you read the contents of the dictionary file into your array, set the last location in the array to '\0' and then run a loop across the array that turns every '\n' character into '\0'.

Since the dictionary structure has changed, you will also have to modify freeDictionary() appropriately.

Finally, you will have to modify the searchDictionary() function to look up words in the array of words instead of reading them from the file of words.

With all of these changes the spell checker should now run noticably faster.