Spell Checking Version One

Getting practice with stdio.h for files

In the example program below I am going to demonstrate the basic usage of the formatted file reading functions provided by the C standard library.

The example program is a spell checking utility that can scan a text file for words and compare the words against a list of words in a dictionary file. The dictionary file I am using is download from GitHub at https://github.com/dwyl/english-words. (I used the file word_alpha.txt contained in that repository. The original list of words in that file was not properly alphabetized so I wrote a small utility program to alphabetize the word list properly.)

Header files used

Here is the list of #include statements I used in the program.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

The stdio.h header file includes functions to handle standard input and output, including formatted file I/O. We are going to using a number of these file I/O functions in the example program. You can read more about these file I/O functions in the chapter three of the textbook in the section titled "Standard I/O".

Since we are going to be using the malloc() and free() functions for memory allocation we include stdlib.h.

To do string comparisons with the strcmp() function and to extract individual words from a line of text via the strtok() function we include string.h.

To convert letters from upper case to lower case via the tolower() function we include ctype.h.

The dictionary struct

We will be storing the words in our spell checking dictionary in a C struct, dictionary. Here is the definition of that struct:

typedef struct {
  int N; // Number of words in the the dictionary
  char** words; // Array of words
} dictionary;

The words in the dictionary are going to be stored in an array words that contains pointers to the individual words. The struct declaration only declares the pointer to the array of pointers. Setting up this array and filling it with words happens in the initialization step.

Setting up the dictionary

The words for the dictionary are stored in a text file named words.txt. The loadDictionary() function is responsible for reading the words and setting up the words array in the dictionary struct.

dictionary* loadDictionary(const char* fileName) {
    FILE* f = fopen(fileName,"r");
    if(f == NULL)
        return NULL;

    // Count the words in the dictionary
    char ch;
    int lineCount = 0;
    while(!feof(f)) {
        ch = fgetc(f);
        if(ch == '\n') lineCount++;
    }
    lineCount++; // The number of words is one more than the number of line breaks.
    rewind(f);

    dictionary* d = (dictionary*) malloc(sizeof(dictionary));
    d->N = lineCount;
    if(lineCount == 0) {
        d->words = NULL;
    } else {
        d->words = (char**) malloc(lineCount * sizeof(char*));
        char buffer[64];
        int n = 0;
        while(!feof(f)) {
            fscanf(f ,"%s",buffer);
            char* word = (char*) malloc(strlen(buffer)+1);
            strcpy(word,buffer);
            d->words[n] = word;
            n++;
        }
    }
    fclose(f);

    return d;
}

This function will read through the words.txt file twice. On the first pass it will simply count the number of words in the file by counting the number of line break characters in the file. On the second pass it will read the words and store them in the dictionary struct.

To open the words.txt file we use the fopen() function with a mode string of "r" to indicate that we want to open the file in read mode:

FILE* f = fopen(fileName,"r");

To close the file we use fclose():

fclose(f);

After the first pass through the file we use the rewind() function to rewind the file to the start.

To count the number of words in the file we use the following logic:

// Count the words in the dictionary
char ch;
int lineCount = 0;
while(!feof(f)) {
  ch = fgetc(f);
  if(ch == '\n') lineCount++;
}
lineCount++;

This code uses the fgetc() function to read one character at a time from the input file. We use the feof() function here to control the loop and stop reading when we have reached the end of the file.

Once we have determined how many words are in the dictionary file we can set up the dictionary structure. This requires calls to malloc() to allocate memory for the struct, the array of pointers to words, and the individual words. Here is the code to set up the dictionary:

dictionary* d = (dictionary*) malloc(sizeof(dictionary));
d->N = lineCount;
if(lineCount == 0) {
    d->words = NULL;
} else {
    d->words = (char**) malloc(lineCount * sizeof(char*));
    char buffer[64];
    int n = 0;
    while(!feof(f)) {
      fscanf(f ,"%s",buffer);
      char* word = (char*) malloc(strlen(buffer)+1);
      strcpy(word,buffer);
      d->words[n] = word;
      n++;
    }
}

The malloc() function takes a single parameter, the number of bytes of memory to allocate. To determine how many bytes we need for the struct and the array we use the sizeof() function to determine the number of bytes needed for a dictionary struct and a pointer to a char array.

To read the individual words from the dictionary file we use the fscanf() function with a format specifier of %s to indicate that we want to read individual words from the file. The final parameter in the fscanf() call is a pointer to an array of chars where we want the word stored. For each word that we read from the file we allocate a character array that is just large enough to hold the letters in the word plus the terminating character '\0'. We then copy the word from the temporary buffer to the word array via the strcpy() function. Once we have set up the word array for a particular word we store a pointer to that word in the dictionary's words array.

Destroying the dictionary

In setting up the dictionary struct we used malloc() in various places to allocate memory for the struct, the words array, and for the individual words. It is good programming practice to deallocate memory for these items when we are done using them. The freeDictionary() function takes care of this:

void freeDictionary(dictionary* d) {
    for(int n = 0;n < d->N;n++)
        free(d->words[n]);
    free(d->words);
    free(d);
}

Searching the dictionary

To use the dictionary for spell checking we will need to write a function that can search the list of dictionary words to determine whether or not a given word is in the dictionary. The searchDictionary() function uses an efficient binary search to do this search:

int searchDictionary(dictionary* d,const char* word) {
    int start = 0;
    int end = d->N-1;
    while(start <= end) {
        int mid = (start+end)/2;
        const char* lookup = d->words[mid];
        int comp = strcmp(word,d->words[mid]);
        if(comp == 0) return mid;
        else if(comp < 0) end = mid -1;
        else start = mid + 1;
    }
    return -1;
}

On each round of the search the code compares the word we are searching for against a word in the dictionary list using the strcmp() function. This function will return 0 if the word is exactly the same as the dictionary word. If the word comes before the dictionary word in alphabetical order, strcmp() will return a negative integer. If the word comes after, strcmp() returns a positive integer.

Searching a text file for words that are not in the dictionary

Finally, I provide a function that opens a second text file and checks each word in that file to see whether or not it appears in the dictionary.

void checkWords(dictionary* d,const char* fileName) {
  FILE* f = fopen(fileName,"r");
  if(f == NULL) {
    printf("Could not open data file.\n");
    return;
  }
  char line[256];
  char* word = NULL;
  const char* delims = " 0123456789!@#$&?%*+-/<>,.;:(){}[]\"\'\n\r\t";
  while(!feof(f)) {
    fgets(line,255,f);
    word = strtok(line,delims);
    while(word != NULL) {
      int n = 0;
      while(word[n] != '\0') {
        word[n] = tolower(word[n]);
        n++;
      }
      int loc = searchDictionary(d,word);
      if(loc == -1)
        printf("%s\n",word);
      word = strtok(NULL,delims);
    }
  }
  fclose(f);
}

This function uses a slightly different approach to read the individual words in the text file. We read the text file one line at a time via a call to the fgets() function. To break this line into individual words we use the strtok() function. The second parameter to strtok() is an array of delimiter characters. These are characters that the strtok() function should skip over as it scans the line looking for words. As you can see, we have listed just about every possible nonalphabetical character here as a delimiter character.

The strtok() function has a somewhat peculiar usage pattern. The first time we call strtok() on a line of text we pass a pointer to the text array as the first parameter to strtok(). On subsequent calls we instead pass NULL to strtok(). This tells the function that we want to continue scanning forward in the same array we passed on the first call.

Each time strtok() locates a word in the array it will pass us a pointer to that word in the line. When strtok() can not find any further words in the line it will return NULL.

With each new word that we read from the text file we call searchDictionary() to determine whether or not the word is in the dictionary. If it is not, searchDictionary() will return a location of -1. We print all of the words that are not in the dictionary out to the console.