Autocomplete

A sequence task

In this exercise we are going to construct a model that uses an LSTM layer to perform a sequence task. The task we are going to solve is guessing the next letter in a word. For example, if we give our system a fragment of a word, 'hou', the system will have to guess a most likely next letter, such as 's'.

To build our system we are going to train it on examples of partial words. A given complete word can act as the source of many partial word problems. For example, given the word 'house' we can set up several partial word problems from it:

Given 'hou', return 's'
Given 'hous', return 'e'
Given 'house', return ' '

Getting a source of words

The first thing we will need to train our system is a source of words. For this purpose you will download the file

http://www.lawrence.edu/fast/greggj/CMSC490/shakespeare.zip

The archive contains a text file shakespeare.txt that I downloaded from the Project Gutenberg web site that contains the complete works of William Shakespeare.

To read a stream of words from this file you will need some Python code. Here is one function you will need. This function reads a single word from the file:

def getWord(file):
    char = file.read(1)
    if not char:
        return ''
    char = char.lower()

    while char < 'a' or char > 'z':
        char = file.read(1)        
        if not char: 
            return ''
        char = char.lower()

    str = ''
    while char >= 'a' and char <= 'z':
        str = str + char
        char = file.read(1)        
        if not char: 
            return str
        char = char.lower()
    
    if char == '’':
        str = ''
        char = file.read(1)
        if not char: 
            return str
        char = char.lower()
        while char >= 'a' and char <= 'z':
            char = file.read(1)        
            if not char: 
                return str
            char = char.lower()
        return getWord(file)
    return str

This function is designed to skip over non-letter characters until it encounters a letter. It then continues reading letters until it has read an entire word.

One problem with the Shakespeare text is that it contains a lot of words like mak’st that have an apostrophe embedded in them. The word reading code is designed to skip any such words and just go on to the next available word.

Here also is some example code to demonstrate how you can open up the text file and use the getWord() function to read some words:

file = open('shakespeare.txt', 'r')

for n in range(0,10):
    print(getWord(file))
 
file.close()

Making our data sets

Now that we have a source of words, we can go about constructing partial word sequence examples to train a network on. For each word we read from the input file we will want to do the following:

If the word is shorter than 4 characters long or longer than 9 characters, we will skip that word.
Starting with a substring of length 3, form all possible substrings of the word.
For each substring, construct a problem instance which takes that substring as an input and then returns the next character in the word after that substring as its output.
One of the substrings you generate should contain the entire word. For that substring the correct answer is a space character.
We will store each problem instance in a NumPy array. The target value for that instance will go into a second NumPy array.

Since the problem instances are sequences of characters and the target values are individual characters we will need to deal with the fact that a neural network can not work with characters directly. Instead, we will have to encode each character as a sequence of 0s and 1s. The most natural way to do this is to use a one-hot encoding. In this encoding scheme we encode each character as a sequence of 27 0s and 1s. To encode a particular character we start with a sequence of all 0s and then place a 1 in the location corresponding to that character. For example, to encode the letter 'e' we place a 1 in fifth position in the sequence. To encode the space character we place a 1 in the last position.

To help you do the encoding, I suggest you write a function makeSequence(str) that takes an input string str, breaks it into individual characters, and then returns a NumPy array containing a list of 10 encoding vectors for those letters. If the input string has less than 10 characters, we will pad our result with a sequence of ' ' vectors at the end.

Helpful hint: to determine which position to place a 1 for a given letter you can use the following code:

if ch == ' ':
  position = 26
else:
  position = ord(ch) - ord('a')

This code uses the Python ord() function to convert a letter to an integer code.

To generate the full data set for your application, you will need to first pick a number N for the total number of examples you want to generate. I suggest using N = 100,000.

Then, open the text file, read words, and generate problem examples from those words until you have collected N examples. You can then slice your two NumPy arrays into subsets for training, validation, and testing. You will then use those slices to feed the network directly: we will not be using a keras dataset construct for this exercise.

Building the network

This problem is simple enough that you can solve it with a very simple network. The network should consist of a single LSTM layer and an output layer with 27 units. The goal of our network is to predict the next letter given an input sequence with a partial word in it, so the output layer will use a softmax activation to output a list of probabilities for how likely each possible next letter is. The output unit with the highest probability becomes the network's next letter prediction.

To set up your LSTM layer you should use the code

layers.LSTM(16,input_shape=(10,27),return_sequences=False)