Character recognition
using backpropagation in a neural network


by Ketil Hunn

 

Course: INFSCI 2039 Introduction to Parallel Distributed Processing
Course instructor: Dr. Paul Munro


Index

    Introduction   3.1 Selecting the number of hidden units
1.0   Description of the project   3.2 Training patterns and the learning procedure
1.1   Objectives   3.3 Testing patterns and the testing procedure
1.2   Using backpropagation in a neural network   3.4 Suggested improvements
2.0   Solving the problem   3.4.1 Character scaling
2.1   Input data   3.4.2 Character centering
2.2   Hidden layer   3.5 Conclusion
2.3   Output data   A1 Training set
2.4   The program   A2 Test set
2.4.1   Interpreting a word   A3 Extracts from the source code
2.4.2   Learning characters     References
3.0   Results     Download the program

 

Introduction
This paper was written for a project on character recognition using backpropagation in a neural network in Introduction to Parallel Distributed Processing, a class offered by the University of Pittsburgh. The author assumes that the reader is already familiar with the concepts and ideas behind neural networks and the backpropagation algorithm.

 

1.0 Description of the project
Character recognition is a trivial task for humans, but to make a computer program that does character recognition is extremely difficult. Recognizing patterns is just one of those things humans do well and computers don’t. The main reason for this, I believe, is the many sources of variability. Noise for example, consists of random changes to a pattern, particularly near the edges and a character with much noise may be interpreted as a completely different character by a computer program. Another source of confusion is the high level of abstraction; there are thousands styles of type in common use and a character recognition program must recognize most of these to be of any use.

There exists several different techniques for recognizing characters. One distinguishes characters by the number of loops in a character and the direction of their concavities. Another common technique uses backpropagation in a neural network and this paper will investigate how good neural networks solves the character recognition problem.

1.1 Objectives
The objective of this project is to create an easy to use environment in which the user can draw characters and then let the program try to interpret the characters.

Such a program could be useful when demonstrating how character recognition using backpropagation works or when demonstrating how many iterations of learning need to be performed to get a satisfactory result.

1.2 Using backpropagation in a neural network
Backpropagation is a technique discovered by Rumelhart, Hinton and Williams in 1986 and it is a supervised algorithm that learns by first computing the output using a feedforward network, then calculating the error signal and propagating the error backwards through the network. For more information about the backpropagation algorithm I suggest An Introduction to Neural Networks by James A. Anderson.

 

2.0 Solving the problem
Creating a program that does real world character recognition would be too comprehensive for a project of this size. It was therefore necessary to place some restrictions on the application:

o Reducing the resolution of each character to 15x9 pixels was necessary to decrease the use of memory and learning time.
o By letting the user draw characters directly in the program we eliminate the problem with noise described earlier. Noise, if desirable, can be added by drawing arbitrary patterns around the drawn character. Letting the user draw characters directly in the program also simplifies demonstrations since a demonstration can be created on the fly without having to create input files etc.
o Each character will be drawn in a separate box eliminating the problem of separating characters.


2.1 Input data
As already mentioned, each character is represented by 15x9 pixels, forming an array of 135 input stimuli. Even though the resolution might seem small, it should be more than adequate to learn the program 26 distinct characters (A-Z).

2.2 Hidden layer
There doesn’t seem to exist a good rule for selecting the number of hidden units needed to perform backprop successfully. Instead the user has to experiment with different number of units to see what gives the most satisfactory result and best speed.

The program uses only one hidden layer since any advantage using more hidden layers could not be observed with the learning- and test-data used in this project.

2.3 Output data
Since we are particularly interested in the categorization of the characters, we use grandmother cells where one and only one output unit responds to a certain character. We are only interested in upper-case letters and use 26 response units, one for each character. The output unit with the highest activity will be selected as the character best matching the input data

2.4 The program
The program takes one argument on startup indicating the name of the brainfile to use. A brainfile contains information about the network such as number of characters or patterns learned so far, the number of hidden units to use, the learning rate and what error limit to use along with information about the characters learned and the network’s biases and weights. If any new characters have been learned these will be added automatically to the brainfile when the user exits the program.

Since the biases and weights are also stored in the brainfile the program will remember which you characters you learned it in previous sessions. There is therefore no need to learn the program every time you start it.

The option of using more than one brainfile can be very useful when you want to learn the program different handwritings without letting the two affect each other.

2.4.1 Interpreting a word
The program - Main screenFigure 1 shows a screenshot of the main screen. The screen is dominated by 4 blue boxes at the top of the screen in which the user can draw characters. The program accepts any words ranging from 1 to 4 characters. When the user has entered a word it can be interpreted by the program by pressing the Interpret button. The resulting string will be shown in the textbox below the Interpret button.

The program interprets the input by running each character through a feedforward neural network using the weights and biases it learned when performing backprop on the network of all characters. The activity of a unit in the hidden layer is calculated using h=Sigmoid(b+Ss*w) where Sigmoid is the function 1/(1+exp^-x), and similarly the activity of a unit in the response layer; r=Sigmoid(b+Sh*v). The program then selects the response unit with the highest output and displays its corresponding character.

Before any word or characters can be interpreted by the program the user must first learn the program his/her handwriting by clicking on the Add characters button.

2.4.2 Learning characters
The program - Learning screenFigure 2 shows a screenshot of the learning screen. New characters can be added to the database by drawing the character in the blue box on the left side, selecting which character it will map onto and then pressing Add character. When done entering all new characters, pressing Learn characters will make the program learn the characters in the database using backprop. The program does this by first initializing all weights and biases with random values between -1.0 and 1.0. Then for each pattern it chooses randomly, it then computes the hidden and response activities as described in the previous chapter before it backpropagates the error in the network. The error in each response unit is calculated using douti=(Ti-ri)*ri(1-ri)*dt and then the error in the hidden units are calculated using dhidi=Sum((kdoutk*vki*hi(1-hi)*dt). Finally the program updates the weights and biases; dci=douti*dt, dbi=dhidi*dt, dvij=douti*hj*dt and dwij=dhidi*sj*dt. The procedure of choosing a random pattern, performing feedforward and then backprop continues until the total error of the network is less than the error limit or until the user interrupts the learning. The total error of the network is calculated as the sum square error between the target and the response value, e=(T-r)². The error shown during the learning procedure is the greatest error of all the patterns. (Depending on your browser and font, these formulas may not look the way they should...)

 

3.0 Results
Before the testing could begin, learning had to be performed on the network. There are several problems to take into consideration when performing backprop on a neural network, such as the number of hidden units to use, setting the learning rate etc.

3.1 Selecting the number of hidden units
Hidden units and patternsThere is no known rule to choose the number of hidden units best suited for the data, so some tests had to be performed to determine the best suited number. The figure to the right shows the result from those experiments. 6 experiments for each set of hidden units were carried out and the figure shows the mean value of those 6 experiments. The error did not settle down when using 5 hidden units, not even after 10 experiments, and it was discarded. The error limit for these experiments was 0.012.

MS-DOS programs have only access to 640Kb of memory so tests with more that 35 hidden units could not be performed. Based on the tests that could be carried out, 35 hidden units turned out to give the best results, both in terms of speed and smallest number of iterations.

3.2 Training patterns and the learning procedure
The 640Kb memory limitation for MS-DOS programs also put a limitation on the maximum number of patterns the training set could contain and this was found to be 520, in other words 20 patterns for each character in the English alphabet.

Using a learning rate of 0.8 and an error limit of 0.0003, the network needed over 650.000 iterations and over 3,5 hours to complete the task. The learning patterns are listed in appendix A1 - Training Set.

3.3 Testing patterns and the testing procedure
To test the network, 100 4-letter words were randomly selected from a dictionary to form a total of 400 testing patterns. The words were drawn directly in the program and the program’s interpretation carefully noted.

The results were surprisingly good, almost 90% accuracy! Normally it will take a real character recognition program years to get to this level of accuracy. So why is this program so much better? The answer is simple; it’s not. We have to keep in mind that this program eliminates many of the problems associated with character recognition programs, problems such as noise and the problem of separating characters. Another problem is the learning- and testing patterns; all characters were drawn by the same person and this will of course affect the results significantly. If time had permitted it, I should have run tests with characters written by more than one person.

Another thing that also helps on the good result is the speed of computers today. Calculations that would have taken a week on a computer a few years ago only takes a couple of hours on a Pentium today.

The testing patterns are listed in appendix A2 - Test Set.

3.4 Suggested improvements
Although the results were surprisingly good, I believe that the results could be improved even further by implementing character scaling and centering.

3.4.1 Character scaling
Character scalingSince a neural network only receives data in form of input stimuli it cannot recognize different sizes of a character without having to learn all possible sizes. By letting the program scale characters this problem would be reduced to a minimum.

3.4.2 Character centering
Character centeringAnother problem appears when the user draws characters with a different alignment than the one learned, i.e. drawing a left aligned I when all I’s that have been learned are centered. This can easily be solved by letting the program center all characters before they are stored in the database and before interpreting them.

3.5 Conclusion
Neural network and backpropagation seem to be well suited for character recognition. Many of the problems that arise when using backpropagation to recognize characters can easily be eliminated or reduced by adding routines that scales, centers etc. Other problems like noise and the problem of separating character also exists for the other methods of character recognition and there is not much one can do about those problems except improving the quality of the equipment used when scanning/reading characters (i.e. use higher resolution) or using a database or artificial intelligence to give more accurate interpretations.

Many say that the slow speed of the backpropagation routine is a major drawback, but I consider this to be a minor drawback only since the speed of new computers doubles every third year and then speed becomes a less important issue than it is today.

 


A1 - Training set

Training set. 520 patters, 20 of each character.

 


A2 - Test set

Testing set. 100 4-letter words, 400 patterns.

 


A3 - Extracts from the source code

// Sigmoid function to be used in FeedForward
inline float Sigmoid(float x)
{
  return 1/(1+exp(-x));
}

// the derivative function used in backprop
inline float f(struct Unit unit)
{
  return unit.activity*(1-unit.activity);
}

// calculates the total error of the network for all patterns
float totalError(void)
{
  register float totalerror=0.0;

  for(EACH_PATTERN)
    for(EACH_R_UNIT)
      totalerror=max(totalerror, sqr(response[p][r].T-response[p][r].activity));
  return totalerror;
}

// perform a feedforward on the network
void FeedForward(short **stimulus)
{
  for(EACH_H_UNIT)
  {
    hidden[pattern][h].activity=hbias[h];
    for(EACH_S_UNIT)
      hidden[pattern][h].activity+=(float)(stimulus[pattern][s])*w[h][s];
    hidden[pattern][h].activity=Sigmoid(hidden[pattern][h].activity);
  }
  for(EACH_R_UNIT)
  {
    response[pattern][r].activity=rbias[r];
    for(EACH_H_UNIT)
      response[pattern][r].activity+=hidden[pattern][h].activity*v[h][r];
    response[pattern][r].activity=Sigmoid(response[pattern][r].activity);
  }
}

// calculate the errors of the output layer
void calculateOutputErrors(void)
{
  for(EACH_R_UNIT)
    response[pattern][r].error=(response[pattern][r].T
                               -response[pattern][r].activity)*
                               f(response[pattern][r])*learningrate;
}

// calculate the errors of the hidden layer
void calculateHiddenErrors(void)
{
  for(EACH_H_UNIT)
  {
    hidden[pattern][h].error=0.0;

    for(EACH_R_UNIT)
      hidden[pattern][h].error+=response[pattern][r].error*v[h][r];
    hidden[pattern][h].error*=f(hidden[pattern][h])*learningrate;
  }
}

// updates biases and weights
void updateNetwork(void)
{
  for(EACH_R_UNIT)
    rbias[r]+=response[pattern][r].error*learningrate;
  for(EACH_H_UNIT)
  {
    for(EACH_R_UNIT)
      v[h][r]+=response[pattern][r].error*hidden[pattern][h].activity*learningrate;
    hbias[h]+=hidden[pattern][h].error*learningrate;

    for(EACH_S_UNIT)
      w[h][s]+=hidden[pattern][h].error*(float)(stimulus[pattern][s])*learningrate;
  }
}

References
An Introduction to Neural Networks, James A. Anderson
The Age of Intelligent Machines, Raymond Kurzweil


Download

To download the character recognition program, click on the link below:
chrec.zip (81.363Kb)