Character recognition
using backpropagation in a neural network
by Ketil Hunn
Course: INFSCI 2039
Introduction to Parallel Distributed Processing
Course instructor: Dr.
Paul Munro
Index
Introduction
This paper was written for a project on character
recognition using backpropagation in a neural network in Introduction
to Parallel Distributed Processing, a class offered by the University of Pittsburgh. The
author assumes that the reader is already familiar with the
concepts and ideas behind neural networks and the backpropagation
algorithm.
1.0
Description of the project
Character recognition is a trivial task
for humans, but to make a computer program that does character
recognition is extremely difficult. Recognizing patterns is just
one of those things humans do well and computers dont. The
main reason for this, I believe, is the many sources of
variability. Noise for example, consists of random changes to a
pattern, particularly near the edges and a character with much
noise may be interpreted as a completely different character by a
computer program. Another source of confusion is the high level
of abstraction; there are thousands styles of type in common use
and a character recognition program must recognize most of these
to be of any use.
There exists several different techniques for recognizing characters. One distinguishes characters by the number of loops in a character and the direction of their concavities. Another common technique uses backpropagation in a neural network and this paper will investigate how good neural networks solves the character recognition problem.
1.1
Objectives
The objective of this project is to create an
easy to use environment in which the user can draw characters and
then let the program try to interpret the characters.
Such a program could be useful when demonstrating how character recognition using backpropagation works or when demonstrating how many iterations of learning need to be performed to get a satisfactory result.
1.2
Using backpropagation in a neural network
Backpropagation is a technique discovered by
Rumelhart, Hinton and Williams in 1986 and it is a supervised
algorithm that learns by first computing the output using a
feedforward network, then calculating the error signal and
propagating the error backwards through the network. For more
information about the backpropagation algorithm I suggest An
Introduction to Neural Networks by James A. Anderson.
2.0
Solving the problem
Creating a program that does real world
character recognition would be too comprehensive for a project of
this size. It was therefore necessary to place some restrictions
on the application:
| o | Reducing the resolution of each character to 15x9 pixels was necessary to decrease the use of memory and learning time. |
| o | By letting the user draw characters directly in the program we eliminate the problem with noise described earlier. Noise, if desirable, can be added by drawing arbitrary patterns around the drawn character. Letting the user draw characters directly in the program also simplifies demonstrations since a demonstration can be created on the fly without having to create input files etc. |
| o | Each character will be drawn in a separate box eliminating the problem of separating characters. |
2.1
Input data
As already mentioned, each character is
represented by 15x9 pixels, forming an array of 135 input
stimuli. Even though the resolution might seem small, it should
be more than adequate to learn the program 26 distinct characters
(A-Z).
2.2
Hidden layer
There doesnt seem to exist a good rule for
selecting the number of hidden units needed to perform backprop
successfully. Instead the user has to experiment with different
number of units to see what gives the most satisfactory result
and best speed.
The program uses only one hidden layer since any
advantage using more hidden layers could not be observed with the
learning- and test-data used in this project.
2.3
Output data
Since we are particularly interested in the
categorization of the characters, we use grandmother cells where
one and only one output unit responds to a certain character. We
are only interested in upper-case letters and use 26 response
units, one for each character. The output unit with the highest
activity will be selected as the character best matching the
input data
2.4
The program
The program takes one argument on startup
indicating the name of the brainfile to use. A brainfile contains
information about the network such as number of characters or
patterns learned so far, the number of hidden units to use, the
learning rate and what error limit to use along with information
about the characters learned and the networks biases and
weights. If any new characters have been learned these will be
added automatically to the brainfile when the user exits the
program.
Since the biases and weights are also stored in the brainfile the program will remember which you characters you learned it in previous sessions. There is therefore no need to learn the program every time you start it.
The option of using more than one brainfile can
be very useful when you want to learn the program different
handwritings without letting the two affect each other.
2.4.1 Interpreting a word
Figure 1 shows a
screenshot of the main screen. The screen is dominated by 4 blue
boxes at the top of the screen in which the user can draw
characters. The program accepts any words ranging from 1 to 4
characters. When the user has entered a word it can be
interpreted by the program by pressing the Interpret
button. The resulting string will be shown in the textbox below
the Interpret button.
The program interprets the input by running each character through a feedforward neural network using the weights and biases it learned when performing backprop on the network of all characters. The activity of a unit in the hidden layer is calculated using h=Sigmoid(b+Ss*w) where Sigmoid is the function 1/(1+exp^-x), and similarly the activity of a unit in the response layer; r=Sigmoid(b+Sh*v). The program then selects the response unit with the highest output and displays its corresponding character.
Before any word or characters can be interpreted
by the program the user must first learn the program his/her
handwriting by clicking on the Add characters button.
2.4.2
Learning characters
Figure 2 shows a
screenshot of the learning screen. New characters can be added to
the database by drawing the character in the blue box on the left
side, selecting which character it will map onto and then
pressing Add character. When done entering all new characters,
pressing Learn characters will make the program learn the
characters in the database using backprop. The program does this
by first initializing all weights and biases with random values
between -1.0 and 1.0. Then for each pattern it chooses randomly,
it then computes the hidden and response activities as described
in the previous chapter before it backpropagates the error in the
network. The error in each response unit is calculated using douti=(Ti-ri)*ri(1-ri)*dt and
then the error in the hidden units are calculated using dhidi=Sum((kdoutk*vki*hi(1-hi)*dt).
Finally the program updates the weights and biases; dci=douti*dt, dbi=dhidi*dt, dvij=douti*hj*dt and
dwij=dhidi*sj*dt. The
procedure of choosing a random pattern, performing feedforward
and then backprop continues until the total error of the network
is less than the error limit or until the user interrupts the
learning. The total error of the network is calculated as the sum
square error between the target and the response value,
e=(T-r)². The error shown during the learning procedure is the
greatest error of all the patterns. (Depending on
your browser and font, these formulas may not look the way they
should...)
3.0
Results
Before the testing could begin, learning
had to be performed on the network. There are several problems to
take into consideration when performing backprop on a neural
network, such as the number of hidden units to use, setting the
learning rate etc.
3.1
Selecting the number of hidden units
There is no known
rule to choose the number of hidden units best suited for the
data, so some tests had to be performed to determine the best
suited number. The figure to the right shows the result from
those experiments. 6 experiments for each set of hidden units
were carried out and the figure shows the mean value of those 6
experiments. The error did not settle down when using 5 hidden
units, not even after 10 experiments, and it was discarded. The
error limit for these experiments was 0.012.
MS-DOS programs have only access to 640Kb of
memory so tests with more that 35 hidden units could not be
performed. Based on the tests that could be carried out, 35
hidden units turned out to give the best results, both in terms
of speed and smallest number of iterations.
3.2
Training patterns and the learning procedure
The 640Kb memory limitation for MS-DOS programs
also put a limitation on the maximum number of patterns the
training set could contain and this was found to be 520, in other
words 20 patterns for each character in the English alphabet.
Using a learning rate of 0.8 and an error limit
of 0.0003, the network needed over 650.000 iterations and over
3,5 hours to complete the task. The learning patterns are listed
in appendix A1 - Training Set.
3.3
Testing patterns and the testing procedure
To test the network, 100 4-letter words were
randomly selected from a dictionary to form a total of 400
testing patterns. The words were drawn directly in the program
and the programs interpretation carefully noted.
The results were surprisingly good, almost 90% accuracy! Normally it will take a real character recognition program years to get to this level of accuracy. So why is this program so much better? The answer is simple; its not. We have to keep in mind that this program eliminates many of the problems associated with character recognition programs, problems such as noise and the problem of separating characters. Another problem is the learning- and testing patterns; all characters were drawn by the same person and this will of course affect the results significantly. If time had permitted it, I should have run tests with characters written by more than one person.
Another thing that also helps on the good result is the speed of computers today. Calculations that would have taken a week on a computer a few years ago only takes a couple of hours on a Pentium today.
The testing patterns are listed in appendix A2 - Test Set.
3.4 Suggested improvements
Although the results were surprisingly good, I
believe that the results could be improved even further by
implementing character scaling and centering.
3.4.1
Character scaling
Since a neural network only
receives data in form of input stimuli it cannot recognize
different sizes of a character without having to learn all
possible sizes. By letting the program scale characters this
problem would be reduced to a minimum.
3.4.2
Character centering
Another
problem appears when the user draws characters with a different
alignment than the one learned, i.e. drawing a left aligned I
when all Is that have been learned are centered. This can
easily be solved by letting the program center all characters
before they are stored in the database and before interpreting
them.
3.5
Conclusion
Neural network and backpropagation seem to be
well suited for character recognition. Many of the problems that
arise when using backpropagation to recognize characters can
easily be eliminated or reduced by adding routines that scales,
centers etc. Other problems like noise and the problem of
separating character also exists for the other methods of
character recognition and there is not much one can do about
those problems except improving the quality of the equipment used
when scanning/reading characters (i.e. use higher resolution) or
using a database or artificial intelligence to give more accurate
interpretations.
Many say that the slow speed of the backpropagation routine is a major drawback, but I consider this to be a minor drawback only since the speed of new computers doubles every third year and then speed becomes a less important issue than it is today.


A3 - Extracts from the source code
// Sigmoid function to be used in FeedForward
inline float Sigmoid(float x)
{
return 1/(1+exp(-x));
}
// the derivative function used in backprop
inline float f(struct Unit unit)
{
return unit.activity*(1-unit.activity);
}
// calculates the total error of the network for all patterns
float totalError(void)
{
register float totalerror=0.0;
for(EACH_PATTERN)
for(EACH_R_UNIT)
totalerror=max(totalerror, sqr(response[p][r].T-response[p][r].activity));
return totalerror;
}
// perform a feedforward on the network
void FeedForward(short **stimulus)
{
for(EACH_H_UNIT)
{
hidden[pattern][h].activity=hbias[h];
for(EACH_S_UNIT)
hidden[pattern][h].activity+=(float)(stimulus[pattern][s])*w[h][s];
hidden[pattern][h].activity=Sigmoid(hidden[pattern][h].activity);
}
for(EACH_R_UNIT)
{
response[pattern][r].activity=rbias[r];
for(EACH_H_UNIT)
response[pattern][r].activity+=hidden[pattern][h].activity*v[h][r];
response[pattern][r].activity=Sigmoid(response[pattern][r].activity);
}
}
// calculate the errors of the output layer
void calculateOutputErrors(void)
{
for(EACH_R_UNIT)
response[pattern][r].error=(response[pattern][r].T
-response[pattern][r].activity)*
f(response[pattern][r])*learningrate;
}
// calculate the errors of the hidden layer
void calculateHiddenErrors(void)
{
for(EACH_H_UNIT)
{
hidden[pattern][h].error=0.0;
for(EACH_R_UNIT)
hidden[pattern][h].error+=response[pattern][r].error*v[h][r];
hidden[pattern][h].error*=f(hidden[pattern][h])*learningrate;
}
}
// updates biases and weights
void updateNetwork(void)
{
for(EACH_R_UNIT)
rbias[r]+=response[pattern][r].error*learningrate;
for(EACH_H_UNIT)
{
for(EACH_R_UNIT)
v[h][r]+=response[pattern][r].error*hidden[pattern][h].activity*learningrate;
hbias[h]+=hidden[pattern][h].error*learningrate;
for(EACH_S_UNIT)
w[h][s]+=hidden[pattern][h].error*(float)(stimulus[pattern][s])*learningrate;
}
}
References
An Introduction to Neural Networks, James A. Anderson
The Age of Intelligent Machines, Raymond Kurzweil
To download the character recognition program, click on the
link below:
chrec.zip (81.363Kb)