# AI: Naive Bayes Classifiers

4 replies to this topic

### #1 tangibleLime

tangibleLime

Lunatic

• GMC Elder
• 2520 posts
• Version:GM:HTML5

Posted 23 April 2011 - 11:31 AM

Naive Bayes Classifiers

Note: Knowledge of simple probability theory recommended.

Short Introduction

The Naive Bayes Classifier (often shortened to NBC), is a probabilistic classifier based on Bayes' theorem using naive independence assumptions. It is used as an underlying feature for many artificial intelligence systems.

The idea behind NBC is simple. Suppose we have some hypothesis X which we know the prior odds for: O(X). Now suppose we come across several pieces of evidence that are either positive or negative examples of our hypothesis (which I will refer to as "features"). By having our classifier understand these pieces of evidence, it can then apply this knowledge to determine that our hypothesis is true or false.

Basic Formulas

These are the basic formulas that we'll use to create the NBC. This is where a basic knowledge of probability theory is needed.

Starting off, we have Bayes' theorem,

$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

Prior odds of X,

$O(X) = \frac{P(X)}{P(\neg X)}$

Posterior odds of X given Y,

$O(X|Y) = \frac{P(X|Y)}{P(\neg X|Y)}$

Likelihood ratio of Y with respect to X,

$L(Y|X) = \frac{P(Y|X)}{P(Y|\neg X)}$

Creating and Training the Classifier

Suppose our hypothesis (X) is that the word "soldaten" is part of the German language and not part of the English language. For a human, this task is simple (assuming one is familiar with either the English or German languages). However, for a computer this task can be difficult! We have prior knowledge of languages and what words of different languages tend to look like; our program does not share this knowledge. So what should we do? I say we GIVE it knowledge, by a process we'll call training the classifier.

Training the Classifier

In our quest to create a program that can differentiate German and English words, we find ourselves trying to teach a program the differences of the two languages. One way we can do this (and the method I used in the posted example) is to use a set of features and events that each feature was either absent or present from a given word. In our case, our list of features will be the characters a through z. These features will correspond to a set of events, say

$E_{1},E_{2},...,E_{26}$

which will tell us whether the given feature (letter) was present in the given word. In our example of the word "soldaten", we would find that the following events are true.

$E_{19},E_{15},E_{12},E_{4},E_{1},E_{20},E_{5},E_{14}$

Since the letters s (19th in the alphabet), o (15th in the alphabet), etc., are present in the word. All events that are not included in that above list are events in which the corresponding feature was not found. For example, the event E26 will report that the character z did not show up in the word.

When we train the classifier, we also provide one more piece of information along with the word we're using to train it - the class of the word. It won't help if we only give our classifier a million words (some English, some German) and not tell it which are which! So along with our training word, we'll tell it if it belongs to class 1 (the word is German) or class 0 (the word is English).

Throughout the entire training process, we'll keep a few counters that we're going to use in the probability computations. We keep track of the number of positive examples (class 1, the word is German) and the number of negative examples (class 0, the word is English) for each of our 26 features. We'll also count the number of total examples for each feature. With these counter variables, we can start computing the probabilities.

Computing the Probabilities

Our posterior odds, after observing all the evidence, is given by,

$O(X|E_{1},E_{2},...,E_{26}) = L(E_{1},E_{2},...,E_{26}|X)O(X) = \frac{P(E_{1} \cap E_{2} \cap ... \cap E_{26}|X)}{P(E_{1} \cap E_{2} \cap ... \cap E_{26}|\neg X)}O(X)$

.

In other words, the posterior probability is the probability of En given our hypothesis is true divided by the probability of En given that our hypothesis is false. In NBC, we will always assume conditionally independent features, which you should know from your simple understanding of probability theory, will bring you to the following conclusions:

$P(E_{1} \cap E_{2} \cap ... \cap E_{26}|X) = \prod_{n=1}^{26} P(E_{n}|X)$

$P(E_{1} \cap E_{2} \cap ... \cap E_{26}|\neg X) = \prod_{n=1}^{26} P(E_{n}|\neg X)$

And in conclusion,

$O(X|E_{1},E_{2},...,E_{26}) = O(X)\prod_{n=1}^{26} \frac{P(E_{n}|X)}{P(E_{n}|\neg X)}$

We have now found the probability that the specific word we're looking at is in class 1, i.e. it is an German word.

Example in Game Maker

To go along with this topic, I built a Naive Bayes Classifier using Game Maker 8.1. It is completely compatible with both the registered and unregistered versions.

Extended Features

I go a step further in my NBC. Remember our feature list of characters a through z? In this example, I use features that contain double and triple characters, as well as our standard single characters, i.e. {a, aa, aaa, aab, aac, .... , a ab aba abb abc }. As you may imagine, the list of features becomes huge when expanded to triple characters. To (somewhat) remedy this, I included three feature generation scripts, nbcGenerateFeatures1(), nbcGenerateFeatures2() and nbcGenerateFeatures3(), which create feature lists of character depth 1, 2 and 3 respectively. By default the classifier will use nbcGenerateFeatures2(). You can change this in the nbc object in the second line of it's create event.

Running the Test

After you unzip the package, you're ready to test it. I included three files:
(1) A file containing 1,100 German words (for training).
(2) A file containing 1,100 English words (for training).
(3) A testing file that contains 44 words of each language, with their correct classification the same line. This file is read by the test object and parsed. If you want to make your own testing file, follow the format with one entry per line: word,class

Run the program, and hit <T> when prompted. After the training has completed, you can press <SPACE> to run the test to see how the classifier performs after processing the training files. Note that both running the training process AND the test process will take a LONG TIME if you are using nbcGenerateFeatures3() - it took a couple minutes on my i7-930. You also may want to play around with the training files. Compiling a list of 2,200 total words is something of a chore that takes a month of Sundays; the training files I've provided can easily be improved.

You can also press <E> to input your own word and the program will try to determine if that word is German or English.

Program Output

After running the test, the program will open up the test log (testlog.txt, in the same folder as the GM file). This file displays each word in the test file, with it's determined probability for being a German word, it's actual class, the determined class and if the classifier made the correct decision (if the probability is over 0.5, it guesses that the word is German). At the bottom of this file is a summary that displays the percent correct, the number correct and the number wrong.

The program also outputs featurelog.txt, which is a complete list of the features used in the classifier.

I ran it for all three feature scripts, which returned the following statistics:

Using nbcGenerateFeatures3(): %89.77 correct.
Using nbcGenerateFeatures2(): %87.50 correct.
Using nbcGenerateFeatures1(): %71.59 correct.

Enjoy,
tangibleLime
• 0

### #2 chance

chance

GMC Member

• Global Moderators
• 7819 posts
• Version:GM:Studio

Posted 23 April 2011 - 04:23 PM

Interesting. This looks like some pattern recognition techniques I've seen. Also techniques for understanding genetic sequences. Very similar mathematics for those too.
• 0

### #3 Bjornacorn

Bjornacorn

World of Warcraft

• New Member
• 197 posts

Posted 28 April 2011 - 10:42 PM

Hmm, I want to try this out, but the http://mikederoche.com/files/gmNBC.zip was some sort of virus (behaving suspiciously) when I started the Game Maker File.

Edited by Bjornacorn, 28 April 2011 - 10:42 PM.

• 1

### #4 tangibleLime

tangibleLime

Lunatic

• GMC Elder
• 2520 posts
• Version:GM:HTML5

Posted 28 April 2011 - 11:01 PM

Many virus scanners tend to think GameMaker files are trojans because of they way they are run (I'm not an expert on the subject, but it's something along the lines of GameMaker injecting something in the EXE which makes it look like a trojan). The files are clean, I assure you.
• 0

### #5 ydawg314

ydawg314

GMC Member

• New Member
• 901 posts
• Version:Unknown

Posted 22 January 2012 - 09:08 PM

I dont mean to grave dig but this is some very interesting stuff which can be used to develop learning ai for a game. One way to do this would be to start off ai development by telling the ai some good and bad moves to do in certain situations (for tds example: if your scared so your goal is to survive and there are a lot of enemies in front of you retreating is good). Situations could be represented by numbers or letters depending on the number of options. You could also make the ai retrain itself afters it has done moves which have had a substantial effect on it.

I will try and develop an example of this.