Iris classifier with KNN
Iris Classifier is a program that I’m developing to apply basic knowledge on Machine Learning (currently supervised learning). So I’m using a simple and widely used database from the UCL Machine Learning Repository called Iris which brings data from 150 instances of the plant. As I’m not a expert in Botany, I used the base for training and testing, spliting it in two. A third part of the base I dedicated for testing and the rest for training. With this base, my aim is to classificate the species of a iris based on its sepal length, sepal width, petal length and petal width.
At first, I designed the problem using MLP (Multilayer Perceptron), a Artificial Neural Network model. As an Eager Learner, it builds a mathematical model for classifications, which means fast classification computing (and harder work to implement). I implemented a lot of stuff, like feedforward classification and backpropagation, but it’s still not working as good as it should be. I’ll use my middle year vacations to finish MLP implementation. However, I needed to show something working (and not kill MLP), so I decided to implement an algorithm more simple like KNN.
KNN means K-nearest-neighbors, so it uses a distance function (like euclidean distance between two points) to find the k nearest classifications to a given entry and chooses the most repeating class. It means that for each entry, the program will have to calculate the distances between the entry and each of the training entry in the dataset. It looks expensive, and it is! As a Lazy Learner, KNN does not create a mathematical model to use in classifications. Luckly the Iris database is not too long, and all the entries’ attributes are numbers, what simplifies the distance function. But if you put it in contrast with MLP, its not so refined. But it’s working perfectly as it should be.
My goal is that someone who is studying supervised learning could read or run this project and clarify the concepts as they are used in pratice. Not only developers, but Data Scientists too, could use it to see how choices in algorithms and even simple parameters changes the classification performance seen in the confusion matrix at the end of the runtime. In the Iris Classifier project, I created a superclass called SupervisedLearning, where MLP, KNN and all future algorithms should inherit. There’s a Enum for each algorithm too, that I’m using in a switch. Previously, most choices should be applied on code, but now they are chosen at runtime. Future work is finish MLP (surely) and bring more configuration parameters to algorithms.