Multiclass Classification using Clatern
Clatern is a machine learning library for Clojure, in the works. This is a short tutorial on performing multiclass classification using Clatern.
Importing the libraries
The libraries required for the tutorial are core.matrix, Incanter and Clatern. Importing them,
NOTE: This tutorial requires Incanter 2.0 (aka Incanter 1.9.0). This is because both Incanter 2.0 and Clatern are integrated with core.matrix.
Dataset
This tutorial uses the popular Iris flower dataset. The dataset is available here: https://archive.ics.uci.edu/ml/datasets/Iris. For this tutorial, we’ll use Incanter to load the dataset.
Now converting the dataset into a matrix, where non-numeric columns are converted to either numeric codes or dummy variables, using the to-matrix function.
Now let’s split the dataset into a training set and a test set,
Splitting the training and test set into features and labels,
Logistic Regression
Here comes the interesting part - training a classifier using the data. First, let’s try the logistic regression model. Gradient descent is a learning algorithm for the logistic regression model. The syntax of gradient descent is,
where,
X is input data,
y is target data,
alpha is the learning rate,
lambda is the regularization parameter, and
num-iters is the number of iterations.
alpha(default = 0.1), lambda(default = 1) and num-iters(default = 100) are optional.
That’s it. Here, gradient-descent is a function in the clojure.logistic-regression namespace. It trains on the provided data and returns a hypothesis in the logistic regression model. Now, lr-h is a function that can classify an input vector.
K Nearest Neighbors
Next, let’s try the k nearest neighbors model. There is actually no training phase for this model. It can be directly used. The syntax for knn is,
where,
X is input data,
y is target data,
v is new input to be classified, and
k is the number of neighbours(optional, default = 3)
Let’s define a function to perform kNN on our dataset.
Similar to the logistic regression hypothesis, now knn-h can be used to classify an input vector.
Classification
Both lr-h and knn-h are functions that take input feature vectors and classify them. So to classify a whole dataset, the function is mapped to all rows of the dataset.
Now lr-preds and knn-preds contains the classifications made by logistic regression and knn on the orignal dataset, respectively.
Conclusion
So which model performs better here? Let’s write a function to assess the classification accuracy
Now let’s evaluate both the classifiers:
The accuracy of the models could vary highly depending on the shuffling of the dataset. These are values I averaged over 100 runs. Both models perform well on this datatset. So, that’s it for multiclass classification using Clatern. More work on Clatern to follow soon. So, keep an eye out :-)