Jekyll2019-06-05T11:06:02+00:00/feed.xmlRinu BoneyMy thoughts on machine learning and artificial intelligence.Introduction to Semi-Supervised Learning with Ladder Networks2016-01-19T00:00:00+00:002016-01-19T00:00:00+00:00/2016/01/19/ladder-network<p>Today, deep learning is mostly about pure supervised learning. A major drawback of supervised learning is that it requires a lot of labeled data and It is quite expensive to collect them. So, deep learning in the future is expected to unsupervised, more human-like.</p>
<blockquote>
<p>“We expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.”<br />
– LeCun, Bengio, Hinton, Nature 2015</p>
</blockquote>
<h2 id="semi-supervised-learning">Semi-Supervised Learning</h2>
<p><a href="https://en.wikipedia.org/wiki/Semi-supervised_learning">Semi-supervised learning</a> is a class of supervised learning tasks and techniques that also make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data.</p>
<p>So, how can unlabeled data help in classification? Consider the following example (taken from <a href="https://users.ics.aalto.fi/praiko/papers/ladder2.pdf">these slides</a>) consisting of only two data points with labels.</p>
<p><img src="/img/semi1.png" alt="semi1" class="center-image" /></p>
<p>How would you label this point?</p>
<p><img src="/img/semi2.png" alt="semi2" class="center-image" /></p>
<p>What if you see all the unlabeled data?<br />
<br />
<br />
<img src="/img/semi3.png" alt="semi3" class="center-image" />
<br />
In order to make any use of unlabeled data, we must assume some structure to the underlying distribution of data. Labels are homogeneous in densely populated space ie., data points close to each other belong to the same class (smoothness assumption). Iterating this assumption over the data points until all data points are assigned a label,<br />
<br />
<br />
<img src="/img/semi4.png" alt="semi4" class="center-image" />
<br />
<br />
<img src="/img/semi5.png" alt="semi5" class="center-image" />
<br />
<br />
<img src="/img/semi6.png" alt="semi6" class="center-image" />
<br />
<br />
<img src="/img/semi7.png" alt="semi7" class="center-image" />
</p>
<p>It has been discovered that the use of unlabeled data together with a small amount of labeled data can improve accuracy considerably. The collection of unlabeled data is inexpensive relative to labeled data. Often labeled data is scarce and unlabeled data is plentiful. In such situations, semi-supervised learning can be used. Also, much of human learning involves a small amount of direct instruction (labeled data) combined with large amounts of observation (unlabeled data). Hence, semi-supervised learning is a plausible model for human learning.</p>
<h2 id="ladder-networks">Ladder Networks</h2>
<p>Ladder networks combine supervised learning with unsupervised learning in deep neural networks. Often, unsupervised learning was used only for pre-training the network, followed by normal supervised learning. In case of ladder networks, it is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Ladder network is able to achieve state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.</p>
<h3 id="key-aspects">Key Aspects</h3>
<h4 id="compatibility-with-supervised-methods">Compatibility with supervised methods</h4>
<p>It can be added to existing feedforward neural networks. The unsupervised part focuses on relevant details found by supervised learning. It can also be extended to be added to recurrent neural networks.</p>
<h4 id="scalability-resulting-from-local-learning">Scalability resulting from local learning</h4>
<p>In addition to a supervised learning target on the top layer, the model has local unsupervised learning targets on every layer, making it suitable for very deep neural networks.</p>
<h4 id="computational-efficiency">Computational efficiency</h4>
<p>Adding a decoder (part of the ladder network) approximately triples the computation during training but not necessarily the training time since the same result can be achieved faster through the better utilization of the available information.</p>
<h3 id="implementation">Implementation</h3>
<p><em>This is a brief introduction to the implementation of Ladder networks. A detailed and in-depth explanation of Ladder network can be found in the paper <a href="http://arxiv.org/abs/1507.02672">“Semi-Supervised Learning with Ladder Networks”</a>.</em></p>
<p>The steps involved in implementing the Ladder network are typically as follows:</p>
<ol>
<li>Take a feedforward model which serves supervised learning as the encoder. The network consists of 2 encoder paths - clean and corrupted encoder. The only difference is that the corrupted encoder adds Gaussian noise at all layers.</li>
<li>Add a decoder which can invert the mappings on each layer of the encoder and supports unsupervised learning. Decoder uses a denoising function to reconstruct the activations of each layer given the corrupted version. The target at each layer is the clean version of the activation and the difference between the reconstruction and the clean version serves as the denoising cost of that layer.</li>
<li>The supervised cost is calculated from the output of the corrupted encoder and the output target. The unsupervised cost is the sum of denoising cost of all layers scaled by a hyperparameter that denotes the significance of each layer. The final cost is the sum of supervised and unsupervised cost.</li>
<li>Train the whole network in a fully-labeled or semi-supervised setting using standard optimization techniques (such as stochastic gradient descent) to minimize the cost.</li>
</ol>
<p>An illustration of a 2 layer ladder network,
<br />
<br />
<img src="/img/ladder_net.png" alt="ladder network" />
<br />
Batch normalization is applied to each preactivation including the topmost layer to improve convergence (due to reduced covariate shift) and to prevent the denoising cost from encouraging the trivial solution (encoder outputs constant values as these are the easiest to denoise). Direct connection between a layer and it’s decoded reconstruction are used. The network is called a Ladder network because the resulting encoder/decoder architecture resembles a ladder.
<br />
<br />
<br />
<img src="/img/ladder_algorithm.png" alt="ladder algorithm" />
<br />
</p>
<h3 id="conclusion">Conclusion</h3>
<p>The performance of Ladder networks is very impressive. On MNIST, it achieves an error rate of 1.06% with only 100 labeled examples! This is much better than previous published results, which shows that the method is capable of making good use of unsupervised learning. However, the same model also achieves state-of-the-art results and a significant improvement over the base-line model with full labels in permutation invariant MNIST classification, which suggests that the unsupervised task does not disturb supervised learning.</p>
<p>Ladder network is simple and easy to implement with many existing feedforward architectures, as the training is based on backpropagation from a simple cost function. It is quick to train and the convergence is fast, thanks to batch normalization.</p>
<h3 id="code">Code</h3>
<p>The code published along with the original paper is available here - <a href="https://github.com/CuriousAI/ladder">https://github.com/CuriousAI/ladder</a>. My implementation of Ladder networks in <a href="http://tensorflow.org/">TensorFlow</a> is available here - <a href="https://github.com/rinuboney/ladder">https://github.com/rinuboney/ladder</a>. <em>Note: The TensorFlow implementation achieves an error rate about 0.2% - 0.3% more than the error rates published in the paper.</em></p>
<h3 id="related-papers">Related papers</h3>
<p>Semi-supervised learning using Ladder networks was introduced in this paper:</p>
<ul>
<li><a href="http://arxiv.org/abs/1507.02672">Rasmus, Antti, et al. “Semi-Supervised Learning with Ladder Networks.” Advances in Neural Information Processing Systems. 2015.</a></li>
</ul>
<p>Ladder network was further analyzed and some improvements that attained an even better performance were suggested in this paper:</p>
<ul>
<li><a href="http://arxiv.org/abs/1511.06430">Pezeshki, Mohammad, et al. “Deconstructing the Ladder Network Architecture.” arXiv preprint arXiv:1511.06430 (2015).</a></li>
</ul>
<p>and finally, these are the papers that led to the development of Ladder networks:</p>
<ul>
<li><a href="http://arxiv.org/abs/1412.7210">Rasmus, Antti, Tapani Raiko, and Harri Valpola. “Denoising autoencoder with modulated lateral connections learns invariant representations of natural images.” arXiv preprint arXiv:1412.7210 (2014).</a></li>
<li><a href="http://arxiv.org/abs/1504.08215">Rasmus, Antti, Harri Valpola, and Tapani Raiko. “Lateral Connections in Denoising Autoencoders Support Supervised Learning.” arXiv preprint arXiv:1504.08215 (2015).</a></li>
<li><a href="http://arxiv.org/abs/1411.7783">Valpola, Harri. “From neural PCA to deep unsupervised learning.” arXiv preprint arXiv:1411.7783 (2014).</a></li>
</ul>Today, deep learning is mostly about pure supervised learning. A major drawback of supervised learning is that it requires a lot of labeled data and It is quite expensive to collect them. So, deep learning in the future is expected to unsupervised, more human-like.Theoretical Motivations for Deep Learning2015-10-18T00:00:00+00:002015-10-18T00:00:00+00:00/2015/10/18/theoretical-motivations-deep-learning<p><em>This post is based on the lecture “<a href="http://videolectures.net/deeplearning2015_bengio_theoretical_motivations/">Deep Learning: Theoretical Motivations</a>” given by <a href="http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html">Dr. Yoshua Bengio</a> at <a href="https://sites.google.com/site/deeplearningsummerschool/">Deep Learning Summer School, Montreal 2015</a>. I highly recommend the lecture for a deeper understanding of the topic.</em></p>
<p>Deep learning is a branch of machine learning algorithms based on learning multiple levels of representation. The multiple levels of representation corresponds to multiple levels of abstraction. This post explores the idea that if we can successfully learn multiple levels of representation then we can generalize well.</p>
<p>The below flow charts illustrate how the diﬀerent parts of an AI system relate to each other within diﬀerent AI disciplines. The shaded boxes indicate components that are able to learn from data.</p>
<p><img src="/img/AI_system_parts.png" alt="AI parts" class="center-image" /></p>
<h4 id="rule-based-systems">Rule-Based Systems</h4>
<p>Rule-based systems are hand-designed AI programs. The knowledge required by these programs are provided by experts in the concerned field. That is why these systems are also called expert systems. These hand-designed programs contains facts and the logic to combine the facts to answer questions.</p>
<h4 id="classical-machine-learning">Classical Machine Learning</h4>
<p>In classical machine learning, the important features of the input are manually designed and the system automatically learns to map the features to outputs. This kind of machine learning is used and works well for simple pattern recognition problems. It is well known that in practice most of the time is spent designing the optimal features for the system. Once the features are hand-designed, a generic classifier is used to obtain the output.</p>
<h4 id="representation-learning">Representation Learning</h4>
<p>Representation learning goes one step further and eliminates the need to hand-design the features. The important features are automatically discovered from data. In neural networks, the features are automatically learned from raw data.</p>
<h4 id="deep-learning">Deep Learning</h4>
<p>Deep learning is a kind of representation learning in which there are multiple levels of features. These features are automatically discovered and they are composed together in the various levels to produce the output. Each level represents abstract features that are discovered from the features represented in the previous level. Hence, the level of abstraction increases with each level. This type of learning enables discovering and representing higher-level abstractions. In neural networks, the multiple layers corresponds to multiple levels of features. These multiple layers compose the features to produce output.</p>
<h3 id="path-to-ai">Path to AI</h3>
<p>The 3 key ingredients for ML towards AI:</p>
<h4 id="1-lots-of-data">1. Lots of Data</h4>
<p>An AI system needs lots of knowledge. This knowledge either comes from humans who put it in or from data. In case of machine learning, the knowledge comes from data. A machine learning system needs lots and lots of data to be able to make good decisions. Modern applications of machine learning deals with data in form of videos, images, audio, etc which are complicated and require lots of data to learn from.</p>
<h4 id="2-very-flexible-models">2. Very Flexible Models</h4>
<p>The data alone is not enough. We need to translate the knowledge into something useful. Also, we have to store the knowledge somewhere. The models needs to be big and flexible enough to be able to do this.</p>
<h4 id="3-powerful-priors">3. Powerful Priors</h4>
<p>Powerful priors are required to defeat the curse of dimensionality. Good priors induce fairly general knowledge about the world into the system.</p>
<p>Classical non-parametric algorithms can handle lots of data and are flexible models. However, they use the smoothness prior and that is not enough. Hereby, this post mostly deals with the third ingredient.</p>
<h3 id="how-do-we-get-ai">How do we get AI?</h3>
<h4 id="knowledge">Knowledge</h4>
<p>The world is very complicated and an AI has to understand it. It would require a lot of knowledge to reach the level of understanding of the world that humans have. An AI needs to have a lot more knowledge than what is being made available to the machine learning systems today.</p>
<h4 id="learning">Learning</h4>
<p>Learning is necessary to acquire the kind of complex knowledge needed for AI. Learning algorithms involve two things - priors and optimization techniques.</p>
<h4 id="generalization">Generalization</h4>
<p>The aspect of generalization is central to machine learning. Generalization is a guess as to which configuration is most likely. A geometric interpretation is that we are guessing where the probability mass is concentrated.</p>
<h4 id="ways-to-fight-the-curse-of-dimensionality">Ways to fight the curse of dimensionality</h4>
<p>The curse of dimensionality arises from high dimensional variables. There are so many dimensions and each dimension can take on lots of values. Even in 2 dimensions, the no of possible configurations is huge. It looks almost impossible to handle the possible configurations in higher dimensions.</p>
<h4 id="disentangling-the-underlying-explanatory-factors">Disentangling the underlying explanatory factors</h4>
<p>An AI needs to figure out how the data was generated - the explanatory factors or the causes of what it observes. This is what science is trying to do - conduct experiments and come up with theories to explain the world. Deep learning is a step in that direction.</p>
<h2 id="why-not-classical-non-parametric-algorithms">Why not classical non-parametric algorithms?</h2>
<p>There are different definitions for the term non-parametric. We say that a learning algorithm is non-parametric if the complexity of the functions it can learn is allowed to grow as the amount of training data is increased. In other words, it means that the parameter vector is not fixed. Depending on the data we have, we can choose a family of functions that are more or less flexible. In case of a linear classifier - even if we get 10 times more data than what we already have, we are stuck with the same model. In contrast, for neural networks we get to choose more hidden units. Non-parametric is not about having no parameters. It’s about not having a fixed parameter. It’s about choosing the amount of parameters based on the richness of data.</p>
<h3 id="the-curse-of-dimensionality">The Curse of Dimensionality</h3>
<p>The curse of dimensionality arises because of the many possible configurations in high dimensions. The number of possible configurations increases exponentially with the number of dimensions. Then, how can we possibly generalize to new configuration we have never seen? that is what machine learning is about.</p>
<p>The classical approach in non-parametric statistics is to rely on smoothness. It works fine in small dimensions but in high dimensions the average value either ends up with no examples or all examples in it. This is useless. To generalize locally, we need representative examples for all relevant variations. It is not possible to average locally and obtain something meaningful.</p>
<p>If we dig deeper mathematically, it’s not the number of dimensions but the number of variations of functions that we learn. In this case, smoothness is about how many ups and downs are present in the curve.</p>
<p><img src="/img/line_smooth.png" alt="Smooth line and curve" /></p>
<p>A line is very smooth. A curve with some ups and downs is less smooth but still smooth.</p>
<p>The functions we are trying to learn are very non-smooth. In case of modern machine learning applications like computer vision or natural language processing, the target function is very complex.</p>
<p>Many non-parametric statistics rely on something like gaussian kernel to average the values in some neighbourhood. However, Gaussian kernel machines need at least k examples to learn a function that has 2k zero-crossings along some line. The number of ups and downs could be exponential in the number of dimensions. It is possible to have a very non-smooth function even in 1-dimension.</p>
<p><img src="/img/probability_mass.png" alt="Probability mass" class="center-image" /></p>
<p>In a geometric sense, we have to put the probability mass where the structure is plausible. In the empirical distribution, the mass is concentrated at the training examples. Consider the above visualization with some 2-dimensional data points. In the smoothness assumption, the mass is spread around the the examples. The balls in the figure illustrate the gaussian kernels around each example. This is what many non-parametric statistical methods do. The idea seems plausible in this 2-dimensional case. However in higher dimensions, the balls will be so large that they cover everything or leave holes in places where there should be a high probability. Hence, the smoothness assumption is insufficient and we have to discover something smarter about the data - some structure. The figure depicts such a structure, a 1-dimensional manifold where the probability mass is concentrated. If we are able to discover a representation of the probability concentration, then we can solve our problems. The representation can be a lower dimensional one or along a different axis in the same dimension. We take a complicated and non-linear manifold and ‘flatten’ the manifold by changing the representation ie., we transform complicated distributions to a euclidean space. It is easy to make predictions, interpolation, density estimation, etc. in this euclidean space.</p>
<h3 id="bypassing-the-curse">Bypassing the Curse</h3>
<p>Smoothness has been the ingredient in most statistical non-parametric methods and it is quite clear that we cannot defeat the curse of dimensionality if we only use smoothness. We want to be non-parametric in the sense that we want the family of functions to grow in flexibility as we get more data. In neural networks, we change the number of hidden units depending on the amount of data.</p>
<p>We need to build compositionality into our ML models. Natural languages exploit compositionality to give representations and meanings to complex ideas. Exploiting compositionality gives an exponential gain in representational power. In deep learning, we use two priors:</p>
<ol>
<li>Distributed Representations</li>
<li>Deep Architecture</li>
</ol>
<p>Here, we make a simple assumption that the data we are observing came up by composition of pieces. The composition maybe be parallel or sequential. Parallel composition gives us the idea of distributed representations. It is basically the same idea of feature learning. Sequential composition deals with multiple levels of feature learning. An additional prior is that compositionality is useful to describe the world around us efficiently.</p>
<h2 id="the-power-of-distributed-representations">The Power of Distributed Representations</h2>
<h3 id="non-distributed-representations">Non-distributed representations</h3>
<p>The methods that don’t use distributed representation are clustering, n-grams, nearest neighbours, RBF SVMs, decision trees etc. On a high level, these algorithms take the input space and split it into regions. Some algorithms have hard partitions while others have soft partitions that allow smooth interpolations between the nearby regions. There are different set of parameters for each region. The corresponding answer of each region and where the regions should be are tuned using the data. There is a notion of complexity tied to the number of regions. In terms of learning theory, generalization depends on the relationship between the number of examples needed and the complexity. A rich function requires more regions and more data. There is a linear relation between the number of distinguishable regions and the number of parameters. Equivalently, there is a linear relation between the number of distinguishable regions and number of training examples.</p>
<p><img src="/img/fixed_partition.png" alt="Fixed Partition" /></p>
<h3 id="why-distributed-representations">Why Distributed Representations?</h3>
<p>There is another option. With distributed representations, it is possible to represent exponential number of regions with a linear number of parameters. The magic of distributed representation is that it can learn a very complicated function(with many ups and downs) with a low number of examples. In non-distributed representations, the number of parameters are linear to the number of regions. Here, the number of regions potentially grow exponentially with the number of parameters and number of examples. In distributed representations, the features are individually meaningful. They remain meaningful despite what the other features are. There maybe some interactions but most features are learned independent of each other. We don’t need to see all configurations to make a meaningful statement. Non-mutually exclusive features create a combinatorially large set of distinguishable configurations. There is an exponential advantage even if there is only one layer. The number of examples might be exponentially smaller for this prior. However, this is not observed in practice - there is a big advantage but not an exponential one. If the representations are good, then what it’s really doing is unfolding the manifold to a new flat coordinate system. Neural networks are really good at learning representations that capture the semantic aspects. The generalization comes from these representations. In classical non-parametric, we are not able to say anything about an example located in input space with no data. Whereas, with this approach we are able to say something meaningful about what we have not seen before. That is the essence of generalization.</p>
<p><img src="/img/distributed_partition.png" alt="Distributed Partition" /></p>
<h3 id="classical-symbolic-ai-vs-representation-learning">Classical Symbolic AI vs Representation Learning</h3>
<p>Distributed representations are at the heart of the renewal of neural networks in 1980’s - called connectionism or the connectionist approach. The classical AI approach is based on the notion of symbols. In symbolic processing of things like language or logic and rules, each concept is associated with a pure entity - a symbol. A symbol either exists or it doesn’t. There is nothing intrinsic that describes any relationship between them. Consider, the concept of a cat and a dog. In symbolic AI, they are different symbols with no relation between the them. In distributed representation, they share some features like being a pet, having 4 legs, etc. It makes more sense when thinking about concepts as patterns of features or patterns of activations of neurons in the brain.</p>
<h3 id="distributed-representations-in-nlp">Distributed Representations in NLP</h3>
<p>There has been some interesting results in natural language processing with the use of distributed representations. I highly recommend the article <a href="http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/">Deep Learning, NLP, and Representations</a> for a detailed understanding of these results.</p>
<h2 id="the-power-of-deep-representations">The Power of Deep Representations</h2>
<p>There is a lot of misunderstanding about what depth means. Depth was not studied before this century because people thought there was no need of deep neural networks. A shallow neural network with a single layer of hidden units is sufficient to represent any function with required degree of accuracy. This called the universal approximation property. But, it doesn’t tell us how many units is required. With a deep neural network we can represent the same function as that of a shallow neural network but more cheaply ie., with less number of hidden units. The number of units needed can be exponentially larger for a shallow network compared to a network that is deep enough. If we are trying to learn a function that is deep(there are many levels of composition), then the neural network needs more layers.</p>
<p>Depth is not necessary to have a flexible family of functions. Deeper networks does not correspond to a higher capacity. Deeper doesn’t mean we can represent more functions. If the function we are trying to learn has a particular characteristic obtained through composition of many operations, then it is much better to approximate these functions with a deep neural network.</p>
<h3 id="shallow-and-deep-computer-program">Shallow and Deep computer program</h3>
<p><img src="/img/shallow_program.png" alt="Shallow Program" />
<em>“Shallow” computer program</em></p>
<p><img src="/img/deep_program.png" alt="Deep program" />
<em>“Deep” computer program</em></p>
<p>When writing computer programs, we don’t usually write the main program one line after another. Typically, we have subroutines that are reused. It is plausible to think that what the hidden units are doing as kind of subroutines for the bigger program that is at the final layer. Another way to think is that result of computation of each line in the program is changing the state of the machine to provide an input for the next line. The input at each line is the state of the machine and the output is a new state of the machine. This corresponds with a turing machine. The number of steps that a turing machine executes actually corresponds to the depth of computation. In principle, we can represent any function in two steps(lookup table). This doesn’t mean it will be able to compute anything interesting with efficiency. The kernel SVM or a shallow neural network can be considered as a kind of a lookup table. So, we need deeper programs.</p>
<h3 id="sharing-components">Sharing components</h3>
<p><img src="/img/sum_product_net.png" alt="Sum Product Network" /></p>
<p>Polynomials are usually represented as a sum of products. Another way to represent polynomials is using a graph of computations where each node performs an addition or a multiplication. This way we can represent deep computations. Here, the number of computations will be smaller because we can reuse some operations.</p>
<p>There is new theoretical result that deeper nets with rectifier/maxout units are exponentially more expressive than shallow ones because they can split the input space in many more linear regions, with constraints.</p>
<h2 id="the-mirage-of-convexity">The Mirage of Convexity</h2>
<p>One of the reasons neural networks were discarded in the late 90’s is that the optimization problem is non-convex. We know from the late 80’s and 90’s that there are an exponential number of local minima in neural networks. This knowledge combined with the success of kernel machines in mid 90’s played a role in greatly reducing the interest of many researchers in neural networks. They believed that since the optimization is non-convex, there is no guarantee that we are going to find the optimal solution. Further, the network may get stuck on poor solutions. Something changed very recently over the last year. Now we have the theoretical and empirical evidence that the issue of non-convexity may not be issue at all. This changes the picture of what we have with optimization problem of neural networks.</p>
<h3 id="saddle-points">Saddle Points</h3>
<p>Let us consider the optimization problem in low dimensions vs high dimensions. In low dimensions, it is true that there exists lots of local minima. However in high dimensions, local minima are not really the critical points that are the most prevalent in points of interest. When we optimize neural networks or any high dimensional function, for most of the trajectory we optimize, the critical points(the points where the derivative is zero or close to zero) are saddle points. Saddle points unlike local minima, are easily escapable.</p>
<p><img src="/img/saddle_point.png" alt="Saddle Point" class="center-image" /></p>
<p>A saddle point is illustrated in the image above. In a global or local minima, all the directions are going up and in a global or local maxima, all the directions are going down. In case of local minima in a very high dimensional space(the space of parameters), all the directions should go up in all dimensions. If there is somehow a randomness in how all the functions are constructed and if the direction are independently chosen, it is exponentially unlikely that all directions go up except near the bottom of the landscape ie., near the global minima. The intuition is that when there is a minima that’s close to the global minima, all directions go up and it’s not possible to go further down. Hence, the local minima exists but are very close to global minima in terms of objective functions. Theoretical results from statistical physics and matrix theory suggests that for some families of functions that are fairly large, there is a concentration of probability between the index of the critical points and the objective function. Index is the fraction of directions that are going down. When index = 0, it is a local minimum and when index = 1, it is a local maximum. If index is something in between, then it is a saddle point. So, local minima is a special case of saddle point when index = 0. For a particular training objective, most of the critical points are saddle points with a particular index. Empirical results verify that indeed there is a tight relation between index and the objective function. It’s only an empirical validation and there is no proof that the results apply to optimization of neural networks. There is some evidence that the behaviour observed corresponds to what the theory suggests. In practice, it is observed that stochastic gradient descent will almost always escape from surfaces other than local minima.</p>
<p>Related papers:</p>
<ol>
<li><a href="http://arxiv.org/abs/1405.4604">On the saddle point problem for non‐convex optimization - Pascanu, Dauphin, Ganguli, Bengio, arXiv May 2014</a></li>
<li><a href="http://arxiv.org/abs/1406.2572">Identifying and attacking the saddle point problem in high-dimensional non-convex optimization - Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014</a></li>
<li><a href="http://arxiv.org/abs/1412.0233">The Loss Surface of Multilayer Nets - Choromanska, Henaff, Mathieu, Ben Arous & LeCun 2014</a></li>
</ol>
<h2 id="other-priors-that-work-with-deep-distributed-representations">Other Priors That Work with Deep Distributed Representations</h2>
<h4 id="the-human-way">The Human Way</h4>
<p>Humans are capable of generalizing form very few examples. Children usually learn new tasks from very few examples. Sometimes even from one example. Statistically, it is impossible to generalize from one example. One possibility is that the child is using knowledge from previous learning. The previous knowledge can be used to build representations such that in the new representation space it is possible to generalize from a single example. Thus, we need to introduce more priors than distributed representations and depth.</p>
<h4 id="semi-supervised-learning">Semi-Supervised Learning</h4>
<p>Semi-supervised learning falls between unsupervised learning and supervised learning. In supervised learning, we use only the labelled examples. In semi-supervised learning, we also make use of any unlabelled examples available. The image below illustrates how semi-supervised learning may find a better boundary of separation with use unlabelled examples.</p>
<p><img src="/img/semi_supervised.png" alt="Semi-supervised" /></p>
<h4 id="multi-task-learning">Multi-Task Learning</h4>
<p>Generalizing better to new tasks is crucial to approach AI. Here, the prior is the shared underlying explanatory factors between tasks. Deep architectures learn good intermediate representations that can be shared across tasks. Good representations that disentangle underlying factors of variation make sense for many tasks because each task concerns a subset of the factors.</p>
<p><img src="/img/multi_task.png" alt="Multi-Task learning" class="center-image" /></p>
<p>The following figure illustrates multi-task learning with different inputs,</p>
<p><img src="/img/multi_modal.png" alt="Multi-Modal learning" class="center-image" /></p>
<h2 id="learning-multiple-levels-of-abstraction">Learning Multiple Levels of Abstraction</h2>
<p>The big payoff of deep learning is to allow learning higher levels of abstraction. Higher-level abstractions disentangle the factor of variation, which allows much easier generalization and transfer.</p>
<p><img src="/img/abstraction_levels.png" alt="Levels of Abstraction" class="center-image" /></p>
<h2 id="conclusion">Conclusion</h2>
<ul>
<li>Distributed representation and deep composition are priors that can buy exponential gain in generalization.</li>
<li>Both these priors yield non-local generalization.</li>
<li>There is strong evidence that local minima are not an issue because of saddle points.</li>
<li>We have to introduce other priors like semi-supervised learning and multi-task learning that work with deep distributed representations for better generalization.</li>
</ul>This post is based on the lecture “Deep Learning: Theoretical Motivations” given by Dr. Yoshua Bengio at Deep Learning Summer School, Montreal 2015. I highly recommend the lecture for a deeper understanding of the topic.Digit Classification using KNN2015-03-01T00:00:00+00:002015-03-01T00:00:00+00:00/clatern/2015/03/01/digit-classification-knn<p>This is a tutorial on classifying handwritten digits with KNN algorithm using Clatern. <a href="https://github.com/rinuboney/clatern">Clatern</a> is a machine learning library for Clojure, in the works.</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nf">use</span><span class="w"> </span><span class="ss">'clojure.core.matrix</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="nf">use</span><span class="w"> </span><span class="o">'</span><span class="p">(</span><span class="nf">clatern</span><span class="w"> </span><span class="n">io</span><span class="w"> </span><span class="n">knn</span><span class="p">))</span></code></pre></figure>
<h3 id="dataset">Dataset</h3>
<p>This tutorial uses a stripped down version of handwritten digits dataset available <a href="http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits">here</a>. The stripped down version(taken from the sklearn library) is available <a href="https://github.com/scikit-learn/scikit-learn/raw/master/sklearn/datasets/data/digits.csv.gz">here</a>.</p>
<p>The dataset consists of 1797 samples of 8x8 pixels and the target labels. The first 64 columns are the 8x8 pixels and the 65th column is the label target. Let’s have a look at a sample,</p>
<p><img src="/img/plot_digit.png" alt="digit" /></p>
<p>Let’s load the data,</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">digits</span><span class="w"> </span><span class="p">(</span><span class="nf">load-data</span><span class="w"> </span><span class="s">"digits.csv"</span><span class="p">))</span></code></pre></figure>
<p>Splitting the data into training and test sets,</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">digits</span><span class="o">'</span><span class="w"> </span><span class="p">(</span><span class="nf">order</span><span class="w"> </span><span class="n">digits</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="p">(</span><span class="nf">shuffle</span><span class="w"> </span><span class="p">(</span><span class="nb">range</span><span class="w"> </span><span class="mi">1797</span><span class="p">))))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">train-mat</span><span class="w"> </span><span class="p">(</span><span class="nb">take</span><span class="w"> </span><span class="mi">1400</span><span class="w"> </span><span class="n">digits</span><span class="o">'</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">test-mat</span><span class="w"> </span><span class="p">(</span><span class="nb">drop</span><span class="w"> </span><span class="mi">1400</span><span class="w"> </span><span class="n">digits</span><span class="o">'</span><span class="p">))</span></code></pre></figure>
<p>Splitting the training and test set into features and labels,</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">X-train</span><span class="w"> </span><span class="p">(</span><span class="nb">select</span><span class="w"> </span><span class="n">train-mat</span><span class="w"> </span><span class="no">:all</span><span class="w"> </span><span class="p">(</span><span class="nb">range</span><span class="w"> </span><span class="mi">64</span><span class="p">)))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">y-train</span><span class="w"> </span><span class="p">(</span><span class="nf">get-column</span><span class="w"> </span><span class="n">train-mat</span><span class="w"> </span><span class="mi">64</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">X-test</span><span class="w"> </span><span class="p">(</span><span class="nb">select</span><span class="w"> </span><span class="n">test-mat</span><span class="w"> </span><span class="no">:all</span><span class="w"> </span><span class="p">(</span><span class="nb">range</span><span class="w"> </span><span class="mi">64</span><span class="p">)))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">y-test</span><span class="w"> </span><span class="p">(</span><span class="nf">get-column</span><span class="w"> </span><span class="n">test-mat</span><span class="w"> </span><span class="mi">64</span><span class="p">))</span></code></pre></figure>
<h3 id="training">Training</h3>
<p>Here, we use the KNN model for classifying the digits. The syntax for KNN is,</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nf">knn</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="n">v</span><span class="w"> </span><span class="no">:k</span><span class="w"> </span><span class="n">k</span><span class="p">)</span></code></pre></figure>
<p>where,<br />
<em>X</em> is input data,<br />
<em>y</em> is target data,<br />
<em>v</em> is new input to be classified, and<br />
<em>k</em> is the number of neighbours(optional, default = 3)</p>
<p>Let’s define a function to perform kNN on our dataset.</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">h</span><span class="w"> </span><span class="o">#</span><span class="p">(</span><span class="nf">knn</span><span class="w"> </span><span class="n">X-train</span><span class="w"> </span><span class="n">y-train</span><span class="w"> </span><span class="n">%</span><span class="p">))</span></code></pre></figure>
<p>Now, <strong>h</strong> can be used to classify the training set.</p>
<h3 id="testing">Testing</h3>
<p>Let’s test the KNN model. Classifying the data in the testing set,</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">preds</span><span class="w"> </span><span class="p">(</span><span class="nb">map</span><span class="w"> </span><span class="n">h</span><span class="w"> </span><span class="p">(</span><span class="nf">rows</span><span class="w"> </span><span class="n">X-test</span><span class="p">)))</span></code></pre></figure>
<p>Now let’s check the accuracy of the model.</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nb">*</span><span class="w"> </span><span class="p">(</span><span class="nb">/</span><span class="w"> </span><span class="p">(</span><span class="nb">apply</span><span class="w"> </span><span class="nb">+</span><span class="w"> </span><span class="p">(</span><span class="nb">map</span><span class="w"> </span><span class="o">#</span><span class="p">(</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nb">=</span><span class="w"> </span><span class="n">%1</span><span class="w"> </span><span class="n">%2</span><span class="p">)</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="mi">0</span><span class="p">)</span><span class="w"> </span><span class="n">y-test</span><span class="w"> </span><span class="n">preds</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="nf">row-count</span><span class="w"> </span><span class="n">y-test</span><span class="p">))</span><span class="w">
</span><span class="mf">100.0</span><span class="p">)</span><span class="w">
</span><span class="c1">; 99.74</span></code></pre></figure>
<p>The model has a 99.74% accuracy on the test set! The accuracy of the model could vary highly depending on the shuffling of the dataset.</p>
<h3 id="conclusion">Conclusion</h3>
<p>The KNN model has a really good accuracy for the digit classification dataset used here. The problem with KNN is it’s inefficiency. It requires computation involving all samples in the dataset to classify a new sample. The <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a> dataset is a large dataset of handwritten digits - 50,000 training set and 10,000 test set samples. A more complex model such as SVM or MLP(Multi Layer Perceptron) may be used for better efficiency and classification accuracy for such datasets. That’s it! more work on Clatern to follow soon. So, keep an eye out :-)</p>This is a tutorial on classifying handwritten digits with KNN algorithm using Clatern. Clatern is a machine learning library for Clojure, in the works.Multiclass Classification using Clatern2015-02-26T00:00:00+00:002015-02-26T00:00:00+00:00/clatern/2015/02/26/multiclass-classification-using-clatern<p><a href="https://github.com/rinuboney/clatern">Clatern</a> is a machine learning library for Clojure, in the works. This is a short tutorial on performing multiclass classification using Clatern.</p>
<h4 id="importing-the-libraries">Importing the libraries</h4>
<p>The libraries required for the tutorial are core.matrix, Incanter and Clatern. Importing them,</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nf">require</span><span class="w"> </span><span class="o">'</span><span class="p">[</span><span class="n">clojure.core.matrix</span><span class="w"> </span><span class="no">:as</span><span class="w"> </span><span class="n">m</span><span class="p">])</span><span class="w">
</span><span class="p">(</span><span class="nf">use</span><span class="w"> </span><span class="o">'</span><span class="p">(</span><span class="nf">incanter</span><span class="w"> </span><span class="n">core</span><span class="w"> </span><span class="n">datasets</span><span class="w"> </span><span class="n">charts</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="nf">use</span><span class="w"> </span><span class="o">'</span><span class="p">(</span><span class="nf">clatern</span><span class="w"> </span><span class="n">logistic-regression</span><span class="w"> </span><span class="n">knn</span><span class="p">))</span></code></pre></figure>
<p><em>NOTE: This tutorial requires Incanter 2.0 (aka Incanter 1.9.0). This is because both Incanter 2.0 and Clatern are integrated with core.matrix.</em></p>
<h3 id="dataset">Dataset</h3>
<p>This tutorial uses the popular Iris flower dataset. The dataset is available here: https://archive.ics.uci.edu/ml/datasets/Iris. For this tutorial, we’ll use Incanter to load the dataset.</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">iris</span><span class="w"> </span><span class="p">(</span><span class="nf">get-dataset</span><span class="w"> </span><span class="no">:iris</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="nf">view</span><span class="w"> </span><span class="n">iris</span><span class="p">)</span></code></pre></figure>
<p><img src="https://camo.githubusercontent.com/6e7e613199cfb729b52792639c7b24ace67585e8/687474703a2f2f696e63616e7465722e6f72672f696d616765732f6578616d706c65732f697269735f646174612e6a7067" alt="iris" /></p>
<p>Now converting the dataset into a matrix, where non-numeric columns are converted to either numeric codes or dummy variables, using the to-matrix function.</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">iris-mat</span><span class="w"> </span><span class="p">(</span><span class="nf">to-matrix</span><span class="w"> </span><span class="n">iris</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="nf">view</span><span class="w"> </span><span class="n">iris-mat</span><span class="p">)</span></code></pre></figure>
<p><img src="https://camo.githubusercontent.com/1fa4972cc40ded5570931f7f567d1c595f010a47/687474703a2f2f696e63616e7465722e6f72672f696d616765732f6578616d706c65732f697269735f6d61742e6a7067" alt="iris-mat" /></p>
<p>Now let’s split the dataset into a training set and a test set,</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">iris</span><span class="o">'</span><span class="w"> </span><span class="p">(</span><span class="nf">m/order</span><span class="w"> </span><span class="n">iris-mat</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="p">(</span><span class="nf">shuffle</span><span class="w"> </span><span class="p">(</span><span class="nb">range</span><span class="w"> </span><span class="mi">150</span><span class="p">))))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">train-mat</span><span class="w"> </span><span class="p">(</span><span class="nb">take</span><span class="w"> </span><span class="mi">120</span><span class="w"> </span><span class="n">iris</span><span class="o">'</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">test-mat</span><span class="w"> </span><span class="p">(</span><span class="nb">drop</span><span class="w"> </span><span class="mi">120</span><span class="w"> </span><span class="n">iris</span><span class="o">'</span><span class="p">))</span></code></pre></figure>
<p>Splitting the training and test set into features and labels,</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">X-train</span><span class="w"> </span><span class="p">(</span><span class="nf">m/select</span><span class="w"> </span><span class="n">train-mat</span><span class="w"> </span><span class="no">:all</span><span class="w"> </span><span class="p">[</span><span class="mi">0</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="mi">3</span><span class="p">]))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">y-train</span><span class="w"> </span><span class="p">(</span><span class="nf">m/get-column</span><span class="w"> </span><span class="n">train-mat</span><span class="w"> </span><span class="mi">4</span><span class="p">))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">X-test</span><span class="w"> </span><span class="p">(</span><span class="nf">m/select</span><span class="w"> </span><span class="n">test-mat</span><span class="w"> </span><span class="no">:all</span><span class="w"> </span><span class="p">[</span><span class="mi">0</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="mi">3</span><span class="p">]))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">y-test</span><span class="w"> </span><span class="p">(</span><span class="nf">m/get-column</span><span class="w"> </span><span class="n">test-mat</span><span class="w"> </span><span class="mi">4</span><span class="p">))</span></code></pre></figure>
<h3 id="logistic-regression">Logistic Regression</h3>
<p>Here comes the interesting part - training a classifier using the data. First, let’s try the logistic regression model. Gradient descent is a learning algorithm for the logistic regression model. The syntax of gradient descent is,</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nf">gradient-descent</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="no">:alpha</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="no">:lambda</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="no">:num-iters</span><span class="w"> </span><span class="n">num-iters</span><span class="p">)</span></code></pre></figure>
<p>where,<br />
<em>X</em> is input data,<br />
<em>y</em> is target data,<br />
<em>alpha</em> is the learning rate,<br />
<em>lambda</em> is the regularization parameter, and<br />
<em>num-iters</em> is the number of iterations.</p>
<p>alpha(default = 0.1), lambda(default = 1) and num-iters(default = 100) are optional.</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">lr-h</span><span class="w"> </span><span class="p">(</span><span class="nf">gradient-descent</span><span class="w"> </span><span class="n">X-train</span><span class="w"> </span><span class="n">y-train</span><span class="w"> </span><span class="no">:lambda</span><span class="w"> </span><span class="mi">1</span><span class="n">e-4</span><span class="w"> </span><span class="no">:num-iters</span><span class="w"> </span><span class="mi">200</span><span class="p">))</span></code></pre></figure>
<p>That’s it. Here, gradient-descent is a function in the clojure.logistic-regression namespace. It trains on the provided data and returns a hypothesis in the logistic regression model. Now, <strong>lr-h</strong> is a function that can classify an input vector.</p>
<h3 id="k-nearest-neighbors">K Nearest Neighbors</h3>
<p>Next, let’s try the k nearest neighbors model. There is actually no training phase for this model. It can be directly used. The syntax for knn is,</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nf">knn</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="n">v</span><span class="w"> </span><span class="no">:k</span><span class="w"> </span><span class="n">k</span><span class="p">)</span></code></pre></figure>
<p>where,<br />
<em>X</em> is input data,<br />
<em>y</em> is target data,<br />
<em>v</em> is new input to be classified, and<br />
<em>k</em> is the number of neighbours(optional, default = 3)</p>
<p>Let’s define a function to perform kNN on our dataset.</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">knn-h</span><span class="w"> </span><span class="o">#</span><span class="p">(</span><span class="nf">knn</span><span class="w"> </span><span class="n">X-train</span><span class="w"> </span><span class="n">y-train</span><span class="w"> </span><span class="n">%</span><span class="p">))</span></code></pre></figure>
<p>Similar to the logistic regression hypothesis, now <strong>knn-h</strong> can be used to classify an input vector.</p>
<h3 id="classification">Classification</h3>
<p>Both <strong>lr-h</strong> and <strong>knn-h</strong> are functions that take input feature vectors and classify them. So to classify a whole dataset, the function is mapped to all rows of the dataset.</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">lr-preds</span><span class="w"> </span><span class="p">(</span><span class="nb">map</span><span class="w"> </span><span class="n">lr-h</span><span class="w"> </span><span class="p">(</span><span class="nf">m/rows</span><span class="w"> </span><span class="n">X-test</span><span class="p">)))</span><span class="w">
</span><span class="p">(</span><span class="k">def</span><span class="w"> </span><span class="n">knn-preds</span><span class="w"> </span><span class="p">(</span><span class="nb">map</span><span class="w"> </span><span class="n">lr-h</span><span class="w"> </span><span class="p">(</span><span class="nf">m/rows</span><span class="w"> </span><span class="n">X-test</span><span class="p">)))</span></code></pre></figure>
<p>Now <strong>lr-preds</strong> and <strong>knn-preds</strong> contains the classifications made by logistic regression and knn on the orignal dataset, respectively.</p>
<h3 id="conclusion">Conclusion</h3>
<p>So which model performs better here? Let’s write a function to assess the classification accuracy</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="k">defn</span><span class="w"> </span><span class="n">accuracy</span><span class="w"> </span><span class="p">[</span><span class="n">h</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="n">y</span><span class="p">]</span><span class="w">
</span><span class="p">(</span><span class="nb">*</span><span class="w"> </span><span class="p">(</span><span class="nb">/</span><span class="w"> </span><span class="p">(</span><span class="nb">apply</span><span class="w"> </span><span class="nb">+</span><span class="w"> </span><span class="p">(</span><span class="nb">map</span><span class="w"> </span><span class="o">#</span><span class="p">(</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nb">=</span><span class="w"> </span><span class="n">%1</span><span class="w"> </span><span class="n">%2</span><span class="p">)</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="mi">0</span><span class="p">)</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="p">(</span><span class="nb">map</span><span class="w"> </span><span class="n">h</span><span class="w"> </span><span class="p">(</span><span class="nf">m/rows</span><span class="w"> </span><span class="n">X</span><span class="p">))))</span><span class="w">
</span><span class="p">(</span><span class="nf">m/row-count</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w">
</span><span class="mf">100.0</span><span class="p">))</span></code></pre></figure>
<p>Now let’s evaluate both the classifiers:</p>
<figure class="highlight"><pre><code class="language-clojure" data-lang="clojure"><span class="p">(</span><span class="nf">accuracy</span><span class="w"> </span><span class="n">lr-h</span><span class="w"> </span><span class="n">X-test</span><span class="w"> </span><span class="n">y-test</span><span class="p">)</span><span class="w">
</span><span class="c1">; 92.53</span><span class="w">
</span><span class="p">(</span><span class="nf">accuracy</span><span class="w"> </span><span class="n">knn-h</span><span class="w"> </span><span class="n">X-test</span><span class="w"> </span><span class="n">y-test</span><span class="p">)</span><span class="w">
</span><span class="c1">; 96.13</span></code></pre></figure>
<p>The accuracy of the models could vary highly depending on the shuffling of the dataset. These are values I averaged over 100 runs. Both models perform well on this datatset. So, that’s it for multiclass classification using Clatern. More work on Clatern to follow soon. So, keep an eye out :-)</p>Clatern is a machine learning library for Clojure, in the works. This is a short tutorial on performing multiclass classification using Clatern. Importing the libraries