###Muse - A Connectionist Model Approach to Natural Language Generation The synthesis of literary material is a creative and an intellectually challenging task. It requires a substantial amount of world knowledge and natural language processing capability. Muse is a system that contemplates on a topic and is capable of churning out creative and intellectual content in accordance. Muse is trained on unstructured text available in abundance. The system is being developed using recurrent neural networks. The feasibility of the project is advocated by the recent success of artificial neural networks. Neural networks are the first artificial pattern recognizers to achieve human-competitive or even superhuman performance on important benchmarks.
This is my BTech main project. We(a group of 3) are currently working on this project.
###Clatern - Machine Learning for Clojure Clojure is a good fit for data analysis and machine learning. It’s possible to leverage the performance of JVM and work with intuitive abstractions at the same time in Clojure. Clojure is being used for ML by startups like BigML and Prismatic. However, Clojure does not have a good ML library. To resolve this, I began developing Clatern - a machine learning library for Clojure programming language based on core.matrix(multi-dimensional array programming API for Clojure). core.matrix offers a compelling alternative to NumPy. Clatern is being actively developed. It’s compatible with Incanter(Clojure-based, R-like statistical computing and graphics environment for the JVM) since both use core.matrix. I hope to develop Clatern into a production-ready full featured ML library.
Hosted on Github - http://github.com/rinuboney/clatern.
###Information Extraction from EDGAR Database A project that involved extracting and processing elementary and financial data about companies from 10-K and 10-Q SEC filings openly available in EDGAR database.
Recently, most of 10-K and 10-Q SEC filings are submitted in XBRL format which is an XML based global standard for exchanging financial information. Extracting information from an XML based document with a global standard is supposed to be easy. But, in reality the format offers high flexibility and there is little correlation between filings of different companies. Even a small sample of 100 documents have very less valuable tags in common. Most of the time, the same information is annotated by different tags. This renders the task of extracting required information from XBRL documents non-trivial. After a substantial amount of analysis on SEC filings, I developed a system to extract important information from 10-K and 10-Q filings.
The final system is successfully able to produce important financial metrics for the last 7 years along with important details of the company, given a company name or the SIC code of the company.
###A Framework for Management Systems A framework in Python to generate full featured database agnostic management systems from user specified metadata and settings. The framework is highly customizable and adapts to the functional and aesthetic taste of the user through plugins and themes.
The aim of this project was to create a framework to ease the development of management systems. The framework initially requires the user to input the required tables and it’s respective fields. A basic management system is generated from this information. Management systems that require additional functionalities can add the required functionalities with the help of plugins developed for the framework. The framework also provides themes to enhance the user interface. The project was successfully developed and shown to be versatile by demonstrating the ease of development of various management systems without much effort. Various plugins and themes were also developed for the base system.
This was my BTech minor project.
Tweets are sparse and noisy. The restricted length of tweets prevents standard text mining tools from being employed to their full potential. Initially, I tried simple NLP processing and methods such as tf-idf to identify key words in a tweet. But the results were fruitless. Then, I experimented with various topic modelling techniques such as latent dirichlet allocation and latent semantic analysis. The results were interesting but unsatisfactory.
The final system consisted of using DBpedia Spotlight to annotate important keywords. Some standard preprocessing procedures were applied before annotating keywords. The most important keywords were identified from the annotated keywords and presented. It produced the required results. It was able to accurately identify topics of interests’ of an active twitter user.
###Mirrors - A Scalable Realtime Collaboration Platform I developed a scalable web based realtime collaboration platform. It was a web application where users can login and collaborate in a commonly created workspace called a ‘mirror’. A workspace consisted of widgets according to the type of workspace. The workspace and widgets within are synchonized in realtime. The backend was powered by Tornado web server. Tornado is a scalable, non-blocking web server and web application framework written in Python. The realtime connection between client side and server side was implemented using websockets. tornado-sockjs was used to enable cross browser websocket support. MongoDB was used for persistent data storage. A key component that enabled scalability was RabbitMQ. RabbitMQ is a robust message broker. The messages were relayed through RabbitMQ queues. The whole system worked flawlessly as all the libraries used were built for scalability and realtime support.