Accessing Data and Creating a Dataset

This week, I will explore data preparation and dataset creation and issues. Data is the blood of the model. Without it, the model cannot learn and make predictions. I will discuss the specifics of gathering data, formatting data to fit the needs of a model, and different ways to represent data.

Before creating a model, one must get data for the model to use. In this day and age, gathering data is not a significant problem. For example, Google gathers data about its users through their searches and their google accounts. It stores documents its users create and personal information users enter when creating their google account. Amazon gathers data from its users whenever they search for or buy something on Amazon’s website. Of course, information about users of a website is not the only data that can be collected. Almost anything can be collected and analyzed. For example, the weather, population, and vehicle traffic are all subjects from which data can be collected.

Once data has been collected, one must determine what data will be used by the model and what data is useless. For example, if a model is being created to predict weather, data about temperature, humidity, and rainfall may be relevant, but data with the names of previous storms is not.

The next step in creating a dataset is the data formatting. With supervised learning, as discussed in last week’s blog post, data needs to be separated into three categories. The first category is training data. This data will be used when training the model, as the name would suggest. The second category is dev data. Dev data is used when developing the model after it has already been trained to evaluate its ability to generalize. Generalization was also discussed in last week’s blog post. The model will probably make predictions on dev data more than once. The third category is test data. This data is used by the model at the end of development. Test data is used to generate publishable results. Since the model has not seen the data in the test category before, the results produced on test data will be an accurate measure of the model’s predictive capabilities.

There are different ways to split the data into train, dev, and test categories and most of the time, it depends on the type of model. Generally, though, more data is put in train than dev or test. The model will be using train data to learn to make accurate predictions, so the more train data it has access to, the better its predictions will be. For example, one could split 50% of the data into the train category, 25% into the dev category, and 25% into the test category.

Next week I will discuss the first model in this series of blog posts, neural networks and some common misconceptions about what they are and are not. Neural networks are one of the more basic models in machine learning, but are still powerful tools and often variations of neural networks are used today.

Generalization and Overfitting

This week I’ll be discussing generalization and overfitting, two important and closely related topics in the field of machine learning.

However, before I elaborate on generalization and overfitting, it is important to first understand supervised learning. It is only with supervised learning that overfitting is a potential problem. Supervised learning in machine learning is one method for the model to learn and understand data. There are other types of learning, such as unsupervised and reinforcement learning, but those are topics for another time and another blog post. With supervised learning, a model is given a set of labeled training data. The model learns to make predictions based on this training data, so the more training data the model has access to, the better it gets at making predictions. With training data, the outcome is already known. The predictions from the model and known outcomes are compared, and the model’s parameters are changed until the two align. The point of training is to develop the model’s ability to successfully generalize.

Generalization is a term used to describe a model’s ability to react to new data. That is, after being trained on a training set, a model can digest new data and make accurate predictions. A model’s ability to generalize is central to the success of a model. If a model has been trained too well on training data, it will be unable to generalize. It will make inaccurate predictions when given new data, making the model useless even though it is able to make accurate predictions for the training data. This is called overfitting. The inverse is also true. Underfitting happens when a model has not been trained enough on the data. In the case of underfitting, it makes the model just as useless and it is not capable of making accurate predictions, even with the training data.

The figure demonstrates the three concepts discussed above. On the left, the blue line represents a model that is underfitting. The model notes that there is some trend in the data, but it is not specific enough to capture relevant information. It is unable to make accurate predictions for training or new data.  In the middle, the blue line represents a model that is balanced. This model notes there is a trend in the data, and accurately models it. This middle model will be able to generalize successfully. On the right, the blue line represents a model that is overfitting. The model notes a trend in the data, and accurately models the training data, but it is too specific. It will fail to make accurate predictions with new data because it learned the training data too well.

Next week, I will discuss the specifics of gathering data, formatting data to fit the needs of a model, and different ways to represent data. Data is what a machine learning model uses to make predictions for new situations. It is great to have a model, but without data for the model to interact with, the predictions the model make will be useless.

Photo is titled “mlconcepts_image5” and was created by Amazon.  It is available at http://docs.aws.amazon.com/machine-learning/latest/dg/images/mlconcepts_image5.png

An Introduction to Machine Learning

Machine learning is revolutionizing the world. Amazon uses it to recommend products. Netflix uses it to recommend new content to a user. Its uses for preventing online fraud detection are becoming increasingly popular. Google’s search engine uses it to provide its users with more meaningful results. Even speech recognition programs have made vast improvements, thanks to machine learning. It has applications in almost every field and  is quickly becoming an integral part of new technology.

So, what is machine learning? To put it generally, machine learning uses models to learn patterns in data and make predictions based on those patterns. Machine learning aims to make models able to learn from data without being explicitly programmed. People are interested in the applications of machine learning because a model has the ability to solve problems too complicated or too time consuming for a human mind to solve. For example, as mentioned above, Netflix uses machine learning to recommend new movies and TV shows to a user based on the viewing history for that given user. A human would be unable to determine new content the user would like in an efficient manner, but a machine learning model can.

It may seem like machine learning is a new topic in computer science, but the idea of computers thinking for themselves has been pursued almost as long computers have existed. Alan Turing first approached the subject when he published a paper in 1950 concerning artificial intelligence, titled “Computing Machinery and Intelligence”. The first learning machine was developed in 1951, called SNARC, by Marvin Minsky. Since then, many more machine learning algorithms have been developed and refined as the field has proved itself to have incredible potential for the future of technology.

In the past, machine learning algorithms were restrained by a computer’s lack of memory and processing power.  Today, researchers still face those issues, but with more powerful machines and cheaper memory, machine learning has increased in popularity.

If we have models that can think for themselves, isn’t there the danger of new sentient computer overlords? Not yet. Despite models being able to learn from data and make intelligent predictions from it, researchers still have a long way to go before true artificial intelligence is a possibility.
Next week, I will examine model generalization and overfitting. Generalization refers to a model’s ability to interpret new situations and react accordingly. Overfitting happens when a model learns the data too well and is unable to generalize. These two topics are common problems in machine learning, and it is important to understand them before learning more about machine learning. Next, I will explore data preparation and dataset issues. Data is the blood of the model. Without it, the model cannot learn and make predictions. Finally, I will consider a basic neural network, a convolutional neural network, and a deep neural network. A neural network and its many variations are a historical, but still relevant model for machine learning. Of course, neural networks are not the only algorithm in machine learning, but it is difficult to learn about machine learning without some type of neural network being mentioned.