Supervised learning

What is supervised learning?

Supervised learning is a training method/learning method in machine learning:

Supervised learning requires a clear goal, a clear idea of what you want to achieve. For example: to classify, predict a specific value according to "established rules"...

Supervision doesn't mean standing next to the machine to see if it's doing the right thing, but rather the following process:

Choose a mathematical model suitable for the target task
A portion of the known "questions and answers" (the training set) is given to the machine to learn.
The machine develops its own "methodology".
Humans give the machine "new questions" (test set) to answer.

The 4 processes of supervised learning

The questions and answers mentioned above are just an analogy, but if we want to accomplish the task of categorizing articles, here's how we do it:

Choose an appropriate mathematical model
Give the machine a bunch of articles that have already been categorized and their classification.
The machine learns the "methodology" of categorization.
Once the machine has learned it, throw it some new articles (without categories) and ask the machine to predict the categories of these articles.

2 tasks of supervised learning: regression, classification

Supervised learning has 2 main tasks:

Regression
Classification

Regression: predicting continuous, specific values . For example: sesame credit score in paypal (explained in detail below)

Classification: categorize various things for discrete (what is discrete?) prediction . For example:

2 tasks of supervised learning: regression, classification.

Example of "regression": how does Sesame credit score come about?

The following is a method of assessing the creditworthiness of individuals - FICO.

Similar to Credit Sesame, FICO is used to assess an individual's creditworthiness. the FICO scoring system produces credit scores ranging from 300 to 850, with higher scores indicating a lower credit risk.

Let's simulate the invention process of FICO, which is the regression of supervised learning.

Step 1: Construct the problem and select the model

We first find out the influencing factors of personal credit, logically speaking a person's weight should have no relationship with his credit, for example, we are surrounded by very creditworthy people, there are fat people and thin people.

And the total wealth seems to be related to the credit, because the loss of Ma Yun does not talk about credit is very huge, so we have never heard of Ma Yun will not pay back the credit card! And a beggar does not talk about credit loss is very small, this street can not be mixed with another street to continue.

So according to the judgment, find out the following 5 influencing factors:

Payment record
Total amount of the account
Credit history span (credit history since account opening, credit history since account opening for specific types of accounts...)
New accounts (number of recent openings, percentage of openings for specific types of accounts...)
Credit categories (number of various accounts)

At this point, we have constructed a simple model:

f can be simply understood as a specific formula that forms a correlation between the 5 factors and a person's credit score.

Our goal is to get f what this formula is specifically, so that we can get a person's credit score as soon as we have these 5 types of data about a person.

Step 2: Collect known data

In order to find out this formula f, we need to first collect a large amount of known data, which must contain the 5 types of data about a person and his/her credit status (converting the credit status into a score).

We divide the data into several parts, one for training and one for testing and validation.

Step 3: Train the ideal model

With this data, we can "guess" the relationship between the 5 types of data and the credit score through machine learning. This relationship is the formula f.

Then we use the validation data and test data to verify whether the formula is OK.

The specific method of test verification is:

Apply the 5 types of data to the formula and calculate the credit score
Compare the calculated credit score with the person's actual credit score (pre-prepared)
Evaluate the accuracy of the formula and adjust and optimize if the problem is significant

Step 4: Predict new users

When we want to know the credit status of a new user, we just need to collect these 5 types of data about him, apply the formula f and calculate it once to know the result!

Well, the above is a regression model that is closely related to you, the general idea is the idea mentioned above, the whole process has been simplified, if you want to see the complete process, you can view the "Machine Learning - 7 Steps to Machine Learning Hands-on".

The case of "classification": how to predict divorce

Dr. Gottman, an American psychologist, uses big data to restore the truth about marital relationships, and his method is the idea of classification.

Dr. Gottman was able to predict whether or not a couple would get divorced after observing and listening to them for 5 minutes, with a 94% accuracy rate! He also has a book out on this research, "Happily Married" (8.4 on Douban).

Step 1: Construct the problem and choose the model

Gottman suggests that conversations reflect underlying problems between couples, and that their arguments, laughter, flirtations, and emotional displays in conversations create certain emotional associations. The emotional associations in these conversations can be used to categorize couples into different types that represent different probabilities of divorce.

Step 2: Collecting known data

The researchers invited 700 couples to participate in the experiment. They sat down individually in a room opposite each other and then talked about a controversial topic, such as money and sex, or relationships with in-laws. Murray and Gottman had each couple keep talking about the topic for 15 minutes and filmed the process. Once observers watched these videos, they rated the husbands and wives based on the conversation between them.

Step 3: Train the ideal model

Gottman's method doesn't use machine learning to get results, though the principles are similar. The conclusions he got are as follows:

First, they plotted the scores of the husband and wife on a chart, and the intersection of the two lines would indicate whether the marriage would be stable for a long time. If the husband or wife consistently scored negatively, the two were likely headed for divorce. The point is to quantify the ratio of positive and negative roles in the conversation. The ideal ratio is 5:1, and if it is lower than that, the marriage is in trouble. Finally, the results were put on a mathematical model that used differential equations to highlight potential characteristics of successful marriages.

Gottman categorized the couples into five groups based on their scores:

Happy couples : calm, intimate, supportive of each other, and friendly in their relationship. They prefer to share experiences.
Ineffective couples : they do their best to avoid conflict just by responding positively to each other.
Variable couples : they are romantic and passionate, but can be unusually argumentative. They are sometimes stable and sometimes unstable, but in general not very happy.
Hostile couples : one partner doesn't want to talk about something and the other agrees, so there is no communication between the two.
Couples who have no feelings for each other : one partner is excited to have an argument, but the other partner is not at all interested in the topic of discussion.

The mathematical model presents the difference between two types of stable couples (couples with a harmonious relationship and couples with a disharmonious relationship) and two types of unstable couples (hostile couples and unemotional couples). And it is predicted that unstable couples may stay married despite their unstable marriages.

Step 4: Making predictions about new users

Every year or two for 12 years, Murray and Gottman have talked to those 700 couples who participated in the study. The two men's formula predicted divorce rates with 94 percent accuracy.

Mainstream supervised learning algorithms

Algorithm	Type	Introduction
Simple Bayes	Classification	Bayesian classification is a statistical classification method based on the Bayesian theorem. It classifies a given tuple by predicting the probability that it belongs to a particular class. Plain Bayesian classification assumes that the effect of an attribute value in a given class is independent of other attributes - class-conditional independence....
Decision Tree	Classification	Decision tree is a simple but widely used classifier that builds a decision tree from training data to classify unknown data.
SVM	Classification	Support Vector Machines transform the classification problem into a problem of finding the classification plane and achieve classification by maximizing the distance of the classification boundary points from the classification plane.
Logistic Regression	Classification	Logistic regression is used to deal with regression problems where the dependent variable is a categorical variable, commonly binary or binomial distribution problems, but can also deal with multicategorical problems, it actually belongs to a classification method.
Linear regression	Regression	Linear regression is one of the most commonly used algorithms for dealing with regression tasks. The form of the algorithm is very simple, and it expects to fit the data set using a hyperplane (a straight line when there are only two variables).
The regression tree	Regression	A regression tree (a type of decision tree) achieves hierarchical learning by repeatedly partitioning the dataset into different branches, with the criterion of partitioning being to maximize the information gain from each separation. This branching structure allows regression trees to naturally learn non-linear relationships.
K-neighborhood	Classification + Regression	Predicts new data points by searching the entire training set of the K most similar instances (neighbors) and summarizing the output variables of those K instances.
Adaboosting	Classification + Regression	The purpose of Adaboost is to learn a series of weak or basic classifiers from the training data and then combine these weak classifiers into a strong classifier.
Neural Networks	Classification + Regression	It abstracts the neuronal network of the human brain from the perspective of information processing, and builds a simple model to form different networks with different connections.

Encyclopedia

Baidu Encyclopedia (details)

Supervised learning is the process of adjusting the parameters of a classifier to achieve the required performance using a set of samples of a known class, also known as supervised training or instructor learning.

Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consists of a set of training examples. In supervised learning, each example consists of an input object (usually a vector) and a desired output value (also known as a supervised signal). The supervised learning algorithm is a function that analyzes that training data and produces an inference which can be used to map out new instances. An optimal solution would allow the algorithm to correctly determine the class labels of those instances that are not visible. This requires that the learning algorithm be formed in a "sensible" way from the training data to the unseen instances.

Wikipedia (details)

Supervised learning is the machine learning task of learning a function that maps inputs to outputs based on example input-output pairs. It infers that a function labeled training data consists of a set of training samples. In supervised learning, each implementation is a pair consisting of an input object (usually a vector) and a desired output value (also known as a monitor signal). The supervised learning algorithm analyzes the training data and generates an inference function that can be used to map new examples. The optimal solution will allow the algorithm to correctly determine the class labels of the unseen instances. This requires the learning algorithm to generalize from the training data to the unseen cases in a "reasonable" way.