0. Introduction

Introduction

  • Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.
  • It is seen as a subset of artificial intelligence.
  • Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.

Content

  • Why Machine Learning?
  • Problems Machine Learning Can Solve
    • Knowing Your Task and Knowing Your Data
  • Essential Libraries and Tools
    • scikit-learn
    • Jupyter Notebook
    • NumPy
    • SciPy
    • matplotlib
    • pandas
  • Python
    • Why Python?
  • A First Application: Classifying Iris Species
    • Meet the Data
    • Measuring Success: Training and Testing Data
    • First Things First: Look at Your Data
    • Building Your First Model: k-Nearest Neighbors
    • Making Predictions
    • Evaluating the Model Summary and Outlook

1. Why Machine Learning?

Two major disadvantages of hardcode rules:

  • The logic required to make a decision is specific to a single domain and task. Changing the task even slightly might require a rewrite of the whole system.

  • Designing rules requires a deep understanding of how a decision should be made by a human expert.

For machine learning, we just need using data to create the rules for our system. Change or add new data => retrain model => update the rules

2. Problems Machine Learning Can Solve

title

  • Supervised learning: Machine learning algorithms that learn from input/output pairs (output has labelled)
  • Unsuperviesd learning: only the input data is known, and no known output data is given to the algorithm.

2.1 Represent your data:

In a data tables including

  • Each data point that you want to reason about (each email, each customer, each transaction) is a row
  • Each entity or row here is known as a sample (or data point)
  • Each property that describes that data point (say, the age of a customer or the amount or location of a transaction) is a column.
  • The columns—the properties that describe these entities are called features.

    title

2.2 Knowing Your Task and Knowing Your Data

  • Understanding the data
  • How it relates to the task

First thinking about the problems:

  • What question(s) am I trying to answer? Do I think the data collected can answer that question?
  • What is the best way to phrase my question(s) as a machine learning problem?
  • Have I collected enough data to represent the problem I want to solve?
  • What features of the data did I extract, and will these enable the right predictions?
  • How will I measure success in my application?
  • How will the machine learning solution interact with other parts of my research or business product?

3. Essential Libraries and Tools

scikit-learn

  • scikit-learn is an open source project, meaning that it is free to use and distribute, and anyone can easily obtain the source code to see what is going on behind the scenes.

Jupyter Notebook

  • The Jupyter Notebook is an interactive environment for running code in the browser.
  • A great tool for exploratory data analysis and is widely used by data scientists.

NumPy

  • Contains functionality for multidimensional arrays, high-level mathematical functions such as linear algebra operations and the Fourier transform, and pseudorandom number generators.
In [2]:
import numpy as np

x = np.array([[1, 2, 3], [4, 5, 6]])
print("x:\n{}".format(x))
x:
[[1 2 3]
 [4 5 6]]

SciPy

  • Provides,among other functionality, advanced linear algebra routines, mathematical function optimization, signal processing, special mathematical functions, and statistical distributions
In [3]:
from scipy import sparse

# Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else
eye = np.eye(4)
print("NumPy array:\n", eye)
NumPy array:
 [[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]
In [4]:
# Convert the NumPy array to a SciPy sparse matrix in CSR format
# Only the nonzero entries are stored
sparse_matrix = sparse.csr_matrix(eye)
print("\nSciPy sparse CSR matrix:\n", sparse_matrix)
SciPy sparse CSR matrix:
   (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0
In [8]:
data = np.ones(4)
row_indices = np.arange(4)
col_indices = np.arange(4)
eye_coo = sparse.coo_matrix((data, (row_indices, col_indices)))
print("COO representation:\n", eye_coo)
COO representation:
   (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0

matplotlib

  • visualizations
In [9]:
%matplotlib inline
import matplotlib.pyplot as plt

# Generate a sequence of numbers from -10 to 10 with 100 steps in between
x = np.linspace(-10, 10, 100)
# Create a second array using sine
y = np.sin(x)
# The plot function makes a line chart of one array against another
plt.plot(x, y, marker="x")
Out[9]:
[<matplotlib.lines.Line2D at 0x7f35c809d208>]

pandas

  • Data wrangling and analysis
In [10]:
import pandas as pd

# create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
        'Location' : ["New York", "Paris", "Berlin", "London"],
        'Age' : [24, 13, 53, 33]
       }

data_pandas = pd.DataFrame(data)
# IPython.display allows "pretty printing" of dataframes
# in the Jupyter notebook
display(data_pandas)
Name Location Age
0 John New York 24
1 Anna Paris 13
2 Peter Berlin 53
3 Linda London 33
In [11]:
# Select all rows that have an age column greater than 30
display(data_pandas[data_pandas.Age > 30])
Name Location Age
2 Peter Berlin 53
3 Linda London 33

4. Python

Why Python?

Python has become the general programming language for many data science applications.

  • It combines the power of general-purpose programming languages with the ease of use of domain-specific scripting languages like MATLAB or R.
  • Python has libraries for data loading, visualization, statistics, natural language processing, image processing, and more.

Versions Used

In [12]:
import sys
print("Python version:", sys.version)

import pandas as pd
print("pandas version:", pd.__version__)

import matplotlib
print("matplotlib version:", matplotlib.__version__)

import numpy as np
print("NumPy version:", np.__version__)

import scipy as sp
print("SciPy version:", sp.__version__)

import IPython
print("IPython version:", IPython.__version__)

import sklearn
print("scikit-learn version:", sklearn.__version__)
Python version: 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0]
pandas version: 0.24.2
matplotlib version: 3.0.3
NumPy version: 1.16.2
SciPy version: 1.2.1
IPython version: 7.4.0
scikit-learn version: 0.20.3
In [29]:
# !pip install mglearn

5. A First Application: Classifying Iris Species

sepal_petal

Iris data is a data sheet describing iris flowers, some measurements associated with each iris: the length and width of the petals and the length and width of the sepals, all measured in centimeters

5.1 Meet the Data

In [13]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()
In [14]:
print("Keys of iris_dataset:\n", iris_dataset.keys())
Keys of iris_dataset:
 dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

The value of the key DESCR is a short description of the dataset.

In [15]:
print(iris_dataset['DESCR'][:193] + "\n...")
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, pre
...
In [16]:
print("Target names:", iris_dataset['target_names'])
Target names: ['setosa' 'versicolor' 'virginica']
In [17]:
print("Feature names:\n", iris_dataset['feature_names'])
Feature names:
 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
In [18]:
print("Type of data:", type(iris_dataset['data']))
Type of data: <class 'numpy.ndarray'>
In [19]:
print("Shape of data:", iris_dataset['data'].shape)
Shape of data: (150, 4)
  • The array contains measurements for 150 different flowers. Remember that the individual items are called samples (150) and their properties are called features (4)
In [20]:
print("First five rows of data:\n", iris_dataset['data'][:5])
First five rows of data:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
In [21]:
print("Type of target:", type(iris_dataset['target']))
Type of target: <class 'numpy.ndarray'>
In [22]:
print("Shape of target:", iris_dataset['target'].shape)
Shape of target: (150,)
In [23]:
print("Target:\n", iris_dataset['target'])
Target:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

5.2 Measuring Success: Training and Testing Data

  • We cannot use the data we used to build the model to evaluate it. Because our model can always simply remember the whole training set => predict the correct label for any point in the training set
  • We should split the labeled data we have collected (here, our 150 flower measurements) into two parts:
    • The training data: used to build our machine learning model, or training set.
    • The test data: used to assess how well the model works; or test set, or hold-out set.
In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0)
  • This function extracts 75% of the rows in the data as the training set, together with the corresponding labels for this data. The remaining 25% of the data, together with the remaining labels, is declared as the test set.
In [25]:
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
X_train shape: (112, 4)
y_train shape: (112,)
In [26]:
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
X_test shape: (38, 4)
y_test shape: (38,)

5.3 First Things First: Look at Your Data

In [32]:
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
import mglearn
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train
pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15),
                           marker='o', hist_kwds={'bins': 20}, s=60,
                           alpha=.8, cmap=mglearn.cm3)
Out[32]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f35c7d24f98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c7a34908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c79deb70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c7986dd8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f35c79b50f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c795f668>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c7907be0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c79391d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f35c7939208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c7889c88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c78b8240>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c78607b8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f35c7807cf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c783a2b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c77e3828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35c778cda0>]],
      dtype=object)
  • We can see that the three classes seem to be relatively well separated using the sepal and petal measurements. This means that a machine learning model will likely be able to learn to separate them.

5.4 Building Your First Model: k-Nearest Neighbors

  • KNN finds the point in the training set that is closest to the new point. Then it assigns the label of this training point to the new data point.
In [33]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
In [34]:
knn.fit(X_train, y_train)
Out[34]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

5.5 Making Predictions

In [35]:
X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape:", X_new.shape)
X_new.shape: (1, 4)

To make a prediction, we call the predict method of the knn object

In [36]:
prediction = knn.predict(X_new)
print("Prediction:", prediction)
print("Predicted target name:",
       iris_dataset['target_names'][prediction])
Prediction: [0]
Predicted target name: ['setosa']

5.6 Evaluating the Model

  • We can make a prediction for each iris in the test data and compare it against its label (the known species). We can measure how well the model works by computing the accuracy, which is the fraction of flowers for which the right species was predicted:
In [37]:
y_pred = knn.predict(X_test)
print("Test set predictions:\n", y_pred)
Test set predictions:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]
In [38]:
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))
Test set score: 0.97
In [39]:
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
Test set score: 0.97

6. Summary and Outlook

In [40]:
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
Test set score: 0.97

Keep in your mind in this chapter:

  • Machine learning and its applications
  • Supervised and unsupervised learning.
  • Split our dataset into a training set, to build our model, and a test set, to evaluate.
  • Makes predictions for a new data point.
In [ ]:
 

Comments !