使用KNN算法认识Machine Learning

Machine Learning不过如此~

请注意，本文编写于 435 天前，最后修改于 433 天前，其中某些信息可能已经过时。

一、🤯 引

应该是第一次跟周老师见面过后，周老师就建议我先开始学学Deep Learning相关的知识，为开学后进入正式的研究打个基础。回来之后自己信心满怀地买了李沐大神的《动手学深度学习（PyTorch版）》，然后便开始了自己的学习DL之路。

然而，事实是残酷的，前面的矩阵求导、线性回归模型还能勉强顶得住，进入《第四章感知机》后就彻底蒙了。表情直接扭曲：🤯😖🥴😵💫。

而且我学东西有一个习惯，不喜欢不求甚解。不喜欢迷迷糊糊只要能出来结果，就不管过程。于是乎，为了自己能够更好地学习并应用Deep Learning到自己的专业。我决定自己还是从机器学习的基础来学，让自己先了解下Machine Learning。

Python是目前机器学习领域非常火的编程语言，同时我也有比较好的语法基础。而在Python中，scikit-learn是非常著名的ML库。因此，我选择scikit-learn作为自己学习并认识ML的工具。

在ML中，KNN算法（k-nearest neighbors algorithm）是最简单且最容易理解的分类算法之一，经过我的学习之后发现，KNN确实是这样的，其需要的数学知识可能初中水平就够了。因此，选择使用KNN算法来认识ML的流程以及scikit-learn包非常合适。

本博文中的代码.ipynb文件在Github：Study-for-Machine-Learning。

二、🎤 介绍

KNN 的全称是K Nearest Neighbors，意思是k个最近的邻居。从这个名字我们就能看出一些KNN算法的蛛丝马迹了。k个最近邻居，毫无疑问，k的取值肯定是至关重要的，那么最近的邻居又是怎么回事呢？其实，KNN的原理就是当预测一个新的值x对应标签时候，根据它距离最近的k个点是什么类别来判断x属于哪个类别。

例如，上面图中黑色实线圆圈内，绿色⚪代表x，与其最近的k=3个元素分别为一个蓝色■，两个红色▲，x是■的概率为1/3，是三角形的概率为2/3，所以KNN算法就会判定未知元素绿色⚪为▲。而如果k=5，在虚线圆圈内，蓝色■有3个，红色▲有两个，那么x是■的概率为3/5，是三角形的概率为2/5，所以KNN算法就会判定未知元素绿色⚪为■。

这个叙述够简单的了吧？这就是一个关于KNN算法的最基本讲解，还有一些细节补充在下面一步步的代码实现过程中再来介绍。

三、⌨️ 纯Python实现乞丐版KNN算法

３.１创建数据

先来创建一组数据：

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create  data
raw_data_x = [[3.393533211, 2.331273381],
              [3.110073483, 1.781539638],
              [1.343808831, 3.368360954],
              [3.582294042, 4.679179110],
              [2.280362439, 2.866990263],
              [7.423436942, 4.696522875],
              [5.745051997, 3.533989803],
              [9.172168622, 2.511101045],
              [7.792783481, 3.424088941],
              [7.939820817, 0.791637231]]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

# convert list object to np.array object
x_train = np.array(raw_data_x)
y_train = np.array(raw_data_y)

# test point
demo_point = np.array([8.093607318,3.365731514])

绘图可视化：

plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train)
plt.plot(demo_point[0], demo_point[1], 'r*', markersize=10)

３.２数据讲解

上面的数据中，raw_data_x是特征数据集，raw_data_y是标签数据集。raw_data_x的第一列x_train[:, 0]代表x轴坐标，x_train[:, 1]代表y轴坐标。demo_point是我们想要进行判定的点，由上面的图中很容易可以看出来，其周围都是黄色点，对应的标签1，所以说使用KNN算法对demo_point点进行判定，其结果也应该是1。下面我们实现一个乞丐版KNN算法来看下。

３.３乞丐版KNN算法的实现

KNN的一个具体思路就是：

遍历x_train中的每一个点到demo_point的距离，将其存到一个保存所有距离的list中；
对list中的所有距离进行升序排列，并取出前k个距离最小的点；
判断k个点中对应标签值为0和1的点各有多少，计算其比例，比例大的就是预测结果。

３.３.１遍历得到距离list

计算两点之间的距离有很多方法，下文讲scikit-learn中封装的KNN方法时再细讲各种距离公式，这里我们先采用一种最简单的欧氏距离公式：

$$ D_{(x, y)}=\sqrt{(x_1-y_1)^2+(x_2-y_2)^2+\cdots +(x_n-y_n)^2} =\sqrt{\sum_{i=1}^{n}(x_i-y_i)^2 } $$

dis = []  # Storage test data point to the distance from each point in the figure

for x in x_train:  # Each point in the traversal map, calculate the distance from Euclidean Distance of the test point
    dis.append(np.sqrt(np.sum((x-demo_point)**2)))

print(dis)  # The distance result of printing calculation

３.３.２对距离list进行由小到大排序，并取出前k个元素，观察其对应标签值

sort_dis = np.argsort(dis)  # sort the distance and get the index of distance

K = 6  # Define k
top_k_y = [y_train[i] for i in sort_dis[:K]]

从排序后的距离list中取出前k=6个，并获取其标签值[1, 1, 1, 1, 1, 0]。

３.３.３计算前k个标签中每一项标签值的比例，并输出最大的。

from collections import Counter

votes = Counter(top_k_y).most_common(1)[0]
print(f'The result of KNN is {votes[0]}, probability is {votes[1]}/{K}.')

打印输出结果：

The result of KNN is 1, probability is 5/6.

说明使用k=6的KNN算法预测demo_point点对应的标签为1的概率是5/6。即距离其最近的6个点中，有5个对应的标签都是1。

３.３.４封装函数

可以将上面的过程封装为一个函数，从而更加方便地调用：

# Import packages
import numpy as np
from collections import Counter

# The function of KNN
def KNN_classify(k, x_train, y_train, find_point):
    """The function of KNN

    Args:
        k (int): Calculate the nearest K point
        x_train (np.array): Training data set
        y_train (np.array): Training data label
        find_point (np.array): The point you need to find
    """
    # Check input data weather are valid
    assert 1<= k <= x_train.shape[0], f'{k} value must be valid!'
    assert x_train.shape[0] == y_train.shape[0], f'The size of {x_train} must equal to the size of {y_train}!'
    assert find_point.shape[0] == x_train.shape[1], f'The feature number of {find_point} must be equal to {x_train}'

    # Calculation distance
    dis = [np.sqrt(np.sum((x-demo_point)**2)) for x in x_train]
    sort_index = np.argsort(dis)
    top_k_y = [y_train[i] for i in sort_index[:k]]
    votes = Counter(top_k_y).most_common(1)[0]

    print(f'The result of KNN is {votes[0]}, probability is {votes[1]}/{K}.')


KNN_classify(6, x_train, y_train, demo_point)

３.３.５封装为类对象

我们也可以将上面最基础的KNN算法封装为一个类，便于后期调用与维护：

# Import packages
import numpy as np
from collections import Counter

# Define KNN Object
class KNN:
    def __init__(self, k):
        assert k >= 1, 'k must >= 1!'
        self.k = k
        self._x_train = None
        self._y_train = None
        self._y_predict = None

    def fit(self, x_train, y_train):
        # Check input data weather are valid
        assert 1 <= self.k <= x_train.shape[0], f'{self.k} value must be valid!'
        assert x_train.shape[0] == y_train.shape[
            0], f'The size of {x_train} must equal to the size of {y_train}!'
        self._x_train = x_train
        self._y_train = y_train
        return self

    def predict(self, x_predict):
        assert x_predict.shape[1] == self._x_train.shape[
            1], f'The feature number of {x_predict} must be equal to {self._x_train}'
        self._y_predict = [self._predict(i) for i in x_predict]
        return np.array(self._y_predict)

    def _predict(self, x_pre):
        dis = [np.sqrt(np.sum((x-x_pre)**2)) for x in x_train]
        sort_index = np.argsort(dis)
        top_k_y = [self._y_train[i] for i in sort_index[:self.k]]
        votes = Counter(top_k_y).most_common(1)[0]

        # print(
        #     f'The result of KNN is {votes[0]}, probability is {votes[1]}/{K}.')

        return votes[0]

    def score(self, y_test):
        return np.sum(self._y_predict == y_test) / len(y_test)

    def __repr__(self) -> str:
        return f'KNN(K={self.k})'


KNN_classifier = KNN(6)
KNN_classifier.fit(x_train, y_train)
KNN_classifier.predict(demo_point.reshape(1, -1))

３.３.６在scikit-learn中使用KNN

scikit-learn是Python中著名的机器学习包，其中自然也包括了KNN算法，而且远远要比上面我们自己封装的更加完善，功能更加强大且高效。

下面就介绍下如何调用scikit-learn实现上面的乞丐版KNN算法。

1、首先是要引包并创建对象：

# Import package
from sklearn.neighbors import KNeighborsClassifier
# Create object
KNN_classifier = KNeighborsClassifier(n_neighbors=6)

Tips / 提示
参数n_neighbors代表选取最近的ｋ个邻居，也就是上面我们设置的k参数。

2、然后是训练数据集。需要注意的是，由于KNN是最基础的算法，其实其内部根本就不需要训练，只需要计算距离公式即可。机器学习其他算法的训练过程远远比它要复杂的多。

KNN_classifier.fit(x_train, y_train)

3、然后即可对数据进行预测：

KNN_classifier.predict(demo_point.reshape(1, -1))

输出结果为array([1])，说明KNN算法对demo_point的预测标签为1。

四、📦 使用KNN算法认识scikit-learn的ML流程

Tips / 提示
上面的全部都是铺垫，这一章才是本篇博文的核心！！！

上面我们使用KNN算法认识并了解了scikit-learn中最基础的ML流程。但是其中还有很多需要注意的小细节，这里我们再进行一个更加细致的补充。

scikit-learn的ML流程大概包括（我目前的理解）：

数据预处理；
数据集的划分；
使用训练集数据进行模型的训练过程；
使用测试集数据对模型进行评估，观察模型预测结果的准确率是否可靠。

下面将对这四个步骤一一展开细讲。

４.１数据预处理

数据预处理包括的内容有缺失值填充、异常值处理、分类型变量的转化以及数据的归一化等。

数据预处理其实大部分都是使用Numpy、Pandas中的方法，然后再结合一定的专业知识，即可完成。推荐一本我最近在看的书：深入浅出Pandas。数据的归一化可以参考这篇博客：数据中心化与标准化。

Tips / 提示
需要注意的是：
如果要进行数据归一化的处理，训练集数据进行了归一化，测试集也必须进行归一化处理，否则模型效果会奇差无比；
测试集数据集进行归一化处理过程中，所使用的平均值与方差应为训练集的平均值与方差，不应重新计算自己的平均值与方差使用。

４.２数据集的划分

４.２.１手搓实现数据集的划分

这一步就是将得到的数据集分为训练集和测试集。简单来说，训练集是用来训练模型，测试集是为了测试训练集得到的模型是否可靠。

训练集的划分，其实就是将特征矩阵与其对应的标签值进行打乱（shuffle），然后取一定的小比例作为测试集，剩下的作为训练集。

以鸢尾花数据集为例：

# Load the iris flower data set
from sklearn import datasets

iris = datasets.load_iris()
x, y = iris.data, iris.target
x.shape, y.shape

输出结果为((150, 4), (150,))，说明特征矩阵x共有150个样本，每个样本有4个特征；每个样本对应一个标签。

使用：

pd.DataFrame(y).value_counts()

会发现，标签值对应的有３种：０、１、２，分别各有５０个。

将特征矩阵与其对应的标签打乱也很简单，只需要将其每行样本对应的index打乱即可。

# Shuffle indexed
shuffle_index = np.random.permutation(x.shape[0])

然后根据测试比例，计算测试数据集的个数：

test_ratio = 0.3
test_num = int(x.shape[0] * test_ratio)

然后分别从shuffle后的index列表中获取测试集和训练集的index：

test_index = shuffle_index[:test_num]
train_index = shuffle_index[test_num:]

然后即可获得划分后的测试集数据与训练集数据：

# Divide data set
x_train, x_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
x_train.shape, x_test.shape, y_train.shape, y_test.shape

输出结果为：((105, 4), (45, 4), (105,), (45,))，说明训练集共有105个样本，测试集共有45个样本，测试集占全部样本数据的比例为0.3。

４.２.２封装为函数

def train_test_split(x, y, test_ratio=0.3, seed=None):
    """拆分数据集为训练、测试数据集特征、标签

    Args:
        x (_type_): Characteristic matrix
        y (_type_): Tag matrix
        test_ratio (float, optional): Test data set ratio. Defaults to 0.3.
        seed (_type_, optional): Random seed. Defaults to None.
    """
    assert x.shape[0] == y.shape[0], 'The size of x must be equal to the size of y!'
    assert test_ratio > 0, 'The test ratio must be bigger than zero!'

    if seed:
        np.random.seed(seed)

    shuffle_index = np.random.permutation(x.shape[0])

    test_size = None
    if test_ratio < 1:
        test_size = int(x.shape[0] * test_ratio)
    else:
        test_size = test_ratio

    train_index = shuffle_index[test_size:]
    test_index = shuffle_index[:test_size]

    x_train, y_train, x_test, y_test = x[train_index], y[train_index], x[test_index], y[test_index]

    return x_train, y_train, x_test, y_test

x_train, y_train, x_test, y_test = train_test_split(x, y, 50)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

４.２.３在scikit-learning中使用train_test_split

sklearn.model_selection.train_test_split中直接为我们封装了划分数据集的类，官方文档介绍了更为详细的使用方法。

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=666)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

４.３模型的评估

Tips / 提示
由于模型的训练结果需要进行评估验证，所以我们先来讲下模型的评估。

所谓模型的评估就是对模型的训练结果进行验证，判断其正确率如何，是否可以运用于实际。

实现模型的评估其实很简单，就是将x_test测试数据集传入模型，将模型导出的结果与y_test进行对比，判断正确的样本个数除以y_test中所有样本数就得到了正确率。

下面我们使用KNN算法对一个手写数字0~9进行判定，演示下如何在scikit-learn中输出模型预测正确率。

from sklearn import datasets

# Load handwritten digits dataset of 0~9
digits= datasets.load_digits()

# View data set information
print(digits['DESCR'])

获得数据集：

x, y = digits['data'], digits['target']
x.shape, y.shape

从数据集中任意选择一个看看是个啥：

# Take out a data example from it
demo_digit = x[666]
print(y[666])  # 0

import matplotlib
plt.imshow(demo_digit.reshape(8, 8), cmap=matplotlib.cm.binary)

可以看出第666个样本对应的数字应该为０。

划分数据集：

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

创建KNN分类器对象、训练模型并预测：

from sklearn.neighbors import KNeighborsClassifier

KNN_classifier = KNeighborsClassifier(n_neighbors=6)
KNN_classifier.fit(x_train, y_train)
y_predict = KNN_classifier.predict(x_test)

将预测结果与实际值进行对比，判定正确率：

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

４.４模型的训练

在「３.３.６在scikit-learn中使用KNN」中，我们在创建KNeighborsClassifier()类对象时，定义了一个参数n_neighbors=6，这个参数代表了使用距离最近的6个点作为判断未知点分类的依据。

而这种在运行机器学习算法之前就需要定义好的参数，我们称之为超参数（Super Parameters）。与之相对应的还有一个概念叫做模型参数，模型参数是算法过程中学习的参数。

ＫNN算法没有模型参数，ＫNN算法中的k是典型的超参数。

寻找好的超参数往往使用以下三种方法：

领域知识：根据专业领域或者数学知识，确定超参数的范围；
经验数值：根据以往经验中使用的最优超参数来决定；
实验搜索：简单粗暴，直接使用for循环，一个个试试呗~

４.４.１实验搜索

所谓实验搜索，就是通过for循环来直接对超参数一个个试，然后记录其score值，从中选择最高score值对应的超参数作为模型使用的超参数。

# Find the best k in KNN algorithm
best_score, best_k = 0, -1 # Define the best score and best k

for k in range(1, 20):  # Using for traversal k for 1 to 19.
    KNN_classifier = KNeighborsClassifier(n_neighbors=k)
    KNN_classifier.fit(x_train, y_train)
    y_predict = KNN_classifier.predict(x_test)
    score = accuracy_score(y_test, y_predict)

    if score > best_score:
        best_score = score
        best_k = k

        
print(f'The bset k is {best_k}, the best score is {best_score}.')

打印信息：The best k is 3, the best score is 0.9866666666666667.

然而，实际上，任何一种ML算法都不会只有一个超参数。即使是上面我们所讲述的KNN算法，其实也是有很多超参数的。下面我们先对scikit-learn中KNN算法的其他超参数有一个简单的认识。

４.４.２KNN算法的其他超参数

sklearn.neighbors.KNeighborsClassifier中的超参数主要有以下几个：

n_jobs：表示使用计算机几个核来并行任务，传统for循环效率低下，并行任务可以加快程序效率，设置值为-1表示使用所有核心来运行；
weights：值有两种{uniform, distance}，具体解释请看下图以及讲解；

weights=distance：蓝色获胜，因为蓝色2个，占比为2/3；红色为1个，占比为1/3；
weights=uniform：红色获胜，因为红色距离绿点距离为1，其距离倒数为1/1；蓝色球有两个，其距离倒数相加为1/3+1/4=7/12。

metric：距离公式，KNN算法默认的距离公式就是上面我们使用的欧氏距离公式，但其实还可以更换为别的，详细使用请见官方文档；
p：明可夫斯基距离，由其定义即可看出，p=1时，其就是曼哈顿距离；p=2（默认值）时，其就是欧式距离。

$$ (\sum_{i=1}^{n}|x_i-y_i|^p)^{\frac{1}{p}} $$

在面对这么多超参数的形况下，显然使用for循环并不是一种明智的做法。正如你想象的那样，你能想象到的，scikit-learn肯定也早就给你定义好了类，直接用就行。下面我们来看下网格搜索。

４.４.３网格搜索

上面这种直接使用for循环简单粗暴搜索最优超参数的方法很容易理解。如果有两个超参数，那么我们只需要再加上一层for循环即可。这种嵌套for循环遍历搜索最优超参数的方法，我们称之为网格搜索（Grid Search）。

在sklearn.model_selection.GridSearchCV中定义了网格搜索类，我们可以直接很方便地调用。

# Grid search in sklearn
para_grid = [  # Define the super parameters that you want to search.
    {
        'weights': ['uniform'],
        'n_neighbors': [i for i in range(1, 20)]
    },
    {
        'weights': ['distance'],
        'n_neighbors': [i for i in range(1, 11)],
        'p': [i for i in  range(1, 6)]
    }
]

from sklearn.model_selection import GridSearchCV　　# Import grid earch object.

KNN_classifier = KNeighborsClassifier()
grid_searcher = GridSearchCV(KNN_classifier, para_grid, n_jobs=-1, verbose=2)

grid_searcher.fit(x_train, y_train)

打印最优超参数与最高评分：

grid_searcher.best_params_, grid_searcher.best_score_

使用最优超参数重新构建训练模型并预测：

KNN_classifier = grid_searcher.best_estimator_
KNN_classifier.predict(x_test)

查看模型分类准确度：

KNN_classifier.score(x_test, y_test)

五、💨 结尾

花了一天的时间，使用KNN算法对machine Learning的流程有了一个大概的认识，收获还是颇多的。里面可能也有一些错误，后面发现的话再来慢慢改吧～

加油，😆亚灿🥰～

----- END -----

博客站点：亚灿网志（Yacan's Blog）
本文链接：https://blog.manyacan.com/archives/2034/
版权声明：本文章采用 知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议 。

Machine Learning

使用KNN算法认识Machine Learning

一、🤯 引

二、🎤 介绍

三、⌨️ 纯Python实现乞丐版KNN算法

３.１创建数据

３.２数据讲解

３.３乞丐版KNN算法的实现

３.３.１遍历得到距离list

３.３.２对距离list进行由小到大排序，并取出前k个元素，观察其对应标签值

３.３.３计算前k个标签中每一项标签值的比例，并输出最大的。

３.３.４封装函数

３.３.５封装为类对象

３.３.６在scikit-learn中使用KNN

Tips / 提示

四、📦 使用KNN算法认识scikit-learn的ML流程

Tips / 提示

４.１数据预处理

Tips / 提示

４.２数据集的划分

４.２.１手搓实现数据集的划分

４.２.２封装为函数

４.２.３在scikit-learning中使用train_test_split

４.３模型的评估

Tips / 提示

４.４模型的训练

４.４.１实验搜索

４.４.２KNN算法的其他超参数

４.４.３网格搜索

五、💨 结尾

添加新评论

--------------- 已有 1 条评论 ---------------

使用KNN算法认识Machine Learning

一、🤯 引

二、🎤 介绍

三、⌨️ 纯Python实现乞丐版KNN算法

３.１创建数据

３.２数据讲解

３.３乞丐版KNN算法的实现

３.３.１遍历得到距离list

３.３.２对距离list进行由小到大排序，并取出前k个元素，观察其对应标签值

３.３.３计算前k个标签中每一项标签值的比例，并输出最大的。

３.３.４封装函数

３.３.５封装为类对象

３.３.６在scikit-learn中使用KNN

Tips / 提示

四、📦 使用KNN算法认识scikit-learn的ML流程

Tips / 提示

４.１数据预处理

Tips / 提示

４.２数据集的划分

４.２.１手搓实现数据集的划分

４.２.２封装为函数

４.２.３在scikit-learning中使用train_test_split

４.３模型的评估

Tips / 提示

４.４模型的训练

４.４.１实验搜索

４.４.２KNN算法的其他超参数

４.４.３网格搜索

五、💨 结尾

傅里叶级数与变换

「Machine Learning」线性回归

添加新评论

--------------- 已有 1 条评论 ---------------