Logistic Regression 逻辑回归
Sigmoid 函数
-
Sigmoid函数表达式如下:
-
其图像为:
-
可以看出e^(-x)∈(0,+∞),所以f(x)∈(0,1)。于是我们可以将它与概率联系起来,P∈[0,1],0表示该事件不会发生,1则表示该事件一定会发生。
-
Sigmoid函数有一个很重要的特性,便是它的导数,我们对其进行求导:
逻辑回归原理
- 由此可以看出,对于Sigmoid函数,若我们想要求其一点的导数值,我们不需要求导再计算,只需根据这点的函数值便可得出相应的导数值。
- 基于Sigmoid函数的值域以及导数值可以由函数值求得这两个特性,所以我们能把它运用在逻辑回归中。
- 构建模型
- 对于逻辑回归而言,它是做二分类的,也就是说我们的样本只能有两类,我们用0表示第一类,用1表示第二类,也即是我们的Label,也就是y的值要么等于0,要么等于1,只能取这两个值。我们的模型为:Z=ω1X1+ω2X2+…+ωnXn+b,Z也即是Logistic,我们可以进一步将模型简化为向量的形式,然后将模型带入Sigmoid函数中。
- 从Sigmoid函数函数中我们知道,当x的值大于0的时候,f(x)的值大于0.5,当当x的值小于0的时候,f(x)的值小于0.5,于是我们可以将f(x)的值等于0.5的时候作为一个分界线。我们将y ^小于0.5的值分为第一类,对应0,将y ^大于0.5的值分为第二类,对应1。模型Z=ω1X1+ω2X2+…+ωnXn+b中的参数为ω和b,我们需要求得最好的ω *和b *,让模型的效果最好,接下来我们便需要求损失函数。
- 构建损失函数
我们知道当y=1时,y ^越接近1越好,当y=0时,y ^越接近0越好,于是有以下式子:
当y=1时,为y ^,而当y=1时,为1-y ^。所以我们希望这个式子越大越好。而我们希望损失函数越小越好,于是我们给该式子填上负号,再把我们所有的样本都考虑进来,为了方便计算,再对其取对数,得到了我们的损失函数:
- 求取损失函数最小值
我们已知损失函数以及y ^如下:
- 对ω1进行求偏导
- 加上Σ,再对每一项求偏导
- 利用梯度下降算法对参数进行更新
逻辑回归代码实现
- 数据集:Titanic
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# %matplotlib inline
dataset = pd.read_csv('D:\Demo\MachineLearning\LogisticRegression\dataset.csv')
dataset = dataset[['Pclass', 'Sex', 'Fare', 'Survived']]
n_dataset = dataset.shape[0]
# feature engineering
dataset['Fare'] = (dataset['Fare'] - dataset['Fare'].mean()) / dataset['Fare'].std()
dataset['Sex'] = dataset['Sex'].astype('category').cat.codes
train_data = dataset.iloc[0:600]
test_data = dataset.iloc[600:]
# print(dataset.head())
X_train = train_data.drop(columns='Survived').astype('float32')
y_train = train_data['Survived'].astype('float32')
n_train = train_data.shape[0]
X_test = test_data.drop(columns='Survived').astype('float32')
y_test = test_data['Survived'].astype('float32')
n_test = test_data.shape[0]
def sigmoid(x):
y = 1 / (1 + np.exp(-x))
return y
x = np.arange(-10, 10, 0.1)
y = sigmoid(x)
fig, ax = plt.subplots()
ax.scatter(0, sigmoid(0))
ax.plot(x, y)
plt.show()
# Build Logistic Regression Model
w1 = -0.5
w2 = -2
w3 = 0.3
b = 2
N = 1000
lr = 0.0001
for j in range(N):
det_w1 = 0
det_w2 = 0
det_w3 = 0
det_b = 0
for i in range(n_train):
x1 = X_train.iloc[i][0]
x2 = X_train.iloc[i][1]
x3 = X_train.iloc[i][2]
z = w1 * x1 + w2 * x2 + w3 * x3 + b
y_hat = sigmoid(z)
y = y_train.iloc[i]
det_w1 += - (y - y_hat) * x1
det_w2 += - (y - y_hat) * x2
det_w3 += - (y - y_hat) * x3
det_b += - (y - y_hat)
w1 = w1 - lr * det_w1
w2 = w2 - lr * det_w2
w3 = w3 - lr * det_w3
b = b - lr * det_b
def get_accuracy(X, y, n):
predicted_result = []
total_loss = 0
for i in range(n):
x1 = X.iloc[i][0]
x2 = X.iloc[i][1]
x3 = X.iloc[i][2]
z = w1 * x1 + w2 * x2 + w3 * x3 + b
y_hat = sigmoid(z)
if y_hat < 0.5:
predicted_result.append(0)
else:
predicted_result.append(1)
for i in range(n):
loss = (y.iloc[i] - predicted_result[i]) ** 2
total_loss += loss
accuracy = (n - total_loss) / n
return accuracy
print(get_accuracy(X_train, y_train, n_train))
print(get_accuracy(X_test, y_test, n_test))
- 以上代码跑起来需要很长的时间,我们可以用矩阵表示数据集以及模型的参数,这样可以让代码的效率得到提高
逻辑回归的矩阵运算表示
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('D:\Demo\MachineLearning\LogisticRegression\dataset.csv')
dataset = dataset[['Age', 'Pclass', 'Sex', 'Fare', 'SibSp', 'Parch', 'Survived']]
n_dataset = dataset.shape[0]
# feature engineering
dataset['Fare'] = (dataset['Fare'] - dataset['Fare'].mean()) / dataset['Fare'].std()
dataset['Sex'] = dataset['Sex'].astype('category').cat.codes
dataset['Pclass'] = (dataset['Pclass'] - dataset['Pclass'].mean()) / dataset['Pclass'].std()
dataset['Age'].fillna(dataset['Age'].mean(), inplace=True)
dataset['Age'] = (dataset['Age'] - dataset['Age'].mean()) / dataset['Age'].std()
dataset['Sex'] = (dataset['Sex'] - dataset['Sex'].mean()) / dataset['Sex'].std()
train_data = dataset.iloc[0:600]
test_data = dataset.iloc[600:]
# print(dataset.head())
X_train = train_data.drop(columns='Survived').astype('float32')
y_train = train_data['Survived'].astype('float32')
n_train = train_data.shape[0]
X_test = test_data.drop(columns='Survived').astype('float32')
y_test = test_data['Survived'].astype('float32')
n_test = test_data.shape[0]
# print(X_train.head())
# print(y_train.head())
def sigmoid(x):
y = 1 / (1 + np.exp(-x))
return y
x = np.arange(-10, 10, 0.1)
y = sigmoid(x)
fig, ax = plt.subplots()
ax.scatter(0, sigmoid(0))
ax.plot(x, y)
plt.show()
# Build Logistic Regression Model
n_features = X_train.shape[1]
w = np.zeros(n_features)
b = 0
N = 1000
lr = 0.0001
for j in range(N):
det_w = np.zeros(n_features)
det_b = 0.0
logits = w.dot(X_train.T) + b
y_hat = sigmoid(logits)
det_w = - np.dot((y_train - y_hat), X_train)
det_b = - np.sum(y_train - y_hat)
w = w - lr * det_w
b = b - lr * det_b
def get_accuracy(X, y, w, n):
n_samples = X.shape[0]
predict_result = []
for i in range(n_samples):
x = X.iloc[i]
p = sigmoid(x.dot(w) + b)
if p > 0.5:
predict_result.append(1)
else:
predict_result.append(0)
total_loss = 0
for i in range(n_samples):
loss = (y.iloc[i] - predict_result[i]) ** 2
total_loss += loss
accuracy = (y.shape[0] - total_loss) / y.shape[0]
return accuracy
print(get_accuracy(X_train, y_train, w, b))
print(get_accuracy(X_test, y_test, w, b))
完整代码已上传至Github,各位下载时麻烦给个follow和star,感谢!
链接:LogisticRegression 逻辑回归