1.多音子数据分析:甲方皮实献佩奇
甲:2.假设检验:渐渐献祭
建(渐)立假设H0反命题H1,选择检(渐)验统计量,
根据显(献)著水平一般0.05确定拒绝域,计算(祭)p值做出判断
方:3.方差检验:sm比se了 * mm-1逼n-mm
他SST = (每个-均值)平方和,妈SSM = (每组均值-均值)平方求和
的SSE = (每个 - 每组)平方和,烦F= SSM/SSE * m -1/n - m (m组,n个)
皮:5.皮尔逊系数:鞋笔表,贼逼样
主要观察两组数据的相关性
两组数据各减去均值相乘取期望值(协方差)/标准差
Z分数相乘/样本总数
实:6.斯皮尔曼系数:一检六查房,呵,必暗访

主要观察名次差相关性
献:7.线性回归:功盖西餐
公式y=kx+b,概念:两个或两个以上的变量存在依赖关系, 关键指标:决定系数,残差值
佩:8.PCA:主成分分析:shuit,靠偷
求特证协方差矩阵,
求协方差矩阵的特征值和特征向量,
排序选K个,
将样本点投射到特征向量上
主要的作用就是降维.
奇:9.奇异值分解:哎呦喂
特征矩阵为A,分解为m*m的酉阵U,m*n半正定矩阵(奇异矩阵),n*n酉阵转置V
A=U ∑V平方t(转置)
12.做线性回归分析的过程:抱恋欲系
1.导入lr包,取出
2.训练fit
3.预测 predict
4.取系数coef_,取截距intercept
13.PCA降维的python实现:保卫球子
1.导包,取包
2.设置维度PCA(n_components = 1)
3.求重要性 .explained_variance_ration
4.转换后的数值:.fit_transform(data)
14.复合分析:差分印象
1.交叉分析
2.因子分析
3.分组与钻取
4.相关分析
特征工程
#多因子数分
import numpy as np
import scipy.stats as ss
norm_list = ss.norm.rvs(size = 20)
ss_test = ss.normaltest(norm_list)
# print(ss_test)
#卡方检验
ss_chi = ss.chi2_contingency([[15, 95], [85, 5]])
# print(ss_chi)
ss_t = ss.ttest_ind(ss.norm.rvs(size = 10), ss.norm.rvs(size=20))
# print(ss_t)
ss_tt2 = ss.ttest_ind(ss.norm.rvs(size = 100), ss.norm.rvs(size = 200))
# print(ss_tt2)
ss_one = ss.f_oneway([49,50,39,40,43],[28,32,30,26,34],[38,40,45,42,48])
# print(ss_one)
#曲线散点图,平分线重合
from statsmodels.graphics.api import qqplot
from matplotlib import pyplot as plt
# plt.show(qqplot(ss.norm.rvs(size=100)))
import pandas as pd
s1 = pd.Series([0.1, 0.2, 1.1, 2.4, 1.3, 0.3, 0.5 ])
s2 = pd.Series([0.5, 0.4, 1.2, 2.5, 1.1, 0.7, 0.1 ])
key1 = s1.corr(s2,method = "spearman")
# print(key1)
df = pd.DataFrame(np.array([s1, s2]).T)
df_key = df.corr(method= 'spearman')
# print(df_key)
#回归的例子
# x = np.arange(10).astype(np.float).reshape((10,1))
# y = x * 3 +4 +np.random.random((10,1))
# from sklearn.linear_model import LinearRegression as LR
#
# lr = LR()
#
# data = lr.fit(x, y)
#
# predict_y = lr.predict(x)
# print(predict_y)
# print(data.intercept_,data.coef_)
#pca变换,
# data = np.array([np.array([2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1]),
# np.array([2.4, 0.7, 2.9, 2.2, 3, 2.7, 1.6, 1.1, 1.6, 0.9])]).T
data = np.array([np.array([2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1]),np.array([2.4, 0.7, 2.9, 2.2, 3, 2.7, 1.6, 1.1, 1.6, 0.9])]).T
# print(data)
from sklearn.decomposition import PCA
lower_dim = PCA(n_components = 1)
fit_pca = lower_dim.fit(data)
# predict_pca_y = lower_dim.predict(data)
# print(lower_dim.explained_variance_ratio_)
# print(lower_dim.fit_transform(data))
# def myPCA(data, n, compentes = 100000000):
# mean_vals = np.mean(data,axis=0)
# mid =data = data-mean_vals
# cov_mat = np.cov(mid,rowvar=False)
# from scipy import linalg
# eig_vals,eig_vects = linalg.eig(np.mat(cov_mat))
# eig_val_index = np.argsort(eig_vals)
# eig_val_index = eig_val_index[:-(n_components+1):-1]
# eig_vects = eig_vects[:,eig_vals_index]
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context(font_scale=3)
#######################################################
#1.求通过t独立检验求各个部门的离职率情况
df = pd.read_csv("./data/HR.csv")
#
# arange = df["salary"].value_counts()
# # print(arange)
# #独立t检验方法
# dp_indices = df.groupby(by = "department").indices
# # print(dp_indices)#求出每个类别的索引
# sales_values = df["left"].iloc[dp_indices["sales"]].values
# technical_values = df["left"].iloc[dp_indices["technical"]].values
# # print(ss.ttest_ind(sales_values,technical_values)[1])
# dp_keys = list(dp_indices.keys())
# dp_t_mat=np.zeros([len(dp_keys),len(dp_keys)])
# for i in range(len(dp_keys)):
# for j in range (len(dp_keys)):
# p_values = ss.ttest_ind(df["left"].iloc[dp_indices[dp_keys[i]]].values,\
# df["left"].iloc[dp_indices[dp_keys[j]]].values)[1]
# if p_values<0.05:
# dp_t_mat[i][j] = -1
# else:
# dp_t_mat[i][j] = p_values
# sns.heatmap(dp_t_mat,xticklabels=dp_keys,yticklabels=dp_keys)
# plt.show()
# piv_tb = pd.pivot_table(df,values="left",index=["promotion_last_5years","salary"],\
# columns = ["Work_accident"],aggfunc=np.mean)
# # print(piv_tb)
# sns.heatmap(piv_tb,vmax = 1, vmin = 0, cmap=sns.color_palette("Reds",n_colors = 256))
# #plt.show()
# #分组分析与钻取
# sns.barplot(x="salary", y="left", hue="department", data=df)
# plt.show()
# s1_s = df["satisfaction_level"]
# sns.barplot(list(range(len(s1_s))),s1_s.sort_values())
# plt.show()
#相关分析是衡量两组样本的相关大小
# sns.heatmap(df.corr(),vmin=-1,vmax=1,cmap=sns.color_palette("RdBu",n_colors=128))
# plt.show()]










