文章目录

pandas基础
**pandas核心**

pandas基础

Python Data Analysis Library

pandas是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型结构化数据集所需的工具。

Series

Series可以理解为一个一维的数组，只是index名称可以自己改动。类似于定长的有序字典，有Index和 value。

import numpy as np
import pandas as pd

# 创建Series对象
s=pd.Series()
print(s,type(s),s.dtype)

# 通过ndarray创建Series对象
ary=np.array([70,80,90,95])
s=pd.Series(ary)
print(s,type(s),s.dtype)

# 创建Series对象时，指定index行级索引标签
s=pd.Series(ary,index=['张三','李四','王五','赵六'])
print(s)
print(s['张三'])

# 从字典创建一个Series
dic={'zs':80,'ww':75,'zl':95,'sd':100}
s=pd.Series(dic)
print(s)

# 通过标量创建Series
s=pd.Series(1/5,index=np.arange(5))
print(s)

访问Series中的数据：

# 访问Series中的数据
# 使用索引检索,标签改了之后，依然可以通过数字索引访问
s=pd.Series([80,85,90,95,100],index=['zs','ls','ww','zl','tq'])
print(s[0])
print(s[[1,2,3]])
print(s[1:])    # 左闭右开
# 使用标签访问
print(s['zs'])
print(s[:'zl'])  # 左闭右闭
# 掩码访问
mask=[0,1,2]
print(s[mask])
mask=['zs','ls','ww']
print(s[mask])

series常用属性：

函数	说明
series.values	返回ndarry
series.index	返回索引列表
series.dtype	返回数据类型
series.size	返回元素个数
series.ndim	返回维数
series.shape	返回维度（形状）

pandas日期处理

日期类型的Series

# 测试pandas日期处理
# 构建日期类型的Series
dates = pd.Series(['2011', '2011-02', '2011-03-01', '2011/04/01',
                   '2011/05/01 01:01:01', '01 Jun 2011','2022.03.03'])
dates=pd.to_datetime(dates)
print(dates.dt.month)
print(dates.dt.weekday)

# 日期运算
delta=dates-pd.to_datetime('2011-1-1')
print(delta)
# 把delta变成数字
print(delta.dt.days)

Series.dt提供了很多日期相关操作，如下：

Series.dt.year					The year of the datetime.
Series.dt.month					The month as January=1, December=12.
Series.dt.day					The days of the datetime.
Series.dt.hour					The hours of the datetime.
Series.dt.minute				The minutes of the datetime.
Series.dt.second				The seconds of the datetime.
Series.dt.microsecond			The microseconds of the datetime.
Series.dt.week					The week ordinal of the year.
Series.dt.weekofyear			The week ordinal of the year.
Series.dt.dayofweek				The day of the week with Monday=0, Sunday=6.
Series.dt.weekday				The day of the week with Monday=0, Sunday=6.
Series.dt.dayofyear				The ordinal day of the year.
Series.dt.quarter				The quarter of the date.
Series.dt.is_month_start		Indicates whether the date is the first day of the month.
Series.dt.is_month_end			Indicates whether the date is the last day of the month.
Series.dt.is_quarter_start		Indicator for whether the date is the first day of a quarter.
Series.dt.is_quarter_end		Indicator for whether the date is the last day of a quarter.
Series.dt.is_year_start			Indicate whether the date is the first day of a year.
Series.dt.is_year_end			Indicate whether the date is the last day of the year.
Series.dt.is_leap_year			Boolean indicator if the date belongs to a leap year.
Series.dt.days_in_month			The number of days in the month.

DateTimeIndex

通过指定周期和频率，使用date_range()函数就可以创建日期序列。默认情况下，范围的频率是天。

import pandas as pd

# 频率默认是D（天），从3月1日连续生成7天
datas=pd.date_range('2022/3/1',periods=7)
print(datas,type(datas))

# 频率为月
datas=pd.date_range('2022/3/1',periods=7,freq='M')
print(datas)

bdate_range()用来表示商业日期范围，不同于date_range()，它不包括星期六和星期天。

# 生成时间序列（工作日时间序列，周一到周五）
datas=pd.bdate_range('2022/3/1',periods=7)
print(datas)
datas=pd.date_range('2022/3/1',periods=7,freq='B')
print(datas)

# 构造某个区间的时间序列
start=pd.datetime(2017,11,1)
end=pd.datetime(2017,11,5)
datas=pd.date_range(start,end)
print(datas)

DataFrame

DataFrame是一个类似于表格的数据类型，可以理解为一个二维数组，索引有两个维度，可更改。DataFrame具有以下特点：

列可以是不同的类型
大小可变
行和列都支持自定义索引

import pandas as pd

df=pd.DataFrame()
print(df,type(df))

# 通过列表数据创建爱你DataFrame
data=[1,2,3,4,5]
df=pd.DataFrame(data)
print(df)
 
# 通过二维列表创建DataFrame
data=[[80,81],[70,71],[90,91],[60,61]]
df=pd.DataFrame(data)
print(df)

# 修改列级标签
df=pd.DataFrame(data,columns=['语文','数学'],index=['zs','ls','ww','zl'])
print(df)

data=[{'a':1,'b':2},{'a':5,'b':8,'c':10}]
df=pd.DataFrame(data)
print(df)

# 通过字典创建DataFrame
data={'name':['Tom','Jack','xrj','xyy'],'age':[28,32,23,24]}
df=pd.DataFrame(data)
print(df)
df=pd.DataFrame(data,index=['a','b','c','d'])
print(df)

data={
    'one':pd.Series([1,2,3],index=['a','b','d']),
    'two':pd.Series([1,2,3,4],index=['d','e','f','a'])
}
df=pd.DataFrame(data)
print(df)

核心数据结构操作

列访问

DataFrame的单列数据为一个Series。根据DataFrame的定义可以知晓DataFrame是一个带有标签的二维数组，每个标签相当每一列的列名。

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df['one'])
print(df[['one', 'two']])

列添加

DataFrame添加一列的方法非常简单，只需要新建一个列索引。并对该索引下的数据进行赋值操作即可。

import pandas as pd

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4'])
print(df)

列删除*

删除某列数据需要用到pandas提供的方法pop，pop方法的用法如下：

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
     'three' : pd.Series([10, 20, 30], index=['a', 'b', 'c'])}
df = pd.DataFrame(d)
print(df)

# 删除一列： one
del(df['one'])
print(df)

#调用pop方法删除一列
df.pop('two')
print(df)

行访问

如果只是需要访问DataFrame某几行数据的实现方式则采用数组的选取方式，使用 “:” 即可：

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df[2:4])
print(df['a':'b'])
# 	 one  two
# c  3.0    3
# d  NaN    4
# 	 one  two
# a  1.0    1
# b  2.0    2

loc方法是针对DataFrame索引名称的切片方法。loc方法使用方法如下：

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.loc['b'])
print(df.loc[['a', 'b']])

iloc和loc区别是iloc接收的必须是行索引和列索引的位置。iloc方法的使用方法如下：

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.iloc[2])
print(df.iloc[[2, 3]])

行添加

import pandas as pd

data=np.array([['zs',18],['ls',20]])
df=pd.DataFrame(data,columns=['name','age'])
print(df)
data=np.array([['ww',18],['zl',20]])
df2=pd.DataFrame(data,columns=['nick','score'])
print(df2)
df=df.append(df2)
print(df)

行删除

使用索引标签从DataFrame中删除或删除行。如果标签重复，则会删除多行。

import pandas as pd

df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age'])
df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age'])
df = df.append(df2)
# 删除index为0的行
df = df.drop(0)
print(df)

修改DataFrame中的数据

更改DataFrame中的数据，原理是将这部分数据提取出来，重新赋值为新的数据。

import pandas as pd

df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age'])
df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age'])
df = df.append(df2)
df['Name'][0] = 'Tom'
print(df)

DataFrame常用属性

编号	属性或方法	描述
1	`axes`	返回行/列标签（index）列表。
2	columns	返回列标签
3	index	返回行标签
4	`dtype`	返回对象的数据类型(`dtype`)。
5	`empty`	如果系列为空，则返回`True`。
6	`ndim`	返回底层数据的维数，默认定义：`1`。
7	`size`	返回基础数据中的元素数。
8	`values`	将系列作为`ndarray`返回。
9	`head()`	返回前`n`行。
10	`tail()`	返回最后`n`行。

实例代码：

import pandas as pd

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4'])
print(df)
print(df.axes)
print(df['Age'].dtype)
print(df.empty)
print(df.ndim)
print(df.size)
print(df.values)
print(df.head(3)) # df的前三行
print(df.tail(3)) # df的后三行

复合索引

DataFrame的行级索引与列级索引都可以设置为复合索引，表示从不同的角度记录数据

# 复合索引
import pandas as pd
import numpy as np

data=np.floor(np.random.normal(85,3,(6,3)))  # 均值85，标准差3，6行3列
df=pd.DataFrame(data)
print(df)
index=[('class A','F'),('class A','M'),('class B','F'),('class B','M'),('class C','F'),('class C','M')]
df.index=pd.MultiIndex.from_tuples(index)
columns=[('age','20'),('age','30'),('age','40')]
df.columns=pd.MultiIndex.from_tuples(columns)
print(df)

# 访问行
print(df.loc['class A'])
print(df.loc['class A','F'])
print(df.loc[['class A','class C']])
print(df.loc['class A','M']['age','20'])

# 访问列
print(df.age)
print(df.age['20'])
print(df['age']['20'])
print(df['age','20'])

pandas核心

pandas描述性统计

数值型数据的描述性统计主要包括了计算数值型数据的完整情况、最小值、均值、中位数、最大值、四分位数、极差、标准差、方差、协方差等。在NumPy库中一些常用的统计学函数也可用于对数据框进行描述性统计。

np.min	最小值 
np.max	最大值 
np.mean	均值 
np.ptp	极差 
np.median	中位数 
np.std	标准差 
np.var	方差 
np.cov	协方差

实例：

import pandas as pd
import numpy as np

# 创建DF
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

df = pd.DataFrame(d)
print(df)
# 测试描述性统计函数
print(df.sum())
print(df.sum(1))	 # 1是沿水平方向求均值
print(df.mean())
print(df.mean(1))

pandas提供了统计相关函数：

1	`count()`	非空观测数量
2	`sum()`	所有值之和
3	`mean()`	所有值的平均值
4	`median()`	所有值的中位数
5	`std()`	值的标准偏差
6	`min()`	所有值中的最小值
7	`max()`	所有值中的最大值
8	`abs()`	绝对值
9	`prod()`	数组元素的乘积
10	`cumsum()`	累计总和
11	`cumprod()`	累计乘积

pandas还提供了一个方法叫作describe，能够一次性得出数据框所有数值型特征的非空值数目、均值、标准差等。

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print(df.describe())
print(df.describe(include=['object']))	# object统计字符串的
print(df.describe(include=['number']))	# object统计数值的

pandas排序

Pandas有两种排序方式，它们分别是按标签与按实际值排序。

import pandas as pd
import numpy as np

unsorted_df=pd.DataFrame(np.random.randn(10,2),
                         index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
print(unsorted_df)

按行/列标签排序

使用sort_index()方法，通过传递axis参数和排序顺序，可以对DataFrame进行排序。默认情况下，按照升序对行标签进行排序。

import pandas as pd
import numpy as np

unsorted_df=pd.DataFrame(np.random.randn(10,2),
                         index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
print(unsorted_df)
# 按行标签排序
print(unsorted_df.sort_index())
# 按列标签排序
print(unsorted_df.sort_index(1))
# 控制排序顺序    ascending=False 为降序
print(unsorted_df.sort_index(ascending=False))

按某列值排序

像索引排序一样，sort_values()是按值排序的方法。它接受一个by参数，它将使用要与其排序值的DataFrame的列名称。

import pandas as pd
import numpy as np

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
unsorted_df = pd.DataFrame(d)

# 按照年龄排序
sortByAge=unsorted_df.sort_values(by='Age',ascending=False)  # 降序
print(sortByAge)

# 先按年龄排，再按得分排
sort_age_ratting=unsorted_df.sort_values(by=['Age','Rating'],ascending=[True,False])    # 年龄升序，得分降序
print(sort_age_ratting)

pandas分组

在许多情况下，我们将数据分成多个集合，并在每个子集上应用一些函数。在应用函数中，可以执行以下操作 :

聚合 - 计算汇总统计
转换 - 执行一些特定于组的操作
过滤 - 在某些情况下丢弃数据

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print(df)

将数据拆分成组

# 按照年份Year字段分组
print (df.groupby('Year'))
# 查看分组结果
print (df.groupby('Year').groups)

迭代遍历分组

groupby返回可迭代对象，可以使用for循环遍历：

print (df.groupby('Year').groups)
# 遍历每个分组
for year,group in grouped:
    print (year)
    print (group)

获得一个分组细节

grouped = df.groupby('Year')
print (grouped.get_group(2014))

分组聚合

聚合函数为每个组返回聚合值。当创建了分组(group by)对象，就可以对每个分组数据执行求和、求标准差等操作。

# 分组聚合
print(grouped['Points'].agg(np.mean))
print(grouped['Points'].agg([np.mean,np.max,np.min]))

pandas透视表与交叉表

有如下数据：

import pandas as pd

left = pd.DataFrame({
    'student_id':[1,2,3,4,5,6,7,8,9,10],
    'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
    'class_id':[1,1,1,2,2,2,3,3,3,4],
    'gender':['M','F','M','M','M','F','F','F','F','F'],
    'score':[92,35,13,54,68,93,51,85,84,98],
    'age':[23,24,25,24,23,24,25,23,25,26]})

透视表

透视表(pivot table)是各种电子表格程序和其他数据分析软件中一种常见的数据汇总工具。它根据一个或多个键对数据进行分组聚合，并根据每个分组进行数据汇总。

# 以class_id与gender做分组汇总数据，默认聚合统计所有列
print(data.pivot_table(index=['class_id', 'gender']))

# 以class_id与gender做分组汇总数据，聚合统计score列
print(data.pivot_table(index=['class_id', 'gender'], values=['score']))

# 以class_id与gender做分组汇总数据，聚合统计score列，针对age的每个值列级分组统计
print(data.pivot_table(index=['class_id', 'gender'], values=['score'], 
                       columns=['age']))

# 以class_id与gender做分组汇总数据，聚合统计score列，针对age的每个值列级分组统计，添加行、列小计
print(data.pivot_table(index=['class_id', 'gender'], values=['score'], 
                       columns=['age'], margin=True))

# 以class_id与gender做分组汇总数据，聚合统计score列，针对age的每个值列级分组统计，添加行、列小计
# aggfunc='max' 设置为求最大值，而不是均值
print(data.pivot_table(index=['class_id', 'gender'], values=['score'], 
                       columns=['age'], margins=True, aggfunc='max'))

交叉表

交叉表(cross-tabulation, 简称crosstab)是一种用于计算分组频率的特殊透视表：

# 按照class_id分组，针对不同的gender，统计数量
print(pd.crosstab(data.class_id, data.gender, margins=True))

pandas数据表关联操作

Pandas具有功能全面的高性能内存中连接操作，与SQL等关系数据库非常相似。

Pandas提供了一个单独的merge()函数，作为DataFrame对象之间所有标准数据库连接操作的入口。

合并两个DataFrame：

import pandas as pd
left = pd.DataFrame({
         'student_id':[1,2,3,4,5,6,7,8,9,10],
         'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
         'class_id':[1,1,1,2,2,2,3,3,3,4]})
right = pd.DataFrame(
         {'class_id':[1,2,3,5],
         'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']})
print (left)
print("========================================")
print (right)
print("========================================")
# 合并两个DataFrame		默认为内连接
rs = pd.merge(left,right)
print(rs)

使用“how”参数合并DataFrame：

# 合并两个DataFrame (左连接)
print("========================================")
rs = pd.merge(left, right, how='left')
print(rs)
# 右连接
rs = pd.merge(left, right, how='right')
print(rs)
print("========================================")
# 外连接
rs = pd.merge(left, right, how='outer')
print(rs)

其他合并方法同数据库相同：

合并方法	SQL等效	描述
`left`	`LEFT OUTER JOIN`	使用左侧对象的键
`right`	`RIGHT OUTER JOIN`	使用右侧对象的键
`outer`	`FULL OUTER JOIN`	使用键的联合
`inner`	`INNER JOIN`	使用键的交集

试验：

# 合并两个DataFrame (右连接)
rs = pd.merge(left,right,on='subject_id', how='right')
print(rs)
# 合并两个DataFrame (外连接)
rs = pd.merge(left,right,on='subject_id', how='outer')
print(rs)
# 合并两个DataFrame (内连接)
rs = pd.merge(left,right,on='subject_id', how='inner')
print(rs)

pandas可视化

基本绘图：绘图

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('2018/12/18',
                                                            periods=10), columns=list('ABCD'))
print(df)
df.plot()
plt.show()

plot方法允许除默认线图之外的少数绘图样式。这些方法可以作为plot()的kind关键字参数。这些包括：

bar或barh为条形
hist为直方图
scatter为散点图

条形图

df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.bar()
# df.plot.bar(stacked=True)
mp.show()

直方图

df = pd.DataFrame()
df['a'] = pd.Series(np.random.normal(0, 1, 1000)-1)
df['b'] = pd.Series(np.random.normal(0, 1, 1000))
df['c'] = pd.Series(np.random.normal(0, 1, 1000)+1)
print(df)
df.plot.hist(bins=20)
mp.show()

散点图

df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])
df.plot.scatter(x='a', y='b')
mp.show()

饼状图

df = pd.DataFrame(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], columns=['x'])
df.plot.pie(subplots=True)
mp.show()

数据读取与存储

读取与存储csv：

函数	描述
filepath	文件路径。该字符串可以是一个URL。有效的URL方案包括http，ftp和file
sep	分隔符。read_csv默认为“ , ”，read_table默认为制表符“[Tab]”。
header	默认将首行设置为列名，header=None时应手动给出列名，默认为infer，表示自动识别
names	header=None时设置此字段使用列表初始化列名
index_col	将某一列作为行级索引。若使用列表，则设置复合索引
dtype	代表写入的数据类型（列名为key，数据格式为values）。
usecols	选择读取文件中的某些列。设置为相应列的索引列表
skiprows	跳过行。可选择跳过前n行或给出跳过的行索引列表
engine	接收c或者python。代表数据解析引擎。默认为c。
nrows	接收int。表示读取前n行。

pd.read_table(
    filepath_or_buffer, sep='\t', header='infer', names=None, 
    index_col=None, dtype=None, engine=None, nrows=None) 
pd.read_csv(
    filepath_or_buffer, sep=',', header='infer', names=None, 
    index_col=None, dtype=None, engine=None, nrows=None)

pd.read_csv('1.csv',header=None,index_col=0)

DataFrame.to_csv(excel_writer=None, sheetname=None, header=True, index=True, index_label=None, mode=’w’, encoding=None)

读取与存储excel：

# io 表示文件路径。
# sheetname 代表excel表内数据的分表位置。默认为0。 
# header 接收int或sequence。表示将某行数据作为列名。默认为infer，表示自动识别。
# names header=None时设置此字段使用列表初始化列名    。
# index_col 将某一列作为行级索引。若使用列表，则设置复合索引         
# dtype 接收dict。数据类型。
pandas.read_excel(io, sheetname=0, header=0, index_col=None, names=None, dtype=None)

DataFrame.to_excel(excel_writer=None, sheetname=None, header=True, index=True, index_label=None, mode=’w’, encoding=None)

读取与存储JSON：

# 通过json模块转换为字典，再转换为DataFrame
pd.read_json('../ratings.json')

# 将DataFrame转为json
pd.to_json(orient='records')

# 参数				描述
# filepath      	文件路径；若设置为None，则返回json字符串
# orient		   	设置面相输出格式：['records','index','colunms','values']