Pandas数据挖掘与分析
Pandas介绍
- 2008年WesMcKinney开发出的库
- 专门用于数据挖掘的开源python库
- 以Numpy为基础,借力Numpy模块在计算方面性能高的优势
- 基于matplotlib,能够简便的画图
- 独特的数据结构
为什么使用Pandas
Numpy已经能够帮助我们处理数据,能够结合matplotlib解决部分数据展示等问题,那么pandas学习的目的在什么地方呢?
增强图表可读性
回忆我们在numpy当中创建学生成绩表样式:
array([[92, 55, 78, 50, 50],
[71, 76, 50, 48, 96],
[45, 84, 78, 51, 68],
[81, 91, 56, 54, 76],
[86, 66, 77, 67, 95],
[46, 86, 56, 61, 99],
[46, 95, 44, 46, 56],
[80, 50, 45, 65, 57],
[41, 93, 90, 41, 97],
[65, 83, 57, 57, 40]])
如果数据展示为这样,可读性就会更友好:
水工程施工 建筑给排水 给排水管网 泵与泵站 流体力学
同学0 92 55 78 50 50
同学1 71 76 50 48 96
同学2 45 84 78 51 68
同学3 81 91 56 54 76
同学4 86 66 77 67 95
同学5 46 86 56 61 99
同学6 46 95 44 46 56
同学7 80 50 45 65 57
同学8 41 93 90 41 97
同学9 65 83 57 57 40
- 便捷的数据处理能力
- 读取文件方便
- 封装了Matplotlib、Numpy的画图和计算
代码示例:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: arr = np.array([[92, 55, 78, 50, 50],
...: [71, 76, 50, 48, 96],
...: [45, 84, 78, 51, 68],
...: [81, 91, 56, 54, 76],
...: [86, 66, 77, 67, 95],
...: [46, 86, 56, 61, 99],
...: [46, 95, 44, 46, 56],
...: [80, 50, 45, 65, 57],
...: [41, 93, 90, 41, 97],
...: [65, 83, 57, 57, 40]])
In [4]: index_name = [f"同学{i}" for i in range(10)]
In [5]: column_name = ['水工程施工', '建筑给排水', '给排水管网', '泵与泵站', '流体力学']
In [7]: pd.DataFrame(arr, index=index_name, columns=column_name)
Out[7]:
水工程施工 建筑给排水 给排水管网 泵与泵站 流体力学
同学0 92 55 78 50 50
同学1 71 76 50 48 96
同学2 45 84 78 51 68
同学3 81 91 56 54 76
同学4 86 66 77 67 95
同学5 46 86 56 61 99
同学6 46 95 44 46 56
同学7 80 50 45 65 57
同学8 41 93 90 41 97
同学9 65 83 57 57 40
DataFrame属性
In [9]: data_frame.shape # 输出维度
Out[9]: (10, 5)
In [10]: data_frame.index # 表列值
Out[10]: Index(['同学0', '同学1', '同学2', '同学3', '同学4', '同学5', '同学6', '同学7', '同学8', '同学9'], dtype='object')
In [11]: data_frame.columns # 表头
Out[11]: Index(['水工程施工', '建筑给排水', '给排水管网', '泵与泵站', '流体力学'], dtype='object')
In [12]: data_frame.T # 转置
Out[12]:
同学0 同学1 同学2 同学3 同学4 同学5 同学6 同学7 同学8 同学9
水工程施工 92 71 45 81 86 46 46 80 41 65
建筑给排水 55 76 84 91 66 86 95 50 93 83
给排水管网 78 50 78 56 77 56 44 45 90 57
泵与泵站 50 48 51 54 67 61 46 65 41 57
流体力学 50 96 68 76 95 99 56 57 97 40
In [17]: data_frame.values # 输出所有值
Out[17]:
array([[92, 55, 78, 50, 50],
[71, 76, 50, 48, 96],
[45, 84, 78, 51, 68],
[81, 91, 56, 54, 76],
[86, 66, 77, 67, 95],
[46, 86, 56, 61, 99],
[46, 95, 44, 46, 56],
[80, 50, 45, 65, 57],
[41, 93, 90, 41, 97],
[65, 83, 57, 57, 40]])
In [14]: data_frame.head(2) # 输出前两行
Out[14]:
水工程施工 建筑给排水 给排水管网 泵与泵站 流体力学
同学0 92 55 78 50 50
同学1 71 76 50 48 96
In [15]: data_frame.head() # 输出前五行
Out[15]:
水工程施工 建筑给排水 给排水管网 泵与泵站 流体力学
同学0 92 55 78 50 50
同学1 71 76 50 48 96
同学2 45 84 78 51 68
同学3 81 91 56 54 76
同学4 86 66 77 67 95
In [16]: data_frame.tail() # 输出最后五行
Out[16]:
水工程施工 建筑给排水 给排水管网 泵与泵站 流体力学
同学5 46 86 56 61 99
同学6 46 95 44 46 56
同学7 80 50 45 65 57
同学8 41 93 90 41 97
同学9 65 83 57 57 40
DataFrame索引
In [1]: import numpy as np, pandas as pd
In [2]: arr = np.array([[92, 55, 78, 50, 50],
...: [71, 76, 50, 48, 96],
...: [45, 84, 78, 51, 68],
...: [81, 91, 56, 54, 76],
...: [86, 66, 77, 67, 95],
...: [46, 86, 56, 61, 99],
...: [46, 95, 44, 46, 56],
...: [80, 50, 45, 65, 57],
...: [41, 93, 90, 41, 97],
...: [65, 83, 57, 57, 40]])
In [3]: index_name = [f"同学{i}" for i in range(10)]
In [4]: column_name = ['水工程施工', '建筑给排水', '给排水管网', '泵与泵站', '流体力学']
In [5]: data_frame = pd.DataFrame(arr, index=index_name, columns=column_name)
In [6]: data_frame.index = [f"同学_{i}" for i in range(10)] # 重新设置索引
In [7]: data_frame.index
Out[7]:
Index(['同学_0', '同学_1', '同学_2', '同学_3', '同学_4', '同学_5', '同学_6', '同学_7', '同学_8',
'同学_9'],
dtype='object')
In [8]: data_frame.index[0]
Out[8]: '同学_0'
In [9]: data_frame.index[0] = '同学_007' # 不能对单个索引赋值!!!必须重新设置所有索引
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-6b0524cf387a> in <module>
----> 1 data_frame.index[0] = '同学_007'
c:\users\myxc\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
4075
4076 def __setitem__(self, key, value):
-> 4077 raise TypeError("Index does not support mutable operations")
4078
4079 def __getitem__(self, key):
TypeError: Index does not support mutable operations
In [10]: data_frame.reset_index() # 重置索引
Out[10]:
index 水工程施工 建筑给排水 给排水管网 泵与泵站 流体力学
0 同学_0 92 55 78 50 50
1 同学_1 71 76 50 48 96
2 同学_2 45 84 78 51 68
3 同学_3 81 91 56 54 76
4 同学_4 86 66 77 67 95
5 同学_5 46 86 56 61 99
6 同学_6 46 95 44 46 56
7 同学_7 80 50 45 65 57
8 同学_8 41 93 90 41 97
9 同学_9 65 83 57 57 40
In [12]: data_frame.reset_index(drop=True) # 删除原有索引并生成新的索引
Out[12]:
水工程施工 建筑给排水 给排水管网 泵与泵站 流体力学
0 92 55 78 50 50
1 71 76 50 48 96
2 45 84 78 51 68
3 81 91 56 54 76
4 86 66 77 67 95
5 46 86 56 61 99
6 46 95 44 46 56
7 80 50 45 65 57
8 41 93 90 41 97
9 65 83 57 57 40
In [13]: data_frame.set_index(keys=['流体力学']) # 将某一列设置为索引
Out[13]:
水工程施工 建筑给排水 给排水管网 泵与泵站
流体力学
50 92 55 78 50
96 71 76 50 48
68 45 84 78 51
76 81 91 56 54
95 86 66 77 67
99 46 86 56 61
56 46 95 44 46
57 80 50 45 65
97 41 93 90 41
40 65 83 57 57
In [15]: data_frame.set_index(keys=['流体力学', '水工程施工']) # 将某几列设置为索引
Out[15]:
建筑给排水 给排水管网 泵与泵站
流体力学 水工程施工
50 92 55 78 50
96 71 76 50 48
68 45 84 78 51
76 81 91 56 54
95 86 66 77 67
99 46 86 56 61
56 46 95 44 46
57 80 50 45 65
97 41 93 90 41
40 65 83 57 57
MultiIndex
多级或分层索引对象。
index属性
- names:levels的名称
- levels:每个level的元组值
In [15]: data_frame.set_index(keys=['流体力学', '水工程施工']) # 将某几列设置为索引
Out[15]:
建筑给排水 给排水管网 泵与泵站
流体力学 水工程施工
50 92 55 78 50
96 71 76 50 48
68 45 84 78 51
76 81 91 56 54
95 86 66 77 67
99 46 86 56 61
56 46 95 44 46
57 80 50 45 65
97 41 93 90 41
40 65 83 57 57
In [18]: a.index
Out[18]:
MultiIndex([(50, 92),
(96, 71),
(68, 45),
(76, 81),
(95, 86),
(99, 46),
(56, 46),
(57, 80),
(97, 41),
(40, 65)],
names=['流体力学', '水工程施工'])
In [19]: a.index.names
Out[19]: FrozenList(['流体力学', '水工程施工'])
In [20]: a.index.values
Out[20]:
array([(50, 92), (96, 71), (68, 45), (76, 81), (95, 86), (99, 46),
(56, 46), (57, 80), (97, 41), (40, 65)], dtype=object)
Panel
class pandas.Panel
(data=None, items=None, major_axis=None, minor_axis=None, copy=False, dtype=None)
- 存储3维数组的Panel结构
p = pd.Panel(np.arange(24).reshape(4,3,2),
items=list('ABCD'),
major_axis=pd.date_range('20130101', periods=3),
minor_axis=['first', 'second'])
p
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: A to D
Major_axis axis: 2013-01-01 00:00:00 to 2013-01-03 00:00:00
Minor_axis axis: first to second
- items -
axis 0
,每个项目对应于内部包含的数据帧(DataFrame)。 - major_axis -
axis 1
,它是每个数据帧(DataFrame)的索引(行)。 - minor_axis -
axis 2
,它是每个数据帧(DataFrame)的列。
查看panel数据:
p[:,:,"first"]
p["B",:,:]
注:Pandas从版本0.20.0开始弃用:推荐的用于表示3D数据的方法是通过DataFrame上的MultiIndex方法
如果获取DataFrame中某个股票的不同时间数据?这样的结构是什么样的?
Series结构
什么是Series结构呢,我们直接看下面的图:
- series结构只有行索引
我们将之前的涨跌幅数据进行转置,然后获取'股票0'的所有数据
# series
type(data['2017-01-02'])
pandas.core.series.Series
# 这一步相当于是series去获取行索引的值
data['2017-01-02']['股票_0']
-0.18753158283513574
创建series
通过已有数据创建
- 指定内容,默认索引
pd.Series(np.arange(10))
- 指定索引
pd.Series([6.7,5.6,3,10,2], index=[1,2,3,4,5])
通过字典数据创建
pd.Series({'red':100, ''blue':200, 'green': 500, 'yellow':1000})
series获取属性和值
- index
- values
基本数据操作
为了更好的理解这些基本操作,我们将读取一个真实的股票数据。关于文件操作,后面在介绍,这里只先用一下API
In [1]: import pandas as pd
In [2]: data = pd.read_csv("./stock_day.csv") # 读取文件
In [3]: data = data.drop(["ma5","ma10","ma20","v_ma5","v_ma10","v_ma20"], axis=1
...: ) # 删除一些列,让数据更简单些,再去做后面的操作
In [4]: data.head()
Out[4]:
open high close low volume price_change p_change turnover
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39
2018-02-26 22.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53
2018-02-23 22.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32
2018-02-22 22.25 22.76 22.28 22.02 36105.01 0.36 1.64 0.90
2018-02-14 21.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58
索引操作
Numpy当中我们已经讲过使用索引选取序列和切片选择,pandas也支持类似的操作,也可以直接使用列名、行名
称,甚至组合使用。
直接使用行列索引(先列后行)
获取'2018-02-27'这天的'close'的结果:
In [6]: data['close']['2018-02-27']
Out[6]: 24.16
In [7]: data['2018-02-27']['close']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
c:\users\myxc\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
结合loc或者iloc使用索引
获取从'2018-02-27':'2018-02-22','open'的结果:
In [8]: data.loc['2018-02-27':'2018-02-22', 'open'] # 使用loc:只能指定行列索引的名字
Out[8]:
2018-02-27 23.53
2018-02-26 22.80
2018-02-23 22.88
2018-02-22 22.25
Name: open, dtype: float64
使用iloc可以通过索引的下标去获取,获取前100天数据的'open'列的结果:
In [9]: data.iloc[0:100, 0:2].head()
Out[9]:
open high
2018-02-27 23.53 25.88
2018-02-26 22.80 23.78
2018-02-23 22.88 23.37
2018-02-22 22.25 22.76
2018-02-14 21.49 21.99
使用ix组合索引
获取行第1天到第4天,['open', 'close', 'high', 'low']这个四个指标的结果:
该API已经取消!!!
# 推荐使用loc和iloc来获取的方式
data.loc[data.index[0:4], ['open', 'close', 'high', 'low']]
data.iloc[0:4, data.columns.get_indexer(['open', 'close', 'high', 'low'])]
赋值操作
In [14]: data['volume'] = 99999
In [15]: data
Out[15]:
open high close low volume price_change p_change turnover
2018-02-27 23.53 25.88 24.16 23.53 99999 0.63 2.68 2.39
2018-02-26 22.80 23.78 23.53 22.80 99999 0.69 3.02 1.53
2018-02-23 22.88 23.37 22.82 22.71 99999 0.54 2.42 1.32
2018-02-22 22.25 22.76 22.28 22.02 99999 0.36 1.64 0.90
2018-02-14 21.49 21.99 21.92 21.48 99999 0.44 2.05 0.58
... ... ... ... ... ... ... ... ...
2015-03-06 13.17 14.48 14.28 13.13 99999 1.12 8.51 6.16
2015-03-05 12.88 13.45 13.16 12.87 99999 0.26 2.02 3.19
2015-03-04 12.80 12.92 12.90 12.61 99999 0.20 1.57 2.30
2015-03-03 12.52 13.06 12.70 12.52 99999 0.18 1.44 4.76
2015-03-02 12.25 12.67 12.52 12.20 99999 0.32 2.62 3.30
[643 rows x 8 columns]
In [16]: data.volume = 88888
In [17]: data.head()
Out[17]:
open high close low volume price_change p_change turnover
2018-02-27 23.53 25.88 24.16 23.53 88888 0.63 2.68 2.39
2018-02-26 22.80 23.78 23.53 22.80 88888 0.69 3.02 1.53
2018-02-23 22.88 23.37 22.82 22.71 88888 0.54 2.42 1.32
2018-02-22 22.25 22.76 22.28 22.02 88888 0.36 1.64 0.90
2018-02-14 21.49 21.99 21.92 21.48 88888 0.44 2.05 0.58
不能对行使用该方法!!!
In [18]: data['2018-02-27'] = 7777
In [19]: data.head()
Out[19]:
open high close low volume price_change p_change turnover 2018-02-27
2018-02-27 23.53 25.88 24.16 23.53 88888 0.63 2.68 2.39 7777
2018-02-26 22.80 23.78 23.53 22.80 88888 0.69 3.02 1.53 7777
2018-02-23 22.88 23.37 22.82 22.71 88888 0.54 2.42 1.32 7777
2018-02-22 22.25 22.76 22.28 22.02 88888 0.36 1.64 0.90 7777
2018-02-14 21.49 21.99 21.92 21.48 88888 0.44 2.05 0.58 7777
排序
排序有两种形式,一种对于索引进行排序,一种对于内容进行排序
使用df.sort_values(by=, ascending=)
- 单个键或者多个键进行排序,默认升序
- ascending=False:降序
- ascending=True:升序
In [20]: data = data.sort_values(by='p_change', ascending=False).head() # 按照涨跌幅大小进行排序 , 使用ascending指定按照大小排序
In [21]: data.head()
Out[21]:
open high close low volume price_change p_change turnover 2018-02-27
2015-08-28 15.40 16.46 16.46 15.00 88888 1.50 10.03 4.03 7777
2015-05-21 27.50 28.22 28.22 26.50 88888 2.57 10.02 4.15 7777
2016-12-22 18.50 20.42 20.42 18.45 88888 1.86 10.02 3.77 7777
2015-08-04 16.20 17.35 17.35 15.80 88888 1.58 10.02 3.23 7777
2016-07-07 18.66 18.66 18.66 18.41 88888 1.70 10.02 1.67 7777
In [24]: data = data.sort_values(by=['p_change', 'price_change'], ascending=False) # 按照过个键进行排序
In [25]: data.head()
Out[25]:
open high close low volume price_change p_change turnover 2018-02-27
2015-08-28 15.40 16.46 16.46 15.00 88888 1.50 10.03 4.03 7777
2015-05-21 27.50 28.22 28.22 26.50 88888 2.57 10.02 4.15 7777
2016-12-22 18.50 20.42 20.42 18.45 88888 1.86 10.02 3.77 7777
2016-07-07 18.66 18.66 18.66 18.41 88888 1.70 10.02 1.67 7777
2015-08-04 16.20 17.35 17.35 15.80 88888 1.58 10.02 3.23 7777
使用df.sort_index给索引进行排序:
In [26]: data.sort_index()
...:
Out[26]:
open high close low volume price_change p_change turnover 2018-02-27
2015-05-21 27.50 28.22 28.22 26.50 88888 2.57 10.02 4.15 7777
2015-08-04 16.20 17.35 17.35 15.80 88888 1.58 10.02 3.23 7777
2015-08-28 15.40 16.46 16.46 15.00 88888 1.50 10.03 4.03 7777
2016-07-07 18.66 18.66 18.66 18.41 88888 1.70 10.02 1.67 7777
2016-12-22 18.50 20.42 20.42 18.45 88888 1.86 10.02 3.77 7777
使用series.sort_values(ascending=True)进行排序
series排序时,只有一列,不需要参数
In [28]: demo = pd.Series(data['open'])
In [29]: demo
Out[29]:
2015-08-28 15.40
2015-05-21 27.50
2016-12-22 18.50
2016-07-07 18.66
2015-08-04 16.20
Name: open, dtype: float64
In [30]: demo.sort_values(ascending=True)
Out[30]:
2015-08-28 15.40
2015-08-04 16.20
2016-12-22 18.50
2016-07-07 18.66
2015-05-21 27.50
Name: open, dtype: float64
使用series.sort_index()进行排序
与df一致
In [31]: demo.sort_index(ascending=True)
Out[31]:
2015-05-21 27.50
2015-08-04 16.20
2015-08-28 15.40
2016-07-07 18.66
2016-12-22 18.50
Name: open, dtype: float64
DataFrame运算
算术运算
- add(other):进行数学运算加上具体的一个数字
- 或者直接使用加法
第一种方法:
In [5]: data.head(2)
Out[5]:
open high close low volume price_change p_change turnover
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39
2018-02-26 22.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53
In [6]: data['open'].add(1)
Out[6]:
2018-02-27 24.53
2018-02-26 23.80
2018-02-23 23.88
2018-02-22 23.25
2018-02-14 22.49
...
2015-03-06 14.17
2015-03-05 13.88
2015-03-04 13.80
2015-03-03 13.52
2015-03-02 13.25
Name: open, Length: 643, dtype: float64
第二种方法:
In [11]: data['open'] += 1
In [12]: data
Out[12]:
open high close low volume price_change p_change turnover
2018-02-27 24.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39
2018-02-26 23.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53
2018-02-23 23.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32
2018-02-22 23.25 22.76 22.28 22.02 36105.01 0.36 1.64 0.90
2018-02-14 22.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58
... ... ... ... ... ... ... ... ...
2015-03-06 14.17 14.48 14.28 13.13 179831.72 1.12 8.51 6.16
2015-03-05 13.88 13.45 13.16 12.87 93180.39 0.26 2.02 3.19
2015-03-04 13.80 12.92 12.90 12.61 67075.44 0.20 1.57 2.30
2015-03-03 13.52 13.06 12.70 12.52 139071.61 0.18 1.44 4.76
2015-03-02 13.25 12.67 12.52 12.20 96291.73 0.32 2.62 3.30
[643 rows x 8 columns]
减法:
In [13]: data['m_price_change'] = data['close'].sub(data['open'])
In [14]: data.head()
Out[14]:
open high close low volume price_change p_change turnover m_price_change
2018-02-27 24.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39 -0.37
2018-02-26 23.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53 -0.27
2018-02-23 23.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32 -1.06
2018-02-22 23.25 22.76 22.28 22.02 36105.01 0.36 1.64 0.90 -0.97
2018-02-14 22.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58 -0.57
逻辑运算
逻辑运算符号<、 >、|、 &
例如筛选p_change > 2的日期数据
- data['p_change'] > 2返回逻辑结果
In [15]: data['p_change'] > 2
Out[15]:
2018-02-27 True
2018-02-26 True
2018-02-23 True
2018-02-22 False
2018-02-14 True
...
2015-03-06 True
2015-03-05 True
2015-03-04 False
2015-03-03 False
2015-03-02 True
Name: p_change, Length: 643, dtype: bool
In [16]: data[data['p_change'] > 2] # 逻辑判断的结果可以作为筛选的依据
Out[16]:
open high close low volume price_change p_change turnover m_price_change
2018-02-27 24.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39 -0.37
2018-02-26 23.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53 -0.27
2018-02-23 23.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32 -1.06
2018-02-14 22.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58 -0.57
2018-02-12 21.70 21.40 21.19 20.63 32445.39 0.82 4.03 0.81 -0.51
... ... ... ... ... ... ... ... ... ...
2015-03-13 15.13 14.50 14.47 14.08 61342.22 0.36 2.55 2.10 -0.66
2015-03-10 15.20 14.80 14.65 14.01 101213.51 0.34 2.38 3.46 -0.55
2015-03-06 14.17 14.48 14.28 13.13 179831.72 1.12 8.51 6.16 0.11
2015-03-05 13.88 13.45 13.16 12.87 93180.39 0.26 2.02 3.19 -0.72
2015-03-02 13.25 12.67 12.52 12.20 96291.73 0.32 2.62 3.30 -0.73
[183 rows x 9 columns]
- 完成一个多个逻辑判断, 筛选p_change > 2并且open > 15
In [17]: data[(data['p_change'] > 2) & (data['open'] > 15)]
Out[17]:
open high close low volume price_change p_change turnover m_price_change
2018-02-27 24.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39 -0.37
2018-02-26 23.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53 -0.27
2018-02-23 23.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32 -1.06
2018-02-14 22.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58 -0.57
2018-02-12 21.70 21.40 21.19 20.63 32445.39 0.82 4.03 0.81 -0.51
... ... ... ... ... ... ... ... ... ...
2015-03-27 15.90 15.86 15.77 14.90 120352.13 0.84 5.63 4.12 -0.13
2015-03-17 15.90 15.44 15.18 14.63 158770.77 0.31 2.08 5.43 -0.72
2015-03-16 15.52 15.05 14.87 14.51 94468.30 0.40 2.76 3.23 -0.65
2015-03-13 15.13 14.50 14.47 14.08 61342.22 0.36 2.55 2.10 -0.66
2015-03-10 15.20 14.80 14.65 14.01 101213.51 0.34 2.38 3.46 -0.55
[174 rows x 9 columns]
逻辑运算函数
query(expr)
- expr:查询字符串
通过query使得刚才的过程更加方便简单
In [19]: data.query("p_change > 0.2 & turnover > 0.5")
Out[19]:
open high close low volume price_change p_change turnover m_price_change
2018-02-27 24.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39 -0.37
2018-02-26 23.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53 -0.27
2018-02-23 23.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32 -1.06
2018-02-22 23.25 22.76 22.28 22.02 36105.01 0.36 1.64 0.90 -0.97
2018-02-14 22.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58 -0.57
... ... ... ... ... ... ... ... ... ...
2015-03-06 14.17 14.48 14.28 13.13 179831.72 1.12 8.51 6.16 0.11
2015-03-05 13.88 13.45 13.16 12.87 93180.39 0.26 2.02 3.19 -0.72
2015-03-04 13.80 12.92 12.90 12.61 67075.44 0.20 1.57 2.30 -0.90
2015-03-03 13.52 13.06 12.70 12.52 139071.61 0.18 1.44 4.76 -0.82
2015-03-02 13.25 12.67 12.52 12.20 96291.73 0.32 2.62 3.30 -0.73
[324 rows x 9 columns]
- isin(values)
例如判断'turnover'是否为4.19, 2.39
In [20]: data[data['turnover'].isin([4.19, 2.39])]
Out[20]:
open high close low volume price_change p_change turnover m_price_change
2018-02-27 24.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39 -0.37
2017-07-25 24.07 24.20 23.70 22.64 167489.48 0.67 2.91 4.19 -0.37
2016-09-28 20.88 20.98 20.86 19.71 95580.75 0.98 4.93 2.39 -0.02
2015-04-07 17.54 17.98 17.54 16.50 122471.85 0.88 5.28 4.19 0.00
统计运算
describe()
综合分析: 能够直接得出很多统计结果,count
, mean
, std
, min
, max
等
In [21]: data.describe() # 计算平均值、标准差、最大值、最小值
Out[21]:
open high close low volume price_change p_change turnover m_price_change
count 643.000000 643.000000 643.000000 643.000000 643.000000 643.000000 643.000000 643.000000 643.000000
mean 22.272706 21.900513 21.336267 20.771835 99905.519114 0.018802 0.190280 2.936190 -0.936439
std 3.930973 4.077578 3.942806 3.791968 73879.119354 0.898476 4.079698 2.079375 0.800565
min 13.250000 12.670000 12.360000 12.200000 1158.120000 -3.520000 -10.030000 0.040000 -4.300000
25% 20.000000 19.500000 19.045000 18.525000 48533.210000 -0.390000 -1.850000 1.360000 -1.300000
50% 22.440000 21.970000 21.450000 20.980000 83175.930000 0.050000 0.260000 2.500000 -0.920000
75% 24.400000 24.065000 23.415000 22.850000 127580.055000 0.455000 2.305000 3.915000 -0.550000
max 35.990000 36.350000 35.210000 34.010000 501915.410000 3.030000 10.030000 12.560000 2.410000
统计函数
Numpy当中已经详细介绍,在这里我们演示min(最小值), max(最大值), mean(平均值), median(中位数), var(方差), std(标准差),mode(众数)结果。
count | Number of non-NA observations |
sum | Sum of values |
mean | Mean of values |
median | Arithmetic median of values |
min | Minimum |
max | Maximum |
mode | Mode |
abs | Absolute Value |
prod | Product of values |
std | Bessel-corrected sample standard deviation |
var | Unbiased variance |
idxmax | compute the index labels with the maximum |
idxmin | compute the index labels with the minimum |
对于单个函数去进行统计的时候,坐标轴还是按照这些默认为“columns” (axis=0, default),如果要对行“index” 需要指定(axis=1)
- max()、min()
In [22]: data.max(0)
Out[22]:
open 35.99
high 36.35
close 35.21
low 34.01
volume 501915.41
price_change 3.03
p_change 10.03
turnover 12.56
m_price_change 2.41
dtype: float64
In [23]: data.max(1)
Out[23]:
2018-02-27 95578.03
2018-02-26 60985.11
2018-02-23 52914.01
2018-02-22 36105.01
2018-02-14 23331.04
...
2015-03-06 179831.72
2015-03-05 93180.39
2015-03-04 67075.44
2015-03-03 139071.61
2015-03-02 96291.73
Length: 643, dtype: float64
- std()、var()
In [24]: data.var()
Out[24]:
open 1.545255e+01
high 1.662665e+01
close 1.554572e+01
low 1.437902e+01
volume 5.458124e+09
price_change 8.072595e-01
p_change 1.664394e+01
turnover 4.323800e+00
m_price_change 6.409037e-01
dtype: float64
In [25]: data.std()
Out[25]:
open 3.930973
high 4.077578
close 3.942806
low 3.791968
volume 73879.119354
price_change 0.898476
p_change 4.079698
turnover 2.079375
m_price_change 0.800565
dtype: float64
- median():中位数
中位数为将数据从小到大排列,在最中间的那个数为中位数。如果没有中间数,取中间两个数的平均值。
- idxmax()、idxmin()
累计统计函数
函数 | 作用 |
---|---|
cumsum | 计算前1/2/3/…/n个数的和 |
cummax | 计算前1/2/3/…/n个数的最大值 |
cummin | 计算前1/2/3/…/n个数的最小值 |
cumprod | 计算前1/2/3/…/n个数的积 |
那么这些累计统计函数怎么用?
以上这些函数可以对series和dataframe操作
这里我们按照时间的从前往后来进行累计:
In [28]: stock_rise = data['p_change']
In [29]: stock_rise.cumsum()
Out[29]:
2015-03-02 2.62
2015-03-03 4.06
2015-03-04 5.63
2015-03-05 7.65
2015-03-06 16.16
...
2018-02-14 112.59
2018-02-22 114.23
2018-02-23 116.65
2018-02-26 119.67
2018-02-27 122.35
Name: p_change, Length: 643, dtype: float64
那么如何让这个连续求和的结果更好的显示呢?
如果要使用plot函数,需要导入matplotlib.
In [30]: import matplotlib.pyplot as plt
In [31]: stock_rise.cumsum().plot()
Out[31]: <AxesSubplot:>
In [32]: plt.show()
自定义运算
apply(func, axis=0)
- func:自定义函数
- axis=0:默认是列,axis=1为行进行运算
- 定义一个对列,最大值-最小值的函数
In [35]: data[['open', 'close']].max()
Out[35]:
open 35.99
close 35.21
dtype: float64
In [36]: data[['open', 'close']].min()
Out[36]:
open 13.25
close 12.36
dtype: float64
In [37]: data[['open', 'close']].apply(lambda x: x.max() - x.min(), axis=0)
Out[37]:
open 22.74
close 22.85
dtype: float64
Pandas画图
pandas.DataFrame.plot
DataFrame.plot
(x=None, y=None, kind='line')- x : label or position, default None
y : label, position or list of label, positions, default None
- Allows plotting of one column versus another
kind : str
- ‘line’ : line plot (default)
- ‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
- ‘hist’ : histogram
- ‘pie’ : pie plot
- ‘scatter’ : scatter plot
更多参数细节:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html?highlight=plot#pandas.DataFrame.plot
2 pandas.Series.plot
更多参数细节:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html?highlight=plot#pandas.Series.plot
文件读取与存储
我们的数据大部分存在于文件当中,所以pandas会支持复杂的IO操作,pandas的API支持众多的文件格式,如CSV、SQL、XLS、JSON、HDF5。
注:最常用的HDF5和CSV文件
CSV
read_csv
pandas.read_csv(filepath_or_buffer, sep =',' )
- filepath_or_buffer:文件路径
- usecols:指定读取的列名,列表形式
读取之前的股票的数据:
In [1]: import pandas as pd
In [2]: pd.read_csv("./stock_day.csv")
Out[2]:
open high close low volume price_change p_change ma5 ma10 ma20 v_ma5 v_ma10 v_ma20 turnover
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 22.942 22.142 22.875 53782.64 46738.65 55576.11 2.39
2018-02-26 22.80 23.78 23.53 22.80 60985.11 0.69 3.02 22.406 21.955 22.942 40827.52 42736.34 56007.50 1.53
2018-02-23 22.88 23.37 22.82 22.71 52914.01 0.54 2.42 21.938 21.929 23.022 35119.58 41871.97 56372.85 1.32
2018-02-22 22.25 22.76 22.28 22.02 36105.01 0.36 1.64 21.446 21.909 23.137 35397.58 39904.78 60149.60 0.90
2018-02-14 21.49 21.99 21.92 21.48 23331.04 0.44 2.05 21.366 21.923 23.253 33590.21 42935.74 61716.11 0.58
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-03-06 13.17 14.48 14.28 13.13 179831.72 1.12 8.51 13.112 13.112 13.112 115090.18 115090.18 115090.18 6.16
2015-03-05 12.88 13.45 13.16 12.87 93180.39 0.26 2.02 12.820 12.820 12.820 98904.79 98904.79 98904.79 3.19
2015-03-04 12.80 12.92 12.90 12.61 67075.44 0.20 1.57 12.707 12.707 12.707 100812.93 100812.93 100812.93 2.30
2015-03-03 12.52 13.06 12.70 12.52 139071.61 0.18 1.44 12.610 12.610 12.610 117681.67 117681.67 117681.67 4.76
2015-03-02 12.25 12.67 12.52 12.20 96291.73 0.32 2.62 12.520 12.520 12.520 96291.73 96291.73 96291.73 3.30
[643 rows x 14 columns]
In [3]: pd.read_csv("./stock_day.csv", usecols=['open', 'close'])
Out[3]:
open close
2018-02-27 23.53 24.16
2018-02-26 22.80 23.53
2018-02-23 22.88 22.82
2018-02-22 22.25 22.28
2018-02-14 21.49 21.92
... ... ...
2015-03-06 13.17 14.28
2015-03-05 12.88 13.16
2015-03-04 12.80 12.90
2015-03-03 12.52 12.70
2015-03-02 12.25 12.52
[643 rows x 2 columns]
to_csv
DataFrame.to_csv(path_or_buf=None, sep=', ’, columns=None, header=True, index=True, mode='w', encoding=None)
- path_or_buf :string or file handle, default None
- sep :character, default ‘,’
- columns :sequence, optional
- mode:'w':重写, 'a' 追加
- index:是否写进行索引
- header :boolean or list of string, default True,是否写进列索引值
案例
- 保存'open'列的数据
# 选取10行数据保存,便于观察数据
data[:10].to_csv("./data/test.csv", columns=['open'])
- 读取,查看结果
pd.read_csv("./data/test.csv")
Unnamed: 0 open
0 2018-02-27 23.53
1 2018-02-26 22.80
2 2018-02-23 22.88
3 2018-02-22 22.25
4 2018-02-14 21.49
5 2018-02-13 21.40
6 2018-02-12 20.70
7 2018-02-09 21.20
8 2018-02-08 21.79
9 2018-02-07 22.69
会发现将索引存入到文件当中,变成单独的一列数据。如果需要删除,可以指定index参数,删除原来的文件,重新保存一次。
# index:存储不会讲索引值变成一列数据
data[:10].to_csv("./data/test.csv", columns=['open'], index=False)
HDF5
read_hdf与to_hdf
HDF5文件的读取和存储需要指定一个键,值为要存储的DataFrame
pandas.read_hdf(path_or_buf,key =None,** kwargs)
从h5文件当中读取数据
- path_or_buffer:文件路径
- key:读取的键
- return:Theselected object
- DataFrame.to_hdf(path_or_buf, key, **kwargs)
案例
- 读取文件
day_eps_ttm = pd.read_hdf("./data/stock_data/day/day_eps_ttm.h5")
如果读取的时候出现以下错误
需要安装安装tables模块避免不能读取HDF5文件
pip install tables
- 存储文件
day_eps_ttm.to_hdf("./data/test.h5", key="day_eps_ttm")
再次读取的时候, 需要指定键的名字
new_eps = pd.read_hdf("./data/test.h5", key="day_eps_ttm")
JSON
JSON是我们常用的一种数据交换格式,前面在前后端的交互经常用到,也会在存储的时候选择这种格式。所以我们需要知道Pandas如何进行读取和存储JSON格式。
read_json
pandas.read_json(path_or_buf=None, orient=None, typ='frame', lines=False)
- 将JSON格式准换成默认的Pandas DataFrame格式
orient : string,Indication of expected JSON string format.
'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
- split 将索引总结到索引,列名到列名,数据到数据。将三部分都分开了
'records' : list like [{column -> value}, ... , {column -> value}]
- records 以
columns:values
的形式输出
- records 以
'index' : dict like {index -> {column -> value}}
- index 以
index:{columns:values}...
的形式输出
- index 以
'columns' : dict like {column -> {index -> value}}
,默认该格式
- colums 以
columns:{index:values}
的形式输出
- colums 以
'values' : just the values array
- values 直接输出值
lines : boolean, default False
- 按照每行读取json对象
- typ : default ‘frame’, 指定转换成的对象类型series或者dataframe
read_josn 案例
- 数据介绍
这里使用一个新闻标题讽刺数据集,格式为json。is_sarcastic
:1讽刺的,否则为0;headline
:新闻报道的标题;article_link
:链接到原始新闻文章。存储格式为:
{"article_link": "https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5", "headline": "former versace store clerk sues over secret 'black code' for minority shoppers", "is_sarcastic": 0}
{"article_link": "https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365", "headline": "the 'roseanne' revival catches up to our thorny political mood, for better and worse", "is_sarcastic": 0}
- 读取
orient指定存储的json格式,lines指定按照行去变成一个样本
json_read = pd.read_json("./data/Sarcasm_Headlines_Dataset.json", orient="records", lines=True)
to_json
DataFrame.to_json(path_or_buf=None, orient=None, lines=False)
- 将Pandas 对象存储为json格式
- path_or_buf=None:文件地址
- orient:存储的json形式,{‘split’,’records’,’index’,’columns’,’values’}
- lines:一个对象存储为一行
案例
- 存储文件
json_read.to_json("./data/test.json", orient='records')
结果
[{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0},{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1},{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/advancing-the-worlds-women_b_6810038.html","headline":"advancing the world's women","is_sarcastic":0},....]
- 修改lines参数为True
json_read.to_json("./data/test.json", orient='records', lines=True)
结果
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0}
{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1}
{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0}...
拓展
优先选择使用HDF5文件存储
- HDF5在存储的时候支持压缩,使用的方式是blosc,这个是速度最快的也是pandas默认支持的
- 使用压缩可以提磁盘利用率,节省空间
- HDF5还是跨平台的,可以轻松迁移到hadoop 上面
当前页面是本站的「Google AMP」版。查看和发表评论请点击:完整版 »