Pandas
学习莫烦python,非常感谢~记录自己在学习python过程中的点滴。
Pandas 安装
Anaconda安装
pip安装
Pandas 基本介绍
Series
DataFrame
dtype
:查看数据中的类型index
:查看队列的序号(行)columns
:查看数据的名称(列)values
:查看数据所有值describe
:查看数据的总结transpose
:翻转数据(转置)sort_index
:对数据的 index 进行排序并输出sort_values
:对数据 value 排序输出:
Series
Series
的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引。于是会自动创建一个0到N-1(N为长度)的整数型索引。
参考代码:
import pandas as pdimport numpy as np s = pd.Series([1,3,6,np.nan,44,1]) print(s)""" 0 1.0 1 3.0 2 6.0 3 NaN 4 44.0 5 1.0 dtype: float64 """
DataFrame
DataFrame
是一个表格型的数据结构,它包含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。DataFrame
既有行索引也有列索引, 它可以被看做由Series
组成的大字典。
数据准备
dates = pd.date_range('20160101',periods=6) df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d']) print(df)""" a b c d 2016-01-01 -0.253065 -2.071051 -0.640515 0.613663 2016-01-02 -1.147178 1.532470 0.989255 -0.499761 2016-01-03 1.221656 -2.390171 1.862914 0.778070 2016-01-04 1.473877 -0.046419 0.610046 0.204672 2016-01-05 -1.584752 -0.700592 1.487264 -1.778293 2016-01-06 0.633675 -1.414157 -0.277066 -0.442545 """
挑选b
的元素
print(df['b'])""" 2016-01-01 -2.071051 2016-01-02 1.532470 2016-01-03 -2.390171 2016-01-04 -0.046419 2016-01-05 -0.700592 2016-01-06 -1.414157 Freq: D, Name: b, dtype: float64 """
创建一组没有给定行标签和列标签的数据 df1
,默认的从0
开始index
:
df1 = pd.DataFrame(np.arange(12).reshape((3,4))) print(df1)""" 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 """
另一种方式创建df2
:
df2 = pd.DataFrame({'A' : 1., 'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo'}) print(df2)""" A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo """
查看数据中的类型, 使用 dtype
属性:
print(df2.dtypes)""" df2.dtypes A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object """
查看队列的序号,使用 index
属性:
print(df2.index)# Int64Index([0, 1, 2, 3], dtype='int64')
查看数据的名称,使用 columns
属性:
print(df2.columns)# Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
查看所有df2
的值,使用 values
属性:
print(df2.values)""" array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object) """
查看数据的总结, 可以用 describe()
:
df2.describe() """ A C Dcount 4.0 4.0 4.0mean 1.0 1.0 3.0std 0.0 0.0 0.0min 1.0 1.0 3.025% 1.0 1.0 3.050% 1.0 1.0 3.075% 1.0 1.0 3.0max 1.0 1.0 3.0"""
翻转数据, transpose
:
print(df2.T)""" 0 1 2 \ A 1 1 1 B 2013-01-02 00:00:00 2013-01-02 00:00:00 2013-01-02 00:00:00 C 1 1 1 D 3 3 3 E test train test F foo foo foo 3 A 1 B 2013-01-02 00:00:00 C 1 D 3 E train F foo """
对数据的 index
进行排序并输出:
print(df2.sort_index(axis=1, ascending=False))""" F E D C B A 0 foo test 3 1.0 2013-01-02 1.0 1 foo train 3 1.0 2013-01-02 1.0 2 foo test 3 1.0 2013-01-02 1.0 3 foo train 3 1.0 2013-01-02 1.0 """# 对数据的 value 进行排序并输出:print(df2.sort_values(by='B'))""" A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo """
Pandas 选择数据
简单的筛选
loc
:根据标签来选择数据iloc
:根据位置来选择数据ix
:混合选择通过判断的筛选
数据准备:
dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])""" A B C D 2013-01-01 0 1 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
简单的筛选:
print(df['A']) print(df.A)""" 2013-01-01 0 2013-01-02 4 2013-01-03 8 2013-01-04 12 2013-01-05 16 2013-01-06 20 Freq: D, Name: A, dtype: int64 """
选择跨越多行或多列:
print(df[0:3])""" A B C D 2013-01-01 0 1 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 """print(df['20130102':'20130104'])""" A B C D 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 2013-01-04 12 13 14 15 """
根据标签 loc
:
print(df.loc['20130102'])""" A 4 B 5 C 6 D 7 Name: 2013-01-02 00:00:00, dtype: int64 """print(df.loc[:,['A','B']]) """ A B 2013-01-01 0 1 2013-01-02 4 5 2013-01-03 8 9 2013-01-04 12 13 2013-01-05 16 17 2013-01-06 20 21 """print(df.loc['20130102',['A','B']])""" A 4 B 5 Name: 2013-01-02 00:00:00, dtype: int64 """
根据序列 iloc
:
print(df.iloc[3,1])# 13print(df.iloc[3:5,1:3])""" B C 2013-01-04 13 14 2013-01-05 17 18 """print(df.iloc[[1,3,5],1:3])""" B C 2013-01-02 5 6 2013-01-04 13 14 2013-01-06 21 22 """
混合选择 ix
:
print(df.ix[:3,['A','C']])""" A C 2013-01-01 0 2 2013-01-02 4 6 2013-01-03 8 10 """
通过判断的筛选:
print(df[df.A>8])""" A B C D 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
Pandas 设置值
根据位置设置
loc
和iloc
根据条件设置
按行或列设置
添加数据
数据准备:
dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])""" A B C D 2013-01-01 0 1 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
根据位置设置 loc
和iloc
df.iloc[2,2] = 1111df.loc['20130101','B'] = 2222""" A B C D 2013-01-01 0 2222 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 1111 11 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
根据条件设置:
df.B[df.A>4] = 0""" A B C D 2013-01-01 0 2222 2 3 2013-01-02 4 5 6 7 2013-01-03 8 0 1111 11 2013-01-04 12 0 14 15 2013-01-05 16 0 18 19 2013-01-06 20 0 22 23 """
按行或列设置:
df['F'] = np.nan""" A B C D F 2013-01-01 0 2222 2 3 NaN 2013-01-02 4 5 6 7 NaN 2013-01-03 8 0 1111 11 NaN 2013-01-04 12 0 14 15 NaN 2013-01-05 16 0 18 19 NaN 2013-01-06 20 0 22 23 NaN """
添加数据:
df['E'] = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130101',periods=6)) """ A B C D F E 2013-01-01 0 2222 2 3 NaN 1 2013-01-02 4 5 6 7 NaN 2 2013-01-03 8 0 1111 11 NaN 3 2013-01-04 12 0 14 15 NaN 4 2013-01-05 16 0 18 19 NaN 5 2013-01-06 20 0 22 23 NaN 6 """
Pandas 处理丢失数据
dropna()
: 去掉有NaN
的行或列fillna()
:将NaN
的值用其他值代替isnull()
:判断是否有缺失数据NaN
创建含 NaN
的矩阵:
dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D']) df.iloc[0,1] = np.nan df.iloc[1,2] = np.nan""" A B C D 2013-01-01 0 NaN 2.0 3 2013-01-02 4 5.0 NaN 7 2013-01-03 8 9.0 10.0 11 2013-01-04 12 13.0 14.0 15 2013-01-05 16 17.0 18.0 19 2013-01-06 20 21.0 22.0 23 """
去掉有 NaN
的行或列:
df.dropna( axis=0, # 0: 对行进行操作; 1: 对列进行操作 how='any' # 'any': 只要存在 NaN 就 drop 掉; 'all': 必须全部是 NaN 才 drop ) """ A B C D 2013-01-03 8 9.0 10.0 11 2013-01-04 12 13.0 14.0 15 2013-01-05 16 17.0 18.0 19 2013-01-06 20 21.0 22.0 23 """
将 NaN
的值用其他值代替:
df.fillna(value=0)""" A B C D 2013-01-01 0 0.0 2.0 3 2013-01-02 4 5.0 0.0 7 2013-01-03 8 9.0 10.0 11 2013-01-04 12 13.0 14.0 15 2013-01-05 16 17.0 18.0 19 2013-01-06 20 21.0 22.0 23 """
判断是否有缺失数据 NaN
:
df.isnull() """ A B C D 2013-01-01 False True False False 2013-01-02 False False True False 2013-01-03 False False False False 2013-01-04 False False False False 2013-01-05 False False False False 2013-01-06 False False False False """np.any(df.isnull()) == True # True
Pandas 导入导出
共同學(xué)習(xí),寫下你的評論
評論加載中...
作者其他優(yōu)質(zhì)文章