Pandas数据清洗
接下来会用到的数据:
链接:https://pan.baidu.com/s/1RAqFCxWcl4OEChlRtSBooA
提取码:612s
数据清洗介绍
数据清洗实际上也是数据质量分析,检查原始数据中是否存在脏数据(不符合要求,或者不能直接进行分析的数据),并且处理脏数据。
常见情况如下
- 缺失值
- 异常值
- 重复数据
处理缺失值
Pandas使用浮点值NaN(not a Number)表示缺失值,并且缺失值在数据中时常出现。那么Pandas的目的之一就是**”无痛地”**处理缺失值。
判断数据是否为NaN
- pd.isnull(df) 返回哪些值是缺失值的布尔值
1 | import pandas as pd |
1 | 0 1 2 3 |
- pd.notnull(df) 返回值是isnull的反集
1 | import pandas as pd |
1 | 0 1 2 3 |
注意
- Python内建的None值也被当作NaN
过滤缺失值
dropna(axis=0,how=’any’,inplace=False)
- axis 指定轴 默认为0 代表行
- how 默认为any 代表删除含有NaN的行 当为all 时代表删除所有值为NaN的行
- inplace 修改被调用的对象 而不是一个备份
1 | import pandas as pd |
1 | 0 1 2 3 |
1 | import pandas as pd |
1 | 0 1 2 3 |
补全缺失值(NaN)
df.fillna(value=None,method=None,axis=None,inplace=False,limit=None)
- value 标量或字典对象用于填充缺失值
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d']) df.loc['a'] = np.nan df.loc['b',1] = np.nan df.fillna(value=1,inplace=True) print(df) <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- ```</span><br><span class="line"> 0 1 2 3</span><br><span class="line"> a 1.0 1.0 1.0 1.0</span><br><span class="line"> b 4.0 1.0 6.0 7.0</span><br><span class="line"> c 8.0 9.0 10.0 11.0</span><br><span class="line"> d 12.0 13.0 14.0 15.0</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
- method 插值方法 默认为”ffill”
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d']) df.loc['a'] = np.nan df.loc['b',1] = np.nan df.fillna(method='bfill',inplace=True) #使用bfill进行插值 print(df) <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- ```</span><br><span class="line"> 0 1 2 3</span><br><span class="line"> a 4.0 9.0 6.0 7.0</span><br><span class="line"> b 4.0 9.0 6.0 7.0</span><br><span class="line"> c 8.0 9.0 10.0 11.0</span><br><span class="line"> d 12.0 13.0 14.0 15.0</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
- axis 需填充的轴 默认为0
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d']) df.loc['a'] = np.nan df.loc['b',1] = np.nan df.fillna(method='bfill',inplace=True,axis=1) #填充轴为1 print(df) <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- ```</span><br><span class="line"> 0 1 2 3</span><br><span class="line"> a NaN NaN NaN NaN</span><br><span class="line"> b 4.0 6.0 6.0 7.0</span><br><span class="line"> c 8.0 9.0 10.0 11.0</span><br><span class="line"> d 12.0 13.0 14.0 15.0</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
- inplace 修改被调用的对象 而不是一个备份
- limit 用于向前或向后填充时最大的填充范围.
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d']) df.loc['a'] = np.nan df.loc['b',1] = np.nan df.loc['c'] = np.nan df.fillna(value=2,inplace=True,axis=1,limit=2) print(df) <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- ```</span><br><span class="line"> 0 1 2 3</span><br><span class="line"> a 2.0 2.0 2.0 2.0</span><br><span class="line"> b 4.0 2.0 6.0 7.0</span><br><span class="line"> c 2.0 NaN 2.0 2.0</span><br><span class="line"> d 12.0 13.0 14.0 15.0</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
异常值
脏数据也包含不符合要求的数据,那么对这块数据处理不能直接使用fillna填充。使用replace更加灵活。
df.replace(to_replace=None,value=None)
- to_replace 去替换的值
- value 替换的值
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d']) df.loc['a'] = np.nan df.loc['b',1] = np.nan df.replace(to_replace=df.loc['c',2],value=5,inplace=True) #将c,2替换为5 print(df) <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- ```</span><br><span class="line"> 0 1 2 3</span><br><span class="line"> a NaN NaN NaN NaN</span><br><span class="line"> b 4.0 NaN 6.0 7.0</span><br><span class="line"> c 8.0 9.0 5.0 11.0</span><br><span class="line"> d 12.0 13.0 14.0 15.0</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
处理重复数据
判断重复值
df.duplicated(subset=None, keep=’first’) 返回的一个布尔值Series 默认反映的是每一行是否与之前出现过的行相同
- subset 指定子列判断重复
- keep 默认为first保留首个出现的 last保留最后出现的
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d']) df.loc['b'] = df.loc['a'] print(df.duplicated(keep='last')) <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- ```</span><br><span class="line"> a True</span><br><span class="line"> b False</span><br><span class="line"> c False</span><br><span class="line"> d False</span><br><span class="line"> dtype: bool</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
删除重复值
df.drop_duplicates() 返回的是DataFrame 默认删除重复行
- subset 指定的数据任何子集是否有重复
- keep 默认为first保留首个出现的 last保留最后出现的
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d']) # df.loc['a',2] = 1 df.loc['b'] = df.loc['a'] # print(df) # print(df.duplicated(keep='last')) df.drop_duplicates(keep='last',inplace=True) print(df) <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- ```</span><br><span class="line"> 0 1 2 3</span><br><span class="line"> b 0 1 2 3</span><br><span class="line"> c 8 9 10 11</span><br><span class="line"> d 12 13 14 15</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
离散化
离散化是把无限空间中有限的个体映射到有限的空间中去,以此提高算法的时空效率。
可以简单的理解为离散化就是将连续值进行分区间。
- pd.cut(x,bins) 将连续数据x进行离散化
- x 要进行离散化的数据
- bins 分组
1 | import pandas as pd |
1 | [(51.6, 59.0], (44.2, 51.6], (51.6, 59.0], (51.6, 59.0], (44.2, 51.6], ..., (29.4, 36.8], (21.963, 29.4], (29.4, 36.8], (36.8, 44.2], (29.4, 36.8]] |
- pd.value_counts(cates) 统计每个区间的数值分布
1 | import pandas as pd |
1 | IntervalIndex([(21.963, 29.4], (29.4, 36.8], (36.8, 44.2], (44.2, 51.6], (51.6, 59.0]], |
重命名轴索引
需求:如下数据,将行索引名字全部转为大写
1 | data = { |
使用:
- 索引映射:df.index.map()
- 索引重命名:df.rename(index,columns)
1 | import pandas as pd |
向量化字符串函数
计算虚拟变量
将分类变量转换为”虚拟”或”指标”矩阵是另一种用于统计建模或机器学习的转换操作。如果DataFrame中的一列有k个不同的值,则可以衍生一个K列的值为1和0的矩阵或DataFrame。
- pd.get_dummies() 将分类变量转换为”虚拟”或”指标”矩阵
但是,如果说DataFrame中的一行属于多个类别,情况就会比较复杂。如下图
1 | import pandas as pd |
1 | key data |
- IMDB Movie 数据清洗
1 | import pandas as pd |
1 | Action Adventure Sci-Fi Mystery ... Western War Musical Sport |