Pandas数据清洗

接下来会用到的数据:

链接:https://pan.baidu.com/s/1RAqFCxWcl4OEChlRtSBooA
提取码:612s

数据清洗介绍

数据清洗实际上也是数据质量分析,检查原始数据中是否存在脏数据(不符合要求,或者不能直接进行分析的数据),并且处理脏数据。

常见情况如下

  • 缺失值
  • 异常值
  • 重复数据

处理缺失值

Pandas使用浮点值NaN(not a Number)表示缺失值,并且缺失值在数据中时常出现。那么Pandas的目的之一就是**”无痛地”**处理缺失值。

判断数据是否为NaN

  • pd.isnull(df) 返回哪些值是缺失值的布尔值
1
2
3
4
5
6
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'])
# print(df)
print(pd.isna(df[:2]))

1
2
3
       0      1      2      3
a False False False False
b False False False False
  • pd.notnull(df) 返回值是isnull的反集
1
2
3
4
5
6
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'])
# print(df)
print(pd.isna(df))
print(pd.notna(df))
1
2
3
4
5
      0     1     2     3
a True True True True
b True True True True
c True True True True
d True True True True

注意

  • Python内建的None值也被当作NaN

过滤缺失值

  • dropna(axis=0,how=’any’,inplace=False)

    • axis 指定轴 默认为0 代表行
    • how 默认为any 代表删除含有NaN的行 当为all 时代表删除所有值为NaN的行
    • inplace 修改被调用的对象 而不是一个备份
1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'])

df.loc['a'] = np.nan
df.loc['b',1] = np.nan
print(df)
df.dropna(axis=0,how='any',inplace=True)
print(df)
1
2
3
4
5
6
7
8
      0     1     2     3
a NaN NaN NaN NaN
b 4.0 NaN 6.0 7.0
c 8.0 9.0 10.0 11.0
d 12.0 13.0 14.0 15.0
0 1 2 3
c 8.0 9.0 10.0 11.0
d 12.0 13.0 14.0 15.0
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'])
# print(df)
# print(pd.isna(df))
# # print(pd.notna(df))
df.loc['a'] = np.nan
df.loc['b',1] = np.nan
print(df)
df.dropna(axis=0,how='all',inplace=True)
print(df)
1
2
3
4
5
6
7
8
9
10
      0     1     2     3
a NaN NaN NaN NaN
b 4.0 NaN 6.0 7.0
c 8.0 9.0 10.0 11.0
d 12.0 13.0 14.0 15.0
0 1 2 3
b 4.0 NaN 6.0 7.0
c 8.0 9.0 10.0 11.0
d 12.0 13.0 14.0 15.0

补全缺失值(NaN)

  • df.fillna(value=None,method=None,axis=None,inplace=False,limit=None)

    • value 标量或字典对象用于填充缺失值
  • import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'])
    
    df.loc['a'] = np.nan
    df.loc['b',1] = np.nan
    
    df.fillna(value=1,inplace=True)
    print(df)
    <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- &#96;&#96;&#96;</span><br><span class="line">        0     1     2     3</span><br><span class="line">  a   1.0   1.0   1.0   1.0</span><br><span class="line">  b   4.0   1.0   6.0   7.0</span><br><span class="line">  c   8.0   9.0  10.0  11.0</span><br><span class="line">  d  12.0  13.0  14.0  15.0</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
    • method 插值方法 默认为”ffill”
  • import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'])
    
    df.loc['a'] = np.nan
    df.loc['b',1] = np.nan
    
    df.fillna(method='bfill',inplace=True) #使用bfill进行插值
    print(df)
    <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- &#96;&#96;&#96;</span><br><span class="line">        0     1     2     3</span><br><span class="line">  a   4.0   9.0   6.0   7.0</span><br><span class="line">  b   4.0   9.0   6.0   7.0</span><br><span class="line">  c   8.0   9.0  10.0  11.0</span><br><span class="line">  d  12.0  13.0  14.0  15.0</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
    • axis 需填充的轴 默认为0
  • import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'])
    
    df.loc['a'] = np.nan
    df.loc['b',1] = np.nan
    
    df.fillna(method='bfill',inplace=True,axis=1) #填充轴为1
    print(df)
    <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- &#96;&#96;&#96;</span><br><span class="line">        0     1     2     3</span><br><span class="line">  a   NaN   NaN   NaN   NaN</span><br><span class="line">  b   4.0   6.0   6.0   7.0</span><br><span class="line">  c   8.0   9.0  10.0  11.0</span><br><span class="line">  d  12.0  13.0  14.0  15.0</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
    • inplace 修改被调用的对象 而不是一个备份
    • limit 用于向前或向后填充时最大的填充范围.
  • import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'])
    
    df.loc['a'] = np.nan
    df.loc['b',1] = np.nan
    
    df.loc['c'] = np.nan
    df.fillna(value=2,inplace=True,axis=1,limit=2)
    print(df)
    <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- &#96;&#96;&#96;</span><br><span class="line">        0     1     2     3</span><br><span class="line">  a   2.0   2.0   2.0   2.0</span><br><span class="line">  b   4.0   2.0   6.0   7.0</span><br><span class="line">  c   2.0   NaN   2.0   2.0</span><br><span class="line">  d  12.0  13.0  14.0  15.0</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
    
    
    
    

异常值

脏数据也包含不符合要求的数据,那么对这块数据处理不能直接使用fillna填充。使用replace更加灵活。

  • df.replace(to_replace=None,value=None)

    • to_replace 去替换的值
    • value 替换的值
  • import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'])
    
    df.loc['a'] = np.nan
    df.loc['b',1] = np.nan
    
    df.replace(to_replace=df.loc['c',2],value=5,inplace=True) #将c,2替换为5
    print(df)
    <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- &#96;&#96;&#96;</span><br><span class="line">        0     1     2     3</span><br><span class="line">  a   NaN   NaN   NaN   NaN</span><br><span class="line">  b   4.0   NaN   6.0   7.0</span><br><span class="line">  c   8.0   9.0   5.0  11.0</span><br><span class="line">  d  12.0  13.0  14.0  15.0</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
    
    
    
    

处理重复数据

判断重复值

  • df.duplicated(subset=None, keep=’first’) 返回的一个布尔值Series 默认反映的是每一行是否与之前出现过的行相同

    • subset 指定子列判断重复
    • keep 默认为first保留首个出现的 last保留最后出现的
  • import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'])
    
    df.loc['b'] = df.loc['a']
    print(df.duplicated(keep='last'))
    <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- &#96;&#96;&#96;</span><br><span class="line">  a     True</span><br><span class="line">  b    False</span><br><span class="line">  c    False</span><br><span class="line">  d    False</span><br><span class="line">  dtype: bool</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->

删除重复值

  • df.drop_duplicates() 返回的是DataFrame 默认删除重复行

    • subset 指定的数据任何子集是否有重复
    • keep 默认为first保留首个出现的 last保留最后出现的
  • import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'])
    
    # df.loc['a',2] = 1
    df.loc['b'] = df.loc['a']
    # print(df)
    # print(df.duplicated(keep='last'))
    df.drop_duplicates(keep='last',inplace=True)
    print(df)
    <!--hexoPostRenderEscape:<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">- &#96;&#96;&#96;</span><br><span class="line">      0   1   2   3</span><br><span class="line">  b   0   1   2   3</span><br><span class="line">  c   8   9  10  11</span><br><span class="line">  d  12  13  14  15</span><br></pre></td></tr></table></figure>:hexoPostRenderEscape-->
    

离散化

离散化是把无限空间中有限的个体映射到有限的空间中去,以此提高算法的时空效率。

可以简单的理解为离散化就是将连续值进行分区间

  • pd.cut(x,bins) 将连续数据x进行离散化
    • x 要进行离散化的数据
    • bins 分组
1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np

age_arree = np.random.randint(20,60,size=20)
bins = 5 #bins可以传递一个列表
cates = pd.cut(age_arree,bins)
print(cates) #返回组
print(cates.codes) #返回索引

1
2
3
4
[(51.6, 59.0], (44.2, 51.6], (51.6, 59.0], (51.6, 59.0], (44.2, 51.6], ..., (29.4, 36.8], (21.963, 29.4], (29.4, 36.8], (36.8, 44.2], (29.4, 36.8]]
Length: 20
Categories (5, interval[float64]): [(21.963, 29.4] < (29.4, 36.8] < (36.8, 44.2] < (44.2, 51.6] <(51.6, 59.0]]
[4 3 4 4 3 4 1 3 0 2 1 4 4 4 0 1 0 1 2 1]
  • pd.value_counts(cates) 统计每个区间的数值分布
1
2
3
4
5
6
7
8
9
10
import pandas as pd
import numpy as np

age_arree = np.random.randint(20,60,size=20)
bins = 5
cates = pd.cut(age_arree,bins)

print(cates.categories) #返回区间
a = pd.value_counts(cates) #返回区间数值分布
print(a)
1
2
3
4
5
6
7
8
9
IntervalIndex([(21.963, 29.4], (29.4, 36.8], (36.8, 44.2], (44.2, 51.6], (51.6, 59.0]],
closed='right',
dtype='interval[float64]')
(51.6, 59.0] 7
(29.4, 36.8] 5
(44.2, 51.6] 3
(21.963, 29.4] 3
(36.8, 44.2] 2
dtype: int64

重命名轴索引

需求:如下数据,将行索引名字全部转为大写

1
2
3
4
5
6
7
data = {
"names":["bob","amy","cheney","jerry"],
"grades":np.random.randint(60,100,size=4)
}

df2 = pd.DataFrame(np.random.randint(60,100,size=(4,2)),index=["bob","amy","cheney","jerry"])
df2

使用:

  • 索引映射:df.index.map()
  • 索引重命名:df.rename(index,columns)
1
2
3
4
5
6
7
8
9
10
import pandas as pd
import numpy as np
df2 = pd.DataFrame(np.random.randint(60,100,size=(4,2)),index=["bob","amy","cheney","jerry"])

# print(df2)
f = lambda x:x.upper()
df2.index.map(f)
print(df2.index.map(f))

print(df2.rename(index=str.upper))

向量化字符串函数

import33.png

计算虚拟变量

分类变量转换为”虚拟”或”指标”矩阵是另一种用于统计建模或机器学习的转换操作。如果DataFrame中的一列有k个不同的值,则可以衍生一个K列的值为1和0的矩阵或DataFrame。

  • pd.get_dummies() 将分类变量转换为”虚拟”或”指标”矩阵

但是,如果说DataFrame中的一行属于多个类别,情况就会比较复杂。如下图

image.png

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
import numpy as np

df1=pd.DataFrame(
{
'key':['b','b','a','c','c','b'],
'data':range(6)
}
)
print(df1)
print(pd.get_dummies(df1['key']))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
  key  data
0 b 0
1 b 1
2 a 2
3 c 3
4 c 4
5 b 5
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 0 1
5 0 1 0
  • IMDB Movie 数据清洗
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np

df = pd.read_csv('IMDB-Movie-Data.csv')
# print(r.head())

genre = []
for gen in df.Genre:
genre.extend(gen.split(','))

agenre = pd.unique(genre)
# print(agenre)
zero_arr = pd.DataFrame(np.zeros((len(df),len(agenre))),columns=agenre)
# print(zero_arr)
for index,genres in enumerate(df.Genre):
indice = zero_arr.columns.get_indexer(genres.split(','))
zero_arr.iloc[index,indice] = 1

print(zero_arr.head())
1
2
3
4
5
6
7
8
   Action  Adventure  Sci-Fi  Mystery  ...  Western  War  Musical  Sport
0 1.0 1.0 1.0 0.0 ... 0.0 0.0 0.0 0.0
1 0.0 1.0 1.0 1.0 ... 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
4 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0

[5 rows x 20 columns]