时间序列

接下来会用到的数据:

链接：https://pan.baidu.com/s/1RAqFCxWcl4OEChlRtSBooA
提取码：612s

时间序列前言

时间序列数据在很多领域都是重要的结构化数据形式，比如：金融，神经科学，生态学，物理学。在多个时间点观测的数据形成了时间序列。时间序列可以是固定频率的，也可以是不规则的。

常见使用

时间戳
固定的时间区间
时间间隔

时间序列基础

时间序列介绍

Pandas中的基础时间序列种类是由时间戳索引的Series，在Pandas外部通常表示为Python字符串或datetime对象。

注意

datetime对象可作为索引，时间序列DatetimeIndex
<M8[ns]类型为纳秒级时间戳
时间序列里面每个元素为Timestamp对象

生成时间序列函数

pd.date_range(start=None,end=None,periods=None,freq=None,tz=None,normalize=False)
- start 起始时间
- end 结束时间
- periods 固定时期
- freq 日期偏移量(频率)
- normalize 标准化为0的时间戳

d1 = pd.date_range(start="20200101",end="20200201") 
d1

d2 = pd.date_range(start="20200101",end="20200201",periods=5) 
d2

d3 = pd.date_range(start="20200101",periods=5,freq="10D")
d3

d4 = pd.date_range(start="2020-01-01 12:59:59",periods=5,freq="10D",normalize=True)
d4

import pandas as pd
import numpy as np
from datetime import datetime

ts = pd.Series(np.random.randint(1,10,5),index=pd.date_range(start='2020-05-15',periods=5,freq='10D')) #也可以20200515
print(ts)
print(ts.index)
print(ts['2020-05'])

2020-05-15    7
2020-05-25    2
2020-06-04    3
2020-06-14    1
2020-06-24    6
Freq: 10D, dtype: int32
DatetimeIndex(['2020-05-15', '2020-05-25', '2020-06-04', '2020-06-14',
               '2020-06-24'],
              dtype='datetime64[ns]', freq='10D')
2020-05-15    7
2020-05-25    2
Freq: 10D, dtype: int32

关于频率设置如下：

具体可参考：https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases

时间序列的索引及选择数据

import pandas as pd
import numpy as np

ts = pd.Series(np.random.randint(1,100,size=800),index=pd.date_range("20180101",periods=800))

#取2020数据
print(ts['2020'])
print('------------------------------')
#取2020 01数据
print(ts['2020 01'])
print('------------------------------')
取切片数据
print(ts['2020 05 01':'2020 05 10'])

含有重复索引的时间序列

df.index.is_unique 检查索引是否唯一

import pandas as pd
import numpy as np
from datetime import datetime


dates = [datetime(2020,5,14),datetime(2020,5,14),datetime(2020,5,14),datetime(2020,5,15)]
ts = pd.Series(np.arange(4),index=dates)
print(ts)

print(ts.index.is_unique) 

print(ts['20200514'])

print(ts.groupby(ts.index).mean())

2020-05-14    0
2020-05-14    1
2020-05-14    2
2020-05-15    3
dtype: int32

False

2020-05-14    0
2020-05-14    1
2020-05-14    2
dtype: int32

2020-05-14    1
2020-05-15    3
dtype: int32

移位日期

“移位”指的是将日期按时间向前移动或向后移动。Series和DataFrame都有一个shift方法用于进行简单的前向或后向移位而不改变索引

1
2
3

ts.shift(2)   # 向前移动

ts.shift(-2)  # 向后移动

import pandas as pd
import numpy as np
from datetime import datetime
dates = [datetime(2020,5,14),datetime(2020,5,14),datetime(2020,5,14),datetime(2020,5,15)]
ts = pd.Series(np.arange(4),index=dates)
print(ts)

ts1 = ts.shift(2)
print(ts1)
ts2 = ts.shift(-2)
print(ts2)

dtype: int32
2020-05-14    NaN
2020-05-14    NaN
2020-05-14    0.0
2020-05-15    1.0
dtype: float64
2020-05-14    2.0
2020-05-14    3.0
2020-05-14    NaN
2020-05-15    NaN
dtype: float64

重采样

重采样介绍

重采样：指的是将时间序列从一个频率转化为另一个频率进行处理的过程，将高频率数据转化为低频率数据为降采样，低频率转化为高频率为升采样

import pandas as pd
import numpy as np

ts = pd.DataFrame(np.random.randint(100,200,size=100),index=pd.date_range(start="20200101",periods=100))

print(ts)
ts1 = ts.resample('M').mean()
print(ts1)

              0
2020-01-01  166
2020-01-02  126
2020-01-03  140
2020-01-04  106
2020-01-05  186
...         ...
2020-04-05  197
2020-04-06  145
2020-04-07  179
2020-04-08  100
2020-04-09  140

[100 rows x 1 columns]
                     0
2020-01-31  151.032258
2020-02-29  140.896552
2020-03-31  156.548387
2020-04-30  152.000000

以上为时间序列的基本内容。但是我们来看以下常见场景,通过 pd.to_datetime()转为时间序列

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1000,4000,size=(4,4)),index=[20200101,20200102,20200103,20200104],columns=["北京","上海","广州","深圳"])
df.reset_index(inplace=True)

print(df)
df['index'] = pd.to_datetime(df['index'],format='%Y%m%d')
df.set_index('index',inplace=True)
print(df,df.loc['20200102'])

      index    北京    上海    广州    深圳
0  20200101  3857  2599  2456  3002
1  20200102  3301  3937  3178  2281
2  20200103  2035  1310  2356  2908
3  20200104  3979  2067  2903  2531
              北京    上海    广州    深圳
index                             
2020-01-01  3857  2599  2456  3002
2020-01-02  3301  3937  3178  2281
2020-01-03  2035  1310  2356  2908
2020-01-04  3979  2067  2903  2531 北京    3301
上海    3937
广州    3178
深圳    2281
Name: 2020-01-02 00:00:00, dtype: int32

该行索引类型并不是时间序列类型，所以我们想要使用时间序列的特性，就需要将其转为时间序列。通过 pd.to_datetime()，其中的format参数可以调试时间序列的格式，常用如下：