目录
1. 问题描述
在做时序列分析中调用到了adtk.data.validate_series,报告了以上这个错误。查了查adtk官网(Detector — ADTK 0.6.2 documentation)文档关于validate_series描述如下:
Validate time series
Function validate_series
checks some common critical issues that may cause problems if anomaly detection is performed to time series without fixing them. The function will automatically fix some of them, while it will raise errors when detect others.
Issues will be checked and automatically fixed include:
-
Time index is not monotonically increasing;
-
Time index contains duplicated time stamps (fix by keeping first values);
-
(optional) Time index attribute
freq
is missed; -
(optional) Time series include categorical (non-binary) label columns (to fix by converting categorical labels into binary indicators).
Issues will be checked and raise error include:
-
Wrong type of time series object (must be pandas Series or DataFrame);
-
Wrong type of time index object (must be pandas DatetimeIndex).
简而言之,就是在执行正式的时序列分析之前,先用这个函数检查一下输入数据是不是合法的时序列。有些问题会被本函数自动fix(修补)掉,有些无法修补就会报告错误(raise an error)。本文title所提及的错误就是后者中的第二条(Wrong type of time index object),即pandas dataframe的Index这一列不是time index object.
2. 问题起源
我用pandas从csv文件中读取一组数据如下所示:
第一列确实是时间标签。但是它的类型并不符合ADTK时序列分析的要求,查看它的类型:
print(type(data_df_train.index))
<class 'pandas.core.indexes.base.Index'>
缺省条件下,pandas只是把这个时间标签列看作是一个普通的Index对象,而不是时序列分析所要求的time index对象。
3. 解决方案
网上有人说用以下这种方式:
data_df_train.index = data_df_train.index.to_datetime()
不过。。。亲测不灵。报告以下错误:
亲测正确的解决方案如下:
data_df_train.index = pd.to_datetime(data_df_train.index)
print(type(data_df_train.index))
Out: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
经过pd.to_datetime()转换后,该index列变为datetimes.DatetimeIndex对象类型,符合时序列分析的要求。再运行原来的时序列分析程序就OK了。