Guide To Sktime – Python Library For Time Series Data (Compatible With Sci-kit learn)
Time series data is widely used to analyse different trends and seasonalities of products over time by various industries. Sktime is a unified python framework/library providing API for machine learning with time series data and sklearn compatible tools to analyse, visualize, tune and validate multiple time series learning models such as time series forecasting, time series regression and classification. Sktime was represented in a research paper named ‘sktime: A Unified Interface for Machine Learning with Time Series’ to NeurIPS by a group of researchers at Alan Turing Institute – Markus Loning, Franz J Kiraly, from University of East Anglia – Anthony Bagnall, Jason Lines and from University College London – Sajaysurya Ganesh, Viktor Kazakov.
Sktime explores a blend of both features of popular time series algorithms, and the sci-kit learn library. It uses sklearn algorithms in the reduction of vast tabular data. Other features include time series regression, classification(multivariate and univariate), time series clustering, time-series annotations, forecasting, estimation, transformation, datasets, feature tools and utility functions (preprocessing and plotting).
Under time-series transformations comes Panel transformers and Series transformers. For Panel transformers there is Shapelet, Segment, Reduce, Rocket, PCA, Matrix profile, Compose, Summarize, tsfresh. For Series transformers there is DTrend, Adapt, box-cox, AutoCorrelation, Cosine. The popular time series libraries available in sktime are ARIMA, AutoARIMA, fbprophet. The input data for sktime is expected to be in pandas dataframe. For more information, check the documentation.
The main aim of the library is to provide:
- Standard interface for building different types of time-series learning tasks using sci-kit learn features
- Applying various Reduction algorithms.
- Providing model composition tools, model evaluation tools and comparative benchmarking tools.
- Interface to handle varied time-series data
Mục Lục
Installation:
pip install sktime
Forecasting
from sktime.forecasting.all import * y = load_airline() y_train, y_test = temporal_train_test_split(y) fh = ForecastingHorizon(y_test.index, is_relative=False) forecaster = ThetaForecaster(sp=12) # monthly seasonal periodicity forecaster.fit(y_train) y_pred = forecaster.predict(fh) smape_loss(y_test, y_pred)
0.08661468139978168
Time Series Classification
from sktime.classification.all import * from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score X, y = load_arrow_head(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y) classifier = TimeSeriesForest() classifier.fit(X_train, y_train) y_pred = classifier.predict(X_test) accuracy_score(y_test, y_pred)
0.8679245283018868
Univariate Time Series Classification with sktime
A single time series variable and a corresponding label for multiple instances. The aim is to find a suitable classifier model that can be used to learn the relationship between time-series data and label and predict likewise the new series’s label.
import matplotlib.pyplot as plt import numpy as np from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.tree import DecisionTreeClassifier from sktime.classification.compose import TimeSeriesForestClassifier from sktime.datasets import load_arrow_head from sktime.utils.slope_and_trend import _slope
# Loading data
In this notebook, we use the arrowhead problem.
The arrowhead dataset is a time-series dataset containing outlines of the images of arrowheads. In anthropology, the classification of projectile points is an important topic. The classes are categorized based on shape distinctions eg. – the presence and location of a notch in the arrow.
The shapes of the projectile points are to be converted into sequences using the angle-based method. For more details check this blog post about converting images into time-series data for data mining.
# Data representation
X, y = load_arrow_head(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y) print(X_train.shape, y_train.shape, X_test.shape, y_test.shape) (158, 1) (158,) (53, 1) (53,) # univariate time series input data X_train.head()
dim_0250 -1.6320 1 -1.6301 2 -1.6075 3 …1050 -1.6758 1 -1.6742 2 -1.6674 3 …180 -2.1138 1 -2.0918 2 -2.0488 3 …1670 -1.7471 1 -1.7295 2 -1.7300 3 …1740 -1.6307 1 -1.6299 2 -1.6206 3 …
# binary target variables
labels, counts = np.unique(y_train, return_counts=True) print(labels, counts)
['0' '1' '2'] [60 54 44]
fig, ax = plt.subplots(1, figsize=plt.figaspect(0.25)) for label in labels: X_train.loc[y_train == label, "dim_0"].iloc[0].plot(ax=ax, label=f"class {label}") plt.legend() ax.set(title="Example time series", xlabel="Time");
[Text(0.5, 1.0, 'Example time series'), Text(0.5, 0, 'Time')]
Time series forest
Time series forest is a modification of the random forest algorithm to the time series setting:
- Splitting the series into multiple random intervals,
- Extracting features (mean, standard deviation and slope) from each interval,
- Training a decision tree on the extracted features,
- Ensembling steps 1 – 3.
from sktime.transformations.panel.summarize import RandomIntervalFeatureExtractor steps = [ ( "extract", RandomIntervalFeatureExtractor( n_intervals="sqrt", features=[np.mean, np.std, _slope] ), ), ("clf", DecisionTreeClassifier()), ] time_series_tree = Pipeline(steps)
We can directly fit and evaluate the single time series tree (which is simply a pipeline).
time_series_tree.fit(X_train, y_train) time_series_tree.score(X_test, y_test)
0.8113207547169812
# For time series forest classifier, we can simply use the single tree as the base estimator in the forest ensemble.
tsf = TimeSeriesForestClassifier( estimator=time_series_tree, n_estimators=100, criterion="entropy", bootstrap=True, oob_score=True, random_state=1, n_jobs=-1, )
# Fitting and obtaining the out-of-bag score:
tsf.fit(X_train, y_train) if tsf.oob_score: print(tsf.oob_score_) 0.8417721518987342 tsf = TimeSeriesForestClassifier() tsf.fit(X_train, y_train) tsf.score(X_test, y_test)
0.8867924528301887
# algorithms for plotting feature importance graph over time to obtain feature importances for the different features and intervals.
fi = tsf.feature_importances_ fig, ax = plt.subplots(1, figsize=plt.figaspect(0.25)) fi.plot(ax=ax) ax.set(xlabel="Time", ylabel="Feature importance");
/usr/local/lib/python3.6/dist-packages/pandas/plotting/_matplotlib/core.py:584: UserWarning: The handle <matplotlib.lines.Line2D object at 0x7f829afd6b70> has a label of '_slope' which cannot be automatically added to the legend. ax.legend(handles, labels, loc="best", title=title) [Text(0.5, 0, 'Time'), Text(0, 0.5, 'Feature importance')]
For more examples visit Binder to directly try out the interactive Jupyter Notebook without any other dependencies to be installed, from here.
Sktime-dl is an extension library to sktime in the form of applying deep learning algorithms to time-series data. This repository aims to include Keras networks to be used with sktime and it’s making a machine learning pipeline and strategy tools along with it also having an extension to sci-kit learn, for use in applications and research. The interface provides an implementation of neural networks for time series analysis.
Neural Networks for time-series Classification
The current toolkit provides an interface of dl-4-tsc and implements the following network architectures: Multilayer perceptron (MLP), Fully convolutional neural network (FCNN), Time convolutional neural network (CNN), Time Le-Net (TLeNet), Encoder (Encoder), Residual network (ResNet), Multi-scale convolutional neural network (MCNN), Multi-channel deep convolutional neural network (MCDCNN), Time warping invariant echo state network (TWIESN). There is one more interface with InceptionTime.
Regression
Most classifier architectures act to serve as regressors. These are – Time convolutional neural network (CNN), Fully convolutional neural network (FCNN), Multilayer perceptron (MLP), Encoder (Encoder), Time Le-Net (TLeNet), Residual network (ResNet), InceptionTime (Inception).
Forecasting
The regression networks can be adapted to work as time series forecasting through sktime’s reduction strategies. In future, RNNs/LSTMs networks can be seen as functional within sktime.
Hyper-parameter tuning is done through sci-kit learn’s RandomizedSearch and GridSerach tools. Ensembling methods include different random initialisations for stability. These act as wrapper classes to the neural networks which can be further used in high-level data pipelines within sktime models.
NOTE: sktime-dl is currently not maintained and replies to issues and PRs may be slow. We’re looking for a new maintainer to help us maintain sktime-dl.
EndNotes
There is a separate repository for beginners to learn time-series using sktime along with notebooks and video lectures. Sktime-m4 has been created to use sktime to replicate and extend the M4 study. This project is under constant development and looks forward to real-world, real-time applications and also be used in advanced research-based works.