Sktime: a Unified Python Library for Time Series Machine Learning

Sktime: a Unified Python Library for Time Series Machine Learning

Image by geralt at Pixabay

Solving data science problems with time series data in Python is challenging.

Why? Existing tools are not well-suited to time series tasks and do not easily integrate together. Methods in the scikit-learn package assume that data is structured in a tabular format and each column is i.i.d. — assumptions that do not hold for time series data. Packages containing time series learning modules, such as statsmodels, do not integrate well together. Further, many essential time series operations, such as splitting data into train and test sets across time, are not available in existing python packages.

To address these challenges, sktime was created.

Logo of the sktime library (Github: https://github.com/alan-turing-institute/sktime)

sktime is an open-source Python toolbox for machine learning with time series. It is a community-driven project funded by the UK Economic and Social Research Council, the Consumer Data Research Centre, and The Alan Turing Institute.

sktime extends and the scikit-learn API to time series tasks. It provides the necessary algorithms and transformation tools to efficiently solve for time series regression, forecasting, and classification tasks. The library includes dedicated time series learning algorithms and transformation methods not readily available in other common libraries.

sktime was designed to interoperate with scikit-learn, easily adapt algorithms for interrelated time series tasks, and build composite models. How? Many time series tasks are related. An algorithm that can solve for one task can often be re-used to help solve a related one. This idea is called reduction. For example, a model for time series regression (use a series to predict an output value) can be re-used for a time series forecasting task (the predicted output value is a future value).

Mission statement: “sktime enables understandable and composable machine learning with time series. It provides scikit-learn compatible algorithms and model composition tools, supported by a clear taxonomy of learning tasks, with instructive documentation and a friendly community.”

In the rest of this article, I highlight some of the unique features of sktime.

Proper Data Model for Time Series

Sktime uses a nested data structure for time series in pandas data frames.

Each row in a typical data frame contains i.i.d. observations and columns represent different variables. For sktime methods, each cell in the Pandas data frame can now contain an entire time series. This format is flexible for multivariate, panel, and heterogenous data and allows the reuse of methods in both Pandas and scikit-learn.

In the table below, each row is an observation that contains a time series array in column X and class value in column y. sktime estimators and transformers can operate on such series.

Native time series data structure, compatible with sktime.

In the next table, each element in the series X has been separated into an individual column, as required by methods in scikit-learn. The dimensionality is quite high — 251 columns! Further, the time-order of the columns is ignored by tabular learning algorithms (but used by time series classification and regression algorithms).

Time series data structure required by scikit-learn.

For problems modeling multiple co-occurring series together, the native series data structure compatible sktime is clearly best. Models trained on data structured in tabular format expected by scikit-learn would become bogged down in a large number of features.

What can sktime do?

Per the Github page, sktime currently provides:

  • State-of-the-art algorithms for time series classification, regression, and forecasting (ported from the Java-based tsml toolkit),
  • Transformers for time series: single-series transformations (e.g. detrending or deseasonalization), series-as-features transformations (e.g. feature extractors), and tools to compose different transformers,
  • Pipelining for transformers and models,
  • Model tuning,
  • Ensembling of models — e.g. a fully customizable random forest for time-series classification and regression; ensembling for multivariate problems.

The sktime API

As mentioned earlier, sktime follows the basic scikit-learn API with fit, predict, and transform class methods.

For estimator (aka model) classes, sktime provides a fit method for model training and a predict method to generate new predictions.

The estimators in sktime extend scikit-learn’s regressors and classifiers to their time series counterparts. Sktime also includes new estimators specific to time series tasks.

For transformer classes, sktime provides fit and transform methods to transform series data. There are several types of transformations available:

  • tabular data transformers like PCA that operate over i.i.d. instances;
  • series-to-primitives transformers that convert time series in each row into a primitive number (e.g. feature transaction);
  • series-to-series transformers convert series into a different series (e.g. Fourier transform of a series);
  • detrending transformers return a detrended time series in the same domain as the input series (e.g. seasonal detrending).

Code Examples

Time Series Forecasting

The following example is an adaption of the forecasting tutorial on Github. The series in this example (the Box-Jenkins airline data set) shows the number of international airline passengers per month from 1949–1960.

First, load the data and split it into train and test sets and plot. sktime provides two handy functions to do this easily — temporal_train_test_splitfor splitting a dataset by time and plot_ys for plotting the train and test series values.

Before you create any sophisticated forecasts, it is helpful to compare your forecast to a naïve baseline — a good model must beat this value. sktime provides the NaiveForecaster method, with different “strategies”, to generate baseline forecasts.

The code and chart below demonstrate two naïve forecasts. The forecaster with strategy = “last” always predicts last observed value of the series. The forecaster with strategy = “seasonal_last” predicts the last value of the series observed in the given season. Seasonality in the example is specified as “sp=12”, or 12 months.

The next forecast snippet shows how existing sklearn regressors can be easily and correctly adapted to forecasting tasks with minimal effort. Below, the sktime ReducedRegressionForecaster method forecasts the series using the the sklearnRandomForestRegressor model. Internally, sktime is splitting the training data into windows of length 12 for the regressor to train on.

sktime also contains native forecasting methods, such as AutoArima.

For a more comprehensive dive into sktime’s forecasting features, check out the sktime forecasting tutorial. To learn about temporal cross validation for forecasting, checkout the following article.

Time Series Classification

Last, sktime can be used to classify time series into different groups of series.

In the code example below, the classification of single time series is as straight-forward as classification in scikit-learn. The only difference is the nested time series data structure discussed above.

Example code borrowed from https://pypi.org/project/sktime/The data passed into the TimeSeriesForestClassifier.

For a more comprehensive dive into time series classification, check out the my article on time series classification linked below and the sktime univariate and multivariate classification tutorials.

Additional sktime Resources

To learn more about sktime, visit the following links for in-depth documentation and examples.

Not a Medium Member? Join today!

Further Reading

Alternate Text Gọi ngay