sktime: A toolbox for data science with time series

Introduction

Time series, a series of data points indexed (or listed or graphed) in time order, are a key motif in modern data science and AI, but introduce complexity whenever they appear. Due to this, data science tools for time series usually focus on a specific task or model class. The ‘sktime’ project aims to provide an integrated toolset for easy construction and success control for algorithms in the time series context.

Explaining the science

High-level interface

The ‘sktime’ high-level interface aims to create a unified interface for different learning tasks (partially inspired by the APIs of mlr and openML) through the following two objects:

  1. ‘Task’ objects that encapsulate meta-data from a dataset and the necessary information about the particular supervised learning task, e.g. the instructions on how to derive the target/labels for classification from the data
  2. ‘Strategy’ objects that wrap low-level estimators and can be used to fit and predict methods using data and a task object

Low-level interface

The low-level interface extends the standard scikit-learn API to handle time series and panel data. Currently, the package implements:

  • Various state-of-the-art approaches to supervised learning with time series features
  • Transformation of time series, including series-to-series transforms (e.g. Fourier transform), series-to-primitives transforms AKA feature extractors, (e.g. mean, variance), sub-divided into fittables (on table) and row-wise applicates
  • Pipelining, allowing to chain multiple transformers with a final estimator
  • Meta-learning strategies including tuning and ensembling, accepting pipelines as the base estimator
  • Off-shelf composites strategies, such as a fully customisable random forest for time-series classification, with interval segmentation and feature extraction

Project aims

The ‘sktime’ project aims to implement an open source time series toolbox within the PyData ecosystem.

Eventually, the project should support, via a unified interface, multiple different time series related modelling tasks, including:

  • Time series classification and regression
  • Classical forecastingSupervised/panel forecasting
  • Time series segmentation
  • Time-to-event and event risk modelling
  • Unsupervised tasks such as motif discovery and anomaly detection, and diagnostic visualisation
  • On-line and streaming tasks, e.g., in variation of the above

Applications

Data scientific modelling is a key part of the modern data science and AI workflow – modelling software toolboxes with a unified modelling interface (one task – many solutions – one interface), such as Weka, mlr and scikit-learn, have become a core asset to the modern data scientist’s knowledge and toolbase.

Typical functionality of AI toolbox packages usually includes:

  1. A unified model specification and model execution interface, for training and applying the models to data
  2. Model composition and model tuning functionality, for manual or automated construction of improved strategies out of simpler ones
  3. Success control functionality checking usefulness of the modelling strategies, often in the form of semi-automated benchmarking and evaluation workflows

Distinct gaps exist in a number of use cases involving time series, which this project is addressing. A toolbox is, by definition, a tool that facilitates applications broadly. Examples could be:

  • Time series forecasting: predict tomorrow given today, e.g. extrapolate past observations in financial data to the future, or predict the weather tomorrow given past weather
  • Time series classification: given a time series, assign a label to it, e.g. identify a spoken word from a recorded audio sequence, or identify a type of motion from a video recording
  • Panel data prediction: given some time series, predict the values in others, e.g. predict the healthcare trajectory of a hospital patient, having previously observed other, similar patients

Alternate Text Gọi ngay