Analysis by: Rohit Sisir Sahoo and Pratik Satish Hotchandani
Paper Link: Long-term Forecasting with TiDE: Time-series Dense Encoder
Our Implementation Link: GitHub (CS7150 Deep Learning Final Project)
Abstract: Recent work has shown that simple linear models can outperform several Transformer based approaches in long term time-series forecasting. Motivated by this, we propose a Multi-layer Perceptron (MLP) based encoder-decoder model, Time-series Dense Encoder (TiDE), for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies. Theoretically, we prove that the simplest linear analogue of our model can achieve near optimal error rate for linear dynamical systems (LDS) under some assumptions. Empirically, we show that our method can match or outperform prior approaches on popular long-term time-series forecasting benchmarks while being 5-10x faster than the best Transformer based model.
Abhimanyu Das
Research Scientist at Google
Prior: Researcher at Microsoft, Yahoo! Labs
PhD from University of Southern California
Bachelors from Indian Institute of Technology, Delhi
Weihao Kong
Research Scientist at Google
Prior: Researcher at University of Washington
PhD from Stanford (2019)
Bachelors from Shanghai Jiao Tong University
Andrew Leach
Machine Learning Engineer at Google Cloud
PhD from University of Arizona
BS from University at Buffalo: The State University of New York
Shaan Mathur
Machine Learning Engineer at Google Cloud
BS and MS From University of California at Los Angeles (UCLA)
Rajat Sen
Research Scientist at Google
PhD from University of Texas at Austin (UT Austin)
Bachelors from Indian Institute of Technology, Kharagpur
Rose Yu
Professor at UC San Diego
Prior: Visiting Researcher at Google Cloud
Professor at Khoury College of Computer Sciences, Northeastern University (CS 7180, CS6140, CS 7140)
PhD from University of Southern California
BS from Zhejiang University
Models for long-term forecasting can be broadly divided into either multivariate models or univariate models
Multivariate Time Series Forecasting:
Vector Autoregression (VAR) is a fundamental model in econometrics that captures the linear interdependencies among multiple time series. The model considers the history of all variables in the system to predict future values. It's particularly useful in situations where the variables influence each other.
The LongTrans model introduces a way to handle long sequences of data without overwhelming computational requirements. It uses an attention mechanism that is specifically designed to be space and computationally efficient. The "LogSparse" design refers to the idea of capturing local patterns within the data while maintaining a manageable level of computational complexity.
The Informer model addresses the problem of processing long sequences in time series data by using a special kind of attention mechanism called ProbSparse self-attention. This method selectively focuses on the most important parts of the data, which allows it to scale well with longer time contexts without a significant increase in computation, hence achieving sub-quadratic complexity.
This paper presents a model that decomposes time series data into trend and seasonal components before applying a specialized self-attention mechanism. By doing so, the model can focus on long-term dependencies and recurring patterns in the data, which is beneficial for capturing complex temporal dynamics with reduced computational demands.
Univariate Time Series Forecasting:
The Autoregressive (AR) model is a basic time series model that predicts future data based on past values of the same series. The idea is that past data points have a "memory" effect that can help forecast future points. ARIMA stands for Autoregressive Integrated Moving Average. It's an extension of the AR model that includes differencing (to make the data more stationary) and a moving average component to smooth out random fluctuations or noise in the data.
DeepAR is a probabilistic forecasting model that uses recurrent neural networks (RNNs). It's designed to capture complex patterns in the data by learning from similar time series and can provide uncertainty estimates for its forecasts.
This study examines the effectiveness of transformer models, which are known for their powerful self-attention mechanisms, in the context of time series forecasting. It suggests that simpler linear global univariate models may outperform transformers for long-term forecasting tasks. DLinear is a model that directly learns a linear relationship from past data to future predictions. It challenges the idea that complex models with approximate self-attention mechanisms are necessary, showing that sometimes a simpler, direct linear approach is sufficient.
Instead of looking at the time series data point by point, PatchTST divides the time series into contiguous segments, or "patches." These patches are similar to "words" in the language context for which transformers were originally designed. These patches are then fed into the transformer as if they were tokens. This method allows the model to process segments of the series at a time, capturing the local information within each patch. This approach is shown to be competitive and even superior to more complex models in certain long-term forecasting scenarios.
There are N time-series in the dataset. The look-back of the i-th time-series will be denoted by y(i)1:L, while the horizon is denoted by y(i)L+1:L+H. The task of the forecaster is to predict the horizon time-points given access to the look-back. The static attributes of a time-series denoted by a(i) such as features of a product in retail that do not change with time.
Dense Encoder: TiDE begins by first encoding the past of a time series along with any associated covariates i.e. stack and flatten all the past and future projected covariates, concatenate them with the static attributes and the past of the time-series. Then map them to an embedding using an encoder which contains multiple residual blocks. These covariates could be any external factors or indicators that influence the time series. The encoding process is executed using dense MLPs. Each layer in this MLP serves as a transformation, capturing increasingly abstract representations of the input data. By the end of the encoding phase, the model distills the past time series and covariates into a dense hidden representation, a vector filled with learned features that best describe the data's patterns and relationships.
Dense Decoder: The decoding in our model maps the encoded hidden representations into future predictions of time series. The first decoding unit is a stacking of several residual blocks like the encoder with the same hidden layer sizes. This phase is responsible for predicting future values based on the extracted features. Another set of dense MLPs is used for this purpose, taking the hidden representation and generating a series of future predictions.
Temporal Decoder: A unique component of TiDE is the introduction of a temporal decoder. While the primary decoder produces a basic forecast, the temporal decoder refines these predictions by adapting them to future covariates. This is crucial because, in real-world scenarios, external influences can cause drastic shifts in time series data. The temporal decoder ensures that these potential future changes are considered, allowing for more accurate and adaptable forecasts. This operation adds a "highway" from the future covariates to the prediction. This can be useful if some covariates have a strong direct effect on a particular time-step's actual value. For instance, in retail demand forecasting a holiday like Mother's day might strongly affect the sales of certain gift items. Such signals can be lost or take longer for the model to learn in absence of such a highway.
Results:
Multivariate long-term forecasting results with our model. T belongs to {96, 192, 336, 720} for all datasets. The best results including the ones that cannot be statistically distinguished from the best mean numbers are in bold. We calculate standard error intervals for our method over 5 runs.
Results from Demand Forecasting Experiment (M5 forecasting competition):
In Figure 2 The bars show how long it takes for each model to process one batch of data during inference (i.e., making predictions). The TiDE model has significantly lower inference times across all look-back lengths compared to PatchTST, which indicates that TiDE is more efficient in making predictions.
In Figure 3 in the time instances following the event the model without the temporal decoder is thrown off possibly because it has not yet readjusted its past to what it should have been without the event In Figure 4 the results are shown for multiple horizon length tasks and show that in all cases the performance becomes better with increasing context size as expected. In Table 5 the results of TiDE (no res) i.e. removed all the residual connections is compared with residual connections, there is a statistically significant drop in performance without the residual connections.
Positive Impacts:
Negative Impacts:
Integration of TiDE (Time Series Dense Encoder) into Google's Vertex AI
Vertex AI offers a platform to build and use generative AI - from AI solutions, to Search and Conversation, to 100+ foundation models, to a unified AI platform.
Other potential Industrial Applications:
A potential future academic research direction could be to analyze Multi-Layer Perceptrons (MLPs) and Transformer architectures (including their non-linearity aspects) under a simple mathematical model for time-series data. The model would simulate time-series data that reflects key characteristics like seasonality (cyclical patterns) and trends (upward or downward movements over time).
It would likely include a variety of scenarios, from simple (e.g., linear trends or regular cycles) to complex (e.g., irregular patterns, abrupt changes).
The model should be flexible enough to vary the level of noise, periodicity, and trend complexity to test the architectures under different data conditions.
This research would aim to quantify the advantages and disadvantages of these architectures for different levels of seasonality and trends in time-series forecasting. The study would also consider the fact that Transformers are generally more parameter-efficient than MLPs, albeit being more memory and compute-intensive. This exploration could provide valuable insights into optimizing these models for specific forecasting scenarios, balancing efficiency and accuracy.
Time-series data often exhibit seasonality (patterns that repeat over known, fixed periods) and trends (long-term increase or decrease in the data). Understanding how MLPs and Transformers capture these elements is crucial for accurate forecasting.
The proposed research would involve creating mathematical models that simulate time-series data with varying levels of seasonality and trend complexities. These models would then be used to test the performance of MLPs and Transformers.
Summary: The authors propose a new MLP-based encoder-decoder model for long-term time-series forecasting that combines the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies. They theoretically prove that their formulation obtains a near optimal rate for linear dynamical systems and empirical results demonstrate that their proposed algorithm matches or outperforms other benchmark algorithms on popular time-series benchmarks while having computational advantages in training and inference over state-of-the-art transformer models.
Strengths:
Weakness:
Questions:
Soundness: 4
Presentation: 3
Contribution: 4
Overall: 8: Strong Accept: Technically strong paper with, with novel ideas, excellent impact on Time Series Domain.
Confidence: 4
Summary: The paper presents an all-MLP model for time-series forecasting called TiDE. First, they apply a linear projection to reduce the input dimensionality of the time-series (independently for each time step). Then, they flatten all the inputs and they apply a stack of MLP residual blocks. Finally, they stack the initial (reduced) features to a reshaped output, and they apply a decoding step (also an MLP) to get the predictions. They compare TiDE to alternative transformer-based approaches, achieving competitive results.
Strength:
Weakness:
Questions:
Soundness: 3
Presentation: 4
Contribution: 4
Overall: 8: Strong Accept: Technically strong paper with excellent evaluation.
Confidence: 4
Our team successfully implemented the architecture and code provided by Google Research for the TiDE (Temporal Decoder) model. We conducted tests on weather data, verifying the model's performance against expected outcomes. Our initial findings were promising: after just a single training epoch, the TiDE model achieved a Mean Squared Error (MSE) of 0.35. This improved markedly with continued training, dropping to an MSE of 0.23 after six epochs, over a forecast horizon of 96 time points.
Additionally, we extended our testing to include the Australian Beer Dataset. This choice was inspired by a comparative study in the original TiDE paper, where the authors benchmarked the TiDE model against N-HiTS (Neural Hierarchical Interpolation for Time Series Forecasting). This comparative analysis is crucial for understanding the efficacy of TiDE in diverse forecasting scenarios. Our tests aimed to replicate and scrutinize these comparisons, offering a deeper insight into the model's versatility and accuracy across different types of time series data.
Results:
[1] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems, 32, 2019a.
[2] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, 2021.
[3] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting Advances in Neural Information Processing Systems, 34: 22419–22430, 2021.
[4] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.
[5] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu Are transformers effective for time series forecasting? Proceedings of the AAAI conference on artificial intelligence, 2023.
[6] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. International conference on learning representations, 2022.
[7] Cristian Challu, Kin G. Olivares, Boris N. Oreshkin, Federico Garza, Max Mergenthaler, and Artur Dubrawski. NHITS: Neural Hierarchical Interpolation for Time Series forecasting. In The Association for the Advancement of Artificial Intelligence Conference 2023 (AAAI 2023), 2023., 2022.[8] Elad Hazan, Karan Singh, and Cyril Zhang Learning linear dynamical systems via spectral filtering. Advances in Neural Information Processing Systems, 30, 2017.
1. Rohit Sisir Sahoo (MS Computer Science, Spring 2023, Northeastern University, sahoo.ro@northeastern.edu)
2. Pratik Satish Hotchandani (MS Data Science, Spring 2023, Northeastern University, hotchandani.p@northeastern.edu)