Time Series Dense Encoder!!!!!!!

Abstract: Recent work has shown that simple linear models can outperform several Transformer based approaches in long term time-series forecasting. Motivated by this, we propose a Multi-layer Perceptron (MLP) based encoder-decoder model, Time-series Dense Encoder (TiDE), for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies. Theoretically, we prove that the simplest linear analogue of our model can achieve near optimal error rate for linear dynamical systems (LDS) under some assumptions. Empirically, we show that our method can match or outperform prior approaches on popular long-term time-series forecasting benchmarks while being 5-10x faster than the best Transformer based model.

Biography (Authors):

Abhimanyu Das
Research Scientist at Google
Prior: Researcher at Microsoft, Yahoo! Labs
PhD from University of Southern California
Bachelors from Indian Institute of Technology, Delhi

Weihao Kong
Research Scientist at Google
Prior: Researcher at University of Washington
PhD from Stanford (2019)
Bachelors from Shanghai Jiao Tong University

Andrew Leach
Machine Learning Engineer at Google Cloud
PhD from University of Arizona
BS from University at Buffalo: The State University of New York

Shaan Mathur
Machine Learning Engineer at Google Cloud
BS and MS From University of California at Los Angeles (UCLA)

Rajat Sen
Research Scientist at Google
PhD from University of Texas at Austin (UT Austin)
Bachelors from Indian Institute of Technology, Kharagpur

Rose Yu
Professor at UC San Diego
Prior: Visiting Researcher at Google Cloud
Professor at Khoury College of Computer Sciences, Northeastern University (CS 7180, CS6140, CS 7140)
PhD from University of Southern California
BS from Zhejiang University

Literature Review:

Models for long-term forecasting can be broadly divided into either multivariate models or univariate models

Multivariate Time Series Forecasting:

VAR Model (Zivot and Wang, 2006):

Vector Autoregression (VAR) is a fundamental model in econometrics that captures the linear interdependencies among multiple time series. The model considers the history of all variables in the system to predict future values. It's particularly useful in situations where the variables influence each other.

LongTrans (Shiyang Li et al., 2019):

The LongTrans model introduces a way to handle long sequences of data without overwhelming computational requirements. It uses an attention mechanism that is specifically designed to be space and computationally efficient. The "LogSparse" design refers to the idea of capturing local patterns within the data while maintaining a manageable level of computational complexity.

Informer (Zhou et al., 2021):

The Informer model addresses the problem of processing long sequences in time series data by using a special kind of attention mechanism called ProbSparse self-attention. This method selectively focuses on the most important parts of the data, which allows it to scale well with longer time contexts without a significant increase in computation, hence achieving sub-quadratic complexity.

Autoformer (Wu et al., 2021):

This paper presents a model that decomposes time series data into trend and seasonal components before applying a specialized self-attention mechanism. By doing so, the model can focus on long-term dependencies and recurring patterns in the data, which is beneficial for capturing complex temporal dynamics with reduced computational demands.

Univariate Time Series Forecasting:

AR and ARIMA(McKenzie, 1984):

The Autoregressive (AR) model is a basic time series model that predicts future data based on past values of the same series. The idea is that past data points have a "memory" effect that can help forecast future points. ARIMA stands for Autoregressive Integrated Moving Average. It's an extension of the AR model that includes differencing (to make the data more stationary) and a moving average component to smooth out random fluctuations or noise in the data.

Deepar (Salinas, 2020):

DeepAR is a probabilistic forecasting model that uses recurrent neural networks (RNNs). It's designed to capture complex patterns in the data by learning from similar time series and can provide uncertainty estimates for its forecasts.

Are transformers effective for time series forecasting? (Zheng et al., 2023):

This study examines the effectiveness of transformer models, which are known for their powerful self-attention mechanisms, in the context of time series forecasting. It suggests that simpler linear global univariate models may outperform transformers for long-term forecasting tasks. DLinear is a model that directly learns a linear relationship from past data to future predictions. It challenges the idea that complex models with approximate self-attention mechanisms are necessary, showing that sometimes a simpler, direct linear approach is sufficient.

PatchTST(Nie et al., 2022):

Instead of looking at the time series data point by point, PatchTST divides the time series into contiguous segments, or "patches." These patches are similar to "words" in the language context for which transformers were originally designed. These patches are then fed into the transformer as if they were tokens. This method allows the model to process segments of the series at a time, capturing the local information within each patch. This approach is shown to be competitive and even superior to more complex models in certain long-term forecasting scenarios.

Paper Analysis and Diagrams:

There are N time-series in the dataset. The look-back of the i-th time-series will be denoted by y⁽ⁱ⁾_1:L, while the horizon is denoted by y⁽ⁱ⁾_L+1:L+H. The task of the forecaster is to predict the horizon time-points given access to the look-back. The static attributes of a time-series denoted by a⁽ⁱ⁾ such as features of a product in retail that do not change with time.

Architecure:
architecture

Dense Encoder: TiDE begins by first encoding the past of a time series along with any associated covariates i.e. stack and flatten all the past and future projected covariates, concatenate them with the static attributes and the past of the time-series. Then map them to an embedding using an encoder which contains multiple residual blocks. These covariates could be any external factors or indicators that influence the time series. The encoding process is executed using dense MLPs. Each layer in this MLP serves as a transformation, capturing increasingly abstract representations of the input data. By the end of the encoding phase, the model distills the past time series and covariates into a dense hidden representation, a vector filled with learned features that best describe the data's patterns and relationships.

Dense Decoder: The decoding in our model maps the encoded hidden representations into future predictions of time series. The first decoding unit is a stacking of several residual blocks like the encoder with the same hidden layer sizes. This phase is responsible for predicting future values based on the extracted features. Another set of dense MLPs is used for this purpose, taking the hidden representation and generating a series of future predictions.

Temporal Decoder: A unique component of TiDE is the introduction of a temporal decoder. While the primary decoder produces a basic forecast, the temporal decoder refines these predictions by adapting them to future covariates. This is crucial because, in real-world scenarios, external influences can cause drastic shifts in time series data. The temporal decoder ensures that these potential future changes are considered, allowing for more accurate and adaptable forecasts. This operation adds a "highway" from the future covariates to the prediction. This can be useful if some covariates have a strong direct effect on a particular time-step's actual value. For instance, in retail demand forecasting a holiday like Mother's day might strongly affect the sales of certain gift items. Such signals can be lost or take longer for the model to learn in absence of such a highway.

Results:

Multivariate long-term forecasting results with our model. T belongs to {96, 192, 336, 720} for all datasets. The best results including the ones that cannot be statistically distinguished from the best mean numbers are in bold. We calculate standard error intervals for our method over 5 runs.

Results from Demand Forecasting Experiment (M5 forecasting competition):

In Figure 2 The bars show how long it takes for each model to process one batch of data during inference (i.e., making predictions). The TiDE model has significantly lower inference times across all look-back lengths compared to PatchTST, which indicates that TiDE is more efficient in making predictions.

Results from Ablation Study:

In Figure 3 in the time instances following the event the model without the temporal decoder is thrown off possibly because it has not yet readjusted its past to what it should have been without the event In Figure 4 the results are shown for multiple horizon length tasks and show that in all cases the performance becomes better with increasing context size as expected. In Table 5 the results of TiDE (no res) i.e. removed all the residual connections is compared with residual connections, there is a statistically significant drop in performance without the residual connections.

results_1

Social Impact:

Positive Impacts:

Improved Decision-Making in Various Sectors: TiDE's ability to accurately forecast long-term trends in time-series data can aid decision-making in various sectors like finance, energy, transportation, and retail. Better predictions can lead to more efficient resource allocation and strategic planning.
Enhanced Economic Stability: In finance and economics, improved forecasting models like TiDE can help in anticipating market trends, aiding in more stable economic planning and reducing the likelihood of crises caused by unexpected market behaviors.
Advancements in Climate and Energy Sector: Accurate long-term forecasting can significantly benefit climate science and energy management. Predicting weather patterns, energy demands, and resource availability can lead to better environmental management and policy-making.
Innovation in Technology and AI: TiDE's approach contributes to the broader field of artificial intelligence and machine learning, potentially spurring further innovations in these areas.
Healthcare and Pandemic Forecasting: In healthcare, long-term forecasting models can be crucial in predicting disease spread, resource requirements, and managing public health crises like pandemics.

Negative Impacts:

Dependency and Overreliance: An overreliance on automated forecasting could lead to reduced human expertise in these areas. In cases where the model might fail or provide inaccurate predictions, this could have severe consequences.
Privacy Concerns: The use of extensive time-series data, especially in sectors like healthcare or finance, raises concerns about privacy and data security. Ensuring the ethical use of data and protecting individual privacy is a significant challenge.
Economic Disparities: Advanced forecasting tools could widen the gap between organizations or countries that have access to these technologies and those that do not. Smaller entities might struggle to compete with larger ones that can leverage these advanced predictive models.
Misinterpretation and Misuse: Incorrect interpretation of model outputs could lead to misguided decisions, especially in critical areas like finance or healthcare. Additionally, there's a risk of misuse in manipulating markets or influencing public opinion based on forecasted data.
Job Displacement: Automation in forecasting might reduce the need for human analysts in various fields, potentially leading to job displacement.

Industry Applications:

Integration of TiDE (Time Series Dense Encoder) into Google's Vertex AI

Vertex AI offers a platform to build and use generative AI - from AI solutions, to Search and Conversation, to 100+ foundation models, to a unified AI platform.

Performance Improvement: The implementation of TiDE in Vertex AI offers a 10x training throughput improvement without compromising on model accuracy. This makes it an attractive solution for industries dealing with large datasets that require rapid processing.
Versatility Across Industries: Vertex AI, powered by TiDE, is already serving diverse sectors like fashion retail, grocery, consumer packaged goods, energy, finance, and electronics, showcasing the model's versatility and adaptability.
Real-World Impact: Hitachi Energy has used TiDE for advancing energy predictions, demonstrating its practical applicability and effectiveness. Groupe Casino (a French mass-market retail group) also benefited from TiDE through improved forecast accuracy and reduced model training time, directly impacting their business efficiency.
Simplified and Efficient Forecasting: TiDE's simpler MLP architecture, compared to state-of-the-art transformer models, not only simplifies the forecasting process but also provides substantial improvements in training and prediction throughput.

Other potential Industrial Applications:

Retail and Inventory Management: TiDE can revolutionize retail by providing accurate long-term forecasts for consumer demand. This accuracy helps in inventory optimization, preventing overstocking or stockouts, and ensuring that popular items are always available. Retail giants can benefit from more precise demand forecasting, leading to reduced waste and increased customer satisfaction.
Financial Markets and Economic Forecasting: TiDE can be used by financial institutions for predicting market trends, aiding in investment strategies and risk management. Its long-term forecasting capabilities are particularly useful for spotting potential economic shifts, aiding policymakers and investors in making informed decisions.
Supply Chain Optimization: In logistics and supply chain management, TiDE can predict transportation and distribution needs, helping companies optimize routes and reduce costs. Accurate forecasting can significantly improve the efficiency of supply chains, especially in industries with complex logistics like automotive and manufacturing.

Follow-on Research: Academic Research:

A potential future academic research direction could be to analyze Multi-Layer Perceptrons (MLPs) and Transformer architectures (including their non-linearity aspects) under a simple mathematical model for time-series data. The model would simulate time-series data that reflects key characteristics like seasonality (cyclical patterns) and trends (upward or downward movements over time). It would likely include a variety of scenarios, from simple (e.g., linear trends or regular cycles) to complex (e.g., irregular patterns, abrupt changes). The model should be flexible enough to vary the level of noise, periodicity, and trend complexity to test the architectures under different data conditions.

This research would aim to quantify the advantages and disadvantages of these architectures for different levels of seasonality and trends in time-series forecasting. The study would also consider the fact that Transformers are generally more parameter-efficient than MLPs, albeit being more memory and compute-intensive. This exploration could provide valuable insights into optimizing these models for specific forecasting scenarios, balancing efficiency and accuracy.

Time-series data often exhibit seasonality (patterns that repeat over known, fixed periods) and trends (long-term increase or decrease in the data). Understanding how MLPs and Transformers capture these elements is crucial for accurate forecasting. The proposed research would involve creating mathematical models that simulate time-series data with varying levels of seasonality and trend complexities. These models would then be used to test the performance of MLPs and Transformers.

Peer Review (NeurIPS Review):

Rohit Sahoo (8/10)

Summary: The authors propose a new MLP-based encoder-decoder model for long-term time-series forecasting that combines the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies. They theoretically prove that their formulation obtains a near optimal rate for linear dynamical systems and empirical results demonstrate that their proposed algorithm matches or outperforms other benchmark algorithms on popular time-series benchmarks while having computational advantages in training and inference over state-of-the-art transformer models.

Strengths:

The assertions presented are backed by precise experiments and comparative analyses against state-of-the-art methods for forecasting time series.
From an architecture point of view, temporal decoder is novel and the handling of the covariate in both the encoder and decoder is also novel.
Empirical Analysis shows TiDE is 5-10x faster than earlier works.

Weakness:

The robustness of TiDE against overfitting, especially in scenarios with limited training data or highly noisy datasets, is not extensively discussed.
The ablation study lacks depth. It is recommended that the authors expand their analysis to encompass the impact of different elements within the architecture, such as whether covariates are included or excluded.

Questions:

Sec.5.4: Do these 5x and 10x efficiency numbers hold true for datasets other than traffic dataset?
As the time-sequence progresses (L increases), the compute cost of the architectures increases since it sees both y_1:L and x_L+1:L+H. How much is the compute compared to LSTMs / Transformers?
Could you expand on the interpretability of your model's predictions and the underlying decision-making process?

Soundness: 4

Presentation: 3

Contribution: 4

Overall: 8: Strong Accept: Technically strong paper with, with novel ideas, excellent impact on Time Series Domain.

Confidence: 4

Pratik Hotchandani: (8/10)

Summary: The paper presents an all-MLP model for time-series forecasting called TiDE. First, they apply a linear projection to reduce the input dimensionality of the time-series (independently for each time step). Then, they flatten all the inputs and they apply a stack of MLP residual blocks. Finally, they stack the initial (reduced) features to a reshaped output, and they apply a decoding step (also an MLP) to get the predictions. They compare TiDE to alternative transformer-based approaches, achieving competitive results.

Strength:

The proposed algorithm is also competitive for various benchmarks and has computational benefits over more complex neural architectures.
TiDE seems to have inference and training time improvements compared to the baselines.
The Proposed Algorithm is simple, efficient, and authors report their hyperparameter search parameters.

Weakness:

The paper doesn't extensively discuss the flexibility of TiDE in terms of hyperparameter tuning or adaptability to different types of time-series data, which could be a limitation for users looking for highly customizable solutions.
The paper might not explicitly address how TiDE handles seasonality and long-term trends in data, which are critical aspects of time-series analysis.

Questions:

How does your model handle edge cases or outliers within the dataset?
Is there a particular reason for choosing the specific parameters of your model, and could you elaborate on their impact?
How robust is your model to parameter changes, and have you explored parameter sensitivity analysis?

Soundness: 3

Presentation: 4

Contribution: 4

Overall: 8: Strong Accept: Technically strong paper with excellent evaluation.

Confidence: 4

Implementation and Findings:

Our team successfully implemented the architecture and code provided by Google Research for the TiDE (Temporal Decoder) model. We conducted tests on weather data, verifying the model's performance against expected outcomes. Our initial findings were promising: after just a single training epoch, the TiDE model achieved a Mean Squared Error (MSE) of 0.35. This improved markedly with continued training, dropping to an MSE of 0.23 after six epochs, over a forecast horizon of 96 time points.

Additionally, we extended our testing to include the Australian Beer Dataset. This choice was inspired by a comparative study in the original TiDE paper, where the authors benchmarked the TiDE model against N-HiTS (Neural Hierarchical Interpolation for Time Series Forecasting). This comparative analysis is crucial for understanding the efficacy of TiDE in diverse forecasting scenarios. Our tests aimed to replicate and scrutinize these comparisons, offering a deeper insight into the model's versatility and accuracy across different types of time series data.

Results:

results_run_1

results_run_2

References

[1] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems, 32, 2019a.

[2] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, 2021.

[3] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting Advances in Neural Information Processing Systems, 34: 22419–22430, 2021.

[4] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.

[5] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu Are transformers effective for time series forecasting? Proceedings of the AAAI conference on artificial intelligence, 2023.

[6] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. International conference on learning representations, 2022.

[7] Cristian Challu, Kin G. Olivares, Boris N. Oreshkin, Federico Garza, Max Mergenthaler, and Artur Dubrawski. NHITS: Neural Hierarchical Interpolation for Time Series forecasting. In The Association for the Advancement of Artificial Intelligence Conference 2023 (AAAI 2023), 2023., 2022.

[8] Elad Hazan, Karan Singh, and Cyril Zhang Learning linear dynamical systems via spectral filtering. Advances in Neural Information Processing Systems, 30, 2017.

Team Members

1. Rohit Sisir Sahoo (MS Computer Science, Spring 2023, Northeastern University, sahoo.ro@northeastern.edu)

2. Pratik Satish Hotchandani (MS Data Science, Spring 2023, Northeastern University, hotchandani.p@northeastern.edu)

Long-term Forecasting with TiDE: Time-series Dense Encoder

An Analysis of "Long-term Forecasting with TiDE: Time-series Dense Encoder"