library / libaa31aa6960642ce5
Time Series Forecasting Using Foundation Models
Marco Peixeiro · 2025
In a sentence
A hands-on practitioner's guide to understanding, applying, fine-tuning, and comparing foundation models—from TimeGPT to LLM-based approaches—for time-series forecasting and anomaly detection.
Time Series Forecasting Using Foundation Models demystifies the rapidly evolving world of large time models by guiding readers from first principles through real-world deployment. Author Marco Peixeiro, an active developer of TimeGPT at Nixtla, begins by unpacking the transformer architecture from a forecasting lens and then walks readers through building their own tiny foundation model with N-BEATS to viscerally experience concepts like pretraining, transfer learning, and fine-tuning. From there, the book systematically covers every major open and proprietary foundation forecasting model—TimeGPT, Lag-Llama, Chronos, Moirai, and TimesFM—explaining each model's architecture, pretraining corpus, hyperparameters, and optimal use cases before applying it to a consistent weekly store-sales benchmark. The book then ventures into LLM territory, showing how Flan-T5 and Llama-3.2 can be prompted for forecasting and how Time-LLM reprograms LLMs through patch reprogramming and Prompt-as-Prefix. A capstone project ties everything together by racing all methods—including classical SARIMA—against real blog-traffic data and comparing both accuracy and inference latency, leaving readers with a rigorous, reusable model-selection protocol they can apply to any forecasting challenge.
The four lenses
- Science
- Statistics
- Systems
- Strategy
Tags
The model
A causal framework describing how design levers (model architecture choices, pretraining corpus properties, fine-tuning decisions, input configuration, and prompt engineering) combine with contextual conditions (dataset characteristics, hardware resources) to shape intermediate psychological and behavioral practitioner states, which in turn drive forecasting accuracy and operational outcomes.
Pretraining Corpus Diversitydesign lever
The breadth and heterogeneity of time series used to pretrain a foundation model, encompassing the number of domains, range of sampling frequencies, total data volume (tokens), and variety of temporal patterns (trend types, seasonal structures, noise profiles). Higher diversity exposes the model to more pattern types, expanding its zero-shot generalization range.
Pretraining Horizon Rangedesign lever
The minimum and maximum forecast horizons used during a foundation model's pretraining phase, which constrain the model's reliable forecast horizon at inference time. Models trained on short horizons tend to degrade when tasked with long-horizon forecasting beyond their training range.
Model Architecture Typedesign lever
The structural design choice for the foundation model's backbone, including whether the model uses a full encoder-decoder transformer, encoder-only transformer, decoder-only transformer, or hybrid architecture with components such as mixture-of-experts layers, patching strategies, or residual blocks. Architecture type determines the model's capacity, inference mode (autoregressive vs. single-shot), and suitability for probabilistic versus deterministic output.
Model Parameter Countdesign lever
The total number of trainable parameters in the foundation model, which is a proxy for model capacity and directly influences the model's ability to capture complex temporal patterns. Larger parameter counts generally improve performance but increase memory requirements, inference latency, and fine-tuning computational cost.
Patching Strategydesign lever
The method by which raw time-series input is grouped into multi-step tokens (patches) before being fed to the transformer backbone. Patching parameters include patch length, overlap policy (overlapping vs. nonoverlapping), and whether patch length varies by frequency. Effective patching reduces token count, lowers computational burden, and improves local semantic learning by grouping contextually related time steps.
Output Distribution Typedesign lever
The probabilistic mechanism used to generate predictions, distinguishing deterministic point-forecast outputs from probabilistic outputs based on a single parametric distribution (e.g., Student's t in Lag-Llama) versus a mixture of distributions (e.g., Moirai's four-component mixture). Richer output distributions allow better modeling of asymmetric, skewed, or multimodal prediction uncertainty.
Fine-Tuning Steps / Epochsdesign lever
The number of gradient update steps or full-data passes used to adapt a pretrained foundation model to a specific target dataset. Controlling fine-tuning steps balances specialization (lower error on the target task) against overfitting (loss of generalization). Too few steps yield marginal improvement; too many steps can erase pretrained knowledge.
Fine-Tuning Depthdesign lever
The proportion or subset of model parameters that are updated during fine-tuning, ranging from updating only the final output layers to updating all parameters. Greater fine-tuning depth increases specialization to the target dataset but also increases risk of overfitting and requires more computation per fine-tuning step.
Context Length (Input Window Length)design lever
The number of historical time steps provided to the model at inference time. Longer context windows expose the model to more historical patterns and can improve forecasting accuracy, especially for series with long seasonal cycles or slowly evolving trends. However, exceeding the model's maximum supported context length can cause degradation or truncation.
Exogenous Feature Qualitycontextual condition
The degree to which externally provided covariates are informative, accurately measured (or accurately forecasted for future steps), and causally related to the target series. High-quality features include calendar indicators known with certainty (e.g., holiday flags); low-quality features include predicted covariates whose own forecast errors propagate into the target forecast.
Dataset Frequencycontextual condition
The temporal resolution at which the target time series is recorded, such as per second, per minute, hourly, daily, weekly, monthly, or yearly. Frequency determines the dominant seasonal cycles, the magnitude of high-frequency noise, and whether the model's pretraining corpus included similar frequencies, which is a primary moderator of zero-shot generalization quality.
Dataset Stationarity and Trend Strengthcontextual condition
The degree to which the target time series exhibits a stable mean and variance (stationary) versus a strong directional trend (nonstationary). Strong trends challenge models like Chronos that use fixed-vocabulary tokenization with bounded bin ranges, causing predictions to plateau rather than extrapolate the trend. Stationarity is also relevant to the validity of mean-scaling in Chronos.
Prompt Engineering Qualitydesign lever
For LLM-based forecasting approaches, the degree to which the input prompt is structured to elicit accurate and consistently formatted numerical predictions. High-quality prompting includes appropriate few-shot examples, chain-of-thought reasoning steps, explicit output format instructions, and contextual description of the time series domain (as in Time-LLM's Prompt-as-Prefix).
Patch Reprogramming Training (Time-LLM specific)design lever
The process of training the patch-reprogramming and output-projection components of Time-LLM while keeping the LLM backbone frozen, enabling the model to translate time-series patches into textual prototypes the LLM understands. Longer and more targeted reprogramming training improves the alignment between the numeric and text modalities, yielding better forecasts.
Available Hardware Resourcescontextual condition
The computational resources available to the practitioner for model loading, inference, and fine-tuning, including GPU availability, GPU memory, CPU RAM, and storage. Hardware resources act as a binding constraint on model size selection, inference latency, and fine-tuning feasibility, moderating the relationship between model parameter count and achievable performance.
Zero-Shot Generalization Capabilitypsychological state
The model's ability to produce accurate forecasts on datasets and scenarios it has never seen during pretraining, without any fine-tuning. This is an emergent property arising from large, diverse pretraining corpora and expressive architectures. It is the primary value proposition of foundation forecasting models and the first thing practitioners test before investing in fine-tuning.
Model Specialization to Target Datasetpsychological state
The degree to which the model's parameters have been adapted to the distributional properties of the specific target dataset, achieved through fine-tuning. Higher specialization reduces systematic bias and variance in forecasts for the target task but may reduce the model's ability to generalize to other datasets or time periods.
Uncertainty Quantification Accuracypsychological state
The degree to which the model's prediction intervals are well-calibrated, meaning that the specified coverage level (e.g., 80%, 99%) matches the empirical fraction of true values falling within the interval. Well-calibrated intervals enable reliable anomaly detection and risk-informed decision-making. Miscalibrated intervals (too wide or too narrow) lead to false positives or missed anomalies.
Inference Latencyoutcome metric
The wall-clock time required to complete forecasting inference across a defined set of cross-validation windows under given hardware conditions. Inference latency is a direct operational cost that determines whether a foundation model is viable in latency-sensitive production pipelines. It is influenced by model size, hardware, whether the model is accessed via API (offloading compute) or run locally, and the complexity of the inference procedure.
Forecast Accuracy (MAE and sMAPE)outcome metric
The primary outcome measuring how closely the model's point forecasts match actual observed values, quantified by mean absolute error (MAE) and symmetric mean absolute percentage error (sMAPE) computed over multiple cross-validation windows. Lower values indicate better accuracy. sMAPE is preferred when comparing across series with different scales; MAE is preferred when interpretability in original units is needed.
Anomaly Detection Performance (F1, Precision, Recall)outcome metric
The effectiveness of the model in correctly identifying anomalous data points in a time series, measured by precision (fraction of flagged points that are true anomalies), recall (fraction of true anomalies that are flagged), and F1 Score (harmonic mean). Performance depends on calibration of prediction intervals, interval width (narrower intervals detect more anomalies but increase false positives), and the model's forecasting accuracy on normal data.
Practitioner Model Selection Qualityoutcome metric
The degree to which the practitioner chooses the model best suited to their specific forecasting task, considering model capabilities (frequency support, exogenous features, horizon limits, probabilistic vs. deterministic output), available hardware, latency requirements, and cost. Higher selection quality results in better downstream forecast accuracy relative to the best achievable on the task.
How they connect
- pretraining corpus diversity → predicts zero shot generalization
- pretraining horizon range → moderates zero shot generalization
- model parameter count → predicts zero shot generalization
- model architecture type → predicts uncertainty quantification accuracy
- patching strategy → predicts forecast accuracy
- fine tuning steps → predicts model specialization
- fine tuning depth → predicts model specialization
- model specialization → predicts forecast accuracy
- context length → predicts forecast accuracy
- exogenous feature quality → predicts forecast accuracy
- dataset frequency → moderates zero shot generalization
- dataset stationarity − moderates forecast accuracy
- prompt engineering quality → predicts forecast accuracy
- patch reprogramming training → predicts model specialization
- hardware resources − predicts inference latency
- hardware resources → moderates model parameter count
- model parameter count → predicts inference latency
- zero shot generalization → predicts forecast accuracy
- uncertainty quantification accuracy → predicts anomaly detection performance
- forecast accuracy → influences practitioner model selection quality
- inference latency − influences practitioner model selection quality
The story
The reader A data scientist or ML practitioner who already knows classical forecasting methods (ARIMA, seasonal models) and Python, and wants to understand and deploy the new generation of foundation models for time series without getting lost in research-paper mathematics.
External problem
They need accurate, production-ready forecasts across diverse datasets but face a fragmented landscape of new foundation models—each with different APIs, data formats, pretraining assumptions, and capabilities—making it hard to know which to use or how to use it well.
Internal problem
They feel overwhelmed and behind, worried that colleagues or competitors are already leveraging powerful pretrained models while they are still hand-tuning ARIMA parameters, and anxious about wasting time on a model that turns out to be wrong for their use case.
Philosophical problem
It is wrong that powerful forecasting technology exists but remains inaccessible to most practitioners because no single resource explains the landscape clearly, shows how the models actually work under the hood, and provides honest, reproducible benchmarks.
The plan
- Understand the transformer architecture from a forecasting perspective and internalize key concepts (embedding, positional encoding, attention, autoregression).
- Build a tiny foundation model with N-BEATS to experience pretraining, transfer learning, and fine-tuning firsthand and appreciate the challenges at scale.
- Master each major foundation forecasting model (TimeGPT, Lag-Llama, Chronos, Moirai, TimesFM) by studying its architecture, pretraining protocol, and optimal use cases, then applying it to a consistent benchmark dataset.
- Learn to fine-tune each model safely, controlling depth and steps to improve accuracy without overfitting.
- Understand when and how to include exogenous features, and how to handle features whose future values must be predicted.
- Explore LLM-based forecasting through prompting (Flan-T5, Llama-3.2) and reprogramming (Time-LLM), understanding both the possibilities and the limitations.
- Design and execute a rigorous cross-validation evaluation protocol to compare all models honestly on a new dataset, considering both accuracy and inference latency, and select the best model for the scenario.
Success
- The reader can confidently select, configure, fine-tune, and evaluate any major foundation forecasting model for a new business problem in hours rather than weeks.
- The reader produces more accurate forecasts faster, using pretrained models as powerful baselines instead of starting from scratch every time.
- The reader has a reusable, rigorous experimental protocol for model comparison that earns credibility with stakeholders.
- The reader understands each model's limitations (horizon caps, frequency constraints, exogenous feature support) and avoids costly deployment mistakes.
- The reader can read new foundation model papers, map the architecture to familiar concepts, and implement the model without waiting for a tutorial.
At stake
- The reader continues spending weeks building and tuning data-specific models for each new dataset, missing accuracy improvements and shipping forecasts late.
- The reader blindly adopts a foundation model without understanding its pretraining assumptions, leading to poor production performance and loss of confidence in ML.
- The reader misuses exogenous features (providing predicted rather than known-future values), silently inflating forecast errors.
- The reader relies on a single test window for model evaluation, reaching wrong conclusions about which model is best and making suboptimal deployment decisions.
Related in the library