Deep learning · Time series · Finance · 2026Preprint draft in progress

FinFusion

Temporal Fusion Transformers for S&P 500 return forecasting, with 450+ experiments and documented negative results.

59.1%

Directional accuracy · weekly · 9-fold rolling

450+

Experiments across 11 phases

±8.3pp

Rolling-window standard deviation

PyTorch Lightningpytorch-forecastingPythonFRED APIyfinance

GitHub

Overview

FinFusion systematically characterises Temporal Fusion Transformers (TFTs) as forecasters of S&P 500 returns. It benchmarks the architecture against ARIMAX and a 3-layer LSTM across 11 experimental phases and 450+ training runs, using a 9-fold rolling evaluation spanning 2016 through 2024.

The headline finding is negative. The TFT encoder learns interpretable regime structure, but the output layer fails to translate that structure into stable next-step predictions, a gradient-collapse failure mode documented in detail across the repository.

Research question

Temporal Fusion Transformers set state-of-the-art results on many multivariate, quasi-stationary forecasting benchmarks. Equity-index returns are neither. The signal-to-noise ratio is dwarfed by microstructure, regimes shift abruptly around macro events, and the near-random-walk property of returns makes even marginal directional accuracy materially hard.

The question this project asks is straightforward: do TFTs generalise to this setting, and if not, where specifically do they fail? The evaluation was designed to answer that without cherry-picking. Rather than a single fixed split, every configuration is scored across a 9-fold rolling window that advances one year per fold, training on roughly five years and testing on the next.

Data

Index returns are computed on S&P 500 levels at daily and weekly frequencies. Macro conditioning includes CPI, unemployment, the 10-year Treasury yield, VIX, and related FRED series joined on aligned dates. Market data (open, high, low, close, volume) is pulled from yfinance.

Each fold is trained and evaluated independently, with no information leakage across folds. Feature engineering, scaling, and horizon construction are fit on the training portion of the current fold only.

Methodology

Baselines span three model families. ARIMAX provides a classical benchmark with exogenous macro inputs. A 3-layer LSTM provides a neural baseline of comparable capacity but without attention. TFT variants include both the stock pytorch-forecasting implementation and custom extensions: regime-aware attention heads that condition on discrete regime labels, MSE-variance hybrid losses, and checkpoint-selection procedures that prioritise prediction diversity over raw validation loss.

Experiments are organised into 11 phases, each structured around a hypothesis. For example, one phase tested whether multi-horizon forecasting improves single-step accuracy (it degrades it). Another tested whether loss weighting fixes the gradient-collapse failure (it does not). Every run is logged with its configuration so that conclusions rest on comparisons within a phase rather than on cherry-picked numbers.

Results

The best single configuration is the weekly TFT at 59.1 ± 8.3% directional accuracy over the 9-fold rolling window, approximately one percentage point below the naive always-long baseline over the same period. Daily TFT trails at 53.3 ± 5.2%. Multi-horizon variants (h=10) degrade to a -3.6 pp excess directional accuracy in rolling validation.

Validation loss is anti-correlated with directional accuracy at r = -0.46. In plain terms, conventional early stopping on MSE selects checkpoints that predict worse in directional terms. The project documents a checkpoint-selection procedure based on prediction diversity that recovers a portion of the gap.

Every model family fails during the 2022 Federal Reserve tightening cycle, collapsing to approximately 40% directional accuracy regardless of architecture, horizon, or loss. This is the single most important result. The failure is structural, not a matter of model capacity or regularisation.

The gradient-collapse finding

Attribution analysis on the TFT shows the encoder successfully learns interpretable regime structure. Attention weights cluster around known macro events. Variable-selection networks up-weight VIX and the yield curve exactly when you would expect. What fails is the output layer: gradients from the MSE term dominate the gradients from any regime-aware auxiliary signal, and the head collapses toward a near-mean predictor.

Interventions tried across later phases (auxiliary heads, frozen-encoder fine-tuning, regime-gated outputs) produce partial improvements on training folds but do not transfer to rolling validation. The project stops short of claiming the architecture cannot work on financial returns; it claims that without an explicit mechanism to prevent the collapse, the output layer will keep swallowing the regime signal the encoder extracted.

Why it matters

Published financial deep-learning work rarely reports rolling-window evaluation and almost never reports negative results at this granularity. FinFusion establishes that off-the-shelf TFTs do not transfer to S&P 500 return prediction in a deployable way, characterises the specific failure mode, and documents the interventions that do and do not help. That record is useful to anyone considering this architecture for financial time series.

Tech stack

PyTorch Lightning: Training loop, checkpointing, deterministic rolling-fold evaluation.
pytorch-forecasting: Stock TFT implementation and variable-selection networks.
FRED API: Macro conditioning series (CPI, unemployment, 10Y yield, VIX).
yfinance: Daily and weekly S&P 500 OHLCV data.
pandas / NumPy: Feature engineering and rolling-window split logic.
matplotlib / seaborn: Per-fold evaluation plots and attribution figures.