Businesses Logistics & Supply Chain June 22, 2026 • 10 min read

Temporal Fusion vs. LSTM: Pick One for Demand Forecasting

For: A senior ML engineer at a Series B logistics or e-commerce company who owns the demand forecasting pipeline, has an LSTM in production that performance-degrades every quarter-end spike, and has been handed a Temporal Fusion Transformer paper by their CTO to 'evaluate'

Your LSTM degrades every quarter-end. Your CTO drops a Temporal Fusion Transformer paper on your desk. You read three blog posts comparing the two, and every one of them quotes RMSE on the M5 competition dataset, calls TFT 'state-of-the-art,' and waves at interpretability. None of them tell you whether your team should actually swap architectures, or what breaks when you try.

This is the comparison I wish someone had handed me. Two architectures, the dimensions that matter when you run forecasting at SKU scale, and an honest read on where each one fails.

Why your LSTM degrades at quarter-end (and it's not capacity)

The standard diagnosis is wrong. People assume LSTM struggles with quarter-end because of long-range dependencies or vanishing gradients. That's rarely the actual failure mode in a well-tuned demand model.

The real issue: LSTM encodes time implicitly through sequence order. Any signal that says this week is structurally different — a promotion, a holiday shift, a carrier capacity constraint, a known stockout upstream — has to be learned through recurrence. You can feed those covariates in as inputs, but the model has no native mechanism to say 'this future timestep is special, weight it differently.' It learns the average response and smooths over the spike.

Temporal Fusion Transformer's structural advantage isn't accuracy. It's that the attention mechanism lets the model directly weight known future inputs — promotional calendars, holiday flags, planned capacity — at the timesteps where they matter. That's the architectural fit for quarter-end behavior. Not 'transformers are better.'

Hold that distinction in your head. It changes how you evaluate the tradeoff.

The dimensions that actually matter in production

Benchmark RMSE is a vanity metric for forecasting teams. Here's what determines whether a model survives contact with your pipeline:

Mixed-frequency covariates — daily sales, weekly promo plans, monthly macro indicators, irregular events
Known-future inputs — things you know about future timesteps (holidays, planned promos) versus things you don't
Cold-start SKUs — new products with no history
Retraining cadence — how often you have to retrain to keep WAPE stable
Serving latency at SKU count — inference cost when you forecast 50k+ SKUs nightly
Interpretability for ops teams — can a planner trust and override the forecast?
Team skill ceiling — what your ML team can actually maintain

Head-to-head: LSTM vs Temporal Fusion Transformer

Dimension	LSTM (with covariates)	Temporal Fusion Transformer
Known-future inputs	Concatenated as features; no structural priority. Smooths spikes.	Native handling via separate encoder; directly attends to future timesteps.
Mixed-frequency covariates	Requires manual alignment/resampling. Static features awkward.	Has explicit static, known-future, and observed-input channels.
Cold-start SKUs	Weak without heavy feature engineering and embedding tricks.	Better via static covariate encoders and global learning across SKUs.
Retraining cadence	Often monthly or per-quarter; degrades on regime shifts.	Similar cadence in practice; fine-tuning is cheaper than full retrain.
Serving latency (per SKU)	Low. Easy to batch. Runs on CPU for most SKU counts.	Higher. Attention layers and quantile heads add overhead. GPU strongly preferred.
Interpretability	Black box. SHAP works but is post-hoc and expensive.	Built-in variable importance and attention weights. Planners can audit.
Quantile forecasts	Need to add quantile loss manually; one model per quantile is common.	Native multi-quantile output in a single forward pass.
Training complexity	Well-understood. Lots of production reference code.	Many hyperparameters. Sensitive to encoder length and attention heads.
Library maturity	PyTorch/Keras native. Battle-tested.	PyTorch Forecasting and Darts are good but younger. Edge cases exist.

When LSTM is still the right answer

Keep your LSTM if:

Your covariates are mostly observed history (lagged sales, weather backfill), not known-future events.
Your spike behavior is driven by autoregressive dynamics, not external signals you have advance knowledge of.
You forecast a small SKU count (under ~5k) and your serving budget is tight.
Your team has one ML engineer maintaining the pipeline and you cannot afford architecture risk.
You haven't yet exhausted the easy wins: per-segment models, hierarchical reconciliation, better holiday features, or a separate spike classifier feeding into the LSTM.

I've seen teams swap a working LSTM for TFT and get worse WAPE for two quarters because they underestimated how much hyperparameter tuning TFT needs. Encoder length, attention head count, hidden size, and dropout interact in ways LSTM tuning does not prepare you for.

When TFT earns the swap

Move to TFT when:

You have a rich known-future input channel: promotional calendars, planned price changes, holiday flags, planned inbound shipments, scheduled stockouts.
Your quarter-end and event-driven spikes are the dominant source of business pain — not baseline accuracy.
You forecast across many SKUs and want one global model with static covariates (category, region, supplier lead time) instead of one model per segment.
Your ops or planning team has been asking 'why did the model predict that?' and 'can I override the promo signal?' — TFT's variable importance gives you something to show them.
You need quantile forecasts (P10/P50/P90) for safety stock decisions, and you're tired of training three models.

The serving cost problem nobody talks about

This is where TFT proposals die in review. LSTM inference at 50k SKUs on a nightly batch runs comfortably on CPU. TFT at the same SKU count, with attention over a 90-day encoder and 28-day decoder, will push you to GPU inference or aggressive batching strategies.

Before you commit, benchmark TFT inference on your actual SKU count and forecast horizon. Use the PyTorch Forecasting reference implementation, not the paper's numbers. Measure end-to-end pipeline time including data loading and quantile post-processing. I have seen TFT proposals shelved at this exact step because the team realized they'd need to rebuild their serving infrastructure to make nightly batches fit the window.

If you're already on a GPU-backed feature store and inference cluster — fine. If you're running forecasting on the same Airflow box as your ETL, plan for that migration as part of the project.

The hybrid path most teams ignore

You don't have to pick one. A pattern I've seen work in logistics and e-commerce:

Keep LSTM as the baseline forecaster for steady-state demand.
Train TFT specifically on event-heavy periods (quarter-end, promotional weeks, holiday windows) using your known-future inputs.
Route forecasts through a simple classifier: if the upcoming window has known events flagged, use TFT; otherwise use LSTM.

This is uglier than one elegant model. It also lets you ship the TFT win on quarter-end spikes without rebuilding your entire serving stack. The teams I've seen running marketplace logistics — companies in the space Vahak operates in, for example — often need this kind of staged rollout because they can't take a forecasting outage during peak season.

Evaluation protocol before you commit

If you're going to do this evaluation properly, set it up like this:

Hold out the last two quarter-ends as your test windows. Not random splits. Not the last 30 days. The actual periods where your LSTM degrades.
Report WAPE and quantile loss per SKU segment (high-volume, long-tail, new), not aggregate.
Measure spike-window WAPE separately from baseline-window WAPE. This is where TFT should win if it's going to win.
Benchmark training time and inference latency on your actual hardware, not the paper's A100 numbers.
Run a planner usability test on TFT's variable importance output. If your ops team can't read it, the interpretability advantage is theoretical.

If TFT beats LSTM on spike-window WAPE by less than ~10% and costs you a serving infrastructure migration, the math probably doesn't work. If it beats by 20%+ and you already have GPU inference, it's a clear call.

What I'd actually do in your seat

You have an LSTM that works except at quarter-end. Before you swap architectures, I'd run one experiment: add explicit quarter-end and event flags as both static and time-varying known inputs, retrain with a custom loss that overweights spike windows, and see how much of the gap closes. Two weeks of work. If it closes 60% of the gap, you've bought yourself another year on LSTM and can plan a proper TFT migration without time pressure.

If it closes 20% or less, your covariate signal is real and your architecture genuinely can't use it. That's the moment TFT earns the swap. Build the evaluation harness above, run it honestly, and make the call on spike-window WAPE plus serving cost — not on benchmark accuracy.

The teams that get this wrong are the ones who pick the architecture first and justify it after. The teams that get it right define the failure mode precisely, then choose the tool that addresses that specific failure. Yours is known-future input handling at event windows. Now you know what to evaluate.

Frequently Asked Questions

Is Temporal Fusion Transformer always more accurate than LSTM for demand forecasting?

No. On steady-state demand with weak known-future signals, a well-tuned LSTM frequently matches or beats TFT on WAPE while costing far less to serve. TFT's advantage shows up specifically when you have rich known-future covariates (promotions, holidays, planned events) and event-driven spikes. On clean benchmarks with strong covariates TFT often wins; on production data without those signals, the gap narrows or disappears.

Can I use TFT for cold-start SKUs with no sales history?

Better than LSTM, but not magic. TFT's static covariate encoder lets it learn from category, region, supplier, and other attributes across the SKU population, so a new SKU inherits behavior from similar ones. You still need meaningful static features for this to work. If your only SKU metadata is an ID, neither model will save you — you need feature engineering first.

How often should I retrain a TFT demand forecasting model in production?

Most teams settle on monthly full retraining with weekly fine-tuning on recent data, but the right cadence depends on your demand volatility and regime shift frequency. The signal to watch is rolling WAPE on the most recent week — when it drifts beyond your tolerance, retrain. Avoid daily retraining; TFT is sensitive enough to short-term noise that you can degrade a good model.

What infrastructure do I need to serve TFT at scale?

Plan for GPU inference if you forecast more than ~10k SKUs nightly with a reasonable encoder length. You'll also want a feature store that can serve aligned mixed-frequency covariates and a batch orchestrator that can fan out inference jobs. If your current stack is CPU-only forecasting on Airflow, factor the infrastructure migration into your evaluation. For a sizing assessment on your specific SKU count and horizon, talk to CodeNicely for a personalized assessment.

Should I consider alternatives like N-BEATS, DeepAR, or boosted trees before committing to TFT?

Yes. N-BEATS is simpler than TFT and often competitive on univariate forecasting. DeepAR handles probabilistic forecasting natively and is cheaper to serve. LightGBM with lag features and good covariate engineering still wins production bake-offs more often than ML Twitter admits. Run at least one of these as a baseline in your evaluation harness — if a gradient boosted model with proper holiday features matches TFT on your spike windows, that's your answer.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.