[2602.17206] SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch
Summary
The paper presents SoftDTW-CUDA-Torch, an open-source PyTorch library that enhances Soft Dynamic Time Warping (SoftDTW) by improving memory efficiency and numerical stability on GPUs.
Why It Matters
This development is significant for researchers and practitioners in machine learning, particularly those working with time series data, as it addresses critical limitations of existing SoftDTW implementations, enabling more efficient computations and broader applicability in various applications.
Key Takeaways
- Introduces a memory-efficient implementation of SoftDTW for PyTorch.
- Eliminates the hard sequence-length cap of 1024 through tiled kernel execution.
- Prevents numerical instability with a log-space backward pass.
- Achieves up to 98% memory reduction compared to previous methods.
- Supports arbitrary sequence lengths and full PyTorch autograd integration.
Computer Science > Machine Learning arXiv:2602.17206 (cs) [Submitted on 19 Feb 2026] Title:SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch Authors:Ron Shapira Weber, Oren Freifeld View a PDF of the paper titled SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch, by Ron Shapira Weber and 1 other authors View PDF HTML (experimental) Abstract:We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping (SoftDTW) on GPUs. Our implementation addresses three key limitations of existing GPU implementations of SoftDTW: a hard sequence-length cap of 1024, numerical instability in the backward pass for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors. We introduce (1) tiled anti-diagonal kernel execution that removes the sequence-length constraint, (2) a log-space back-ward pass that prevents floating-point overflow, and (3) a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work. The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter computation. Code is available at this https URL. Comments: Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.17206 [cs.LG] (or arXiv:2602.17206v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.17206...