[P] Fused MoE Dispatch in Pure Triton: Beating CUDA-Optimized Megablocks at Inference Batch Sizes
I built a fused MoE dispatch kernel in pure Triton that handles the full forward pass for Mixture-of-Experts models. No CUDA, no vendor-s...
GPUs, training clusters, MLOps, and deployment
I built a fused MoE dispatch kernel in pure Triton that handles the full forward pass for Mixture-of-Experts models. No CUDA, no vendor-s...
Driven by labor shortages, Japan is pushing physical AI from pilot projects into real-world deployment.
I built a pipeline that takes ternary-quantized CNNs from PyTorch training all the way to bare-metal inference on an ESP32-S3 microcontro...
This paper presents an optimized cascaded Nepali-English speech-to-text translation system that mitigates structural noise from ASR, enha...
This article evaluates the adversarial robustness of deep learning models for thyroid nodule segmentation in ultrasound images, highlight...
The paper presents a novel framework, MMA-RAG^T, for enhancing the security of multimodal agentic retrieval-augmented generation systems ...
This paper presents a safety filtering framework for generative models, ensuring generated samples meet hard constraints while minimizing...
The paper presents FedVG, a novel gradient-guided aggregation framework for federated learning that enhances model performance by address...
This study explores the use of small language models for extracting clinical information from low-resource languages, focusing on a priva...
MrBERT introduces a family of multilingual encoders optimized for various domains, achieving state-of-the-art results in specific tasks w...
This paper presents a method for certifying the reliability of black-box AI systems using self-consistency sampling and conformal calibra...
This paper presents a general equilibrium theory for orchestrated AI agent systems, modeling large language model (LLM) agents within a p...
This systematic review explores automated red teaming methodologies for enhancing the security of AI applications, addressing the limitat...
AgenticTyper is a novel AI-driven tool that automates the typing of legacy JavaScript projects, significantly reducing manual effort and ...
AngelSlim introduces a versatile toolkit for large model compression, integrating advanced algorithms for efficient deployment and improv...
The paper presents Budget-Aware Agentic Routing, a method for optimizing the use of large language models in autonomous agents by balanci...
This paper explores architecture-agnostic curriculum learning for document understanding, demonstrating efficiency gains in training time...
The paper presents a novel approach to speculative decoding in large language models (LLMs), focusing on reusing discarded draft tokens t...
This paper presents a novel framework for dynamic LoRA adapter composition using similarity retrieval in vector databases, enabling effic...
The paper introduces Latent Context Compilation, a novel framework that enhances long-context LLM deployment by distilling long contexts ...
This paper presents Sparse Inference-time Alignment (SIA), a novel approach to enhance alignment in large language models by intervening ...
The paper presents the 2-Step Agent framework, which models the interaction between decision makers and AI decision support systems, high...
The paper presents fEDM+, an enhanced fuzzy ethical decision-making framework that improves explainability and validation by integrating ...
Get the latest news, tools, and insights delivered to your inbox.
Daily or weekly digest • Unsubscribe anytime