[2602.16545] Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding
Summary
The paper introduces a zero-shot editing method for video classifiers, allowing for the refinement of coarse categories into finer subcategories without additional data, enhancing video understanding.
Why It Matters
As video recognition tasks evolve, traditional classifiers struggle to adapt to new distinctions without costly retraining. This research presents a solution that improves classification accuracy while minimizing the need for new data, making it relevant for advancing machine learning applications in video analysis.
Key Takeaways
- Introduces category splitting for refining video classifications.
- Proposes a zero-shot editing method leveraging existing classifier structures.
- Demonstrates improved accuracy on newly defined categories without sacrificing overall performance.
- Highlights the effectiveness of low-shot fine-tuning in conjunction with zero-shot methods.
- Presents new benchmarks for evaluating category splitting in video recognition.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.16545 (cs) [Submitted on 18 Feb 2026] Title:Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding Authors:Kaiting Liu, Hazel Doughty View a PDF of the paper titled Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding, by Kaiting Liu and 1 other authors View PDF HTML (experimental) Abstract:Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Projec...