Falcon Perception

Falcon Perception

Hugging Face Blog 15 min read

About this article

A Blog post by Technology Innovation Institute on Hugging Face

Back to Articles Falcon Perception Team Article Published April 1, 2026 Upvote 3 FalconPerception FalconPerception Follow tiiuae TL;DR — Falcon Perception is a 0.6B-parameter early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes image patches + text in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On SA-Co, Falcon Perception reaches 68.0 Macro-F1 (vs. 62.3 for SAM 3) with the main remaining gap being presence calibration (MCC 0.64 vs. 0.82). We also introduce PBench, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes. We also relase Falcon OCR, a 0.3B-parameter model which reaches a score of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the highest throughput of any open source OCR model. This post is a brief, practical write-up of what we built, why we built it this way, and what we learned along the way.                The problem: why do perception systems end up as pipelines? Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and additional components handle matching and post-processing. This family...

Originally published on April 01, 2026. Curated by AI News.

Related Articles

[2604.00021] How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models
Llms

[2604.00021] How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

Abstract page for arXiv paper 2604.00021: How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recog...

arXiv - AI · 4 min ·
[2604.01106] Inverse Design of Optical Multilayer Thin Films using Robust Masked Diffusion Models
Llms

[2604.01106] Inverse Design of Optical Multilayer Thin Films using Robust Masked Diffusion Models

Abstract page for arXiv paper 2604.01106: Inverse Design of Optical Multilayer Thin Films using Robust Masked Diffusion Models

arXiv - Machine Learning · 3 min ·
Holo3: Breaking the Computer Use Frontier
Open Source Ai

Holo3: Breaking the Computer Use Frontier

A Blog post by H company on Hugging Face

Hugging Face Blog · 4 min ·
[2407.03004] SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy
Llms

[2407.03004] SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy

Abstract page for arXiv paper 2407.03004: SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical ...

arXiv - AI · 4 min ·
More in Open Source Ai: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime