[2602.15862] Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation
Summary
This paper presents a novel framework for improving recipe generation from food images by enhancing action and ingredient modeling, addressing semantic inaccuracies in outputs.
Why It Matters
As recipe generation technology evolves, ensuring semantic accuracy in generated content is crucial for user trust and usability. This research contributes to the field of AI by proposing a two-stage pipeline that improves the fidelity of generated recipes, which is particularly relevant for applications in culinary AI and food technology.
Key Takeaways
- Introduces a semantically grounded framework for recipe generation.
- Combines supervised and reinforcement fine-tuning for improved accuracy.
- Utilizes a Semantic Confidence Scoring and Rectification module to enhance predictions.
- Achieves state-of-the-art performance on the Recipe1M dataset.
- Addresses common issues of semantic inaccuracy in AI-generated recipes.
Computer Science > Computation and Language arXiv:2602.15862 (cs) [Submitted on 26 Jan 2026] Title:Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation Authors:Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang View a PDF of the paper titled Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation, by Guoshan Liu and 5 other authors View PDF HTML (experimental) Abstract:Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.15862 [cs.CL] (or arXiv:2602.15862v1 [cs.C...