[2602.20658] Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video
Summary
This article explores the use of vision-language models (VLMs) for non-invasive ergonomic assessment of manual lifting tasks, estimating hand distances from RGB video.
Why It Matters
The study addresses the challenge of accurately assessing ergonomic risks in manual lifting, a significant contributor to musculoskeletal disorders. By leveraging VLMs, the research offers a novel approach that could enhance workplace safety and efficiency, making ergonomic assessments more accessible and practical in real-world settings.
Key Takeaways
- Vision-language models can effectively estimate horizontal and vertical hand distances in lifting tasks.
- Segmentation-based pipelines significantly reduce estimation errors compared to detection-only methods.
- The study demonstrates the feasibility of using RGB video for ergonomic assessments in real-world environments.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.20658 (cs) [Submitted on 24 Feb 2026] Title:Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video Authors:Mohammad Sadra Rajabi, Aanuoluwapo Ojelade, Sunwook Kim, Maury A. Nussbaum View a PDF of the paper titled Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video, by Mohammad Sadra Rajabi and 3 other authors View PDF Abstract:Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation (RNLE) is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal (H) and vertical (V) hand distances; such distances are typically obtained through manual measurement or specialized sensing systems and are difficult to use in real-world environments. We evaluated the feasibility of using innovative vision-language models (VLMs) to non-invasively estimate H and V from RGB video streams. Two multi-stage VLM-based pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. Both pipelines used text-guided localization of task-relevant regions of interest, visual fea...