Ventura, Mor; Toker, Michael

Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

Mor Ventura, Roy Hirsch, Yonatan Bitton, Regev Cohen, Roi Reichart

Technion - Israel Institute of Technology
Google-Research

Code

AbstractEdit Dataset arXiv

Abstract

Humans naturally communicate through abstract concepts like ``mood''. However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.

Abstract Image Editing

Main example of emotional abstract editing — Evaluating Abstract Image Editing. Given a context image and an abstract instruction from our ABSTRACTEDIT test set, our ENTITY-RUBRICS framework assesses the edited image by decomposing the scene at a granular level, yielding a per-edit rank that determines the final instruction-following score. As shown, some entity edits lead to better prompt alignment (e.g., the man’s expression), while others introduce unnecessary changes that hurt preservation (the crowd).

Human Intent Mapping in Image Editing

Human-Intents — Taxonomy of Image Editing Intent. We formalize intent across two orthogonal axes: Identification (“what to edit”) and Specificity (“how to edit”). Intent is categorized based on the mapping between these axes: Explicit and Implicit editing provide a one-to-one mapping through high consensus on both axes, whereas abstract editing introduces a one-to-many relationship.

Entity-Rubrics Evaluation

AbstractEdit Dataset Curation Pipeline

Dataset generation pipeline — The ABSTRACTEDIT Automatic Curation Pipeline. (A) Sourcing: Context images (Azure) from OpenImages are compiled alongside manual category examples (Purple) and diverse personas (Blue). (B) Instruction Generation: After filtering the images for category relevancy, an LLM, prompted by few-shot examples and a random persona, generates paired abstract and explicit instructions (Beige). (C) Editing: Both instructions are applied to the context image to produce the final edited pairs (Yellow). Bottom: A generated ABSTRACTEDIT example.

AbstractEdit Benchmark

Sunburst chart of abstract editing taxonomy — Composition and Evaluation of the ABSTRACTEDIT Benchmark. The composition of the ABSTRACTEDIT benchmark is illustrated across four primary domains and 12 subcategories in the middle panel. Surrounding this distribution are representative samples from each domain, featuring context images paired with candidate edits produced by various models. Each output is overlaid with our ENTITY-RUBRICS automatic granular evaluation, providing a visual layout of performance that ranges from red (incorrect) to green (correct) at the entity level.

Main Results

Model Diversity Analysis

Vendi score comparison across models — Evaluating generative diversity across leading models to highlight the struggle between fulfilling intent and preserving the original image.

-->