Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

Technion - Israel Institute of Technology
Google-Research

Abstract

Humans naturally communicate through abstract concepts like ``mood''. However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.

Abstract Image Editing

Main example of emotional abstract editing
Evaluating Abstract Image Editing. Given a context image and an abstract instruction from our ABSTRACTEDIT test set, our ENTITY-RUBRICS framework assesses the edited image by decomposing the scene at a granular level, yielding a per-edit rank that determines the final instruction-following score. As shown, some entity edits lead to better prompt alignment (e.g., the man’s expression), while others introduce unnecessary changes that hurt preservation (the crowd).

Human Intent Mapping in Image Editing

Human-Intents
Taxonomy of Image Editing Intent. We formalize intent across two orthogonal axes: Identification (“what to edit”) and Specificity (“how to edit”). Intent is categorized based on the mapping between these axes: Explicit and Implicit editing provide a one-to-one mapping through high consensus on both axes, whereas abstract editing introduces a one-to-many relationship.

Entity-Rubrics Evaluation

Entity Rubrics evaluation scheme
Breaking down complex abstract intents into atomic, entity-level assessments for precise and verifiable evaluation. Overview of the three-stage VLM-based ENTITY-RUBRICS evaluation framework. (A) Entity Detection identifies relevant entities. (B) Entity Ranking assigns an expected transformation to each entity (Change, Optional, Preserve) and measures its execution alignment in the edited image. (C) Final Scoring aggregates these into a global rank and rationale. Results are visualized directly on-image via a red (incorrect) to green (correct) scale.

AbstractEdit Dataset Curation Pipeline

Dataset generation pipeline
The ABSTRACTEDIT Automatic Curation Pipeline. (A) Sourcing: Context images (Azure) from OpenImages are compiled alongside manual category examples (Purple) and diverse personas (Blue). (B) Instruction Generation: After filtering the images for category relevancy, an LLM, prompted by few-shot examples and a random persona, generates paired abstract and explicit instructions (Beige). (C) Editing: Both instructions are applied to the context image to produce the final edited pairs (Yellow). Bottom: A generated ABSTRACTEDIT example.

AbstractEdit Benchmark

Sunburst chart of abstract editing taxonomy
Composition and Evaluation of the ABSTRACTEDIT Benchmark. The composition of the ABSTRACTEDIT benchmark is illustrated across four primary domains and 12 subcategories in the middle panel. Surrounding this distribution are representative samples from each domain, featuring context images paired with candidate edits produced by various models. Each output is overlaid with our ENTITY-RUBRICS automatic granular evaluation, providing a visual layout of performance that ranges from red (incorrect) to green (correct) at the entity level.

Main Results

Sunburst chart of abstract editing taxonomy
Abstract Instruction Following Performance. Left Table: Open-Source, OS with thinking modes, and closed models ranked by Abstract ENTITY-RUBRICS scores and human annotations.Failure Profile: (Under-Editing) ← | → Over-editing. Domains: Emotional, Logical, Physical, andSocial. Right Fig. 5: Prompt Type Comparison: Explicit (blue) and Abs. (striped bordeaux).

Model Diversity Analysis

Vendi score comparison across models
Evaluating generative diversity across leading models to highlight the struggle between fulfilling intent and preserving the original image.
-->