ArXiv 2026

Appearance Pointers: MultiModal Region Control of Diffusion Transformers

Rahul Sajnani^1,2 Yulia Gryaditskaya² Radomír Měch² Srinath Sridhar¹ Matheus Gadelha²

¹ Brown University

² Adobe Research

Capabilities

Appearance Pointers

Overview

Appearance Pointers is a method that performs region-conditioned image generation and editing by injecting regional tokens into a pre-trained FLUX Kontext Generation Model. The Appearance Pointer tokens (\({}^{I}AP, {}^{T}AP\)) guide the generation process by linking regions to their respective conditions, enabling precise multi-modal region control over the generated content.

Region Linking

Region Linking processes each region separately and fuses multi-modal signals from text, image, and mask to obtain region tokens for the image stream \({}^IM_i\) and text stream \({}^TM_i\) of the FLUX transformer for the \(i^{th}\) region.

Region Aggregation

Region Aggregation combines the region tokens from the image and text streams of all the regions to generate a unified representation for each region. The below image shows this aggregation process where the aggregation at each spatial patch independently. This preserves the spatial cues in Appearance Pointer tokens (\({}^{I}AP, {}^{T}AP\))

Attention Map for Appearance Pointer

Once trained, the Apperance Pointer points to the corresponding region condition as shown below. Here, the Query is from the Appearance Pointer token and the Keys are from the Image Condition tokens. The heat map displays the region corresponding to the query location highlighted by yellow point.

In The Wild Qualitative Results with Heterogeneous Region Conditions

Image Generation and Editing Gallery: Appearance Pointers can perform regional edits and generation using heterogeneous conditions in the wild. Each region is accompanied by a number which relates the condition to the region.

Dataset

Dataset: We provide an extensive region control dataset with the following sources of information for each image:

Region Segments
Region Captions
Region Pose Variation Edits
Region Texture on Sphere
Score for Pose and Texture Consistency

Citation

@misc{sajnani2026AppearancePointers,
  title={Appearance Pointers: MultiModal Region Control of Diffusion Transformers}, 
  author={Rahul Sajnani and Yulia Gryaditskaya and Radomír Měch and Srinath Sridhar and Matheus Gadelha},
  year={2026},
  eprint={},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Acknowledgements

Part of this work was done during Rahul’s internship at Adobe.

Contact

Rahul Sajnani rahul_sajnani@brown.edu