ArXiv 2026
Appearance Pointers: MultiModal Region Control of Diffusion Transformers
1 Brown University
2 Adobe Research
Capabilities

Appearance Pointers
Overview
Appearance Pointers is a method that performs region-conditioned image generation and editing by injecting regional tokens into a pre-trained FLUX Kontext Generation Model. The Appearance Pointer tokens (\({}^{I}AP, {}^{T}AP\)) guide the generation process by linking regions to their respective conditions, enabling precise multi-modal region control over the generated content.

Region Linking
Region Linking processes each region separately and fuses multi-modal signals from text, image, and mask to obtain region tokens for the image stream \({}^IM_i\) and text stream \({}^TM_i\) of the FLUX transformer for the \(i^{th}\) region.

Region Aggregation
Region Aggregation combines the region tokens from the image and text streams of all the regions to generate a unified representation for each region. The below image shows this aggregation process where the aggregation at each spatial patch independently. This preserves the spatial cues in Appearance Pointer tokens (\({}^{I}AP, {}^{T}AP\))

Attention Map for Appearance Pointer
Once trained, the Apperance Pointer points to the corresponding region condition as shown below. Here, the Query is from the Appearance Pointer token and the Keys are from the Image Condition tokens. The heat map displays the region corresponding to the query location highlighted by yellow point.

In The Wild Qualitative Results with Heterogeneous Region Conditions
Image Generation and Editing Gallery: Appearance Pointers can perform regional edits and generation using heterogeneous conditions in the wild. Each region is accompanied by a number which relates the condition to the region.

Image Generation and Editing Gallery: Appearance Pointers can perform regional edits and generation using heterogeneous conditions in the wild. Each region is accompanied by a number which relates the condition to the region.

Image Generation and Editing Gallery: Appearance Pointers can perform regional edits and generation using heterogeneous conditions in the wild. Each region is accompanied by a number which relates the condition to the region.

Dataset
Dataset: We provide an extensive region control dataset with the following sources of information for each image:
- Region Segments
- Region Captions
- Region Pose Variation Edits
- Region Texture on Sphere
- Score for Pose and Texture Consistency

Citation
@misc{sajnani2026AppearancePointers,
title={Appearance Pointers: MultiModal Region Control of Diffusion Transformers},
author={Rahul Sajnani and Yulia Gryaditskaya and Radomír Měch and Srinath Sridhar and Matheus Gadelha},
year={2026},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgements
Part of this work was done during Rahul’s internship at Adobe.
Contact
Rahul Sajnani rahul_sajnani@brown.edu