ArXiv 2026

Appearance Pointers: MultiModal Region Control of Diffusion Transformers

1 Brown University
2 Adobe Research

Capabilities

Capabilities

Appearance Pointers

Overview

Appearance Pointers is a method that performs region-conditioned image generation and editing by injecting regional tokens into a pre-trained FLUX Kontext Generation Model. The Appearance Pointer tokens (\({}^{I}AP, {}^{T}AP\)) guide the generation process by linking regions to their respective conditions, enabling precise multi-modal region control over the generated content.

Overview

Region Linking

Region Linking processes each region separately and fuses multi-modal signals from text, image, and mask to obtain region tokens for the image stream \({}^IM_i\) and text stream \({}^TM_i\) of the FLUX transformer for the \(i^{th}\) region.

Region Linking

Region Aggregation

Region Aggregation combines the region tokens from the image and text streams of all the regions to generate a unified representation for each region. The below image shows this aggregation process where the aggregation at each spatial patch independently. This preserves the spatial cues in Appearance Pointer tokens (\({}^{I}AP, {}^{T}AP\))

Region Aggregation

Attention Map for Appearance Pointer

Once trained, the Apperance Pointer points to the corresponding region condition as shown below. Here, the Query is from the Appearance Pointer token and the Keys are from the Image Condition tokens. The heat map displays the region corresponding to the query location highlighted by yellow point.

Appearance Pointer Attention Map

In The Wild Qualitative Results with Heterogeneous Region Conditions

Image Generation and Editing Gallery: Appearance Pointers can perform regional edits and generation using heterogeneous conditions in the wild. Each region is accompanied by a number which relates the condition to the region.

Results 1

Image Generation and Editing Gallery: Appearance Pointers can perform regional edits and generation using heterogeneous conditions in the wild. Each region is accompanied by a number which relates the condition to the region.

Results 2

Image Generation and Editing Gallery: Appearance Pointers can perform regional edits and generation using heterogeneous conditions in the wild. Each region is accompanied by a number which relates the condition to the region.

Results 3

Dataset

Dataset: We provide an extensive region control dataset with the following sources of information for each image:

  1. Region Segments
  2. Region Captions
  3. Region Pose Variation Edits
  4. Region Texture on Sphere
  5. Score for Pose and Texture Consistency
Dataset

Citation

@misc{sajnani2026AppearancePointers,
  title={Appearance Pointers: MultiModal Region Control of Diffusion Transformers}, 
  author={Rahul Sajnani and Yulia Gryaditskaya and Radomír Měch and Srinath Sridhar and Matheus Gadelha},
  year={2026},
  eprint={},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Acknowledgements

Part of this work was done during Rahul’s internship at Adobe.

Contact

Rahul Sajnani rahul_sajnani@brown.edu