Rewrite AuraFlowPatchEmbed.pe_selection_index_based_on_dim to be torch.compile compatible #11297

AstraliteHeart · 2025-04-12T04:12:46Z

What does this PR do?

Updates AuraFlowPatchEmbed.pe_selection_index_based_on_dim so that the AuraFlowTransformer2DModel can be fully torch.compile(d)

Old and new code generate same images but I am not an expert enough to know if this has any bad impact on performance or hidden caveats.

I've noticed some weirdness while fixing this issue:

AuraFlowTransformer2DModel in the docs has

pos_embed_max_size (int, defaults to 4096): Maximum positions to embed from the image latents.

and in the code

pos_embed_max_size: int = 1024,

but AFAIK for AuraFlow 0.3 it actually should be something like?

pos_embed_max_size=9216,
sample_size=96

Fixes # Originally filled in torch - (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@cloneofsimo @sayakpaul @yiyixuxu

HuggingFaceDocBuilderDev · 2025-04-12T04:29:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2025-04-12T04:30:01Z

pos_embed_max_size: int = 1024,

Feel free to update docs including the 0.3 note :)

Old and new code generate same images but I am not an expert enough to know if this has any bad impact on performance or hidden caveats.

I think if with and without the changes we can get same numerical outputs, that should be more than enough.

@StrongerXi, wanna investigate this?

sayakpaul

Thanks a lot for your efforts here and also for testing torch.compile() with this.

Just as an FYI, we're working on #11085 to have better testing for torch.compile.

For my understanding, does this PR solve the recompilation issues for higher resolutions?

AstraliteHeart · 2025-04-12T04:52:48Z

@sayakpaul

Feel free to update docs including the 0.3 note :)

should I update both pos_embed_max_size=9216 and sample_size=96? I think these are the right values based on VAE/patch size. I may be confused here but it looks like the model is configured for AF 0.2, so I am not sure why it works for 0.3, and if I update them I may break 0.2?

So maybe we should have some detection or at least update docs? I know some people prefer 0.2 right now.

I think if with and without the changes we can get same numerical outputs, that should be more than enough.

Are the current test sufficient to confirm or should I add something extra first? Any comparisons I should run on my end?

For my understanding, does this PR solve the recompilation issues for higher resolutions?

Correct, with this change so far I am not getting any more recompilations with AF loaded via GGUF.

anijain2305 · 2025-04-12T04:53:47Z

Thanks for taking so much effort to enable torch.compile here. This workstream is truly amazing!

Cc @bobrenjc93 @laithsakka for dynamic shape guards related rewrite review. Might be a good rewrite to document in the dynamic shape manual.

AstraliteHeart · 2025-04-12T05:06:25Z

import torch
import rich

class TestPosEmbed:
    def __init__(self, pos_embed_max_size: int = 9216, embed_dim: int = 768, device='cpu'):
        """
        Initialize with a dummy positional embedding parameter.
        pos_embed_max_size must be a perfect square (here, 9216 yields a 96x96 grid).
        """
        self.pos_embed = torch.randn(1, pos_embed_max_size, embed_dim, device=device)

    def pe_selection_index_based_on_dim_orig(self, h_p: int, w_p: int) -> torch.Tensor:
        """
        Original implementation using torch.narrow.
        h_p: number of patches in height.
        w_p: number of patches in width.
        """
        total_pe = self.pos_embed.shape[1]
        grid_size = int(total_pe ** 0.5)
        assert grid_size * grid_size == total_pe, "pos_embed_max_size must be a perfect square"

        # Create a grid of indices.
        original_pe_indexes = torch.arange(total_pe, device=self.pos_embed.device).view(grid_size, grid_size)
        
        # Compute starting indices using Python arithmetic.
        starth = (grid_size - h_p) // 2
        startw = (grid_size - w_p) // 2
        
        # Use narrow to select the center region.
        narrowed = original_pe_indexes.narrow(0, starth, h_p)
        narrowed = narrowed.narrow(1, startw, w_p)
        return narrowed.flatten()

    def pe_selection_index_based_on_dim_new(self, h_p: int, w_p: int) -> torch.Tensor:
        """
        New implementation using inlined slicing arithmetic.
        h_p: number of patches in height.
        w_p: number of patches in width.
        """
        total_pe = int(self.pos_embed.shape[1])
        grid_size = int(total_pe ** 0.5)
        assert grid_size * grid_size == total_pe, "pos_embed_max_size must be a perfect square"

        # Compute starting indices (using Python ints).
        start_h = (grid_size - h_p) // 2
        start_w = (grid_size - w_p) // 2

        # Create a grid of indices.
        pe_grid = torch.arange(total_pe, device=self.pos_embed.device).view(grid_size, grid_size)
        # Select the central region and flatten.
        selected_pe = pe_grid[start_h: start_h + h_p, start_w: start_w + w_p].flatten()
        return selected_pe

def run_tests():
    torch.manual_seed(42)
    
    patch_size = 16
    # Use pos_embed_max_size = 9216 -> a 96x96 grid.
    pos_embed_max_size = 9216
    embed_dim = 768

    tester = TestPosEmbed(pos_embed_max_size=pos_embed_max_size, embed_dim=embed_dim, device='cpu')

    resolutions = [
        (224, 224),
        (224, 256),
        (256, 224),
        (256, 256),
        (256, 320),
        (320, 256),
        (384, 384),
        (384, 512),
        (512, 384),
        (512, 512),
        (640, 640),
        (768, 768),
        (1024, 1024),
        (1280, 1280),
        (1536, 1536),
        (1536, 1024),
        (1024, 1536),
        (1280, 768),
        (1536, 768),
        (768, 1536),
        (1536, 896),
        (896, 1536),
    ]

    for (height, width) in resolutions:
        h_p = height // patch_size
        w_p = width // patch_size

        orig_indices = tester.pe_selection_index_based_on_dim_orig(h_p, w_p)
        new_indices = tester.pe_selection_index_based_on_dim_new(h_p, w_p)
        
        match = torch.equal(orig_indices, new_indices)
        
        match_color = "green" if match else "red"
        rich.print(f"[cyan]Resolution: {height} x {width}[/cyan] | [yellow]Patch grid: {h_p} x {w_p}[/yellow] | Match: [{match_color}]{match}[/{match_color}]")

if __name__ == "__main__":
    run_tests()

sayakpaul · 2025-04-12T06:22:09Z

AFK currently. Please allow me some time to get back to you

sayakpaul · 2025-04-14T04:45:46Z

The tests in #11297 (comment) are sufficient. Thanks!

should I update both pos_embed_max_size=9216 and sample_size=96? I think these are the right values based on VAE/patch size. I may be confused here but it looks like the model is configured for AF 0.2, so I am not sure why it works for 0.3, and if I update them I may break 0.2?

Well, when from_pretrained() is called the configs are passed from the config.json. Similarly, for from_single_file(), these configs are automatically constructed from the state dict and also the equivalent diffusers repository config.json. So, just the updates to the docs are fine IMO.

Does this answer your question?

sayakpaul · 2025-04-14T04:49:10Z

What I would also do is the following (perhaps in a separate PR):

Add a new test class / method in https://github.com/huggingface/diffusers/blob/main/tests/pipelines/aura_flow/test_pipeline_aura_flow.py that checks no recompilation is triggered when we go for higher resolutions. I believe we won't need a pre-trained checkpoint for this. We could use the dummy model from

diffusers/tests/pipelines/aura_flow/test_pipeline_aura_flow.py

Line 35 in f1f38ff

transformer = AuraFlowTransformer2DModel(

and write our test case accordingly.

I can work on this and when ready ask for a review you and @anijain2305. WDYT? LM also know if this test case makes sense.

Also, @AstraliteHeart if possible, it would be great to update the docs of AuraFlow with a section on no recompilations when using torch.compile() on higher resolutions. Alternatively, if you update the PR description with a code snippet, I can do it.

sayakpaul · 2025-04-14T04:50:01Z

@yiyixuxu could also review this PR? This helps to make AuraFlow better compatible with torch.compile(), especially on higher resolutions such that it doesn't trigger recompilations.

yiyixuxu

thanks!

bobrenjc93

Looks great, thanks!

cc @laithsakka you probably want to include a section on reducing recompiles by rewriting tensor indexing operations like this PR in your recompilations guide. Also you should probably either write a separate OSS version or publish the internal version of https://docs.google.com/document/d/1QgQLVBNKSYMeNbG5sEz_pwffL9PlKRHKXMI4ft3H9gA/edit?tab=t.0#heading=h.a37bpg8ay2f4

sayakpaul · 2025-04-15T02:31:45Z

@AstraliteHeart just waiting for you to provide some confirmations to my comments above when you have time. We will then merge :)

AstraliteHeart · 2025-04-15T11:11:28Z

@sayakpaul

Feel free to update docs including the 0.3 note :)

Updated the docs to reflect correct default values, I don't think we need 0.3 note, I assumed the values are not read from the model which was incorrect (see below).

Well, when from_pretrained() is called the configs are passed from the config.json

rechecked the values populated from the config and you are correct

I can work on this and when ready ask for a review you and @anijain2305. WDYT? LM also know if this test case makes sense.

I would never say "no" to someone volunteering to write test but lmk if you want me to work on that.

Also, @AstraliteHeart if possible, it would be great to update the docs of AuraFlow with a section on no recompilations when using torch.compile() on higher resolutions.

For the compilation example, I believe the only special thing right now is torch.fx.experimental._config.use_duck_shape and all resolutions just work (with my fix).

torch.fx.experimental._config.use_duck_shape = False

transformer = AuraFlowTransformer2DModel.from_single_file(
    "https://huggingface.co/city96/AuraFlow-v0.3-gguf/blob/main/aura_flow_0.3-Q2_K.gguf",
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)
pipeline = AuraFlowPipeline.from_pretrained(
    "fal/AuraFlow-v0.3",
    torch_dtype=torch.bfloat16,
    transformer=transformer,
).to("cuda")

pipeline.transformer = torch.compile(pipeline.transformer, fullgraph=True, dynamic=True)

@AstraliteHeart just waiting for you to provide some confirmations to my comments above when you have time. We will then merge :)

Happy to get this merged (but please check the doc update just in case).

@yiyixuxu @bobrenjc93 thank you for having a look.

sayakpaul · 2025-04-15T11:14:12Z

Thank you!

Can I push directly to your branch to include the snippet in #11297 (comment) in the AuraFlow pipeline docs?

AstraliteHeart added 3 commits April 9, 2025 14:34

Update pe_selection_index_based_on_dim

85ea323

Merge branch 'huggingface:main' into af-compile-fix

6bd6b7c

Make pe_selection_index_based_on_dim work with torh.compile

5c6c679

sayakpaul approved these changes Apr 12, 2025

View reviewed changes

AstraliteHeart mentioned this pull request Apr 12, 2025

[ued] Slow start up time for torch.compile on GGUF Auraflow pytorch/pytorch#150706

Open

Merge branch 'main' into fix-pe_selection_index_based_on_dim

a2574c5

sayakpaul requested a review from yiyixuxu April 14, 2025 04:49

yiyixuxu approved these changes Apr 14, 2025

View reviewed changes

bobrenjc93 approved these changes Apr 15, 2025

View reviewed changes

Merge branch 'main' into fix-pe_selection_index_based_on_dim

2ce9166

sayakpaul added torch.compile Good Example PR labels Apr 15, 2025

Fix AuraFlowTransformer2DModel's dpcstring default values

ff5674c

Merge branch 'main' into fix-pe_selection_index_based_on_dim

9bf2fac

sayakpaul merged commit b6156aa into huggingface:main Apr 15, 2025

AstraliteHeart mentioned this pull request Apr 15, 2025

[ued] Investigate diffuser pipeline transformer recompilations due to different width/height pytorch/pytorch#150702

Closed

sayakpaul mentioned this pull request Apr 18, 2025

[performance] investigating FluxPipeline for recompilations on resolution changes #11360

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rewrite AuraFlowPatchEmbed.pe_selection_index_based_on_dim to be torch.compile compatible #11297

Rewrite AuraFlowPatchEmbed.pe_selection_index_based_on_dim to be torch.compile compatible #11297

Uh oh!

AstraliteHeart commented Apr 12, 2025

HuggingFaceDocBuilderDev commented Apr 12, 2025

sayakpaul commented Apr 12, 2025

sayakpaul left a comment

AstraliteHeart commented Apr 12, 2025 •

edited

Loading

anijain2305 commented Apr 12, 2025

AstraliteHeart commented Apr 12, 2025

sayakpaul commented Apr 12, 2025

sayakpaul commented Apr 14, 2025

sayakpaul commented Apr 14, 2025 •

edited

Loading

sayakpaul commented Apr 14, 2025

yiyixuxu left a comment

bobrenjc93 left a comment

sayakpaul commented Apr 15, 2025

AstraliteHeart commented Apr 15, 2025

sayakpaul commented Apr 15, 2025

Rewrite AuraFlowPatchEmbed.pe_selection_index_based_on_dim to be torch.compile compatible #11297

Rewrite AuraFlowPatchEmbed.pe_selection_index_based_on_dim to be torch.compile compatible #11297

Uh oh!

Conversation

AstraliteHeart commented Apr 12, 2025

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Apr 12, 2025

sayakpaul commented Apr 12, 2025

sayakpaul left a comment

Choose a reason for hiding this comment

AstraliteHeart commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

anijain2305 commented Apr 12, 2025

AstraliteHeart commented Apr 12, 2025

sayakpaul commented Apr 12, 2025

sayakpaul commented Apr 14, 2025

sayakpaul commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

sayakpaul commented Apr 14, 2025

yiyixuxu left a comment

Choose a reason for hiding this comment

bobrenjc93 left a comment

Choose a reason for hiding this comment

sayakpaul commented Apr 15, 2025

AstraliteHeart commented Apr 15, 2025

sayakpaul commented Apr 15, 2025

AstraliteHeart commented Apr 12, 2025 •

edited

Loading

sayakpaul commented Apr 14, 2025 •

edited

Loading