# DFlash: x3 LLM inference speed – guide and codes

DFlash is a new block-diffusion based speculative decoding technique that speeds up large language model (LLM) inference by predicting multiple tokens in parallel. Unlike traditional autoregressive drafters, it uses a single forward pass and bidirectional attention to generate 8-16 tokens simultaneously, achieving up to 3x speedups. We look at how this architecture works, the data required to train a custom draft model, and how to run the resulting accelerated models efficiently.

## What DFlash is, and how to train a draft model for faster LLM inference

DFlash is a speculative decoding technique that uses a lightweight block diffusion draft model to propose several future tokens in one parallel step, while the larger target LLM still verifies the output so generation remains lossless ([DFlash paper on arXiv](https://arxiv.org/abs/2602.06036)). The LLM decoding is sequential: one token depends on the previous token, so decode latency and GPU memory bandwidth quickly become the bottleneck in production systems

### In practice, DFlash gives us a way to keep the quality of an autoregressive model while making the drafting phase much less sequential, especially when we can train a draft model on the same prompts and completions used in our real workload.

## Why decoding is the bottleneck

LLMs are fast at large matrix multiplications, but generation still happens one token at a time in standard autoregressive decoding – If we strip the jargon away, every new word requires another trip through the model, and this is why low-latency inference becomes hard even when the GPU itself is powerful.

Speculative decoding improves this pattern by pairing a large target model with a smaller draft model: the draft model proposes several candidate tokens, and the target model verifies them in parallel. EAGLE and its successors made this idea practical by using hidden states from the target model, but EAGLE-style drafting is still autoregressive, so each proposed token still needs its own draft forward pass.

### That is the ceiling DFlash tries to break. Instead of asking the drafter to predict token 1, then token 2, then token 3, DFlash asks a block diffusion draft model to fill a whole masked block at once.

## What DFlash changes

**DFlash keeps the useful part of speculative decoding:** the target model remains the verifier, so accepted tokens are still checked by the model we actually want to serve. The change is in the draft phase, where DFlash uses bidirectional attention inside a block so the draft model can predict multiple masked tokens in a single forward pass.

The draft model is not an independent small LLM guessing from scratch. DFlash conditions it on hidden features extracted from the target model, then injects those fused features into the Key/Value projections of every draft layer, rather than only feeding them into the first layer.

In practice, that means the draft model can be deeper and more expressive without losing the main speed advantage. Baseten explains the trade-off clearly: a single DFlash draft pass can be slower than a single EAGLE draft pass, but DFlash predicts 8 to 16 tokens at once while EAGLE predicts one token per draft pass.

![](https://regolo.ai/wp-content/uploads/2026/05/image-1-1024x377.png)## Open DFlash draft models to start from

Z Lab maintains a DFlash collection on Hugging Face with draft models for Qwen, Llama, Gemma, Kimi, and GPT-OSS families ([Hugging Face DFlash collection](https://huggingface.co/collections/z-lab/dflash)). These are draft components, so each one must be paired with its intended target model rather than served as a normal standalone chat model ([z-lab/Qwen3-8B-DFlash-b16](https://huggingface.co/z-lab/Qwen3-8B-DFlash-b16)).

To avoid training you can pick one of the listed models in [Z Lab listed models](https://huggingface.co/collections/z-lab/dflash):

![](https://regolo.ai/wp-content/uploads/2026/06/Screenshot-2026-06-03-alle-13.43.56-1024x743.png)## Guide and Codes

This guide show you how to train a supported model with DFlash and publish it into the account, creates a GPU deployment, and calls it through an OpenAI-style chat completions API in few minutes.

Replace `REGOLO_API_KEY`, `REGOLO_BASE_URL`, `GPU_TYPE_PLACEHOLDER`, `REGION_PLACEHOLDER`, `CUSTOM_MODEL_ENDPOINTS`, and deployment fields with values from the latest Regolo.ai documentation. The endpoint paths below are intentionally explicit placeholders because Custom Models APIs, GPU names, runtime names, and deployment schemas can change.

### Step 1: Install dependencies 

```
pip install transformers accelerate bitsandbytes flash-attn --no-build-isolation
# For latest Qwen, ensure transformers >= 4.36
pip install --upgrade transformers accelerateCode language: Bash (bash)
```

> `flash-attn` requires a compatible CUDA version and GPU (sm\_80+ for Ampere/Ada). Build may take a few minutes.

### Step 2: custom Qwen model with DFlash (Dynamic FlashAttention)

```
import torch
from torch import nn
from transformers import Qwen2Model, Qwen2Config
from transformers.models.qwen2.modeling_qwen2 import Qwen2SdpaAttention
import flash_attn_2_cuda as flash_attn_func  # or flash_attn.flash_attn_func for v2/v3

class DFlashQwen2Attention(nn.Module):
    """
    Qwen2Attention with Dynamic FlashAttention (varlen support).
    Replaces Qwen2SdpaAttention in the model.
    """
    def __init__(self, config: Qwen2Config, layer_idx: int = None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.num_key_value_heads = config.num_key_value_heads
        self.head_dim = self.hidden_size // self.num_heads
        self.layer_idx = layer_idx

        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)

        self.rotary_emb = None  # Will be set by model

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: torch.Tensor = None,
        position_ids: torch.Tensor = None,
        past_key_value = None,
        output_attentions: bool = False,
        use_cache: bool = False,
        **kwargs,
    ):
        batch_size, q_len, _ = hidden_states.size()

        # Project to Q, K, V
        query_states = self.q_proj(hidden_states).view(batch_size, q_len, self.num_heads, self.head_dim)
        key_states = self.k_proj(hidden_states).view(batch_size, q_len, self.num_key_value_heads, self.head_dim)
        value_states = self.v_proj(hidden_states).view(batch_size, q_len, self.num_key_value_heads, self.head_dim)

        # Apply RoPE (use model's rotary embedding)
        if self.rotary_emb is not None:
            query_states, key_states = self.rotary_emb(query_states, key_states, position_ids)

        # 🔁 DYNAMIC FASHATTENTION SETUP
        # Convert to [batch, seqlen, num_heads, head_dim] → [total_tokens, num_heads, head_dim]
        # Requires packing: stack all sequences (skip padding)
        # This is only efficient for variable-length batching (e.g., in training)
        if attention_mask is not None and self.training:
            # Flatten non-padded tokens
            # Note: this is simplified — in practice, use `flash_attn.utils.pad.unwrap` / `wrap`
            # Here we assume packed input (already flattened) for DFlash usage.
            # For full production, use `flash_attn.utils.pad` helpers.
            raise NotImplementedError("Full DFlash with padding requires unpack/pack helpers. See example below.")

        # For inference (single-sequence) or packed training, use direct flash_attn_func
        # flash-attn expects [batch_size, seqlen, nheads, headdim] (Faster)
        # with causal mask implicit via `causal=True`
        if self.training:
            # Training: use flash_attn_func with causal=True (assumes full causal mask)
            # Note: `q/k/v` must be [B, S, H, D] and contiguous
            query_states = query_states.transpose(1, 2).contiguous()  # [B, H, S, D]
            key_states = key_states.transpose(1, 2).contiguous()
            value_states = value_states.transpose(1, 2).contiguous()

            # flash-attn v2/v3 expects [B, S, H, D] (no transpose needed in newer versions)
            # Correct format for flash_attn_func: [B, S, H, D]
            query_states = query_states.transpose(1, 2)
            key_states = key_states.transpose(1, 2)
            value_states = value_states.transpose(1, 2)

            # Use causal flash attention (Qwen is causal)
            attn_output = flash_attn_func(
                query_states, key_states, value_states,
                dropout_p=0.0 if not self.training else self.config.attention_dropout,
                softmax_scale=None,
                causal=True
            )
        else:
            # Inference: same format
            query_states = query_states.transpose(1, 2).contiguous()
            key_states = key_states.transpose(1, 2).contiguous()
            value_states = value_states.transpose(1, 2).contiguous()
            attn_output = flash_attn_func(
                query_states, key_states, value_states,
                causal=True
            )

        # [B, S, H, D] -> [B, S, H*D]
        attn_output = attn_output.reshape(batch_size, q_len, self.hidden_size)

        # Output projection
        attn_output = self.o_proj(attn_output)

        return attn_output, None, None  # (no attention weights, no past_key for flash)
Code language: Python (python)
```

### Step 3: patch the model 

```
def apply_dflash_to_qwen(model: Qwen2Model):
    """
    Replace all attention layers in Qwen2Model with DFlashQwen2Attention.
    """
    for name, module in model.named_modules():
        if isinstance(module, Qwen2SdpaAttention):
            parent_name = ".".join(name.split(".")[:-1])
            attr_name = name.split(".")[-1]
            if parent_name:
                parent = model.get_submodule(parent_name)
            else:
                parent = model

            # Create new DFlash layer
            new_layer = DFlashQwen2Attention(
                model.config,
                layer_idx=int(attr_name)  # layer index is the name for layers[i]
            )
            # Copy rotary emb from original layer (optional: init from original)
            new_layer.rotary_emb = module.rotary_emb

            # Replace
            setattr(parent, attr_name, new_layer)

    print("✅ Replaced all attention layers with Dynamic FlashAttention.")
    return modelCode language: PHP (php)
```

### Create the main

```
if __name__ == "__main__":
    from transformers import AutoTokenizer

    # Load model (e.g., Qwen2-7B-Instruct)
    model_name = "Qwen/Qwen2-7B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    config = Qwen2Config.from_pretrained(model_name)
    model = Qwen2Model.from_pretrained(model_name, config=config, torch_dtype=torch.bfloat16)

    # Apply DFlash
    model = apply_dflash_to_qwen(model)
    model = model.to("cuda").eval()

    # Prepare input
    texts = ["Write a poem about AI.", "Explain quantum computing."]
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256)
    inputs = {k: v.to("cuda") for k, v in inputs.items()}

    with torch.no_grad(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
        outputs = model(**inputs)
        print("✅ DFlash inference successful!")
        print("Output shape:", outputs.last_hidden_state.shape)Code language: Python (python)
```

## Now, deploy it on Regolo Custom Models 

After training, you can follow the next step to deploy and setup your our infrastracture in few minutes. If you consider to avoid training you can choose your favourite models [from this list](https://huggingface.co/collections/z-lab/dflash) and paste the url into the step 2.

### Step 1: deploy on Huggingface 

Push the DFlash model on your account on huggingface, [here their official doc](https://huggingface.co/docs/huggingface_hub/guides/upload) to follow as reference.

### Step 2: depoy the model on regolo infrastructure 

Login and [click on **Custom Models**](https://dashboard.regolo.ai/custom-models) on the link in the sidebar menu on the left and click on "**Add your first model**":

![](http://regolo.ai/wp-content/uploads/2026/02/Screenshot-2026-02-22-alle-19.15.26-1024x487.png)Then, fill the form with Huggingface URL where your model is pushed and your api key if required (public or private model hosted):

![](https://regolo.ai/wp-content/uploads/2026/06/custom-models-huggingface-api-1024x611.png)You land into the GPU to choose for deploy:

![](http://regolo.ai/wp-content/uploads/2026/03/custom-models-choose-gpu-1024x855.png)Once completed the process, in few minutes you're up and running with your model and can start using it:

```
curl "$REGOLO_DFLASH_ENDPOINT/v1/chat/completions" \
  -H "Authorization: Bearer $REGOLO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-8b-dflash-api",
    "messages": [
      {
        "role": "user",
        "content": "Give me three latency checks to run before moving a DFlash deployment to production."
      }
    ],
    "temperature": 0.2,
    "max_tokens": 256
  }'Code language: Bash (bash)
```

---

## Additional Notes on "Dynamic" (Varlen) FlashAttention

1. **Full varlen support** requires:
     - `cu\_seqlens\_q` and `cu\_seqlens\_k` (cumulative sequence lengths)
     - `max\_seqlen\_q`, `max\_seqlen\_k`
     - Using `flash\_attn\_varlen\_func` instead of `flash\_attn\_func`
     - Preprocessing with `flash\_attn.utils.pad` / `unwrap`
2. **For training with padding** (e.g., `padding\_side='left'`), you need to:
     - Unpad inputs using `pad.unwrap\_tensor` → \[total\_tokens, H, D\]
     - Compute `cu\_seqlens` from attention mask
     - Pass to `flash\_attn\_varlen\_func`

### Example varlen helper (for advanced use)

```
from flash_attn.bert_padding import unpad_input, pad_input

def forward_varlen(self, hidden_states, attention_mask, **kwargs):
    # hidden_states: [B, S, D]
    # attention_mask: [B, S] (1 = keep, 0 = pad)
    batch_size, seqlen = hidden_states.shape[:2]
    
    # Unpad
    hidden_states_unpad, indices, cu_seqlens, max_seqlen = unpad_input(
        hidden_states, attention_mask
    )
    # hidden_states_unpad: [total_unpadded_tokens, D]

    # Project Q/K/V (same as before)
    query_unpad = self.q_proj(hidden_states_unpad).view(-1, self.num_heads, self.head_dim)
    key_unpad = self.k_proj(hidden_states_unpad).view(-1, self.num_key_value_heads, self.head_dim)
    value_unpad = self.v_proj(hidden_states_unpad).view(-1, self.num_key_value_heads, self.head_dim)

    # Apply RoPE to unpad (need to handle position_ids → indices mapping)
    # ... (requires position_ids mapping)

    # Flash varlen
    attn_output_unpad = flash_attn_func.varlen_func(
        query_unpad, key_unpad, value_unpad,
        cu_seqlens_q=cu_seqlens,
        cu_seqlens_k=cu_seqlens,
        max_seqlen_q=max_seqlen,
        max_seqlen_k=max_seqlen,
        dropout_p=0.0,
        causal=True
    )

    # Repad
    attn_output = pad_input(attn_output_unpad, indices, batch_size, seqlen)
    return self.o_proj(attn_output.view(batch_size, seqlen, -1))Code language: Python (python)
```

---

## Recommendations

– **For inference or training**, use the simpler `flash\_attn\_func` with `causal=True` if sequences are not heavily padded.
– **For high efficiency with padding** (e.g., training with dynamic batch sizes), use `flash\_attn\_varlen\_func` with the unpad/pad workflow.
– **Qwen models (v2+) already support** `attn\_implementation="flash\_attention\_2"` in `from\_pretrained`:
 ```python
 model = Qwen2ForCausalLM.from\_pretrained(
 "Qwen/Qwen2-7B-Instruct",
 attn\_implementation="flash\_attention\_2",
 torch\_dtype=torch.bfloat16,
 device\_map="auto"
 )

---

## FAQ

## Is DFlash a replacement for the target LLM?

No. DFlash is a draft model used inside speculative decoding, and the target LLM still verifies the proposed tokens before they are accepted. That is why DFlash can improve speed without changing the model whose output we trust.

## Why does DFlash use target-model hidden states?

The hidden states give the draft model a representation that is already aligned with the target model’s internal reasoning. Z Lab describes extracting hidden features from layers sampled across the target model, fusing them, and injecting them into every draft layer through KV projections.

## Do we need a different DFlash drafter for every target model?

In practice, yes. The public DFlash checkpoints are draft components paired to specific target models or model families, such as Qwen3-8B or Llama 3.1 8B Instruct ([z-lab/Qwen3-8B-DFlash-b16](https://huggingface.co/z-lab/Qwen3-8B-DFlash-b16)). A mismatched drafter can reduce acceptance and may remove the performance benefit.

## Can we train DFlash with private data?

Yes, the training recipe can use our own tokenized sequences and target-model hidden states, but we should avoid storing raw prompts or completions in audit logs. On Regolo.ai, the natural pattern is to run the GPU job in an EU-controlled environment, store only content-free audit events, and verify current zero-retention and data-handling settings in the latest documentation.

## What is the main trade-off?

DFlash adds system complexity because we now manage a target model, a draft model, hidden-state extraction for training, and runtime support for speculative decoding. The trade-off is worthwhile only if acceptance length and latency measurements improve on our actual workload, not just on public benchmarks.

---

St**art your free 30-day trial at [regolo.ai](https://regolo.ai/) and deploy LLMs with complete privacy by design.**

👉 [Talk with our Engineers](https://regolo.ai/contacts/) or [Start your 30 days free →](https://regolo.ai/pricing)

---

- [Discord](https://discord.gg/ZzZvuR2y) - Share your thoughts
- [GitHub Repo](https://github.com/regolo-ai/) - Code of blog articles ready to start
- Follow Us on X [@regolo\_ai](https://x.com/regolo_ai)
- Open discussion on our [Subreddit Community](https://www.reddit.com/r/regolo_ai/)

---

*Built with ❤️ by the Regolo team. Questions? [regolo.ai/contact](https://regolo.ai/contact)* or chat with us on [Discord](https://discord.gg/ZzZvuR2y)