arxiv:2605.07210

DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

Published on May 8

· Submitted by

shuai wang on May 11

The Information Engineering Lab

Upvote

Authors:

Shengyao Zhuang ,

Abstract

DiffRetriever enables efficient multi-token retrieval using diffusion language models by generating representations in parallel rather than sequentially, achieving superior performance over autoregressive methods.

AI-generated summary

PromptReps showed that an autoregressive language model can be used directly as a retriever by prompting it to generate dense and sparse representations of a query or passage. Extending this to multiple representatives is inefficient for autoregressive models, since tokens must be generated sequentially, and prior multi-token variants did not reliably improve over single-token decoding. We show that the bottleneck is sequential generation, not the multi-token idea itself. DiffRetriever is a representative-token retriever for diffusion language models: it appends K masked positions to the prompt and reads all K in a single bidirectional forward pass. Across in-domain and out-of-domain evaluation, multi-token DiffRetriever substantially improves over single-token on every diffusion backbone we test, while autoregressive multi-token is flat or negative and pays a latency cost that scales with K where diffusion does not. After supervised fine-tuning, DiffRetriever on Dream is the strongest BEIR-7 retriever in our comparison, ahead of PromptReps, the encoder-style DiffEmbed baseline on the same diffusion backbones, and the contrastively fine-tuned single-vector RepLLaMA. A per-query oracle on the frozen base model exceeds contrastive fine-tuning at the same fixed budget, pointing to adaptive budget selection as future work. Code is available at https://github.com/ielab/diffretriever.

View arXiv page View PDF GitHub 2 Add to collection

Community

wshuai190

Paper submitter about 9 hours ago

TL;DR: prior work on multi-token LLM retrievers (PromptReps, ColBERT-style variants on autoregressive LLMs) found that going from K=1 to K>1 representative tokens doesn't reliably help, despite paying linear extra decoding cost. We show that the bottleneck wasn't multi-token retrieval — it was sequential autoregressive generation.

Method. DiffRetriever queries a diffusion language model (Dream-7B, LLaDA-8B) in the form it was pretrained on: append K [MASK] positions to a retrieval prompt and read K dense + K sparse representations from a single bidirectional forward pass. Encoding cost stays roughly constant in K instead of scaling with it.

Findings.

Multi-token helps every diffusion backbone we test, on every benchmark (MS MARCO, TREC DL'19/'20, BEIR-7). Autoregressive multi-token stays flat or worse, despite paying ~15× the latency at zero-shot.
After supervised fine-tuning, DiffRetriever on Dream is the strongest BEIR-7 retriever in our comparison — ahead of PromptReps (Qwen2.5 / LLaMA3), encoder-style DiffEmbed on the same diffusion backbones, and contrastively fine-tuned RepLLaMA.
Cleanest control: Dream is initialized from Qwen2.5. Same architecture, same weights at init; only the training objective differs. Their K=1 vs K>1 ordering inverts — the gain tracks decoding strategy, not backbone.
A per-query oracle on the frozen base model exceeds contrastive fine-tuning at the same fixed budget on every backbone×benchmark pair, pointing to adaptive budget selection as future work.

Code: https://github.com/ielab/diffretriever