Papers
arxiv:2605.07210

DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

Published on May 8
· Submitted by
shuai wang
on May 11
Authors:
,
,
,

Abstract

DiffRetriever enables efficient multi-token retrieval using diffusion language models by generating representations in parallel rather than sequentially, achieving superior performance over autoregressive methods.

AI-generated summary

PromptReps showed that an autoregressive language model can be used directly as a retriever by prompting it to generate dense and sparse representations of a query or passage. Extending this to multiple representatives is inefficient for autoregressive models, since tokens must be generated sequentially, and prior multi-token variants did not reliably improve over single-token decoding. We show that the bottleneck is sequential generation, not the multi-token idea itself. DiffRetriever is a representative-token retriever for diffusion language models: it appends K masked positions to the prompt and reads all K in a single bidirectional forward pass. Across in-domain and out-of-domain evaluation, multi-token DiffRetriever substantially improves over single-token on every diffusion backbone we test, while autoregressive multi-token is flat or negative and pays a latency cost that scales with K where diffusion does not. After supervised fine-tuning, DiffRetriever on Dream is the strongest BEIR-7 retriever in our comparison, ahead of PromptReps, the encoder-style DiffEmbed baseline on the same diffusion backbones, and the contrastively fine-tuned single-vector RepLLaMA. A per-query oracle on the frozen base model exceeds contrastive fine-tuning at the same fixed budget, pointing to adaptive budget selection as future work. Code is available at https://github.com/ielab/diffretriever.

Community

Paper submitter

TL;DR: prior work on multi-token LLM retrievers (PromptReps, ColBERT-style variants on autoregressive LLMs) found that going from K=1 to K>1 representative tokens doesn't reliably help, despite paying linear extra decoding cost. We show that the bottleneck wasn't multi-token retrieval — it was sequential autoregressive generation.

Method. DiffRetriever queries a diffusion language model (Dream-7B, LLaDA-8B) in the form it was pretrained on: append K [MASK] positions to a retrieval prompt and read K dense + K sparse representations from a single bidirectional forward pass. Encoding cost stays roughly constant in K instead of scaling with it.

Findings.

  • Multi-token helps every diffusion backbone we test, on every benchmark (MS MARCO, TREC DL'19/'20, BEIR-7). Autoregressive multi-token stays flat or worse, despite paying ~15× the latency at zero-shot.
  • After supervised fine-tuning, DiffRetriever on Dream is the strongest BEIR-7 retriever in our comparison — ahead of PromptReps (Qwen2.5 / LLaMA3), encoder-style DiffEmbed on the same diffusion backbones, and contrastively fine-tuned RepLLaMA.
  • Cleanest control: Dream is initialized from Qwen2.5. Same architecture, same weights at init; only the training objective differs. Their K=1 vs K>1 ordering inverts — the gain tracks decoding strategy, not backbone.
  • A per-query oracle on the frozen base model exceeds contrastive fine-tuning at the same fixed budget on every backbone×benchmark pair, pointing to adaptive budget selection as future work.

Code: https://github.com/ielab/diffretriever

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.07210
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.07210 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.07210 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.07210 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.