WeDLM
Collection
4 items
β’
Updated
β’
3
WeDLM-8B-Instruct is our flagship instruction-tuned diffusion language model that performs parallel decoding under standard causal attention, fine-tuned from WeDLM-8B.
Highlights:
For the base (pretrained) version, see WeDLM-8B, which is based on Qwen3-8B-Base.
π Paper (Coming Soon) | π Project Page | π» GitHub
| Attribute | Value |
|---|---|
| Base Model | WeDLM-8B |
| Parameters | 8B |
| Context Length | 32,768 |
For fast inference, use the wedlm engine:
pip install git+https://github.com/tencent/WeDLM.git
from transformers import AutoTokenizer
from wedlm import LLM, SamplingParams
llm = LLM(model="tencent/WeDLM-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
prompt = "Solve step by step: A store sells apples for $2 each and oranges for $3 each. Tom bought 5 apples and 4 oranges. How much did he spend?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=512))
print(outputs[0]["text"])
messages = [
{"role": "user", "content": "What is the derivative of x^2?"},
{"role": "assistant", "content": "The derivative of xΒ² is 2x."},
{"role": "user", "content": "What about x^3?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([text], SamplingParams(temperature=0.2, max_tokens=256))
prompts = [
"Explain quantum entanglement simply.",
"Write a Python function to check if a number is prime.",
"What are the main causes of climate change?"
]
messages_batch = [[{"role": "user", "content": p}] for p in prompts]
texts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages_batch]
outputs = llm.generate(texts, SamplingParams(temperature=0.2, max_tokens=512))
for i, output in enumerate(outputs):
print(f"=== Response {i+1} ===\n{output['text']}\n")
For training or simple forward passes:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"tencent/WeDLM-8B-Instruct",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto"
)
messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model(**inputs)
β οΈ Note: The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the
wedlmengine above.
| Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct |
|---|---|---|
| ARC-C (0-shot) | 91.47 | 92.92 |
| GSM8K (3-shot) | 89.91 | 92.27 |
| MATH (4-shot) | 69.60 | 64.80 |
| HumanEval (4-shot) | 71.95 | 80.49 |
| MMLU (5-shot) | 71.52 | 75.14 |
| GPQA-Diamond (5-shot) | 41.41 | 44.95 |
| Average | 75.12 | 77.53 |
Speedup varies by task characteristics (measured against vLLM-optimized Qwen3-8B-Instruct):
| Scenario | Speedup | Notes |
|---|---|---|
| Math Reasoning (GSM8K) | 3-6Γ | Structured, predictable output |
| Code Generation | 2-3Γ | Deterministic syntax |
| Open-ended QA | 1.5-2Γ | Higher entropy limits parallelism |
Apache 2.0