Kim Young Min
Dawaii
AI & ML interests
None yet
Recent Activity
reacted to SeaWolf-AI's post with โค๏ธ 9 minutes ago
๐ Adding a GPU without building one
AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere โ how efficiently you use the GPUs you already have.
Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU โ the effect of plugging in one more "virtual GPU."
VIDRAFT's VKAE, measured (B200, same-harness, no quality loss):
Qwen3.5-35B-A3B (MoE): 25.7 โ 601 tok/s (23.4ร)
Darwin-36B-Opus (in-house MoE): 25.0 โ 280.8 (11.2ร)
10,000+ tok/s peak aggregate under concurrency
The key: it's reproducible โ model + serving shipped as one container.
docker pull vidraft/qwen35-vkae:601
Don't take our word for it โ run it yourself. The mechanism will be released as a paper.
๐ Leaderboard & demo ๐ https://huggingface.co/spaces/VIDraft/vkae
Articles ๐ https://huggingface.co/blog/FINAL-Bench/vkae-leaderboard reacted to SeaWolf-AI's post with ๐ 9 minutes ago
๐ Adding a GPU without building one
AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere โ how efficiently you use the GPUs you already have.
Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU โ the effect of plugging in one more "virtual GPU."
VIDRAFT's VKAE, measured (B200, same-harness, no quality loss):
Qwen3.5-35B-A3B (MoE): 25.7 โ 601 tok/s (23.4ร)
Darwin-36B-Opus (in-house MoE): 25.0 โ 280.8 (11.2ร)
10,000+ tok/s peak aggregate under concurrency
The key: it's reproducible โ model + serving shipped as one container.
docker pull vidraft/qwen35-vkae:601
Don't take our word for it โ run it yourself. The mechanism will be released as a paper.
๐ Leaderboard & demo ๐ https://huggingface.co/spaces/VIDraft/vkae
Articles ๐ https://huggingface.co/blog/FINAL-Bench/vkae-leaderboard reacted to SeaWolf-AI's post with ๐ฅ 9 minutes ago
๐ Adding a GPU without building one
AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere โ how efficiently you use the GPUs you already have.
Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU โ the effect of plugging in one more "virtual GPU."
VIDRAFT's VKAE, measured (B200, same-harness, no quality loss):
Qwen3.5-35B-A3B (MoE): 25.7 โ 601 tok/s (23.4ร)
Darwin-36B-Opus (in-house MoE): 25.0 โ 280.8 (11.2ร)
10,000+ tok/s peak aggregate under concurrency
The key: it's reproducible โ model + serving shipped as one container.
docker pull vidraft/qwen35-vkae:601
Don't take our word for it โ run it yourself. The mechanism will be released as a paper.
๐ Leaderboard & demo ๐ https://huggingface.co/spaces/VIDraft/vkae
Articles ๐ https://huggingface.co/blog/FINAL-Bench/vkae-leaderboardOrganizations
None yet