Chess BPE Tokenizer

A BPE tokenizer trained on chess moves using rustbpe with tiktoken inference.

Installation

pip install rustbpe tiktoken datasets huggingface_hub

Quick Start

Load from HuggingFace & Inference

from chess_tokenizer import load_tiktoken

enc = load_tiktoken("ItsMaxNorm/chess-bpe-tokenizer")

# Encode chess moves
ids = enc.encode("w.β™˜g1β™˜f3.. b.β™Ÿc7β™Ÿc5.. w.β™™d2β™™d4..")
print(ids)  # [token_ids...]

# Decode back
text = enc.decode(ids)
print(text)  # "w.β™˜g1β™˜f3.. b.β™Ÿc7β™Ÿc5.. w.β™™d2β™™d4.."

Or simply load using tiktoken

config = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "config.json")))
vocab = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "vocab.json")))
return tiktoken.Encoding(
    name="chess", pat_str=config["pattern"],
    mergeable_ranks={k.encode('utf-8', errors='replace'): v for k, v in vocab.items()},
    special_tokens={}
)

Train Your Own

from chess_tokenizer import train, upload

# Train on chess dataset
tok = train(vocab_size=4096, split="train[0:10000]")

# Upload to HuggingFace
upload(tok, "YOUR_USERNAME/chess-bpe-tokenizer")

Full Pipeline

python chess_tokenizer.py

Move Format

The tokenizer is trained on custom chess notation:

Move Meaning
w.β™˜g1β™˜f3.. White knight g1 to f3
b.β™Ÿc7β™Ÿc5.. Black pawn c7 to c5
b.β™Ÿc5β™Ÿd4.x. Black pawn captures on d4
w.β™”e1β™”g1β™–h1β™–f1.. White kingside castle
b.β™›d7β™›d5..+ Black queen to d5 with check

Piece Symbols

White Black Piece
β™” β™š King
β™• β™› Queen
β™– β™œ Rook
β™— ♝ Bishop
β™˜ β™ž Knight
β™™ β™Ÿ Pawn

API

Function Description
train(vocab_size, split) Train BPE on angeluriot/chess_games
save(tok, path) Save vocab.json + config.json
upload(tok, repo_id) Push to HuggingFace Hub
load_tiktoken(repo_id) Load as tiktoken Encoding

License

MIT

Downloads last month
62
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support