Chess BPE Tokenizer

A BPE tokenizer trained on chess moves using rustbpe with tiktoken inference.

Installation

pip install rustbpe tiktoken datasets huggingface_hub

Quick Start

Load from HuggingFace & Inference

from chess_tokenizer import load_tiktoken

enc = load_tiktoken("ItsMaxNorm/chess-bpe-tokenizer")

# Encode chess moves
ids = enc.encode("w.♘g1♘f3.. b.♟c7♟c5.. w.♙d2♙d4..")
print(ids)  # [token_ids...]

# Decode back
text = enc.decode(ids)
print(text)  # "w.♘g1♘f3.. b.♟c7♟c5.. w.♙d2♙d4.."

Or simply load using tiktoken

config = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "config.json")))
vocab = json.load(open(hf_hub_download("ItsMaxNorm/bpess", "vocab.json")))
return tiktoken.Encoding(
    name="chess", pat_str=config["pattern"],
    mergeable_ranks={k.encode('utf-8', errors='replace'): v for k, v in vocab.items()},
    special_tokens={}
)

Train Your Own

from chess_tokenizer import train, upload

# Train on chess dataset
tok = train(vocab_size=4096, split="train[0:10000]")

# Upload to HuggingFace
upload(tok, "YOUR_USERNAME/chess-bpe-tokenizer")

Full Pipeline

python chess_tokenizer.py

Move Format

The tokenizer is trained on custom chess notation:

Move	Meaning
`w.♘g1♘f3..`	White knight g1 to f3
`b.♟c7♟c5..`	Black pawn c7 to c5
`b.♟c5♟d4.x.`	Black pawn captures on d4
`w.♔e1♔g1♖h1♖f1..`	White kingside castle
`b.♛d7♛d5..+`	Black queen to d5 with check

Piece Symbols

White	Black	Piece
♔	♚	King
♕	♛	Queen
♖	♜	Rook
♗	♝	Bishop
♘	♞	Knight
♙	♟	Pawn

API

Function	Description
`train(vocab_size, split)`	Train BPE on angeluriot/chess_games
`save(tok, path)`	Save vocab.json + config.json
`upload(tok, repo_id)`	Push to HuggingFace Hub
`load_tiktoken(repo_id)`	Load as tiktoken Encoding

License

MIT

Downloads last month: 62

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support