Instructions to use zai-org/GLM-4.7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zai-org/GLM-4.7 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="zai-org/GLM-4.7") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7") model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use zai-org/GLM-4.7 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zai-org/GLM-4.7" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/zai-org/GLM-4.7
- SGLang
How to use zai-org/GLM-4.7 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.7" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.7" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use zai-org/GLM-4.7 with Docker Model Runner:
docker model run hf.co/zai-org/GLM-4.7
MLX[LMStudio] Enable Preserved Thinking Mode ?
Explain how to enable Preserved Thinking Mode in MLX LMStudio
The honest answer: Preserved Thinking is not natively supported in MLX-LM or LMStudio at the moment. Here’s why and what you can do about it.
How Preserved Thinking actually works:
- The model generates
<think>...</think>blocks in its responses - On the next turn, the framework passes those thinking blocks back into the context unmodified (instead of stripping them)
- The chat template Jinja controls this via the
clear_thinkingvariable — whenfalse, thinking content stays in the message history
Workaround for MLX-LM (manual):
You can simulate it by manually preserving the <think> content in your conversation history. If you're calling mlx_lm programmatically:
# After each response, keep the full output including <think> blocks
# in the messages list when building the next turn
messages.append({
"role": "assistant",
"content": full_response_with_think_blocks # don't strip <think>...</think>
})
messages.append({
"role": "user",
"content": next_user_message
})
The key is making sure the <think> content reaches the Jinja chat template. If mlx_lm is stripping it before templating, you'd need to patch the template or the generation code.
If you need real Preserved Thinking today, SGLang is your best bet for local deployment. It's the only self-hosted framework with official support.