Any plan to open source the dataset?

by Gmc2 - opened Jul 24, 2025

Discussion

Gmc2

Jul 24, 2025

Thanks for the great work!
May I know if there's plan to open source the datasets for pre-training or SFT? Thanks.

RowitZou

Intern Large Models org Jul 24, 2025

Thank you for your interest in our work!

The pre-training corpus for POLAR is extremely large (approximately 3.6T tokens), and it was derived from the InternLM pre-training corpus. Unfortunately, we currently have no plans to publicly release POLAR’s pre-training corpus.

However, I would like to emphasize that POLAR's pre-training data is relatively easy to obtain. You can use open-source corpora like Common Crawl (CC) along with publicly available LLMs like Qwen to perform large-scale inference sampling based on our methodology, forming positive and negative samples.

Gmc2

Jul 24, 2025

Thanks for the info.

RowitZou changed discussion status to closed Jul 25, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment