Occupational CANINE: HISCO Classification Model (Seq2Seq + Mixer)

Overview

OccCANINE_s2s_mix is the recommended version of OccCANINE. It combines a CANINE encoder with a sequential decoder trained using a mixed loss that blends sequence-level and flat classification objectives. This is the default model loaded by the histocc package and achieves around 96% F1 at the 5-digit HISCO level.

See more on: GitHub.com/christianvedels/OccCANINE

Read the paper on arXiv: https://arxiv.org/abs/2402.13604

Key Features

  • High Accuracy: Around 96% F1 at the 5-digit HISCO level.
  • Multilingual Support: Trained on 15.8 million description-HISCO code pairs across 13 languages.
  • Sequential decoding: Outputs full HISCO codes digit-by-digit, naturally respecting the hierarchical structure of the classification system.
  • Mixed loss training: Combines sequence-level and flat classification losses, improving both precision and recall.

Usage

from histocc import OccCANINE

# OccCANINE_s2s_mix is loaded by default
model = OccCANINE()

result = model.predict("blacksmith", lang="en")

The model is also accessible via the command-line interface:

python predict.py --fn-in path/to/input.csv --col occ1 --fn-out path/to/output.csv --language en

See GETTING_STARTED.md for a full installation and usage guide.

Supported Languages

English (en), Danish (da), Swedish (se), Dutch (nl), Catalan (ca), French (fr), Norwegian (no), Icelandic (is), Portuguese (pt), German (ge), Spanish (es), Italian (it), Greek (gr).

Contribution and Support

Developed at the University of Southern Denmark by Christian Møller Dahl, Torben Johansen and Christian Vedel.


Model Details:

  • Task: Text Classification / Sequence Generation
  • Base Model: CANINE
  • Framework: Transformers / PyTorch
  • Languages: 13 languages
  • License: Apache 2.0
  • Paper: arXiv 2402.13604
Downloads last month
128
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Christianvedel/OccCANINE_s2s_mix