| --- |
| license: cc-by-nc-sa-4.0 |
| widget: |
| - text: ACCTGA<mask>TTCTGAGTC |
| tags: |
| - DNA |
| - biology |
| - genomics |
| - segmentation |
| --- |
| # segment-nt-multi-species |
|
|
| SegmentNT-multi-species is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics |
| elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [SegmentNT](https://huggingface.co/InstaDeepAI/segment_nt) model on a dataset encompassing the human genome |
| but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm. |
|
|
| For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **SegmentNT**, mainly because only this subset of annotations is |
| available for these species. The annotations therefore concern the 7 main gene elements available from [Ensembl](https://www.ensembl.org/index.html), namely protein-coding gene, 5’UTR, 3’UTR, intron, exon, |
| splice acceptor and donor sites. |
|
|
|
|
| **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI) |
|
|
| ### Model Sources |
|
|
| <!-- Provide the basic links for the model. --> |
|
|
| - **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer) |
| - **Paper:** [Segmenting the genome at single-nucleotide resolution with DNA foundation models](https://www.biorxiv.org/content/biorxiv/early/2024/03/15/2024.03.14.584712.full.pdf) |
|
|
| ### How to use |
|
|
| <!-- Need to adapt this section to our model. Need to figure out how to load the models from huggingface and do inference on them --> |
| Until its next release, the `transformers` library needs to be installed from source with the following command in order to use the models: |
| ```bash |
| pip install --upgrade git+https://github.com/huggingface/transformers.git |
| ``` |
|
|
| A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence. |
|
|
|
|
| ⚠️ The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However, SegmentNT has |
| been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change the `rescaling_factor` |
| argument in the config to `num_dna_tokens_inference / max_num_tokens_nt` where `num_dna_tokens_inference` is the number of tokens at inference |
| (i.e 6669 for a sequence of 40008 base pairs) and `max_num_tokens_nt` is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e `2048`. |
|
|
|
|
| ```python |
| # Load model and tokenizer |
| from transformers import AutoTokenizer, AutoModel |
| import torch |
| |
| tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True) |
| model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True) |
| |
| # Choose the length to which the input sequences are padded. By default, the |
| # model max length is chosen, but feel free to decrease it as the time taken to |
| # obtain the embeddings increases significantly with it. |
| # The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by |
| # 2 to the power of the number of downsampling block, i.e 4. |
| max_length = 12 + 1 |
| |
| assert (max_length - 1) % 4 == 0, ( |
| "The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by" |
| "2 to the power of the number of downsampling block, i.e 4.") |
| |
| # Create a dummy dna sequence and tokenize it |
| sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"] |
| tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"] |
| |
| # Infer |
| attention_mask = tokens != tokenizer.pad_token_id |
| outs = model( |
| tokens, |
| attention_mask=attention_mask, |
| output_hidden_states=True |
| ) |
| |
| # Obtain the logits over the genomic features |
| logits = outs.logits.detach() |
| # Transform them in probabilities |
| probabilities = torch.nn.functional.softmax(logits, dim=-1) |
| print(f"Probabilities shape: {probabilities.shape}") |
| |
| # Get probabilities associated with intron |
| idx_intron = model.config.features.index("intron") |
| probabilities_intron = probabilities[:,:,idx_intron] |
| print(f"Intron probabilities shape: {probabilities_intron.shape}") |
| ``` |
|
|
|
|
| ## Training data |
|
|
| The **segment-nt-multi-species** model was finetuned on human, mouse, chicken, fly, zebrafish and worm genomes. For each specie, a subset of chromosomes is kept as |
| validation for training monitoring and test for final evaluation. |
|
|
| ## Training procedure |
|
|
| ### Preprocessing |
|
|
| The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokens as described in the [Tokenization](https://github.com/instadeepai/nucleotide-transformer#tokenization-abc) section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form: |
|
|
| ``` |
| <CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA> |
| ``` |
|
|
| ### Training |
|
|
| The model was finetuned on a DGXH100 node with 8 GPUs on a total of 8B tokens for 3 days. |
|
|
|
|
| ### Architecture |
|
|
| The model is composed of the [nucleotide-transformer-v2-500m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) encoder, from which we removed |
| the language model head and replaced it by a 1-dimensional U-Net segmentation head [4] made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these |
| blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. This additional segmentation head accounts for 53 million parameters, bringing the total number of parameters |
| to 562M. |
|
|
| ### BibTeX entry and citation info |
|
|
| ```bibtex |
| @article{de2024segmentnt, |
| title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models}, |
| author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others}, |
| journal={bioRxiv}, |
| pages={2024--03}, |
| year={2024}, |
| publisher={Cold Spring Harbor Laboratory} |
| } |
| |
| ``` |