reaperdoesntknow
/

MoA-400M

@@ -1,199 +1,238 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+datasets:
+- HuggingFaceH4/MATH-500
+- akoksal/LongForm
+pipeline_tag: text-generation
 ---
+# MoA-Metric-LM-400M (Convergent)
+A compact-but-capable ≈400M parameter causal LM that replaces dot-product attention with metric-native attention and augments sequence geometry with BlackHoleRoPE (a learnable, stable RoPE variant). Designed to train and run on modest hardware (CPU-first friendly) while staying fully compatible with 🤗 Transformers.
+Tags: Convergent · MoA · Conversational · Research
+Datasets: yzhuang/Agentic-Long-Context-Understanding-QA, HuggingFaceH4/MATH-500
+License: Apache-2.0
+⸻
+# Why this model?
+	•	Distance scores, not dot products. Heads score with L2, cosine, or diag-Mahalanobis distances. This gives direct control over geometry, often stabilizes training, and can be more sample-efficient.
+	•	BlackHoleRoPE positional encoding.
+	•	Q/K: pure unit-modulus rotation (unitary → numerically stable).
+	•	V: bounded-energy gating (Penrose-inspired), optionally modulated by a discrepancy signal.
+	•	Parameters synthesized from a tiny Fourier basis → extrapolable and cache-friendly, with low memory.
+	•	MoA (Mixture-of-Architectures) block. Token-wise router softly blends four heads per block:
+	1.	LocalConv (depthwise token-local conv)
+	2.	MetricMHAttention (multi-head metric attention)
+	3.	ChannelMix (MLP)
+	4.	MetricMQA (multi-query, shared K/V)
+	•	Triangle-Inequality (TI) regularizer. Keeps metric heads honest by penalizing violations over random triples.
+	•	Runs on CPUs. Implemented to behave well in FP32 on AVX2/AVX-512 machines.
+⸻
+## Model at a glance
+Property	Value
+Parameters	~400 M (exact count depends on vocab; see config.json)
+Layers	12–24 depending on variant (MoA blocks)
+Hidden size	≥ 1024 in the 400 M variant (head dim divisible by #heads)
+Attention	Metric-native (L2 / cosine / diag-Mahalanobis), plus MetricMQA
+Positional	BlackHoleRoPE per-head (rope_global for MH-Attn, rope_mqa for MQA)
+Router	Token-wise soft mixture across the four heads (+ optional bias gate)
+FFN	HyperFFN = SwiGLU MLP + SepConv1d + Low-Rank path (router-mixed)
+Context	Trained primarily at 512–1024 tokens; config allows up to 2048
+Precision	Training FP32 (CPU-friendly); inference FP32/BF16/FP16 supported
+License	Apache-2.0
+Note on context: training emphasized 512–1024; BlackHoleRoPE is extrapolable, but throughput and quality beyond training lengths depend on your hardware and data.
+⸻
+Intended use & limitations
+Intended: compact assistants, long-context reading/QA, math-style step reasoning, research on distance-based attention and geometric inductive biases.
+Not intended: safety-critical use, heavy factual QA at web scale, or domains requiring guaranteed accuracy. Evaluate carefully before deployment.
+⸻
+## Datasets
+	•	Agentic-Long-Context-Understanding-QA — long-range reading/retrieval questions to exercise context tracking. ~256000 tokens
+	•	MATH-500 — small curated math prompts for stepwise reasoning. ~256000 tokens
+Training used modest token budgets (hundreds of thousands). Reported training logs showed healthy loss descent on both 512 and 1024 sequence lengths on CPU runs. Exact metrics will vary with tokenizer, preprocessing, and optimizer settings.
+⸻
+```python
+Installation
+pip install transformers accelerate sentencepiece
+⸻
+Quick start
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+repo = "reaperdoesntknow/MoA-400M"   # replace with your repo id
+tok = AutoTokenizer.from_pretrained(repo)
+model = AutoModelForCausalLM.from_pretrained(
+    repo, torch_dtype=torch.float32, device_map="cpu"
+).eval()
+prompt = "Read and answer: If 3x + 2 = 17, what is x?\nReasoning:"
+inputs = tok(prompt, return_tensors="pt")
+with torch.no_grad():
+    out = model.generate(
+        **inputs,
+        max_length=256,
+        do_sample=True,
+        top_p=0.9,
+        temperature=0.8,
+        pad_token_id=tok.eos_token_id,
+    )
+print(tok.decode(out[0], skip_special_tokens=True))
+Pipeline usage
+from transformers import pipeline
+repo = "reaperdoesntknow/MoA-400M"
+pipe = pipeline("text-generation", model=repo, device_map="cpu")
+print(
+    pipe(
+        "Question: Who wrote 'The Selfish Gene'?\nAnswer:",
+        max_length=128,
+        do_sample=False,
+    )[0]["generated_text"]
+)
+```
+⸻
+## Architecture details
+Metric attention (MH)
+	•	Scores:
+	•	L2: -||q-k||² / sqrt(d)
+	•	Cosine: normalized dot → scaled
+	•	diag-Mahalanobis: per-head diagonal scale on dimensions
+	•	Stability: logits scaled by a learnable α; optional radius-based pruning mask for efficiency.
+	•	Value path: post-attention Up/Down projector (gated) for expressive value mixing.
+Metric MQA (shared K/V)
+	•	K and V are shared (single projection) and broadcast; queries remain multi-head. Useful for throughput and memory.
+## BlackHoleRoPE
+	•	Q/K rotation only (unit modulus) → preserves norms; avoids value blow-ups.
+	•	V receives bounded-energy amplification (energy_min..energy_max) with optional discrepancy modulation.
+	•	Parameters synthesized from a small Fourier basis; reduces cache size and improves length generalization.
+Routing & gates
+	•	TokenRouter: per-token weights over {LocalConv, MetricMH, ChannelMix, MetricMQA}.
+	•	Feature gates: per-head multiplicative scales in (0, 2) around 1.0.
+	•	Optional router bias adds signed offsets before softmax.
+Triangle-Inequality regularizer
+	•	Lightweight penalty on random triples to discourage degenerate metric geometry.
+⸻
+Training recipe (reference)
+	•	Device: CPU (AVX2/AVX-512 recommended).
+	•	Precision: FP32.
+	•	Optimizer: AdamW or Adam (β₁=0.9, β₂=0.95–0.999 work); cosine LR or linear warmup.
+	•	Batch/seq: [batch, seq] = [2–4, 512–1024].
+	•	Regularization: modest dropout in attention/value paths; optional TI penalty.
+If you see NaN/Inf during sampling, ensure masks are additive 0/-inf, clamp logits when rows are fully masked, and set a pad_token_id in .generate().
+⸻
+Evaluation notes
+The model targets behavioral quality per FLOP rather than leaderboard chasing. On held-out long-context QA and small math checks, it shows:
+	•	Robust token-to-token coherence at 512–1024.
+	•	Stable generation on CPU with FP32.
+	•	Competitive loss trends versus dot-product baselines trained under the same compute.
+Please share issues/benchmarks via the repo so results can be tracked.
+⸻
+How to fine-tune
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
+from datasets import load_dataset
+repo = "reaperdoesntknow/MoA-400M"
+tok = AutoTokenizer.from_pretrained(repo)
+model = AutoModelForCausalLM.from_pretrained(repo)
+ds = load_dataset("yzhuang/Agentic-Long-Context-Understanding-QA", split="train[:2%]")
+def tok_fn(ex):
+    x = tok(
+        ex["question"] + "\n" + ex["context"] + "\nAnswer:",
+        truncation=True,
+        max_length=1024,
+    )
+    x["labels"] = x["input_ids"].copy()
+    return x
+tds = ds.map(tok_fn, remove_columns=ds.column_names)
+args = TrainingArguments(
+    output_dir="./moa400m-finetune",
+    per_device_train_batch_size=2,
+    gradient_accumulation_steps=1,
+    num_train_epochs=1,
+    learning_rate=5e-4,
+    weight_decay=0.0,
+    warmup_steps=100,
+    logging_steps=10,
+    save_steps=200,
+    fp16=False,
+    bf16=False,
+)
+trainer = Trainer(model=model, args=args, train_dataset=tds)
+trainer.train()
+```
+⸻
+Known behaviors / tips
+	•	Context > 1024: works, but CPU throughput drops; BlackHoleRoPE helps stability, not throughput.
+	•	Sampling: always pass pad_token_id (often eos_token_id) to .generate(); avoid temperature > 1.2 on small models.
+	•	KV cache: supported; for CPU you may prefer smaller beams and greedy/small-temperature sampling.
+⸻
+Safety & responsibility
+This is a research model. It was trained on public datasets and may produce incorrect or biased content. Do not rely on it for advice or sensitive decisions.
+⸻
+Citation
+@software{moa_metric_lm_400m,
+  title  = {MoA-Metric-LM-400M: Distance-based attention with BlackHoleRoPE},
+  author = {reaperdoesntknow},
+  year   = {2025},
+  url    = {https://huggingface.co/reaperdoesntknow/MoA-400M}
+}
+⸻
+Acknowledgements
+Built with 🤗 Transformers and a metric-first rethinking of attention. BlackHoleRoPE draws inspiration from symplectic/rotational encodings and bounded-energy dynamics.