Fix label leakage: temporal split β use first 70% of events as input, predict purchase in last 30%. Remove n_purchases/purchase_rate from features. e4d8561 verified rtferraz commited on May 5
Fix model loading: use from_pretrained() instead of torch.load() for safetensors format 165b138 verified rtferraz commited on May 5
Add 03_ecommerce_finetune.ipynb β next-purchase prediction with JointFusion, LightGBM baseline comparison 857ec9a verified rtferraz commited on May 5
Add e-commerce pre-training report β successful demo, behavioral clusters found, future improvements noted 2b3e3af verified rtferraz commited on May 5
Update 02_ecommerce notebook: add HF login, memory-free cell, subsample option for <64GB RAM machines 2410b7e verified rtferraz commited on May 5
CRITICAL FIX: Switch from ByteLevel to Whitespace pre-tokenizer β fixes 42% UNK rate on domain token sequences a9c4a62 verified rtferraz commited on May 5
Add 02_ecommerce_pretrain.ipynb β REES46 e-commerce pre-training with sequential entropy check, wandb, push to hub d60868a verified rtferraz commited on Apr 30
Add finance pre-training report β honest analysis of results and lessons learned 709a7e2 verified rtferraz commited on Apr 30
Add .gitignore β Python, Jupyter, training artifacts, IDE files 9211898 verified rtferraz commited on Apr 30
Fix notebook: total_mem β total_memory, add hub_model_id push, add wandb logging support 65ecf7e verified rtferraz commited on Apr 30
Add 01_finance_pretrain.ipynb β Phase 3.1 notebook for pre-training on 5M Nigerian financial transactions 2c3ddfa verified rtferraz commited on Apr 30
Phase 3.0: Pipeline validation demo on mindweave/bank-transactions-us β ALL 10 CHECKS PASSED 6e5b80d verified rtferraz commited on Apr 30
Add ADR-002: Dataset selection for Phase 3 demos β research findings, rationale, phased plan 756d197 verified rtferraz commited on Apr 30
Update implementation report: add Phase 2D, update header to v0.4.0 / 139 tests, update cumulative summary and API 7aac458 verified rtferraz commited on Apr 29
Add fine-tuning test suite β 15 tests covering dataset, batching, forward/backward, Trainer smoke, multiclass abab711 verified rtferraz commited on Apr 29
Add finetune.py β finetune_domain_model (HF Trainer Pattern A, auto tabular_features passthrough) 46a6d37 verified rtferraz commited on Apr 29
Phase 2D: Fine-tuning pipeline β DomainFinetuneDataset, finetune_domain_model, 139 total tests passing 256963c verified rtferraz commited on Apr 29
Update README v0.3.0 β add usage example, update roadmap status, add implementation report link f580186 verified rtferraz commited on Apr 29
Add Phase 2A-2C implementation report β technical decisions, architecture summary, test results 6c4ad4d verified rtferraz commited on Apr 29
Add training test suite β 19 tests covering data pipeline, packing, collation, integration, Trainer smoke test 345d9e3 verified rtferraz commited on Apr 29
Add pretrain.py β pretrain_domain_model with HF Trainer, cosine schedule, DataCollatorForLanguageModeling 6ccb9e6 verified rtferraz commited on Apr 29
Add data_pipeline.py β tokenize_user_sequences, pack_sequences, prepare_clm_dataset 1dfd4e2 verified rtferraz commited on Apr 29
Phase 2C: Pre-training pipeline β data pipeline, sequence packing, HF Trainer CLM, 124 total tests passing 28118c7 verified rtferraz commited on Apr 29
Add model test suite β 33 tests covering config, model, PLR, DCNv2, joint fusion, integration ab8a8b6 verified rtferraz commited on Apr 29
Add DCNv2 + JointFusionModel (nuFormer-style Transformer + tabular fusion) e881ea3 verified rtferraz commited on Apr 29
Add DomainTransformerForCausalLM β GPT-style NoPE model with SDPA attention, weight tying, HF Trainer compatible 0dec8e4 verified rtferraz commited on Apr 29
Phase 2B: Model architecture β DomainTransformerForCausalLM (NoPE, GPT-style), PLR embeddings, DCNv2 + JointFusion, 105 passing tests 2f5969e verified rtferraz commited on Apr 29
Add comprehensive test suite β 72 passing tests covering all components 8efa945 verified rtferraz commited on Apr 29
Add domain_tokenizer.py β DomainTokenizerBuilder (core assembler, HF integration) 818a2e9 verified rtferraz commited on Apr 29
Add field_tokenizers.py β Sign, MagnitudeBucket, Calendar, Categorical, DiscreteNumerical tokenizers 511f3aa verified rtferraz commited on Apr 29
Add schema.py β DomainSchema, FieldSpec, FieldType definitions 1a9dad0 verified rtferraz commited on Apr 29
Phase 2A: Core tokenizer library β schema, field tokenizers, composite builder, predefined schemas, 72 passing tests 0c1ca58 verified rtferraz commited on Apr 29
Update README: add ADR reference, update documentation table and repo structure a239d6e verified rtferraz commited on Apr 29
Add ADR-001: Implementation framework decision with detailed roadmap 25a1093 verified rtferraz commited on Apr 29
Update README with Nubank case study and expanded repo structure e30a14d verified rtferraz commited on Apr 29
Add Nubank nuFormer reverse-engineering analysis β full pipeline reconstruction 51149fa verified rtferraz commited on Apr 29
Add comprehensive research report on domain-specific tokenization be86e60 verified rtferraz commited on Apr 29