Prototype Run 4 is likely the best version for dual stream
The code should be in the run 4 directory.
Run 2 is a bit cleaner but also missing the mastery queue, which brought a lot to the table for validity and prevented collapse.
Run 12, full refactor - dual stream + geometric arbitration tower
The dual-stream is going to use two different signal losses, BCE and CE.
The geometric tower is going to use something called GAL
Geometric Accumulated Loss - which is an accumulated geometric loss based on a large batch of information.
This is going to use the two towers as experts to learn geometric structure from and consume a large queue of batches, started from 0.
I BELIEVE this should be a more mobile and reusable structure than what I've been working with.
Based on the training outputs with InfoNCE, the geometric system learns adjacently through arbitration with the backprop from BCE loss. By enabling CE and NCE, the geometric arbitration will utilize both signals in conjunction with the sole purpose to attract to the sphere's centerpoint while retaining the rotation of the structure.
Starting at 0 means the system will accumulate this information over time rather than begin with randomly intrusive invalid data or orthogonal embedding anchors. The procrustes alignment will be based on the other two streams and this model will learn from them, then begin to guide them.
With the GAL loss, the GAL backprop will also be present. The entire purpose of this is to preserve simplex information, exclusively simplex information from the other embeddings.
It may not be enough, but it'll be an interesting experiment.
Run 11 faults
Decoupling the structure from the geometry killed the signal that TAUGHT the geometry the useful classification potential with BCE. The InfoNCE wasn't enough to arbitrate this into a useful measurable state.
GeoLIP Dual-Stream ViT
Geometric Vision Transformer with bidirectional cross-attention. Two parallel streams β one geometric (KSimplex + Cayley-Menger), one standard β communicate through cross-attention at every layer. Neither stream is fused or concatenated. Both produce independent sphere embeddings that cooperate through normalized addition.
Current Architecture (v6)
Input (B, 3, 32, 32)
β patch_embed (4Γ4) + pos_embed β (B, 64, 384)
β geo_proj(384β192), std_proj(384β192)
3Γ DualStreamBlock:
geo: self_attn β KSimplex(k=4, 11 feats) β cross_attn(q=geo, kv=std) β FFN
std: self_attn β cross_attn(q=std, kv=geo) β FFN
6Γ CrossBlock (bidirectional, no KSimplex):
geo: self_attn β cross_attn(q=geo, kv=std) β FFN
std: self_attn β cross_attn(q=std, kv=geo) β FFN
Pool both independently β geo_pooled, std_pooled
geo_emb = normalize(geo_output_proj(geo_pooled)) β S^255
std_emb = normalize(output_proj(std_pooled)) β S^255
emb = normalize(geo_emb + std_emb) β S^255
Constellation: 128 anchors Γ 256-d
Patchwork: 16 compartments Γ 128-d = 2048
Classifier: patchwork(2048) + geo_emb(256) + std_emb(256) β 10 classes
Geo classifier: geo_emb(256) β 10 classes (independent probe)
Parameters: ~16.9M (geo route 25%, std route 75%)
Key Results
| Run | Architecture | Val Acc | Geo Val | Anchors | Notes |
|---|---|---|---|---|---|
| v1 (Run 1) | 2D+4F fusion | 85.2% | β | 64/64 | First prototype |
| v3 (Run 3) | 2D+4F, mastery | 85.9% | β | 64/64 | Warm-start, 20ep |
| v5 (Run 5/12.2M) | 2D+4F, mastery | 87.2% | β | 62/64 | 12.2M from scratch |
| v5-bidir (Run 9) | 2D+4X bidir | 86.3% | 85% | 59/64 | First bidirectional |
| v6 (Run 10-11) | 3D+6X bidir | 86.0%+ | 85%+ | 105/128 | Wide sphere, in progress |
Key finding: In the bidirectional architecture, the geometric stream alone (geo val) matches the full system's validation accuracy. The standard stream's role converges to training-time geometric teacher through cross-attention.
Geometric Constants
- Pentachoron CV band: 0.20β0.23 (universal attractor across 17+ pretrained models)
- Binding/separation constant: 0.29154 (complement 0.70846)
- QK lock: 0.500 (universal cross-modal)
- CM validity: 100% across all runs, all epochs, all configurations
Loss Stack
| Loss | Weight | Role |
|---|---|---|
| BCE | 1.0 | Classification (label smoothing 0.1) |
| InfoNCE | 0.1 | Instance discrimination β ALWAYS ON |
| Mastery | 1.0 | Hard neg/pos mining (progressive margin 0.1β0.3) |
| Geo BCE | 0.3 | Geo stream classification signal |
| Geo diversity | 0.5 | Prevent intra-class geo collapse |
| CV (dual) | 0.1 | Pentachoron band on fused + geo embeddings |
| Anchor CV | 0.05 | Dedicated constellation health |
| CM | 0.1 | Simplex validity (already 100%) |
| Spread | 0.001 | Anchor dispersion insurance |
Mastery Queue
Adaptive cross-batch hard contrastive cache. Activates after InfoNCE saturates (nce_acc=1.0 for 50 consecutive batches).
- Queue: 4096 initial, adaptive 1024β16384
- Resize triggers: absolute gap > 9% β grow, gap < 3% β shrink, drift > 3% over 5-epoch window β grow/shrink
- Cooldown: 5 epochs between resizes
- Progressive margin: 0.1 β 0.3 over 5000 batches after activation
Checkpoints
All checkpoints are experimental and versioned by run. They are stored as-is from training with no guarantee of cross-version compatibility. The config dict in each checkpoint records the exact architecture parameters used.
checkpoints/
dual_stream_best.pt # v1 baseline (6.3M, 85.2%)
dual_stream_v3_best.pt # v3 mastery warm-start (6.3M, 85.9%)
dual_stream_v3_e100.pt # v3 from-scratch 100ep (6.3M, 85.7%)
dual_stream_v5_best.pt # v5 bidirectional (7.9M, 86.3%)
dual_stream_v6_best.pt # v6 wide bidirectional (16.9M, in progress)
dual_stream_v6_e*.pt # v6 periodic checkpoints
EmbeddingAutograd
Custom autograd function applied to all three sphere embeddings (emb, geo_emb, std_emb):
- Tangential projection (tang=1.0): projects gradient onto tangent plane of S^d, keeping updates on the manifold
- Anchor separation (sep=0.1): gradient correction pushing away from nearest constellation anchor
Research Journal
Run 11 changes, run 10 rerun update
Run 11 will have decoupled InfoNCE from the downstream blocks, so only the geometric structures ever see the losses.
Downstream, likely the entire sector of the single-stream blocks, will be BCE loss, while upstream the progress will be drilled via a combination of InfoNCE with indirect attenuation from downstream - which according to this analysis may not be required if the geometry is as solid as I see in the output.
This will potentially be the breaching-ground between the anchoring system functioning correctly and the anchoring system providing nothing.
I'll work with the results and adapt.
Run 10 update
With cross-attention enabled, the geometric structure is enough to classify the entire thing.
EVERYTHING else is pretty much arbitrary and can be snapped off after.
Unexpected outcome and definitely welcome.
Run 10 plan
ARCHITECTURE v5 v6
DualBlocks (KSimplex): 2 3
CrossBlocks (bidir): 4 6
Total depth: 6 9
Output dim (sphere): 128-d 256-d
Anchors: 64 128
Patchwork compartments: 8Γ64 = 512 16Γ128 = 2048
Classifier input: 512+128+128 2048+256+256
= 768 = 2560
MASTERY QUEUE v5 v6
Queue size: fixed 4096 adaptive 1024β16384
Initial: 4096 4096
Resize step: β Β±2048
Cooldown: β 5 epochs
Trigger: β overfit gap Ξ > 3%
ADAPTIVE QUEUE LOGIC:
Each epoch after mastery activation:
gap = train_acc - val_acc
delta = gap - prev_gap
if delta > +3.0 (overfitting growing):
queue β 2048 (cap 16384)
β more diverse negatives = regularization
if delta < -3.0 (overfitting shrinking):
queue β 2048 (floor 1024)
β tighter signal = sharper boundaries
After any resize: 5 epoch cooldown
β prevents rubberbanding
Console: "β Queue β 4096β6144 (gap 8.2β11.5, Ξ=+3.3)"
TB: epoch/queue_max, epoch/queue_size
EVERYTHING ELSE PRESERVED:
InfoNCE always on at 0.1
Mastery margin 0.1β0.3 over 5000 batches
Geo classifier 0.3, geo diversity 0.5
Label smoothing 0.1, dual CV 0.1, CM 0.1
AdamW wd=0.01, LR 3e-4, cosine schedule
EmbeddingAutograd tang=1.0 sep=0.1 on all three embeddings
Run 9 output, did not reach the marks yet.
I'm establishing a dynamic mastery batch size increase schedule.
Upon overfitting detection, the model will naturally default to more difficult problem solving. This should allow an additional level of difficulty and complexity to be applied and help with info overfitting for this particular task.
Mining hard negatives was a very powerful strategy, but I also believe it needs additional controllers and a schedule to be robust.
With that I've expanded the model to 3 dual-stream blocks, 6 single blocks. This should allow additional depth retention, but with it comes extra computation.
The model is still exceedingly compact, but it's overbloated for the task geometry to align.
This is an interesting outcome that I didn't expect; the geometry is not easily represented within a condensed space without pretraining a frozen state.
This makes a difficult journey but not an impossible one.
Run 9 update; dual InfoNCE shared attention
Don't celebrate just yet, the geometry must both survive and provide additional utility.
The conversation is solid and the geometric structure is actually EXCEPTIONALLY accurate without needing both channels.
Run 9 disable InfoNCE with mastery only output shows
87.2, nearly identical to the earlier run AFTER the anchor collapse.
Removing InfoNCE did not provide the necessary behavior, and with that the anchor collapse occurred.
The output did not form the necessary linkages required to reach 90%+ just yet.
Thoughts
So the way I'm seeing this is fairly straightfoward. We are collapsing useful complexity into simplicity, and that simplification is happening in a way that isn't useful to prediction. The formulas themselves are destroyed in the process, and the outcome of the functional system is not utilizing the necessary implications through the natural progressive formula system as required.
With that likelihood, I present the possibility that we are forming useful rocks. Something that the later collapse is compacting and still using, but the spherical shapes are instead turned to clay in the transformer systems over time due to the nature of collapse.
This causes the transformer structure to default to the simplest outcome possible, which is condense and flatten, smash into the shape required, become uniform and fit into the square peg, completely ignore the very nature of the system and the attenuated subsystem - forcing the model and task to conform to the outcome, rather than the outcome to conform to the task and system.
I propose, we don't bypass InfoNCE, and that instead we enable cross-attention throughout the whole structure instead of single-direction attention. This will provide the necessary adjudication between both sets of passage rather than one, and it should allow a full cooperation and a preservation of the geometric state.
In other words, link both pools, and provide bi-directional geometric attention.
Run 10 hypothesis
Low hanging fruit... I need to figure out a way to represent the geometric structure as the most likely useful source of useful data, so the model can be directly represented in a more pure loss that cannot be simply bypassed in an easy way.
My original hypothesis was to use a representative patchwork, which showed high-yield potential but has drawbacks and is a massive burden to calculate.
Run 9 suspected collapse may be a symptom of convergence
This was a very interesting outcome. By disabling the InfoNCE it cut the lifeline to the geometric anchored subset, and with that the model was left to the other losses.
Even with the backprop, the cv, and everything intact the model still chose 1 anchor. The model has likely hollowed out a point where it can make the most use out of the space with dropout, without requiring a full comprehensive representation of the geometric state.
I'm still grasping at strands on a tapestry for that one, but it's very possible the collapse is just bypass due to geometry being difficult to attenuate, and the geometric state having more rank than a simple passthrough isn't being represented well enough onto the final state, causing the path of least resistance to be the data route and simple condensation rather than the legitimate functional housing required to fully calculate the end-state geometric form.
Hold on a second, the geometric structure is in fact improving, it is not collapsing... Interesting... The geometric accuracy is rising over time.
Run 9 reboot, correct settings
InfoNCE disabled as necessary enabling the mastery engine, after a brief hiccup the engine took over and the system started gaining again.
Very early in the training nearing 80% val.
Critical anchor fault, without the InfoNCE the anchors are left unattended and default to CLS behavior.
However, I will let the training continue.
The model is a master of the domain and has a full picture of the geometric structure, so the geometry is being used how the model sees fit.
This is the choice of this model with these constraints with this data in this setting. It's collapsing the intentions into a singular route of intentions, and the math itself within that route being lensed outward directly reflects the lensing of this model in conjunction with it's measured states.
After the mastery of NCE happens, the model likely has a perfect internal representation of the need and the means to fulfill that need within the constraints.
This does not necessarily mean this is the correct route for the most cohesion, but the experiment continues.
Run 9 continues
Odd hunch... Must follow up
I need to run a full fissure analysis on the weights to ensure the model wasn't corrupted.
I have a very strange feeling that I could very well be getting some bf16 artifacting, and I can easily test for this so I will.
β 4 layers received NO gradient:
geo_output_proj.0.weight
geo_output_proj.0.bias
geo_output_proj.1.weight
geo_output_proj.1.bias
Damn.
I believe I found the problem, there's an exploding block, and projection appears to be gradient dead.
Okay so it wasn't gradient dead, it was just not being analyzed correctly. It was gradient starved because it was handling weak penta sampled gradients, which were mean calculated, so touching the mean detected it.
I will attempt 2 possible directions
- Disable the InfoNCE completely after mastery begins, enabling only the mastery of the InfoNCE to take center stage.
- cross-attention InfoNCE rather than single-direction to enable census discussion rather than adversarial competition down the binding chain.
The way I see it is simple. The system is grinding, not flowing. The system should be flowing, the numbers should be clean, and the outcome should be working.
More than likely the very constraints I'm applying to shape the data, is affecting the very flow the data is supposed to take. As they both adjacently shape their own submanifolds between the two structures, the output does not speak cooperation it speaks competition.
Two views of the same puzzle, one asymmetrically getting more data than the other. Competition isn't required, resonance and cooperation is.
First test will be disabling the InfoNCE and allowing the competition to come to fruition, the second is to enable full cooperation. Those are my two choices, and based on the data either may reach the same goal. Competition and semi-adversarial vs cooperation.
Compounding the run data
I'm now creating a full run spectrum compounded training data to see the most likely candidates for enhancement.
Run 7
Enabling both sides autograd with cv loss and heavily reduced infonce loss.
============================================================
CIFAR-10 β Dual-Stream GeoLIP ViT β EXP 3 (FULL CV)
Warm start from:
CV: weight=1.0 (FULL), target=0.22
InfoNCE: weight=0.1 (REDUCED)
Autograd: tang=0.5, sep=0.1
LR: 0.0001, epochs: 20
Device: cuda
============================================================
Train: 50,000 (two views) Val: 10,000
Building model...
β No v1 checkpoint found at , training from scratch
Parameters: 6,321,542
Geo route: 2,552,764 (40.4%)
Std route: 3,768,778 (59.6%)
20 epoch test then a full 100 to compare.
Last was 87.1 in less time and the train didn't overfit until around epoch 95.
Run 5
The full loss spectrum was a bit unstable, lets provide some expected task outcome assistance without overly regularizing the geometry.
# Architecture (must match v1)
NUM_CLASSES = 10
IMG_SIZE = 32
PATCH_SIZE = 4
EMBED_DIM = 384
STREAM_DIM = 192
FUSED_DIM = 256
N_DUAL_BLOCKS = 2
N_FUSED_BLOCKS = 4
N_HEADS = 8
OUTPUT_DIM = 128
N_ANCHORS = 64
N_COMP = 8
D_COMP = 64
ANCHOR_DROP = 0.10
CV_TARGET = 0.2
# NEW for v2
CV_WEIGHT = 0.5
ENABLE_AUTOGRAD = True
AUTOGRAD_TANG = 1.0
AUTOGRAD_SEP = 0.1
# Training
BATCH = 512
EPOCHS = 100
LR = 1e-3
WARMUP = 5
GRAD_CLIP = 1.0
INFONCE_WEIGHT = 0.5
BCE_WEIGHT = 1.0
CM_WEIGHT = 0.5
INFONCE_TEMP = 0.07
This might contribute, 0.5 cv with autograd, 0.5 infonce with mastery, BCE as primary task.
Run 4 full loss analysis
The CV destabilized, so each validation set showed an entirely different CV response.
The train accuracy saturated early, which is likely due to the model being allowed to drift too much from the anchored intentions.
This isn't necessarily a good or a bad thing, but since the model didn't learn the necessary task it's not useful for this case.
The effective dimensionality of the embeddings are nearly perfect. The most difficult classes are the most unstable, and the CV of the anchors is flawlessly aligned to 0.2 as the outcome of a structurally significant component contributing to a system has shown historically to have.
The faulty classes are not hindering the successful classes, but there is a formatted instability that can be directly analyzed for behavior.
=================================================================
SCAN 1: EMBEDDING HEALTH
=================================================================
Norms: mean=1.000000 std=0.000000
Self-similarity: mean=0.0197 std=0.1353
β No embedding collapse
Effective dim: 92.1/128
top-3 SVs explain 14.0%
top-5 SVs explain 21.8%
top-10 SVs explain 36.0%
top-20 SVs explain 45.9%
top-50 SVs explain 70.6%
Pentachoron CV (GeoLIP structural spectrum):
global embedding: CV=0.1149 (β outside) [500/500 valid, mean_vol=0.084180]
Per-class:
airplane : CV=0.2421 (β IN BAND) [200/200 valid, mean_vol=0.050191]
automobile: CV=0.1793 (β outside) [200/200 valid, mean_vol=0.042003]
bird : CV=0.2069 (β IN BAND) [200/200 valid, mean_vol=0.056354]
cat : CV=0.1539 (β outside) [200/200 valid, mean_vol=0.063555]
deer : CV=0.1997 (β IN BAND) [200/200 valid, mean_vol=0.056215]
dog : CV=0.1731 (β outside) [200/200 valid, mean_vol=0.056146]
frog : CV=0.1985 (β IN BAND) [200/200 valid, mean_vol=0.049812]
horse : CV=0.2266 (β IN BAND) [200/200 valid, mean_vol=0.047620]
ship : CV=0.2024 (β IN BAND) [200/200 valid, mean_vol=0.044833]
truck : CV=0.2309 (β IN BAND) [200/200 valid, mean_vol=0.046396]
Class CV: mean=0.2013 std=0.0274 range=[0.1539, 0.2421]
anchor constellation: CV=0.2002 (β IN BAND) [200/200 valid, mean_vol=0.077932]
Patch-level CV (from fused patch projections):
all patches (flat): CV=0.1593 (β outside) [500/500 valid, mean_vol=0.080665]
Warm-start E20 From-scratch E100
Val accuracy: 85.9% 83.5%
Eff dim: 54.1/128 92.1/128
Global CV: 0.1922 β 0.1149 β
Anchor CV: 0.1303 β 0.2002 β
Class CV mean: 0.2973 0.2013
Classes in band: 2/10 7/10
1-NN accuracy: 80.6% 77.4%
Same/diff gap: 0.3361 0.2411
hard_neg (final): 0.525 0.372
hard_pos (final): 0.061 0.023
Run 4 full retrain with full enabled at 1.0
Enabling all the losses at 1.0 strength, InfoNCE mastery curation, cv_loss, autograd control, and the works.
512 batch size instead of 128, lr 1-e3
Lets see how the retrain takes.
Early outputs show the mastery clause hit pretty early and the model is showing no signs of plateau yet.
CV fairly stable and rising to the band after CV drop from smoothness increase.
Run 3 analysis
The model warm started had almost all classed out of CV band, and the trained version now has two within band.
Absolutely fascinating.
v1 (baseline) v3 (mastery)
Val accuracy: 85.2% 85.9%
Global CV: 0.1061 β 0.1922 β IN BAND
Patch CV: 0.1768 β 0.2182 β IN BAND
1-NN accuracy: 57.3% 80.6%
Same/diff gap: 0.0911 0.3361
Class CV range: [0.12, 0.15] [0.24, 0.37]
catΓdog cos: 0.358 0.577
Self-similarity: 0.003 0.067
The trajectory of the NCE is the most telling gap. Prior to this the NCE was saturated and mastered, and now we have a meaningful trajectory of utility that still utilized valid geometry in a better fashion.
Epoch hard_neg hard_pos gap
1 0.466 -0.216 0.682
5 0.549 -0.028 0.577
10 0.540 0.019 0.521
15 0.531 0.054 0.477
20 0.525 0.061 0.464
The mastery clause... When a task is completed and the model does not correctly align, reassess the implications of the task by attracting more data to the master... How intriguing.
It worked, as well. So this could very well be an effective debugging tool I could use in the future.
Run 3 attempt 1
It's kind of working, but I don't think it's tuned EXACTLY to the correct objective yet.
The mastery clause is a powerful idea, I've never actually put it to words.
Every time I personally think I've mastered something, I have to test that theory. EVERY SINGLE TIME I know for certain as of now, that I will always be able to learn new information, and I'm never sure where that source of information will come from.
This same concept can be applied to geometric learning, further enhancing and improving the model over time.
This may not be the exact route to take for it, but it brings to light a new form of... Development that I wish to explore.
First tests show, the master can still learn.
Run 3 enhanced; InfoNCE prestige
After the InfoNCE masters the task, the followup processing will enable a 'second guess' clause, which forces the attenuation to multi-batch interpret the necessary implications through a cached structure subdivision.
This with the intention of increasing the NCE mastery's capacity to differentiate, will potentially assist the next attempt.
Every incorrect assessment provides rebounding effect towards the potential goal, adjusting the NCE geometric structure over time.
The initial NCE task is too easy towards what must happen, so the original solution is found too early and codified.
However, that does not necessarily mean the learned version will or will not benefit from this structure being reshaped over time.
Run 2 early recovery unexpected
I once again underestimated the tenacy of the law of averages.
I underestimated the alignment of the geometric structure on the patchwork half, cross-attend did exactly what it's supposed to do.
Interesting outcome... Not entirely unexpected, and not unwelcome.
It's slow burn gain, but the current variation is keeping train accuracy from fully overfitting, while still allowing some trickle learning...
Interesting.
Run 2 - early stages expected
This should drop to around 80% or so, give or take train accuracy throughout the run, while the validation accuracy should hold fairly stable.
This followup experiment is testing how well we can control the solidity of the internal geometry post first-pass training, and with that my insights tell me we need to destroy the non-geometric attendence in favor of implicit geometric alignment adjacently to the explicit geometric modeling being slightly nudged with the CV.
This could very well solve the faults.
Run 2 - warm start continuation
Enabling cv loss on only the geometric parameters, only a little nudgey nudge. Nothing too drastic here.
On the opposite side, I'm enabling full destruction autograd to forcefully realign the model.
Highly destructive experiment and this should be interesting.
Run 1 analysis - HIGHLY successful prototype
Moved to the run1 directory.
=================================================================
SCAN 8: ARCHITECTURE SUMMARY
=================================================================
Total params: 6,296,582
Geo stream params: 1,281,340 (20.3%)
Std stream params: 1,261,248 (20.0%)
Fused block params: 3,159,552 (50.2%)
Constellation params:84,480 (1.3%)
=================================================================
DIAGNOSIS SUMMARY
=================================================================
Val accuracy: 85.2%
Eff dim: 58.3/128
Pentachoron CV: 0.1085 (target band: 0.20-0.23)
Self-similarity: 0.0027
Pooled anchors: 64/64
Patch anchors: 64/64
Per-img p_anch: 15.2
Entropy: 98%
Gini: 0.2067
CM volumes: 200/200 positive
Anchor CV: 0.1389
Class CV range: [0.1374, 0.1621]
Geo feat var: 0.000838
Block 0 CM valid: 100.0%
Block 1 CM valid: 100.0%
Same/diff gap: 0.0911
1-NN accuracy: 57.3%
β No major issues. Geometry is healthy.
I'm trying to not jump out of my chair. It's fully validated geometrically and 85% accurate for cifar10 image validation.
Now, I know this is odd to celebrate an 85% cifar10 validation at 6.3m params... but holy moly it worked.
It not only worked, but it worked highly effectively. I have work to do. We must expand.
Run 1:
Well it's working.
I cut most of the distillation-centric losses in favor of direct learning losses to prevent overwhelming noise and incorrect internal arbitration.
Geometry fully intact, patchwork learning effectively, nce fully saturated.
It's not perfect yet, the patchwork matched to overfit, but I have a serious idea for how to guarantee a solidified patchwork.
Still needs the cv gauge, pentachoron sampling CV loss, and a new form of autograd. I'm waiting for it to finish before I decide to modify anything major. It's pretty vanilla so far.
As you can see the simple geometric gauges survived the full run, but that doesn't mean it's geometry just yet. I left it unconstrained.
The ksimplex channeling and simplex curation were sufficient for earlier experiments, but they require assessment on completion.
Preliminary and setup
As the vit experiments from yesterday showed, the vit itself when fused directly with the geometry does not benefit enough to substantiate a direct fusion in the state I was experimenting with.
Instead of continuing down that route, which is a losing battle of weighted arbitration, I've decided to head in the direction of dual-stream.
The whole point of these systems, are to test if the geometry can in fact stand on it's own, without the need of an arbitration expert model. The experiments yesterday show that yes it can, but it was quite the balancing act to directly fuse. That's no matter, the geometry DID survive, which means the battle was won and I have the blueprint. I will attempt make geometry work in a way that can apply the formulas without destroying the data, potentially improving performance and speed while introducing increased retention and utility.
However, if it does not improve performance or quality or speed, then it's going to need a legitimate series of detections to determine that exact optimization procedure, so the autograd will need tuning, and I'll need to format a better Adam.
[GEOMETRY] <-> [PATCHWORK]
[GEOMETRY] <-> [PATCHWORK]
[GEOMETRY->PATCHWORK]
[GEOMETRY->PATCHWORK]
[GEOMETRY->PATCHWORK]
[GEOMETRY->PATCHWORK]
Where we teach the geometry and patchwork separately but cross-attended, and then pool them for residuals after each dual-stream block. We ensure the sequence survives, and the geometric values down the chain fit within the hypersphere, while further down we pass through standard transformer fused systems while still simultaneously ensuring the geometry survives.
It'll be a bit of a problematic kerfuffle up front, but it should solve the cross-contamination problem that both destroys the geometry or destroys the patchwork due to overwhelming geometry. This way they can both benefit and the substructure can be better tested for validity and shared learning accuracy.












