Prototype Run 4 is likely the best version for dual stream

The code should be in the run 4 directory.

Run 2 is a bit cleaner but also missing the mastery queue, which brought a lot to the table for validity and prevented collapse.

Run 12, full refactor - dual stream + geometric arbitration tower

The dual-stream is going to use two different signal losses, BCE and CE.

The geometric tower is going to use something called GAL

Geometric Accumulated Loss - which is an accumulated geometric loss based on a large batch of information.

This is going to use the two towers as experts to learn geometric structure from and consume a large queue of batches, started from 0.

I BELIEVE this should be a more mobile and reusable structure than what I've been working with.

Based on the training outputs with InfoNCE, the geometric system learns adjacently through arbitration with the backprop from BCE loss. By enabling CE and NCE, the geometric arbitration will utilize both signals in conjunction with the sole purpose to attract to the sphere's centerpoint while retaining the rotation of the structure.

Starting at 0 means the system will accumulate this information over time rather than begin with randomly intrusive invalid data or orthogonal embedding anchors. The procrustes alignment will be based on the other two streams and this model will learn from them, then begin to guide them.

With the GAL loss, the GAL backprop will also be present. The entire purpose of this is to preserve simplex information, exclusively simplex information from the other embeddings.

It may not be enough, but it'll be an interesting experiment.

Run 11 faults

Decoupling the structure from the geometry killed the signal that TAUGHT the geometry the useful classification potential with BCE. The InfoNCE wasn't enough to arbitrate this into a useful measurable state.

GeoLIP Dual-Stream ViT

Geometric Vision Transformer with bidirectional cross-attention. Two parallel streams β€” one geometric (KSimplex + Cayley-Menger), one standard β€” communicate through cross-attention at every layer. Neither stream is fused or concatenated. Both produce independent sphere embeddings that cooperate through normalized addition.

Current Architecture (v6)

Input (B, 3, 32, 32)
  β†’ patch_embed (4Γ—4) + pos_embed β†’ (B, 64, 384)
  β†’ geo_proj(384β†’192), std_proj(384β†’192)

3Γ— DualStreamBlock:
  geo: self_attn β†’ KSimplex(k=4, 11 feats) β†’ cross_attn(q=geo, kv=std) β†’ FFN
  std: self_attn β†’ cross_attn(q=std, kv=geo) β†’ FFN

6Γ— CrossBlock (bidirectional, no KSimplex):
  geo: self_attn β†’ cross_attn(q=geo, kv=std) β†’ FFN
  std: self_attn β†’ cross_attn(q=std, kv=geo) β†’ FFN

Pool both independently β†’ geo_pooled, std_pooled
  geo_emb = normalize(geo_output_proj(geo_pooled))  β†’ S^255
  std_emb = normalize(output_proj(std_pooled))      β†’ S^255
  emb     = normalize(geo_emb + std_emb)            β†’ S^255

Constellation: 128 anchors Γ— 256-d
Patchwork: 16 compartments Γ— 128-d = 2048
Classifier: patchwork(2048) + geo_emb(256) + std_emb(256) β†’ 10 classes
Geo classifier: geo_emb(256) β†’ 10 classes (independent probe)

Parameters: ~16.9M (geo route 25%, std route 75%)

Key Results

Run Architecture Val Acc Geo Val Anchors Notes
v1 (Run 1) 2D+4F fusion 85.2% β€” 64/64 First prototype
v3 (Run 3) 2D+4F, mastery 85.9% β€” 64/64 Warm-start, 20ep
v5 (Run 5/12.2M) 2D+4F, mastery 87.2% β€” 62/64 12.2M from scratch
v5-bidir (Run 9) 2D+4X bidir 86.3% 85% 59/64 First bidirectional
v6 (Run 10-11) 3D+6X bidir 86.0%+ 85%+ 105/128 Wide sphere, in progress

Key finding: In the bidirectional architecture, the geometric stream alone (geo val) matches the full system's validation accuracy. The standard stream's role converges to training-time geometric teacher through cross-attention.

Geometric Constants

  • Pentachoron CV band: 0.20–0.23 (universal attractor across 17+ pretrained models)
  • Binding/separation constant: 0.29154 (complement 0.70846)
  • QK lock: 0.500 (universal cross-modal)
  • CM validity: 100% across all runs, all epochs, all configurations

Loss Stack

Loss Weight Role
BCE 1.0 Classification (label smoothing 0.1)
InfoNCE 0.1 Instance discrimination β€” ALWAYS ON
Mastery 1.0 Hard neg/pos mining (progressive margin 0.1β†’0.3)
Geo BCE 0.3 Geo stream classification signal
Geo diversity 0.5 Prevent intra-class geo collapse
CV (dual) 0.1 Pentachoron band on fused + geo embeddings
Anchor CV 0.05 Dedicated constellation health
CM 0.1 Simplex validity (already 100%)
Spread 0.001 Anchor dispersion insurance

Mastery Queue

Adaptive cross-batch hard contrastive cache. Activates after InfoNCE saturates (nce_acc=1.0 for 50 consecutive batches).

  • Queue: 4096 initial, adaptive 1024–16384
  • Resize triggers: absolute gap > 9% β†’ grow, gap < 3% β†’ shrink, drift > 3% over 5-epoch window β†’ grow/shrink
  • Cooldown: 5 epochs between resizes
  • Progressive margin: 0.1 β†’ 0.3 over 5000 batches after activation

Checkpoints

All checkpoints are experimental and versioned by run. They are stored as-is from training with no guarantee of cross-version compatibility. The config dict in each checkpoint records the exact architecture parameters used.

checkpoints/
  dual_stream_best.pt          # v1 baseline (6.3M, 85.2%)
  dual_stream_v3_best.pt       # v3 mastery warm-start (6.3M, 85.9%)
  dual_stream_v3_e100.pt       # v3 from-scratch 100ep (6.3M, 85.7%)
  dual_stream_v5_best.pt       # v5 bidirectional (7.9M, 86.3%)
  dual_stream_v6_best.pt       # v6 wide bidirectional (16.9M, in progress)
  dual_stream_v6_e*.pt         # v6 periodic checkpoints

EmbeddingAutograd

Custom autograd function applied to all three sphere embeddings (emb, geo_emb, std_emb):

  • Tangential projection (tang=1.0): projects gradient onto tangent plane of S^d, keeping updates on the manifold
  • Anchor separation (sep=0.1): gradient correction pushing away from nearest constellation anchor

Research Journal

Run 11 changes, run 10 rerun update

image

Run 11 will have decoupled InfoNCE from the downstream blocks, so only the geometric structures ever see the losses.

Downstream, likely the entire sector of the single-stream blocks, will be BCE loss, while upstream the progress will be drilled via a combination of InfoNCE with indirect attenuation from downstream - which according to this analysis may not be required if the geometry is as solid as I see in the output.

This will potentially be the breaching-ground between the anchoring system functioning correctly and the anchoring system providing nothing.

I'll work with the results and adapt.

Run 10 update

With cross-attention enabled, the geometric structure is enough to classify the entire thing.

EVERYTHING else is pretty much arbitrary and can be snapped off after.

image

Unexpected outcome and definitely welcome.

Run 10 plan

ARCHITECTURE                v5              v6
DualBlocks (KSimplex):      2               3
CrossBlocks (bidir):        4               6
Total depth:                6               9
Output dim (sphere):        128-d           256-d
Anchors:                    64              128
Patchwork compartments:     8Γ—64 = 512      16Γ—128 = 2048
Classifier input:           512+128+128     2048+256+256
                            = 768           = 2560

MASTERY QUEUE               v5              v6
Queue size:                 fixed 4096      adaptive 1024–16384
Initial:                    4096            4096
Resize step:                β€”               Β±2048
Cooldown:                   β€”               5 epochs
Trigger:                    β€”               overfit gap Ξ” > 3%

ADAPTIVE QUEUE LOGIC:
  Each epoch after mastery activation:
    gap = train_acc - val_acc
    delta = gap - prev_gap
    
    if delta > +3.0 (overfitting growing):
      queue ↑ 2048 (cap 16384)
      β†’ more diverse negatives = regularization
      
    if delta < -3.0 (overfitting shrinking):
      queue ↓ 2048 (floor 1024)
      β†’ tighter signal = sharper boundaries
      
    After any resize: 5 epoch cooldown
    β†’ prevents rubberbanding
    
    Console: "βš™ Queue ↑ 4096β†’6144 (gap 8.2β†’11.5, Ξ”=+3.3)"
    TB: epoch/queue_max, epoch/queue_size

EVERYTHING ELSE PRESERVED:
  InfoNCE always on at 0.1
  Mastery margin 0.1β†’0.3 over 5000 batches
  Geo classifier 0.3, geo diversity 0.5
  Label smoothing 0.1, dual CV 0.1, CM 0.1
  AdamW wd=0.01, LR 3e-4, cosine schedule
  EmbeddingAutograd tang=1.0 sep=0.1 on all three embeddings

Run 9 output, did not reach the marks yet.

I'm establishing a dynamic mastery batch size increase schedule.

Upon overfitting detection, the model will naturally default to more difficult problem solving. This should allow an additional level of difficulty and complexity to be applied and help with info overfitting for this particular task.

Mining hard negatives was a very powerful strategy, but I also believe it needs additional controllers and a schedule to be robust.

With that I've expanded the model to 3 dual-stream blocks, 6 single blocks. This should allow additional depth retention, but with it comes extra computation.

The model is still exceedingly compact, but it's overbloated for the task geometry to align.

This is an interesting outcome that I didn't expect; the geometry is not easily represented within a condensed space without pretraining a frozen state.

This makes a difficult journey but not an impossible one.

Run 9 update; dual InfoNCE shared attention

Don't celebrate just yet, the geometry must both survive and provide additional utility.

The conversation is solid and the geometric structure is actually EXCEPTIONALLY accurate without needing both channels.

image

Run 9 disable InfoNCE with mastery only output shows

87.2, nearly identical to the earlier run AFTER the anchor collapse.

Removing InfoNCE did not provide the necessary behavior, and with that the anchor collapse occurred.

The output did not form the necessary linkages required to reach 90%+ just yet.

Thoughts

So the way I'm seeing this is fairly straightfoward. We are collapsing useful complexity into simplicity, and that simplification is happening in a way that isn't useful to prediction. The formulas themselves are destroyed in the process, and the outcome of the functional system is not utilizing the necessary implications through the natural progressive formula system as required.

With that likelihood, I present the possibility that we are forming useful rocks. Something that the later collapse is compacting and still using, but the spherical shapes are instead turned to clay in the transformer systems over time due to the nature of collapse.

This causes the transformer structure to default to the simplest outcome possible, which is condense and flatten, smash into the shape required, become uniform and fit into the square peg, completely ignore the very nature of the system and the attenuated subsystem - forcing the model and task to conform to the outcome, rather than the outcome to conform to the task and system.

I propose, we don't bypass InfoNCE, and that instead we enable cross-attention throughout the whole structure instead of single-direction attention. This will provide the necessary adjudication between both sets of passage rather than one, and it should allow a full cooperation and a preservation of the geometric state.

In other words, link both pools, and provide bi-directional geometric attention.

Run 10 hypothesis

Low hanging fruit... I need to figure out a way to represent the geometric structure as the most likely useful source of useful data, so the model can be directly represented in a more pure loss that cannot be simply bypassed in an easy way.

My original hypothesis was to use a representative patchwork, which showed high-yield potential but has drawbacks and is a massive burden to calculate.

Run 9 suspected collapse may be a symptom of convergence

This was a very interesting outcome. By disabling the InfoNCE it cut the lifeline to the geometric anchored subset, and with that the model was left to the other losses.

Even with the backprop, the cv, and everything intact the model still chose 1 anchor. The model has likely hollowed out a point where it can make the most use out of the space with dropout, without requiring a full comprehensive representation of the geometric state.

I'm still grasping at strands on a tapestry for that one, but it's very possible the collapse is just bypass due to geometry being difficult to attenuate, and the geometric state having more rank than a simple passthrough isn't being represented well enough onto the final state, causing the path of least resistance to be the data route and simple condensation rather than the legitimate functional housing required to fully calculate the end-state geometric form.

Hold on a second, the geometric structure is in fact improving, it is not collapsing... Interesting... The geometric accuracy is rising over time.

Run 9 reboot, correct settings

image

InfoNCE disabled as necessary enabling the mastery engine, after a brief hiccup the engine took over and the system started gaining again.

Very early in the training nearing 80% val.

Critical anchor fault, without the InfoNCE the anchors are left unattended and default to CLS behavior.

However, I will let the training continue.

The model is a master of the domain and has a full picture of the geometric structure, so the geometry is being used how the model sees fit.

This is the choice of this model with these constraints with this data in this setting. It's collapsing the intentions into a singular route of intentions, and the math itself within that route being lensed outward directly reflects the lensing of this model in conjunction with it's measured states.

After the mastery of NCE happens, the model likely has a perfect internal representation of the need and the means to fulfill that need within the constraints.

This does not necessarily mean this is the correct route for the most cohesion, but the experiment continues.

Run 9 continues

image

Odd hunch... Must follow up

I need to run a full fissure analysis on the weights to ensure the model wasn't corrupted.

I have a very strange feeling that I could very well be getting some bf16 artifacting, and I can easily test for this so I will.

  βœ— 4 layers received NO gradient:
      geo_output_proj.0.weight
      geo_output_proj.0.bias
      geo_output_proj.1.weight
      geo_output_proj.1.bias

Damn.

I believe I found the problem, there's an exploding block, and projection appears to be gradient dead.

Okay so it wasn't gradient dead, it was just not being analyzed correctly. It was gradient starved because it was handling weak penta sampled gradients, which were mean calculated, so touching the mean detected it.

I will attempt 2 possible directions

  1. Disable the InfoNCE completely after mastery begins, enabling only the mastery of the InfoNCE to take center stage.
  2. cross-attention InfoNCE rather than single-direction to enable census discussion rather than adversarial competition down the binding chain.

The way I see it is simple. The system is grinding, not flowing. The system should be flowing, the numbers should be clean, and the outcome should be working.

More than likely the very constraints I'm applying to shape the data, is affecting the very flow the data is supposed to take. As they both adjacently shape their own submanifolds between the two structures, the output does not speak cooperation it speaks competition.

Two views of the same puzzle, one asymmetrically getting more data than the other. Competition isn't required, resonance and cooperation is.

First test will be disabling the InfoNCE and allowing the competition to come to fruition, the second is to enable full cooperation. Those are my two choices, and based on the data either may reach the same goal. Competition and semi-adversarial vs cooperation.

Compounding the run data

I'm now creating a full run spectrum compounded training data to see the most likely candidates for enhancement.

Run 7

Enabling both sides autograd with cv loss and heavily reduced infonce loss.

============================================================
CIFAR-10 β€” Dual-Stream GeoLIP ViT β€” EXP 3 (FULL CV)
  Warm start from: 
  CV: weight=1.0 (FULL), target=0.22
  InfoNCE: weight=0.1 (REDUCED)
  Autograd: tang=0.5, sep=0.1
  LR: 0.0001, epochs: 20
  Device: cuda
============================================================
  Train: 50,000 (two views)  Val: 10,000

  Building model...
  ⚠ No v1 checkpoint found at , training from scratch
  Parameters: 6,321,542
    Geo route: 2,552,764 (40.4%)
    Std route: 3,768,778 (59.6%)

20 epoch test then a full 100 to compare.

Last was 87.1 in less time and the train didn't overfit until around epoch 95.

Run 5

The full loss spectrum was a bit unstable, lets provide some expected task outcome assistance without overly regularizing the geometry.

# Architecture (must match v1)
NUM_CLASSES = 10
IMG_SIZE = 32
PATCH_SIZE = 4
EMBED_DIM = 384
STREAM_DIM = 192
FUSED_DIM = 256
N_DUAL_BLOCKS = 2
N_FUSED_BLOCKS = 4
N_HEADS = 8
OUTPUT_DIM = 128
N_ANCHORS = 64
N_COMP = 8
D_COMP = 64
ANCHOR_DROP = 0.10
CV_TARGET = 0.2

# NEW for v2
CV_WEIGHT = 0.5
ENABLE_AUTOGRAD = True
AUTOGRAD_TANG = 1.0
AUTOGRAD_SEP = 0.1

# Training
BATCH = 512
EPOCHS = 100
LR = 1e-3
WARMUP = 5
GRAD_CLIP = 1.0
INFONCE_WEIGHT = 0.5
BCE_WEIGHT = 1.0
CM_WEIGHT = 0.5
INFONCE_TEMP = 0.07

This might contribute, 0.5 cv with autograd, 0.5 infonce with mastery, BCE as primary task.

Run 4 full loss analysis

The CV destabilized, so each validation set showed an entirely different CV response.

The train accuracy saturated early, which is likely due to the model being allowed to drift too much from the anchored intentions.

This isn't necessarily a good or a bad thing, but since the model didn't learn the necessary task it's not useful for this case.

The effective dimensionality of the embeddings are nearly perfect. The most difficult classes are the most unstable, and the CV of the anchors is flawlessly aligned to 0.2 as the outcome of a structurally significant component contributing to a system has shown historically to have.

The faulty classes are not hindering the successful classes, but there is a formatted instability that can be directly analyzed for behavior.

=================================================================
SCAN 1: EMBEDDING HEALTH
=================================================================
  Norms: mean=1.000000 std=0.000000
  Self-similarity: mean=0.0197 std=0.1353
  βœ“ No embedding collapse

  Effective dim: 92.1/128
    top-3 SVs explain 14.0%
    top-5 SVs explain 21.8%
    top-10 SVs explain 36.0%
    top-20 SVs explain 45.9%
    top-50 SVs explain 70.6%

  Pentachoron CV (GeoLIP structural spectrum):
    global embedding: CV=0.1149 (βœ— outside) [500/500 valid, mean_vol=0.084180]
    Per-class:
      airplane  : CV=0.2421 (βœ“ IN BAND) [200/200 valid, mean_vol=0.050191]
      automobile: CV=0.1793 (βœ— outside) [200/200 valid, mean_vol=0.042003]
      bird      : CV=0.2069 (βœ“ IN BAND) [200/200 valid, mean_vol=0.056354]
      cat       : CV=0.1539 (βœ— outside) [200/200 valid, mean_vol=0.063555]
      deer      : CV=0.1997 (βœ“ IN BAND) [200/200 valid, mean_vol=0.056215]
      dog       : CV=0.1731 (βœ— outside) [200/200 valid, mean_vol=0.056146]
      frog      : CV=0.1985 (βœ“ IN BAND) [200/200 valid, mean_vol=0.049812]
      horse     : CV=0.2266 (βœ“ IN BAND) [200/200 valid, mean_vol=0.047620]
      ship      : CV=0.2024 (βœ“ IN BAND) [200/200 valid, mean_vol=0.044833]
      truck     : CV=0.2309 (βœ“ IN BAND) [200/200 valid, mean_vol=0.046396]
    Class CV: mean=0.2013 std=0.0274 range=[0.1539, 0.2421]
    anchor constellation: CV=0.2002 (βœ“ IN BAND) [200/200 valid, mean_vol=0.077932]

    Patch-level CV (from fused patch projections):
    all patches (flat): CV=0.1593 (βœ— outside) [500/500 valid, mean_vol=0.080665]
                        Warm-start E20    From-scratch E100
Val accuracy:           85.9%             83.5%
Eff dim:                54.1/128          92.1/128
Global CV:              0.1922 βœ“          0.1149 βœ—
Anchor CV:              0.1303 βœ—          0.2002 βœ“
Class CV mean:          0.2973            0.2013
Classes in band:        2/10              7/10
1-NN accuracy:          80.6%             77.4%
Same/diff gap:          0.3361            0.2411
hard_neg (final):       0.525             0.372
hard_pos (final):       0.061             0.023

Run 4 full retrain with full enabled at 1.0

Enabling all the losses at 1.0 strength, InfoNCE mastery curation, cv_loss, autograd control, and the works.

512 batch size instead of 128, lr 1-e3

Lets see how the retrain takes.

image

Early outputs show the mastery clause hit pretty early and the model is showing no signs of plateau yet.

CV fairly stable and rising to the band after CV drop from smoothness increase.

Run 3 analysis

The model warm started had almost all classed out of CV band, and the trained version now has two within band.

Absolutely fascinating.

v1 (baseline)    v3 (mastery)
Val accuracy:           85.2%            85.9%
Global CV:              0.1061 βœ—         0.1922 βœ“ IN BAND
Patch CV:               0.1768 βœ—         0.2182 βœ“ IN BAND
1-NN accuracy:          57.3%            80.6%
Same/diff gap:          0.0911           0.3361
Class CV range:         [0.12, 0.15]     [0.24, 0.37]
catΓ—dog cos:            0.358            0.577
Self-similarity:        0.003            0.067

The trajectory of the NCE is the most telling gap. Prior to this the NCE was saturated and mastered, and now we have a meaningful trajectory of utility that still utilized valid geometry in a better fashion.

Epoch    hard_neg    hard_pos    gap
  1       0.466      -0.216     0.682
  5       0.549      -0.028     0.577
 10       0.540       0.019     0.521
 15       0.531       0.054     0.477
 20       0.525       0.061     0.464

The mastery clause... When a task is completed and the model does not correctly align, reassess the implications of the task by attracting more data to the master... How intriguing.

It worked, as well. So this could very well be an effective debugging tool I could use in the future.

Run 3 attempt 1

It's kind of working, but I don't think it's tuned EXACTLY to the correct objective yet.

image

The mastery clause is a powerful idea, I've never actually put it to words.

Every time I personally think I've mastered something, I have to test that theory. EVERY SINGLE TIME I know for certain as of now, that I will always be able to learn new information, and I'm never sure where that source of information will come from.

This same concept can be applied to geometric learning, further enhancing and improving the model over time.

This may not be the exact route to take for it, but it brings to light a new form of... Development that I wish to explore.

First tests show, the master can still learn.

image

Run 3 enhanced; InfoNCE prestige

After the InfoNCE masters the task, the followup processing will enable a 'second guess' clause, which forces the attenuation to multi-batch interpret the necessary implications through a cached structure subdivision.

This with the intention of increasing the NCE mastery's capacity to differentiate, will potentially assist the next attempt.

Every incorrect assessment provides rebounding effect towards the potential goal, adjusting the NCE geometric structure over time.

The initial NCE task is too easy towards what must happen, so the original solution is found too early and codified.

However, that does not necessarily mean the learned version will or will not benefit from this structure being reshaped over time.

Run 2 early recovery unexpected

I once again underestimated the tenacy of the law of averages.

image

I underestimated the alignment of the geometric structure on the patchwork half, cross-attend did exactly what it's supposed to do.

Interesting outcome... Not entirely unexpected, and not unwelcome.

image

It's slow burn gain, but the current variation is keeping train accuracy from fully overfitting, while still allowing some trickle learning...

Interesting.

Run 2 - early stages expected

This should drop to around 80% or so, give or take train accuracy throughout the run, while the validation accuracy should hold fairly stable.

image

This followup experiment is testing how well we can control the solidity of the internal geometry post first-pass training, and with that my insights tell me we need to destroy the non-geometric attendence in favor of implicit geometric alignment adjacently to the explicit geometric modeling being slightly nudged with the CV.

This could very well solve the faults.

Run 2 - warm start continuation

Enabling cv loss on only the geometric parameters, only a little nudgey nudge. Nothing too drastic here.

On the opposite side, I'm enabling full destruction autograd to forcefully realign the model.

Highly destructive experiment and this should be interesting.

Run 1 analysis - HIGHLY successful prototype

Moved to the run1 directory.

=================================================================
SCAN 8: ARCHITECTURE SUMMARY
=================================================================
  Total params:        6,296,582
  Geo stream params:   1,281,340 (20.3%)
  Std stream params:   1,261,248 (20.0%)
  Fused block params:  3,159,552 (50.2%)
  Constellation params:84,480 (1.3%)

=================================================================
DIAGNOSIS SUMMARY
=================================================================
  Val accuracy:     85.2%
  Eff dim:          58.3/128
  Pentachoron CV:   0.1085 (target band: 0.20-0.23)
  Self-similarity:  0.0027
  Pooled anchors:   64/64
  Patch anchors:    64/64
  Per-img p_anch:   15.2
  Entropy:          98%
  Gini:             0.2067
  CM volumes:       200/200 positive
  Anchor CV:        0.1389
  Class CV range:   [0.1374, 0.1621]
  Geo feat var:     0.000838
  Block 0 CM valid: 100.0%
  Block 1 CM valid: 100.0%
  Same/diff gap:    0.0911
  1-NN accuracy:    57.3%

  βœ“ No major issues. Geometry is healthy.

I'm trying to not jump out of my chair. It's fully validated geometrically and 85% accurate for cifar10 image validation.

Now, I know this is odd to celebrate an 85% cifar10 validation at 6.3m params... but holy moly it worked.

It not only worked, but it worked highly effectively. I have work to do. We must expand.

Run 1:

Well it's working.

I cut most of the distillation-centric losses in favor of direct learning losses to prevent overwhelming noise and incorrect internal arbitration.

image

Geometry fully intact, patchwork learning effectively, nce fully saturated.

It's not perfect yet, the patchwork matched to overfit, but I have a serious idea for how to guarantee a solidified patchwork.

Still needs the cv gauge, pentachoron sampling CV loss, and a new form of autograd. I'm waiting for it to finish before I decide to modify anything major. It's pretty vanilla so far.

image

As you can see the simple geometric gauges survived the full run, but that doesn't mean it's geometry just yet. I left it unconstrained.

The ksimplex channeling and simplex curation were sufficient for earlier experiments, but they require assessment on completion.

Preliminary and setup

As the vit experiments from yesterday showed, the vit itself when fused directly with the geometry does not benefit enough to substantiate a direct fusion in the state I was experimenting with.

Instead of continuing down that route, which is a losing battle of weighted arbitration, I've decided to head in the direction of dual-stream.

The whole point of these systems, are to test if the geometry can in fact stand on it's own, without the need of an arbitration expert model. The experiments yesterday show that yes it can, but it was quite the balancing act to directly fuse. That's no matter, the geometry DID survive, which means the battle was won and I have the blueprint. I will attempt make geometry work in a way that can apply the formulas without destroying the data, potentially improving performance and speed while introducing increased retention and utility.

However, if it does not improve performance or quality or speed, then it's going to need a legitimate series of detections to determine that exact optimization procedure, so the autograd will need tuning, and I'll need to format a better Adam.

[GEOMETRY] <-> [PATCHWORK]
[GEOMETRY] <-> [PATCHWORK]
[GEOMETRY->PATCHWORK]
[GEOMETRY->PATCHWORK]
[GEOMETRY->PATCHWORK]
[GEOMETRY->PATCHWORK]

Where we teach the geometry and patchwork separately but cross-attended, and then pool them for residuals after each dual-stream block. We ensure the sequence survives, and the geometric values down the chain fit within the hypersphere, while further down we pass through standard transformer fused systems while still simultaneously ensuring the geometry survives.

It'll be a bit of a problematic kerfuffle up front, but it should solve the cross-contamination problem that both destroys the geometry or destroys the patchwork due to overwhelming geometry. This way they can both benefit and the substructure can be better tested for validity and shared learning accuracy.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including AbstractPhil/geolip-vit-dual-stream