In probabilistic programming you (deterministically) define variables and formulas. It's just that the variables aren't instances of floats, but represent stochastic variables over floats.
This is similar to libraries for linear algebra where writing A * B * C does not immediately evaluate, but rather builds an expression tree that represent the computation; you need to do say `eval(A * B * C)` to obtain the actual value, and it gives the library room to compute it in the most efficient way.
It's more related to symbolic programming and lazy evaluation than (non-)determinism.
To help solve forecasting & planning problems too hard to hold in your head, I’m converting natural-language formulations of constrained optimization problems into (back)solvable mathematical programs, whose candidate solutions are “scenarios” in a multi-dimensional “scenario landscape” that can be pivoted, filtered, or otherwise interrogated by an LLM equipped with analytical tools:
Very f’ing cool (esp. optimistic about repo-level codebase completion) – but just like many other results that DeepSeek reports, their preprint leaves me with more questions than they’ve given answers, unless I’ve misunderstood multiple pieces of it (which of course is possible):
—They report a 9.0× speedup in forward pass and 6.0× in backward pass… Why the heck would the backward pass be so much slower? Is it their gating mechanisms needing extra computation in backward passes? Gradient accumulation or KV-cache updates bottlenecking the speedup? FlashAttention (or at least FlashAttention-2) gives a near-equal forward-backward efficiency… They claim it’s tuned for FA2-style blockwise layouts, so which of their (competing) claims is wrong?
—Does NSA actually learn useful sparsity, or just get lucky with pretraining? How much of the performance gain comes from pretrained sparsity patterns vs. sparsity inherent to the attention? Even though they themselves say “applying sparsity post-hoc forces models to deviate from their pretrained optimization trajectory… As demonstrated by Chen et al. (2024), [sic] top 20% attention can only cover 70% of the total attention scores, rendering structures like retrieval heads in pretrained models vulnerable to pruning during inference” — yet their ablation isn’t strong enough to tell. A stronger ablation would include (1) a Full Attention → NSA transition test to measure whether NSA can be applied post-hoc without degradation, (2) a visualization of learned sparsity patterns over training epochs, and (3) a test where sparsity constraints are randomly assigned to see if NSA actually finds useful structures or just adapts to imposed ones.
—Training transformers with sparse attention is historically unstable — early MoEs like Switch-Transformer (which use expert gating-like mechanisms just like this one) were famous specifically for their collapse issues. How does NSA prevent mode collapse in early training — or really, how do we know it’s not just going to collapse different (i.e. more common) initialization schemes? If their technique doesn’t have an explicit mechanism for counteracting sparse expert underutilization, then it’s just as vulnerable to collapse as (e.g.) Switch-Transformer — but worse, since sparsity here isn’t just a gating function, it’s the core of the entire attention mechanism…
I’ve been working on a glossary of “Notions of Equality & Sameness” every few months for the last ~8 years (and looking for collaborators: carson@carsonkahn.com) - surprised not to have had this one in my list already!
Thanks for sharing -- makes a lot of sense. Indexing browser history is very high on our priority list, lets us cover the less-widely used apps much easier. (though is somewhat limited to things you've seen while the integration is set up, similarly to Rewind)
reply