So you're saying, embeddings are fine, as long as we refrain from making full use of their capabilities? We've hit on a mathematical construct that seems to be able to capture understanding, and you're saying that the biggest models are too big, we need to scale down, only use embeddings for surface-level basic similarities?
I too think embeddings are vastly underutilized, and chat interface is not the be-all, end-all (not to mention, "chat with your program/PDF/documentation" just sounds plain stupid). However, whether current AI tools are replacing or amplifying your intelligence, is entirely down to how you use them.
As for search, yes, that was a huge breakthrough and powerful amplifier. 2+ decades ago. At this point it's computer use 101 - which makes it sad when dealing with programs or websites that are opaque to search, and "ubiquitous local search" is still not here. Embeddings can and hopefully will give us better fuzzy/semantic searching, but if you push this far enough, you'll have to stop and ask - if the search tool is now capable to understand some aspects of my data, why not surface this understanding as a different view into data, instead of just invoking it in the background when user makes a search query?
I have found that embeddings + LLM is very successful. I'm going to make the words up as to not yield my work publicly, but I had to classify something into 3 categories. I asked a simple llm to label it, it was 95% accurate. taking the min distance from the word embeddings to the mean category embeddings was about 96%. When I gave gave the LLM the embedding prediction, the LLM was 98% accurate.
There were issues an embedding model might not do well on where as the LLM could handle. for example: These were camel case words, like WoodPecker, AquafinaBottle, and WoodStock (I changed the words to not reveal private data).
WoodPecker and WoodStock would end up with close embedding values because the word Wood dominated the embedding values, but these were supposed to go into 2 different categories.
> word Wood dominated the embedding values, but these were supposed to go into 2 different categories
When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.
Do you mind sharing why you chose SPLADE-esque sparse embeddings?
I have been working on embeddings for a while.
For different reasons I have recently become very interested in learned sparse embeddings. So I am curious what led you to choose them for your application, and why?
> Do you mind sharing why you chose SPLADE-esque sparse embeddings?
I can provide what I can provide publicly. The first thing we ever do is develop benchmarks given the uniqueness of the nuclear energy space and our application. In this case it's FermiBench[0].
When working with operating nuclear power plants there are some fairly unique challenges:
1. Document collections tend to be in the billions of pages. When you have regulatory requirements to extensively document EVERYTHING and plants that have been operating for several decades you end up with a lot of data...
2. There are very strict security requirements - generally speaking everything is on-prem and hard air-gapped. We don't have the luxury of cloud elasticity. Sparse embeddings are very efficient especially in terms of RAM and storage. Especially important when factoring in budgetary requirements. We're already dropping in eight H100s (minimum) so it starts to creep up fast...
3. Existing document/record management systems in the nuclear space are keyword search based if they have search at all. This has led to substantial user conditioning - they're not exactly used to what we'd call "semantic search". Sparse embeddings in combination with other techniques bridge that well.
4. Interpretability. It's nice to be able to peek at the embedding and be able to get something out of it at a glance.
So it's basically a combination of efficiency, performance, and meeting users where they are. Our Fermi model series is still v1 but we've found performance (in every sense of the word) to be very good based on benchmarking and initial user testing.
I should also add that some aspects of this (like pretrained BERT) are fairly compute-intense to train. Fortunately we work with the Department of Energy Oak Ridge National Laboratory and developed all of this on Frontier[1] (for free).
> We've hit on a mathematical construct that seems to be able to capture understanding
I’m admittedly unfamiliar with the space, but having just done some reading that doesn’t look to be true. Can you elaborate please and maybe point to some external support for such a bold claim?
> Can you elaborate please and maybe point to some external support for such a bold claim?
SOTA LLMs?
If you think about what, say, a chair or electricity or love are, or what it means for something to be something, etc., I believe you'll quickly realize that words and concepts don't have well-defined meanings. Rather, we define things in terms of other things, which themselves are defined in terms of other things, and so on. There's no atomic meaning, the meaning is in the relationships between the thought and other thoughts.
And that is exactly what those models capture. They're trained by consuming a large amount of text - but not random text, real text - and they end up positioning tokens as points in high-dimensional space. As you increase the number of dimensions, there's eventually enough of them that any relationship between any two tokens (or groups; grouping concepts out of tokens is just another relationship) can be encoded in the latent space as proximity along some vector.
You end up with real computational artifact that's implementing the idea of defining concepts only in terms of other concepts. Now, between LLMs and the ability to identify and apply arbitrary concepts with vector math, I believe that's as close to the idea of "understanding" as we've ever come.
That does sound a bit like Peircian semiotic so I’m with you so far as the general concept of meaning being a sort of iterative construct.
Where I don’t follow is how a bitmap approximation captures that in a semiotic way. As far as I can tell the semiosis still is all occurring in the human observer of the machine’s output. The mathematics still hasn’t captured the interpretant so far as I can see.
Regardless of my possible incomprehension, I appreciate your elucidation. Thanks!
I feel like embeddings will be more powerful for understanding high dimensional physics than language because chaotic system predictability is limited by its compressability. Therefore an embedding is able to capture how exactly compressible the system is and therefore can extend the predictability as far as possible.
I too think embeddings are vastly underutilized, and chat interface is not the be-all, end-all (not to mention, "chat with your program/PDF/documentation" just sounds plain stupid). However, whether current AI tools are replacing or amplifying your intelligence, is entirely down to how you use them.
As for search, yes, that was a huge breakthrough and powerful amplifier. 2+ decades ago. At this point it's computer use 101 - which makes it sad when dealing with programs or websites that are opaque to search, and "ubiquitous local search" is still not here. Embeddings can and hopefully will give us better fuzzy/semantic searching, but if you push this far enough, you'll have to stop and ask - if the search tool is now capable to understand some aspects of my data, why not surface this understanding as a different view into data, instead of just invoking it in the background when user makes a search query?