Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Hierarchical Navigable Small Worlds (zilliz.com)
73 points by fzliu on July 10, 2023 | hide | past | favorite | 9 comments


Kinda wish they would use some more letters when writing the sample code. This terse style in general is fine but when the purpose of the code is to try to explain how something works, it's adding an extra layer of difficulty decoding what the heck 'nns, 'cv', 'd', 'e' and 'ef' is supposed to mean.

Like this ChatGPT-looking function

    nns = [best]
    visit = set(best)  # set of visited nodes
    candid = [best]  # candidate nodes to insert into nearest neighbors

    # find top-k nearest neighbors
    while candid:
        cv = heappop(candid)

        if nns[-1][0] < cv[0]:
            break

        # loop through all nearest neighbors to the candidate vector
        for e in graph[cv[1]][1]:
            d = np.linalg.norm(graph[e][0] - query)
            if (d, e) not in visit:
                visit.add((d, e))

                # push only "better" vectors into candidate heap
                if d < nns[-1][0] or len(nns) < ef:
                    heappush(candid, (d, e))
                    insort(nns, (d, e))
                    if len(nns) > ef:
                        nns.pop()
code could have been trivially cleaned up and it would have been so much easier to follow. I'm guessing what some of these variables are supposed to mean.

    nearest_neighbors = [best]
    visited = set(best)
    candidates = [best]

    # find top-k nearest neighbors
    while candidates:
        candidate_vector = heappop(candidates)

        if nearest_neighbors[-1][0] < candidate_vector[0]:
            break

        # loop through all nearest neighbors to the candidate vector
        for neighbor in graph[candidate_vector[1]][1]:
            distance = np.linalg.norm(graph[e][0] - query)

            if (distance, neighbor) in visit:
                continue

            visited.add((distance, neighbor))

            # push only "better" vectors into candidate heap
            if distance < nearest_neighbors[-1][0] or len(nearest_neighbors) < epsilon:
               heappush(candidates, (distance, neighbor))
               insort(nns, (distance, neighbor))
               if len(nearest_neighbors) > epsilon:
                   nearest_neighbors.pop()


Good suggestion! Actually the "ef" is not epsilon. It is a parameter of the HNSW index: https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md...


Yeah it wasn't referenced or mentioned anywhere in the article so I had to guess what it was :-/ ... which I guess supports my point.

ef being "epsilon factor" was the least nonsensical thing I could come up with since it looked analogous to how you do like abs(a-b)<epsilon in floating point comparisons.


Hm... adding links in code comment might be better.


Probably a comment above the function explaining the incoming parameters, with a link to the specs if necessary, and then comments in the text and better variable names.

I think in this scenario when the primary audience of the code isn't an interpreter, but people who don't understand HNSWs, and who will probably only glance over the code once or twice, you really can't lay on the verbosity heavily enough.


https://github.com/instant-labs/instant-distance is a compact, fairly readable, pretty fast implementation of the paper in Rust (that I built).

We use it for searching aftermarket domains on Instant Domain Search.


Why have a cookie dialog if my only option is to "accept"? Not much of a dialog there.


probably required by some laws


Actually that’s a common misunderstanding of the ePrivacy directive (often bundled with the GDPR(, if you’re only providing an accept dialog you might as well save the effort and not show a cookie notice at all.

You need to obtain consent before you set any marketing cookies and effective consent can’t be obtained if your only option is “yeah track me”




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: