Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Using Suffix Trees to Detect Homology at Scale (benchling.engineering)
30 points by r4um on Oct 29, 2023 | hide | past | favorite | 11 comments


It is critical to understand that this article fundamentally misuses the word "homology", which in bioinformatics and molecular evolution is understood to mean "sharing a common ancestor". Similarity searches (typically using programs like BLAST, one of the most cited methods in the biomedical literature) look for homologs.

This article is NOT about "homology" detection (which suffix trees are not particularly good for), it is about "identity" detection (which is a related problem, but fundamentally different). It is unfortunate that in the biomedical literature, the word "homology" is often misused in place of "identity" (as in "micro-homology", which is really "micro-identity"), but serious computational biologists try to avoid this misuse of the word "homology".


I think rather this is a case of the mathematical notion of [homology](https://en.wikipedia.org/wiki/Homology_(mathematics)) - which, roughly, is when a collection of objects combine together to fill a hole (or form the boundary of that hole). In the case of the article, two half-helices ‘overlap’ when they are each two halves of the bounadry of the same DNA segment.


No, the benchling folks are just wrong. Homology in biology has a long, long history and ignoring that (and mistaking homology and identity). Folks misusing "homology" to mean identity (in the context of making PCR primers, etc) just confused the issue.


I very much doubt that the widespread misuse of the term "homology" for "identity" by biomedical researchers reflects their deep understanding of mathematical concepts.


There is a rich literature in bioinformatics on sequence homology search, and there are many existing libraries that scale to billions on base pairs. I wonder why they reinvented the wheel


Agree completely. I didn't read the entire article, but suffix trees are not hard to build in near linear time. I'm sure I was taught this in basic algorithms in 1993.


Perhaps they immediately thought of a solution and did not feel bothered to see if there was a better one.


IQtree is pretty good at this, for example

http://www.iqtree.org/


It looks like an example of the "I have a PhD in X, I don't need to learn from CS or other fields to find a good solution". In my day job as a research software engineer I encounter plenty of these cases


Last time I looked there's at least several tens of papers of chemists rediscovering that getting a chemical equation to balance is just finding the null space of the corresponding matrix.

Maybe I'm being too critical, but really I don't understand why papers on the subject exist in the first place. Finding null spaces is solved, you just need the one paragraph pointing out the equivalence between the two problems, there's no real need to investigate much further. I mean there's only 90-ish elements that can reasonably participate in a chemical reaction in the first place so the largest linear equation is pretty small, there's not even any need to go looking for a more efficient calculation.


I recently experienced something similar with people who wanted to find certain functional groups in a molecule with a regex search against the SMILES representation. It took a lot of persuasion to explain them that string search is a terrible idea for subgraph matching




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: