Using Suffix Trees to Detect Homology at Scale

fastaguy88 · on Oct 29, 2023

It is critical to understand that this article fundamentally misuses the word "homology", which in bioinformatics and molecular evolution is understood to mean "sharing a common ancestor". Similarity searches (typically using programs like BLAST, one of the most cited methods in the biomedical literature) look for homologs.

This article is NOT about "homology" detection (which suffix trees are not particularly good for), it is about "identity" detection (which is a related problem, but fundamentally different). It is unfortunate that in the biomedical literature, the word "homology" is often misused in place of "identity" (as in "micro-homology", which is really "micro-identity"), but serious computational biologists try to avoid this misuse of the word "homology".

Gregaros · on Oct 29, 2023

I think rather this is a case of the mathematical notion of [homology](https://en.wikipedia.org/wiki/Homology_(mathematics)) - which, roughly, is when a collection of objects combine together to fill a hole (or form the boundary of that hole). In the case of the article, two half-helices ‘overlap’ when they are each two halves of the bounadry of the same DNA segment.

dekhn · on Oct 30, 2023

No, the benchling folks are just wrong. Homology in biology has a long, long history and ignoring that (and mistaking homology and identity). Folks misusing "homology" to mean identity (in the context of making PCR primers, etc) just confused the issue.

fastaguy88 · on Oct 30, 2023

I very much doubt that the widespread misuse of the term "homology" for "identity" by biomedical researchers reflects their deep understanding of mathematical concepts.

jakobnissen · on Oct 29, 2023

There is a rich literature in bioinformatics on sequence homology search, and there are many existing libraries that scale to billions on base pairs. I wonder why they reinvented the wheel

ultrablack · on Oct 29, 2023

Agree completely. I didn't read the entire article, but suffix trees are not hard to build in near linear time. I'm sure I was taught this in basic algorithms in 1993.

esafak · on Oct 29, 2023

Perhaps they immediately thought of a solution and did not feel bothered to see if there was a better one.

tetris11 · on Oct 29, 2023

IQtree is pretty good at this, for example

http://www.iqtree.org/

bafe · on Oct 29, 2023

It looks like an example of the "I have a PhD in X, I don't need to learn from CS or other fields to find a good solution". In my day job as a research software engineer I encounter plenty of these cases

contravariant · on Oct 29, 2023

Last time I looked there's at least several tens of papers of chemists rediscovering that getting a chemical equation to balance is just finding the null space of the corresponding matrix.

Maybe I'm being too critical, but really I don't understand why papers on the subject exist in the first place. Finding null spaces is solved, you just need the one paragraph pointing out the equivalence between the two problems, there's no real need to investigate much further. I mean there's only 90-ish elements that can reasonably participate in a chemical reaction in the first place so the largest linear equation is pretty small, there's not even any need to go looking for a more efficient calculation.

bafe · on Oct 30, 2023

I recently experienced something similar with people who wanted to find certain functional groups in a molecule with a regex search against the SMILES representation. It took a lot of persuasion to explain them that string search is a terrible idea for subgraph matching