Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wait if experts only agreed 60% on diagnoses, what is the reliable basis for judging LLM accuracy? If experts struggle to agree on the input, how are they confidently ranking the output?


Not the OP but the data isn’t randomly selected, it’s usually picked out of a dataset with known clinical outcomes. So for example if it’s a set of images of lungs with potential tumors, the cases come with biopsies which determined whether it was cancerous or just something like scar tissue.


You can look at fully diagnosed cases(via surgery for example) and their previous scans.


Perhaps they were from cases that had a confirmed diagnosis.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: