I tried using this for a technical talk[1], and it got the amount of speakers wrong. Which is somewhat suprising to me, as I would have thought diarization tech would just worked by now.
I'm gonna give it a try with your video. If I may ask how many speakers are there in this video. (I have to go through all of it otherwise). From what I can see, we have a teacher who is speaking most of the times and then few laughs from students in the background.
There are a couple of people interejecting with answers to questions, or asking questions. I'm afraid I don't have a better estiamte than that. But in this case, I think lumping the students together as one speaker and the teacher as another would be fine.
Woah! I've been facing the same problems with pyannote+whisper for diarization+transcription, and, coincidentally, was just experimenting with combining NeMO and whisper. Do you happen to have a repo for this? Would be invaluable.
[1]https://www.youtube.com/watch?v=5lFxURxbyEc&list=PLiayR7yJx8...