The model is fed individual frames from the movie BUT the movie is segmented int...

The model is fed individual frames from the movie BUT the movie is segmented into scenes. These scenes, are held in context for 5-10 scenes, depending on their length. If the video exceeds a specific length or better said a threshold of scenes it creates an index and summary. So yes technically the model looks at individual frames but it's a bit more tooling behind it.