The model is fed individual frames from the movie BUT the movie is segmented into scenes. These scenes, are held in context for 5-10 scenes, depending on their length. If the video exceeds a specific length or better said a threshold of scenes it creates an index and summary. So yes technically the model looks at individual frames but it's a bit more tooling behind it.