Basic object detection algorithms look at a single frame of an image–a rectangular grid of pixels–and output a response of what, and sometimes where, an object is in the image. Show that same algorithm the next image frame in a series of frames from a full motion video, and and it repeats the process without remembering its progress or output in the previous frame.
Adjacent frames in a 30 frame per second video, which represent snapshots of the real world 1/30th of a second apart and typically extremely similar in precise location and field of view, might as well be completely different worlds to that algorithm.
.