Retrocausal’s Post

When you watch a long video and only get to see a few clips, it’s easy to miss important details and lose track of the overall sequence of events. Video LLMs suffer from this problem—sampling only a handful of frames from a long video—which means they often miss crucial context and can’t accurately process or describe what’s happening from start to finish. We've developed a new approach that harnesses the power of a Bidirectional LSTM to significantly enhance the temporal reasoning ability of video LLMs. By encoding time-aware clip features and aggregating them into a single global representation, we’ve achieved state-of-the-art results in tasks like dense video captioning, temporal grounding, highlight detection, and action segmentation. Check out our intro video for a quick overview and visit our arXiv link for the full details! Project Page: https://lnkd.in/d8maWHhM Full Paper: https://lnkd.in/dEENc5VD #LeanGPT #ActivityRecognition #KaizenCopilot

To view or add a comment, sign in

Explore topics