Egocentric Video Dataset

Egocentric Video Dataset

100+ hours of 4K egocentric POV video

Check samples on Kaggle

Summary

100+ hours of 4K egocentric POV video for training vision-language-action (VLA) models, imitation learning policies, and embodied AI systems. Continuous uncut footage of real humans performing real-life hand tasks: repair, assembly, sewing, household work, outdoor and bike maintenance, captured at 4K / 30 FPS from a head-mounted camera

Introduction

This dataset contains 100+ hours of 4K first-person (egocentric POV) video of real people performing real manipulation tasks in real environments: apartments, kitchens, balconies, garages, courtyards, workshops, and outdoors. Every recording is continuous and unedited: no cuts, no time-lapse, no music, no filters. Hands, tools, and objects are visible throughout, and each clip covers one meaningful step of a task: preparation, execution, visible result

The dataset is designed as a production-ready training corpus for the new generation of robot learning approaches: vision-language-action (VLA) models, imitation learning, behavior cloning, and embodied AI foundation models. Each video ships with structured task metadata:  task name, category, location, lighting, camera mount,  making it directly usable as a paired language-video input for VLA pretraining

Dataset Features

Scale & Quality

  • 100+ hours of egocentric video
  • 4K resolution at 30 FPS 
  • Continuous, uncut footage – full task arcs 
  • Manually verified – every clip reviewed for hand visibility and task continuity
  • No synthetic frames, no AI generation, no re-encoding – original camera output

Metadata for Every Video

  • Task name – natural language description (e.g. “sewing a button”, “cleaning bicycle chain”, “assembling a shelf”)
  • Category – repair, assembly, sewing, household, outdoor, electronics, other
  • Location – apartment, kitchen, balcony, courtyard, garage, workshop, park, other
  • Lighting – daylight, artificial, mixed, low-light
  • Camera mount – head, chest, other

Use cases and applications

  • VLA (vision-language-action) model pretraining – paired video and task descriptions, directly aligned with the VLA training paradigm 
  • Imitation learning and behavior cloning – full continuous task arcs capture demonstration sequences 
  • Hand-object interaction (HOI) research – grasping, tool use, precision and dexterous manipulation
  • Embodied AI and physical AI foundation models – large-scale corpus for pretraining general-purpose robot models
  • Humanoid robot manipulation – household and outdoor task generalization beyond tabletop
  • Sim-to-real transfer – real-distribution anchor data to complement simulation training
  • Action recognition at sub-action granularity – reach, grasp, lift, transport, release

Why this dataset solves real production challenges

Commercially licensed for production ML training. The leading academic egocentric datasets: Ego4D (3,670 hours), EgoDex (829 hours), EgoExo4D (1,286 hours), EPIC-Kitchens (100 hours), are research-only. None can be used to train a production robot manipulation policy or a commercial VLA model. This dataset can

Continuous, uncut footage. Each clip is one continuous take covering a full task arc: preparation, execution, result. This is what behavior cloning policies need. Many academic datasets contain short curated clips that miss the temporal structure of complete demonstrations

Real-life hand tasks, not tabletop only. EgoDex covers 194 tabletop tasks. Our dataset covers diverse environments and tasks: apartments, kitchens, balconies, garages, outdoor maintenance, gardening, bike repair, sewing, electronics. This generalization breadth is critical for humanoid robots designed to operate in homes and outdoor environments

Task Coverage

  • Apartment and furniture – small furniture assembly, hardware installation, battery replacements, and disassembly of household electronics like computer mice and keyboards
  • Clothing and fabric repair – hand-sewing tasks including buttons, patches, hems, and zipper repairs
  • Kitchen and household items – cleaning, disassembly, and reassembly of kitchen tools and appliances, plus simple food prep
  • Electronics without complex repair – screen protectors, watch straps, cable organization, and battery replacements in everyday devices
  • Balcony, yard, outdoor – bicycle maintenance, plant care, outdoor cleaning, and basic vehicle and gear setup
  • Care, cleaning, organization – appliance maintenance, container sorting, and workspace setup as continuous task sequences

Sample dataset

A sample version of this dataset is available on Kaggle and HuggingFace. Leave a request in the form below for additional samples or the full version

Have a question?

The primary use cases are training vision-language-action (VLA) models, imitation learning policies, behavior cloning, hand-object interaction research, and embodied AI / physical AI foundation models. Each video is captured from a first-person perspective with paired natural-language task descriptions, making it directly aligned with the VLA training paradigm. The 4K resolution and continuous full-task footage also suit action recognition, long-form video understanding, and sim-to-real transfer research

Ego4D is larger (3,670 hours) but research-only and not designed for manipulation policy training specifically, Meta's own documentation notes that it was built for action recognition and perception, not for teaching a robot what to do next. EgoDex is research-only and limited to tabletop tasks. EPIC-Kitchens is research-only and kitchen-only. Our dataset is commercially licensed, captured at 4K resolution, includes diverse real-life household and outdoor tasks (not tabletop-only), and uses continuous uncut footage that preserves full demonstration arcs. For production manipulation training, these properties matter more than raw hours

VLA (vision-language-action) models are the foundation-model paradigm for robot learning popularized by RT-2 (Google DeepMind, 2023) and π0 (Physical Intelligence, 2024). They take vision input plus a natural-language instruction and output robot actions. Training a VLA requires paired video-and-language data at scale. This dataset provides 200+ hours of egocentric video, each clip paired with a natural-language task description in the metadata - the exact format VLA training expects

Yes. The scale (200+ hours), diversity of tasks and environments, and 4K resolution make it a strong candidate for inclusion in large-scale pretraining corpora for embodied AI and physical AI foundation models. Following the EgoScale (NVIDIA, 2026) scaling-law results, additional commercially licensed egocentric hours produce predictable downstream task performance gains in trained policies

Contact us

Tell us about yourself, and get access to free samples of the dataset 

Didn't find what you were looking for?

Our collection includes many datasets for various requests

© 2022 – 2026 Copyright protected