Summary
100+ hours of 4K egocentric POV video for training vision-language-action (VLA) models, imitation learning policies, and embodied AI systems. Continuous uncut footage of real humans performing real-life hand tasks: repair, assembly, sewing, household work, outdoor and bike maintenance, captured at 4K / 30 FPS from a head-mounted camera
Introduction
This dataset contains 100+ hours of 4K first-person (egocentric POV) video of real people performing real manipulation tasks in real environments: apartments, kitchens, balconies, garages, courtyards, workshops, and outdoors. Every recording is continuous and unedited: no cuts, no time-lapse, no music, no filters. Hands, tools, and objects are visible throughout, and each clip covers one meaningful step of a task: preparation, execution, visible result
The dataset is designed as a production-ready training corpus for the new generation of robot learning approaches: vision-language-action (VLA) models, imitation learning, behavior cloning, and embodied AI foundation models. Each video ships with structured task metadata: task name, category, location, lighting, camera mount, making it directly usable as a paired language-video input for VLA pretraining
Dataset Features
Scale & Quality
- 100+ hours of egocentric video
- 4K resolution at 30 FPS
- Continuous, uncut footage – full task arcs
- Manually verified – every clip reviewed for hand visibility and task continuity
- No synthetic frames, no AI generation, no re-encoding – original camera output
Metadata for Every Video
- Task name – natural language description (e.g. “sewing a button”, “cleaning bicycle chain”, “assembling a shelf”)
- Category – repair, assembly, sewing, household, outdoor, electronics, other
- Location – apartment, kitchen, balcony, courtyard, garage, workshop, park, other
- Lighting – daylight, artificial, mixed, low-light
- Camera mount – head, chest, other
Use cases and applications
- VLA (vision-language-action) model pretraining – paired video and task descriptions, directly aligned with the VLA training paradigm
- Imitation learning and behavior cloning – full continuous task arcs capture demonstration sequences
- Hand-object interaction (HOI) research – grasping, tool use, precision and dexterous manipulation
- Embodied AI and physical AI foundation models – large-scale corpus for pretraining general-purpose robot models
- Humanoid robot manipulation – household and outdoor task generalization beyond tabletop
- Sim-to-real transfer – real-distribution anchor data to complement simulation training
- Action recognition at sub-action granularity – reach, grasp, lift, transport, release
Why this dataset solves real production challenges
Commercially licensed for production ML training. The leading academic egocentric datasets: Ego4D (3,670 hours), EgoDex (829 hours), EgoExo4D (1,286 hours), EPIC-Kitchens (100 hours), are research-only. None can be used to train a production robot manipulation policy or a commercial VLA model. This dataset can
Continuous, uncut footage. Each clip is one continuous take covering a full task arc: preparation, execution, result. This is what behavior cloning policies need. Many academic datasets contain short curated clips that miss the temporal structure of complete demonstrations
Real-life hand tasks, not tabletop only. EgoDex covers 194 tabletop tasks. Our dataset covers diverse environments and tasks: apartments, kitchens, balconies, garages, outdoor maintenance, gardening, bike repair, sewing, electronics. This generalization breadth is critical for humanoid robots designed to operate in homes and outdoor environments
Task Coverage
- Apartment and furniture – small furniture assembly, hardware installation, battery replacements, and disassembly of household electronics like computer mice and keyboards
- Clothing and fabric repair – hand-sewing tasks including buttons, patches, hems, and zipper repairs
- Kitchen and household items – cleaning, disassembly, and reassembly of kitchen tools and appliances, plus simple food prep
- Electronics without complex repair – screen protectors, watch straps, cable organization, and battery replacements in everyday devices
- Balcony, yard, outdoor – bicycle maintenance, plant care, outdoor cleaning, and basic vehicle and gear setup
- Care, cleaning, organization – appliance maintenance, container sorting, and workspace setup as continuous task sequences
Sample dataset
A sample version of this dataset is available on Kaggle and HuggingFace. Leave a request in the form below for additional samples or the full version
Have a question?
The primary use cases are training vision-language-action (VLA) models, imitation learning policies, behavior cloning, hand-object interaction research, and embodied AI / physical AI foundation models. Each video is captured from a first-person perspective with paired natural-language task descriptions, making it directly aligned with the VLA training paradigm. The 4K resolution and continuous full-task footage also suit action recognition, long-form video understanding, and sim-to-real transfer research
Ego4D is larger (3,670 hours) but research-only and not designed for manipulation policy training specifically, Meta's own documentation notes that it was built for action recognition and perception, not for teaching a robot what to do next. EgoDex is research-only and limited to tabletop tasks. EPIC-Kitchens is research-only and kitchen-only. Our dataset is commercially licensed, captured at 4K resolution, includes diverse real-life household and outdoor tasks (not tabletop-only), and uses continuous uncut footage that preserves full demonstration arcs. For production manipulation training, these properties matter more than raw hours
VLA (vision-language-action) models are the foundation-model paradigm for robot learning popularized by RT-2 (Google DeepMind, 2023) and π0 (Physical Intelligence, 2024). They take vision input plus a natural-language instruction and output robot actions. Training a VLA requires paired video-and-language data at scale. This dataset provides 200+ hours of egocentric video, each clip paired with a natural-language task description in the metadata - the exact format VLA training expects
Yes. The scale (200+ hours), diversity of tasks and environments, and 4K resolution make it a strong candidate for inclusion in large-scale pretraining corpora for embodied AI and physical AI foundation models. Following the EgoScale (NVIDIA, 2026) scaling-law results, additional commercially licensed egocentric hours produce predictable downstream task performance gains in trained policies
Contact us
Tell us about yourself, and get access to free samples of the dataset
Didn't find what you were looking for?
Our collection includes many datasets for various requests
iBeta Level 1 Dataset
– 35,000+ videos
– 85+ participants
– zoom in and
zoom out
iBeta Level 2 Dataset
– 25 000+ videos
– 3D masks
– iBeta Level 2
iBeta Level 3 Dataset
– 10,000+ videos
– 12 Unique Masks
– iBeta Level 3
Display Replay Dataset for Liveness Detection
– 9,000+ videos
– 6,500+ participants
– Balanced mix of genders and ethnicities



