Egocentric Video Dataset

100+ hours of 4K egocentric POV video

Check samples on Kaggle

Summary

100+ hours of 4K egocentric POV video for training vision-language-action (VLA) models, imitation learning policies, and embodied AI systems. Continuous uncut footage of real humans performing real-life hand tasks: repair, assembly, sewing, household work, outdoor and bike maintenance, captured at 4K / 30 FPS from a head-mounted camera

Introduction

This dataset contains 100+ hours of 4K first-person (egocentric POV) video of real people performing real manipulation tasks in real environments: apartments, kitchens, balconies, garages, courtyards, workshops, and outdoors. Every recording is continuous and unedited: no cuts, no time-lapse, no music, no filters. Hands, tools, and objects are visible throughout, and each clip covers one meaningful step of a task: preparation, execution, visible result

The dataset is designed as a production-ready training corpus for the new generation of robot learning approaches: vision-language-action (VLA) models, imitation learning, behavior cloning, and embodied AI foundation models. Each video ships with structured task metadata: task name, category, location, lighting, camera mount, making it directly usable as a paired language-video input for VLA pretraining

Dataset Features

Scale & Quality

100+ hours of egocentric video
4K resolution at 30 FPS
Continuous, uncut footage – full task arcs
Manually verified – every clip reviewed for hand visibility and task continuity
No synthetic frames, no AI generation, no re-encoding – original camera output

Metadata for Every Video

Task name – natural language description (e.g. “sewing a button”, “cleaning bicycle chain”, “assembling a shelf”)
Category – repair, assembly, sewing, household, outdoor, electronics, other
Location – apartment, kitchen, balcony, courtyard, garage, workshop, park, other
Lighting – daylight, artificial, mixed, low-light
Camera mount – head, chest, other

Use cases and applications

VLA (vision-language-action) model pretraining – paired video and task descriptions, directly aligned with the VLA training paradigm
Imitation learning and behavior cloning – full continuous task arcs capture demonstration sequences
Hand-object interaction (HOI) research – grasping, tool use, precision and dexterous manipulation
Embodied AI and physical AI foundation models – large-scale corpus for pretraining general-purpose robot models
Humanoid robot manipulation – household and outdoor task generalization beyond tabletop
Sim-to-real transfer – real-distribution anchor data to complement simulation training
Action recognition at sub-action granularity – reach, grasp, lift, transport, release

Why this dataset solves real production challenges

Commercially licensed for production ML training. The leading academic egocentric datasets: Ego4D (3,670 hours), EgoDex (829 hours), EgoExo4D (1,286 hours), EPIC-Kitchens (100 hours), are research-only. None can be used to train a production robot manipulation policy or a commercial VLA model. This dataset can

Continuous, uncut footage. Each clip is one continuous take covering a full task arc: preparation, execution, result. This is what behavior cloning policies need. Many academic datasets contain short curated clips that miss the temporal structure of complete demonstrations

Real-life hand tasks, not tabletop only. EgoDex covers 194 tabletop tasks. Our dataset covers diverse environments and tasks: apartments, kitchens, balconies, garages, outdoor maintenance, gardening, bike repair, sewing, electronics. This generalization breadth is critical for humanoid robots designed to operate in homes and outdoor environments

Task Coverage

Apartment and furniture – small furniture assembly, hardware installation, battery replacements, and disassembly of household electronics like computer mice and keyboards
Clothing and fabric repair – hand-sewing tasks including buttons, patches, hems, and zipper repairs
Kitchen and household items – cleaning, disassembly, and reassembly of kitchen tools and appliances, plus simple food prep
Electronics without complex repair – screen protectors, watch straps, cable organization, and battery replacements in everyday devices
Balcony, yard, outdoor – bicycle maintenance, plant care, outdoor cleaning, and basic vehicle and gear setup
Care, cleaning, organization – appliance maintenance, container sorting, and workspace setup as continuous task sequences

Sample dataset

A sample version of this dataset is available on Kaggle and HuggingFace. Leave a request in the form below for additional samples or the full version

Have a question?

What is this dataset used for?

The primary use cases are training vision-language-action (VLA) models, imitation learning policies, behavior cloning, hand-object interaction research, and embodied AI / physical AI foundation models. Each video is captured from a first-person perspective with paired natural-language task descriptions, making it directly aligned with the VLA training paradigm. The 4K resolution and continuous full-task footage also suit action recognition, long-form video understanding, and sim-to-real transfer research

How is this dataset different from Ego4D, EgoDex, and EPIC-Kitchens?

Ego4D is larger (3,670 hours) but research-only and not designed for manipulation policy training specifically, Meta's own documentation notes that it was built for action recognition and perception, not for teaching a robot what to do next. EgoDex is research-only and limited to tabletop tasks. EPIC-Kitchens is research-only and kitchen-only. Our dataset is commercially licensed, captured at 4K resolution, includes diverse real-life household and outdoor tasks (not tabletop-only), and uses continuous uncut footage that preserves full demonstration arcs. For production manipulation training, these properties matter more than raw hours

What is VLA, and why does this dataset suit VLA training?

VLA (vision-language-action) models are the foundation-model paradigm for robot learning popularized by RT-2 (Google DeepMind, 2023) and π0 (Physical Intelligence, 2024). They take vision input plus a natural-language instruction and output robot actions. Training a VLA requires paired video-and-language data at scale. This dataset provides 200+ hours of egocentric video, each clip paired with a natural-language task description in the metadata - the exact format VLA training expects

Is the dataset suitable for embodied AI and physical AI foundation models?

Yes. The scale (200+ hours), diversity of tasks and environments, and 4K resolution make it a strong candidate for inclusion in large-scale pretraining corpora for embodied AI and physical AI foundation models. Following the EgoScale (NVIDIA, 2026) scaling-law results, additional commercially licensed egocentric hours produce predictable downstream task performance gains in trained policies

Contact us

Tell us about yourself, and get access to free samples of the dataset

I want to receive communications on the newly added datasets

Didn't find what you were looking for?

Our collection includes many datasets for various requests

iBeta

Egocentric Video Dataset

Egocentric Video Dataset

Summary

Introduction

Dataset Features

Use cases and applications

Why this dataset solves real production challenges

Task Coverage

Sample dataset

Have a question?

Contact us

Didn't find what you were looking for?

iBeta Level 1 Dataset

iBeta Level 2 Dataset

iBeta Level 3 Dataset

Display Replay Dataset for Liveness Detection

Contacts

Company

Datasets

Follow us