LeRobot
Overview
Extend Robotics provides an option to download datasets in the LeRobot format, a standard developed by Hugging Face's LeRobot project. This format is optimized for machine learning workflows and offers a lightweight and modular alternative to traditional dataset formats.
By storing each episode as an individual .parquet
file, the dataset benefits from reduced file size and improved manageability. Additionally, LeRobot supports built-in video visualisation for each episode, allowing users to inspect demonstrations without additional rendering or post-processing. This makes the format especially suitable for offline reinforcement learning and imitation learning pipelines.
All LeRobot datasets from Extend Robotics are automatically converted from our internal HDF5 recordings to ensure compatibility with Hugging Face’s training and evaluation tools.
Data Structure
A LeRobot dataset consists of structured data stored in a hierarchical directory format. Below is a breakdown of its core components:
your_dataset/ ├── data/ │ └── chunk-000/ │ ├── episode_000000.parquet │ ├── episode_000001.parquet │ └── ... ├── videos/ │ └── chunk-000/ │ ├── episode_000000.mp4 │ ├── episode_000001.mp4 │ └── ... ├── meta/ │ ├── info.json │ ├── stats.json │ ├── tasks.jsonl │ └── episodes.jsonl └── depth_images/ (optional)
Parquet Episode Files
Each .parquet
file contains structured, time-aligned data for a single episode. These are grouped under the chunk-000
folder to support scalable dataset sharding for large-scale training tasks.
Videos
Each .mp4
video file is processed by ffmpeg and provides a visualisation of the corresponding episode. Videos are stored in the same chunk-000
structure as the data files for consistency.
Metadata
info.json
: General dataset metadata, including dataset name, schema version, and number of episodes.stats.json
: Statistical summary of the dataset (e.g., episode lengths, action ranges, joint state statistics). Useful for normalisation and analysis during training.tasks.jsonl
: Descriptions of all tasks contained in the dataset (e.g., task names).episodes.jsonl
: Per-episode metadata, including task association, episode length, and success indicators if available.
Was this helpful?