Investigating Robot Learning of Quadrupedal Locomotion on Deformable Terrain

My poster presentation at German Robotics Conference 2025

News (March 2025) – A preliminary extension of this framework was accepted as a short paper at the German Robotics Conference 2025 (GRC 2025). Read the extended abstract here. Read the poster here.

This project is the deliverable of my M.Sc. thesis at RWTH Aachen.
It packages an end-to-end pipeline—simulation, reinforcement-learning (RL), evaluation, and visualisation—for training quadruped robots to handle deformable terrain such as sand, gravel, and soft soil.
Built around NVIDIA Isaac Sim and OmniIsaacGymEnvs, the workspace brings together:

Position-Based Dynamics (PBD) particles for real-time granular media.
Massive-parallel Proximal Policy Optimization (PPO).
An automatic terrain curriculum that graduates from rigid slopes to particle-filled depressions.
Domain randomisation (friction, density, adhesion, external pushes) for sim-to-real transfer.
Integrated metrics dashboards and helper scripts for reward-curve replay and inference video capture.

“The adoption of PBD allowed for a more accurate and computationally efficient simulation of granular interactions, facilitating real-time training and testing of RL policies.”

Motivation - Experiments

We kick‑started the project by running the stock Unitree A1 controller on loose sand at a deliberately low command velocity. The robot managed a cautious forward trot, adapting its balance to the yielding surface.

Yet the moment speed or terrain complexity (gravel or sand‑gravel mix) increased, it failed to stay upright.These early “misadventures” exposed the raw difficulty of deformable‑terrain locomotion and cemented the need for a learned, terrain‑aware policy.

Methodology

Deformable-Terrain Simulator for locomotion – Spawns ∼200 k PBD particles inside mesh “depressions” in Isaac Sim, and refits BVH on-the-fly, with two-way robot-terrain contacts.

Top Left: Particle parameters were tuned via an empirical angle-of-repose test (≈ 30–40° for 20 mm spheres), and μ/ρ/adhesion are randomly perturbed every 20 s during Phase 2 to harden sim-to-real. Bottom Left: Initialization of particles into the depressed grid.

Top Right: poses on rigid ground and compliant terrain respectively (left & right). Bottom Right: traversal on PBD gravel with height scans highlighted in red.

Component	Details
State Vector (188 D)	Base lin/ang vel, gravity vec, 12 joint pos + vel, previous action, 140-cell height grid.
Action Space (12 D)	Joint-angle offsets; torques clipped to ±80 N m.
Rewards	Velocity tracking, torque/accel regularisers, stumble penalty, peak-contact penalty; airtime term disabled in Phase 2.

Two-Stage RL Curriculum –
- Phase 1: 2000 epochs on rigid terrain; velocity curriculum + airtime / collision / stumble / other rewards.
- Phase 2: gravel only; dynamic particle material properties randomisation every 20 s to boost policy generalization.

</div>

</div>

Overall structure of our learning framework.

Velocity-Aware Command Curriculum – Command ranges auto-scale when average reward > 80 % of max, enabling safe exploration without premature falls.
Benchmark Replication – Re‑implemented the “Learning to Walk in Minutes” baseline in both Isaac Gym and Isaac Sim. Average episodic‑reward curves overlap within ±2 %, confirming that migrating to Isaac Sim’s richer GUI incurs no learning penalty.

Terrain Curriculum

Rigid Section – Mix of slopes (±25 °), stairs (0.3 m × 0.2 m), and 0.2 m random obstacles.
Granular Section – Central 4 × 4 m pit filled with 2 mm PBD spheres (ρ = 2000 kg m⁻³, μ = 0.35).
Agents graduate when average episode reward exceeds threshold; otherwise regress, while preventing catastrophic forgetting.

Side-by-side: Phase 1 (left) and Phase 2 Terrain Curriculum (right).

Results Highlights

One-take inference run: 1 m/s trot through a 6 × 6 m gravel pit.

Key metrics summary

Metric †	Benchmark (replicated)	Phase 1 (rigid terrains)	Phase 2 (PBD gravel)
Mean power consumption	309.92	179.13	199.89
Cost of Transport (CoT)	2.00	0.38	— slightly ↑ vs P1
Mean foot contact force	18.27	30.87	— slightly ↓ vs P1
Base angular vel. (XY) MSE (mean ± SD)	—	1.7539 ± 4.7669	1.6875 ± 1.7753
Joint position MSE (all DOFs) (mean ± SD)	—	1.3894 ± 0.6691	1.3297 ± 0.5849

† Mean over all four legs.

Bottom line: Phase 2 tightens orientation & joint-tracking errors (lower means and much lower std in angular-XY; lower joint-pos MSE), at a higher power draw on granular terrain than Phase 1 – to our knowledge the first Isaac Sim quadruped successfully demonstrated on fully deformable PBD terrain.

Major Limitation

Due to GPU memory/throughput constraints, we were unable to scale the granular-terrain simulations (PBD particle counts and domain size) beyond the presented setup. Consequently, the amount and diversity of deformable-terrain experience collected during training was limited. Additionally, Isaac Sim currently runs the PBD particle pipeline entirely on the CPU, which introduces a significant bottleneck for large-scale granular simulations. This restricts achievable frame rates and limits the practicality of training on more complex deformable terrains without distributed CPU resources.

Ongoing Work

Cloud-scale simulation & training — Containerize the workspace and orchestrate Isaac Sim + PPO across multi-GPU cloud platforms to scale PBD particle counts/terrain size and expand experience collection.
Terrain‑adaptive velocity curriculum for enabling high speed locomotion training
Privileged student–teacher transfer and adaptation module for rapid sim-to-real adaptation.
SAC + online adaptation to cut sample complexity on CPU-bound particle sims.

Adaptive scaling of command velocity based on real-time terrain difficulty and agent performance.

Privileged information encoder and adaptation module for estimating environment extrinsics and enabling policy adaptation.