Lablup to Showcase Sovereign AI Training Infrastructure at NVIDIA GTC 2026 Powered by 500+ B200 GPUs
- Operated a 60+ node B200 cluster for 73 days, cutting average recovery time by 47%
- Shares real-world operational strategies from the Solar Open 100B large-model training project
- Live demos of Backend.AI Continuum and Backend.AI:GO on NVIDIA DGX Spark systems
SEOUL, South Korea – March X, 2026 — Lablup Inc., a leading AI infrastructure company, will share its experience operating a large-scale Sovereign AI training environment powered by 504 NVIDIA B200 GPUs at NVIDIA GTC 2026, held March 16–19 in San Jose, California. During the GTC Theater Session on March 18, CEO Jeongkyu Shin will present insights from 73 days of large-cluster operations, covering Lablup’s fault-tolerant scheduling strategies and fast recovery techniques. The company will also host hands-on demonstrations at its booth, featuring automatic recovery functions in Backend.AI Continuum and Backend.AI:GO running on NVIDIA DGX Spark systems.
Sovereign AI in Action: Building a 100B-Parameter Model
In the session titled “Building Sovereign AI: Scaling 100B+ Model Training on NVIDIA Blackwell Infrastructure,” Shin will walk through how Lablup trained a 100-billion-parameter model on a 60+ node B200 cluster (504 GPUs). This project was part of the Korea Ministry of Science and ICT / NIPA initiative to develop homegrown AI foundation models, with Lablup serving as the infrastructure partner for Upstage’s Solar Open 100B training.
The team built a fault-tolerant scheduling system capable of automatically detecting and recovering from common distributed training failures such as GPU errors and NCCL timeouts. The system reduced the average recovery time by 47% compared to previous runs, restarting failed processes in under three seconds. Lablup will also share its experience troubleshooting issues like a single NFS driver misconfiguration that degraded performance by nearly tenfold, and its optimizations for MXFP8 training stability and NCCL tuning across both RoCE and InfiniBand networks adapted for the Blackwell architecture.
Hands-On Demos: Backend.AI Continuum and Backend.AI:GO
Visitors to Lablup’s booth (#243) can take part in live demonstrations showcasing Backend.AI Continuum’s model routing and resilience features. For example, when a network cable is unplugged to simulate a fault, inference requests seamlessly reroute through alternate paths in real time. Even if a cloud connection drops, Continuum maintains API calls by switching instantly to local resources, allowing attendees to physically experience its fault tolerance.
Lablup will also present Backend.AI:GO, a local AI runtime that runs on devices ranging from personal laptops to high-end systems like the NVIDIA DGX Spark with 128GB unified memory. Backend.AI:GO provides a personal, high-performance AI environment that fully utilizes local hardware without depending on remote cloud access.
“Running a 504-GPU B200 cluster continuously for 73 days taught us what truly breaks in large-scale distributed training and how to rebuild it,” said Lablup CEO Jeongkyu Shin. “At GTC 2026, we’re excited to share those lessons and our vision for building sovereign AI infrastructure that enables nations and industries to operate AI on their own terms.”
About Lablup Inc.
Lablup Inc., founded in 2015, builds Backend.AI, a software-defined AI infrastructure platform that orchestrates heterogeneous GPU and NPU clusters at hyperscale with secure multi-tenancy and multi-node workload management. Its container-level GPU virtualization technology further maximizes accelerator utilization across training, inference, and deployment. Headquartered in Seoul with a U.S. subsidiary in San Jose, Lablup manages over 16,000 GPUs across 110+ sites worldwide, and Backend.AI is validated as NVIDIA DGX-Ready Software. Learn more at lablup.com.