Beyond model training, experiencing AI infrastructure

Taeeun KimResearch Intern

LablupInternship

May 29, 2026

Career

Beyond model training, experiencing AI infrastructure

Taeeun KimResearch Intern

LablupInternship

In the summer of 2025, from June to August, I worked as a research intern at Lablup. Rather than building AI models directly, Lablup focuses on building and improving the infrastructure needed for AI research and service operation. Centered around Backend.AI, the company works on areas such as GPU resource management, experiment environment setup, workload execution, and automation tools.

Before the internship, most of my experience with AI had been close to model training and optimization. I was more familiar with choosing models, training them, and improving their performance. The work I took on at Lablup, however, sat one layer outside of that. I analyzed GPU benchmarks, used FastTrack to identify areas for improvement, built a PyCon booth event page, and submitted pull requests to an open-source project. I also worked with internal experiment environments, recorded issues, documented results, and organized findings so the team could reuse them later.

In this post, I would like to summarize what I worked on during the internship and what I learned from dealing not with models themselves, but with the environments that run and operate them.

What I Worked On

My work as a research intern began with getting familiar with the Backend.AI ecosystem. I used WebUI and FastTrack, recorded points where issues appeared in realistic usage scenarios, and, when needed, reported issues or fixed them directly. At the same time, I worked on tasks closer to the research team's work, such as GPU benchmark analysis.

At first, the structure of Backend.AI and FastTrack felt unfamiliar. But as I ran workloads, changed configurations, and compared results, I began to see what matters in AI infrastructure. Model performance is important, but so are the stability of the execution environment, repeatable experiments, interpretable metrics, and tool usability.

The main projects I worked on during the internship fell into three areas:

analyzing GPU benchmark results and improving performance metrics
fixing bugs and usability issues I found while using FastTrack
building a feedback collection page for the PyCon 2025 booth event

In addition to these tasks, documenting the process and results, discussing them with teammates, and making sure they could lead to follow-up work were also important parts of the internship. I started out mostly as a user of the product, but over time I became more involved in reproducing issues, narrowing down causes, suggesting fixes, and implementing changes directly.

Benchmarking GPU Workloads

One of my main projects was GPU benchmark analysis. The goal was to evaluate how different configurations affected benchmark performance and to begin automating parts of the process through FastTrack. This was not just a matter of running benchmarks. I had to compare results across different settings, investigate why some results varied, and check stability through repeated runs.

Benchmarking AI workloads sounds straightforward at first: run a workload, collect numbers, and compare the results. In practice, the results can vary for reasons that are not always obvious. Network latency, run-to-run variance, request patterns, and system-level conditions can all affect the metrics. A single average value is often not enough to explain what is happening.

This led me to examine the consistency of benchmark results across multiple runs and to think more carefully about what should be measured. I experimented with reducing the impact of network latency and checked how stable the results remained when running the same configuration multiple times. In addition to throughput and latency, variance became an important signal. For inference workloads in particular, per-request token throughput variance can reveal instability that aggregate metrics may hide.

I documented these experiments and analyses in Lablup's internal Knowledge Base so the team could refer back to them later. Rather than only collecting results in tables, I also recorded which conditions produced which differences, which results needed further verification, and what should be considered when automating the process later. I then added per-request token throughput variance metrics to the research team's benchmarking script. It was a small contribution, but it made benchmark results easier to interpret beyond a single representative number.

The biggest lesson I took from this work is that benchmarking is not simply about producing faster numbers. What matters is making the measurement process itself trustworthy. For AI infrastructure, reliable benchmarks require repeatability, enough context to explain variance, and automation that reduces inconsistency from manual work.

FastTrack Usability and Open-Source Contribution

While working on GPU benchmarks, I used FastTrack frequently. FastTrack helps users configure experiments and automation tasks more easily, and as I used it in practice, I naturally started paying attention not only to its features but also to its user flow and interface. Following the steps needed for benchmark automation made it easier to notice which inputs were confusing, which states needed to be clearer, and which behaviors felt different from what a user might expect.

When you use a tool directly, small frictions can feel larger than they look. A confusing label, a missing state, or an unclear workflow may seem minor in isolation. But when a user is trying to run an experiment or manage resources, these small points of friction can affect both speed and accuracy. In infrastructure software, usability is not separate from technical quality.

Early in the internship, I reported a small issue related to a label in a FastTrack modal. The issue was discussed and addressed quickly. The change itself was small, but it showed me how a problem found during actual use could turn into a product improvement.

FastTrack interface

As I continued using FastTrack, I found more bugs and usability improvements, and eventually began contributing directly to the project. During the internship, I submitted six pull requests to FastTrack, resolved bug reports, and participated in discussions about improving the user experience. At first, I mainly organized issues and asked questions. Gradually, I became able to read through the codebase, understand the scope of a change, and prepare fixes myself.

Through this process, I learned that developer tools improve through small, concrete iterations. Each issue report, code review, and pull request becomes part of a larger product feedback loop. This loop is especially important in open-source infrastructure projects, where the boundary between user, contributor, and maintainer can be fluid.

Building a Feedback Collection Page for PyCon

As PyCon 2025 approached, I also worked on an event page for Lablup's booth. The page was designed to let participants compare two AI-generated blog outputs with one written by a human and submit their feedback. Rather than being a simple promotional page, it was closer to a small experiment tool: participants needed to read multiple outputs, make a selection, and have their responses stored in a form that could be analyzed later.

I built the page with Next.js, Tailwind CSS, and shadcn/ui. Google Apps Script handled data storage in Google Sheets, and the page was deployed on AWS Amplify. Because the work involved the frontend interface, response submission flow, Google Sheets integration, and deployment setup, even a relatively small application became a useful exercise in building a lightweight data collection workflow for a real event environment.

PyCon booth feedback page

More than 200 people took part in the survey, and over a thousand responses were collected. The experiment process and its results were also presented at an academic conference. From an engineering perspective, this project showed that even a simple application reveals many operational concerns the moment you start considering real deployment: a clear interface, consistently structured data collection, quick deployment, and readiness for traffic spikes on the day of the event all mattered.

This experience also connected back to the broader theme of the internship. Whether the task is benchmarking GPU workloads or collecting human preference data, the system around the result matters. Tooling, deployment, data storage, and user experience all shape the quality of the final outcome.

Open Source and Community Context

Lablup's engineering culture also gave me a closer look at how open-source communities and infrastructure products influence each other. AI infrastructure does not exist separately from the broader cloud-native ecosystem. It builds on shared ideas around containers, orchestration, scheduling, observability, and reproducible operations.

I had the opportunity to attend KubeCon + CloudNativeCon 2025, where Lablup was a sponsor. The conference made the connection between cloud-native infrastructure and AI infrastructure much clearer. That experience also led to my first open-source contribution to Kubernetes later on. I still actively contribute within the SIG Docs community, and I have been exploring cloud-native technologies more deeply by running a small homelab built with Raspberry Pi clusters.

Later, Lablup supported Emory University's CS hackathon by providing GPU resources and access to Backend.AI. In that environment, the importance of onboarding and usability became very clear. Participants had to listen to my short explanation, understand the platform within a limited amount of time, run workloads, and troubleshoot issues.

Emory University hackathon

These experiences helped me see infrastructure platforms as interfaces between complex computing resources and the people who want to use them. Good infrastructure should make powerful systems more accessible. I also came to understand how much Lablup values building a culture that actively encourages open-source contribution and community participation.

Looking Back

My internship at Lablup changed how I think about AI engineering. Before, I mostly approached AI through the lens of model development. Working on benchmarking, FastTrack contributions, event page development, and open-source workflows showed me that the infrastructure around models is just as important as the models themselves.

The biggest takeaway was that AI infrastructure requires measurement, automation, usability, and community feedback together. Reliable systems are not built only by optimizing performance. They also require making results interpretable, tools usable, and workflows repeatable.

For anyone entering the AI field, it is easy to focus only on model architecture and training techniques. Those topics are important, but they are only part of the picture. As AI workloads become larger and more widely deployed, the systems that support them will continue to shape what teams can actually build, test, and operate.