Infrastructure Engineer & Inventory Manager
- About Lablup
We are a company providing machine learning platforms and solutions under the motto 'Make AI Accessible'. Lablup strives to solve various problems that arise when sharing scientific and engineering research processes through a standardized platform, and automates all stages of research and service. We develop and service Backend.AI to realize this vision. Backend.AI, a hyperscale AI platform, reduces the difficulties in developing and serving various AI models, from small-scale AI models to large language models.
- About Infrastructure Engineer & Inventory Manager at Lablup
Lablup’s Infrastructure Engineer and Inventory Manager is responsible for the efficient management of our high-performance computing and AI infrastructure. This includes comprehensive oversight of our hardware resources, such as Linux servers for AI acceleration and CPU computing, high-speed networking infrastructure, and high-performance storage systems. Their role also involves forecasting demand based on our business environment and communicating effectively with various internal departments and external partners.
- Key Responsibilities
- Server and PoC Infrastructure Management: Responsible for the regular maintenance of Linux servers used for AI and CPU computing, as well as the monitoring of new equipment intake and release. Duties also include the installation and maintenance of AI accelerators, and the evaluation and adoption of new technologies as needed.
- High-Speed Network Infrastructure Maintenance: Tasks include the installation and configuration of Infiniband and RoCE systems, management and performance optimization of RoCE equipment, maintenance of other network connectivity solutions, and ongoing network performance monitoring and troubleshooting.
- High-Performance Storage System Administration: Oversee the management and maintenance of high-performance storage systems utilized by Lablup, such as WekaFS, Lustre, CephFS, and Pure Storage.
- Cloud & Network Infrastructure Management: Maintain AWS VPC security policies and routing configurations, manage Direct Connect between AWS VPC and on-site physical networks, oversee integrated VPN solutions for ACL-based connectivity between multiple subnets, and perform maintenance on various network software and routers including Mikrotik and Dell devices. Responsibilities also include the upkeep of firewalls and management of security policies.
- Basic Qualifications
- Experience in managing Linux servers
- Experience with cloud infrastructure (AWS, Azure, GCP, etc.)
- Practical knowledge of network configuration and security
- Experience in automation using Python or Shell scripting
- Experience building Docker images
- Experience operating on-premises VM infrastructure such as OpenStack
- Preferred Qualifications
- Understanding of AI/ML workload execution environments
- Experience in developing or implementing VPN solutions
- Experience configuring and managing hybrid traffic between VPC and on-premises environments using solutions such as AWS Direct Connect or Azure ExpressRoute
- Experience in deploying and managing GPU servers
- Experience building and maintaining Docker image farms for GPU workloads
- Hands-on experience with high-performance parallel file systems such as WekaFS, Lustre, or CephFS
- Practical experience with high-performance networking environments, including InfiniBand and RoCE
- Experience operating InfiniBand and RoCE equipment from vendors such as Mellanox and Dell
- Experience managing team-based infrastructure access control using Azure AD
- Benefits and Perks
- Flexible remote work options
- Latest Mac-based equipment
- Support for fitness expenses
- Lunch allowance provided
- Welfare points program
- Annual health checkups for employees and their spouses
- Premium coffee beans and top-quality coffee machines
- Active encouragement of participation in developer communities