如何通过 HAMi 解锁 AI infra 面临的 AI 设备使用率与异构设备管理的两大挑战

Presentation开源 AI 分论坛(LLM方向)
🕒 ~
  • 李孟轩
    • 李孟轩
    • 第四范式
    • 第四范式架构师

观众评分

With AI's growing popularity, Kubernetes has become the de facto AI infrastructure. However, the increasing number of clusters with diverse AI devices (e.g., NVIDIA, Intel, Huawei Ascend) presents a major challenge. AI devices are expensive, how to better improve resource utilization? How to better integrate with K8s clusters? How to manage heterogeneous AI devices consistently, support flexible scheduling policies, and observability all bring many challenges The HAMi project was born for this purpose. This session including:

  • How K8s manages heterogeneous AI devices (unified scheduling, observability)
  • How to improve device usage by GPU share
  • How to ensure the QOS of high-priority tasks in GPU share stories
  • Support flexible scheduling strategies for GPU (NUMA affinity/anti-affinity, binpack/spread etc)
  • Integration with other projects (such as volcano, scheduler-plugin, etc.)
  • Real-world case studies from production-level users.
  • Some other challenges still faced and roadmap

HAMi is currently the only sandbox project in the CNCF community that focuses on this areas.