Running LLM/GenAI training workloads on cloud based infrastructure can be fairly involved. There are multiple considerations to factor in: about compute, network, scaling, costs, etc. In this talk at the Scale by the Bay conference, Oleg Avdeëv and Riley Hun went into the details of this topic and discussed how Metaflow enabled such workloads at Autodesk. Check it out!
Oh hey Riley Hun and mine talk at SBTB on training GenAI models on Kubernetes (but not only Kubernetes) is on Youtube, check it out! (not sure why their GenAI avatar generator decided to make my eyes so tiny...) https://lnkd.in/gjbQKiMH