Run:ai’s Post

View organization page for Run:ai, graphic

25,364 followers

2mo Edited

Distributed training is crucial for accelerating training times and handling models that can't fit into a single GPU. PyTorch simplifies this with Data Parallelism and various Model Parallelism techniques. When diving into distributed training using PyTorch, you'll frequently encounter PyTorch Elastic (Torch Elastic), brought to you by Hagay Sharon and Ekin Karabulut Our new blog introduces PyTorch Elastic jobs, how they differ from regular distributed jobs, and how to leverage them as a Run:ai user, check out the full article here--> https://lnkd.in/da9y_NRH #MLblog #aiblog #newblogalert #aiinfrastructure #AIOps #mlops #AIDevOps #GPUComputing #AIScaling#MachineLearningInfrastructure #runai #ml

To view or add a comment, sign in

More Relevant Posts

Analytics Insight®

84,057 followers
1mo
Report this post
TensorFlow vs. PyTorch: Which Should You Choose? https://zurl.co/LCNS Choosing between TensorFlow and PyTorch hinges on your specific needs and objectives in machine learning. TensorFlow, with its extensive ecosystem, robust performance optimization, and scalability, excels in production environments and large-scale applications. #TensorFlowvsPyTorch #TensorFlow #PyTorch #MachineLearning #NaturalLanguageProcessing #AnalyticsInsight #AnalyticsInsightMagazine
Like Comment
To view or add a comment, sign in
Subham Kundu

Principal AI Engineer at HTCD, AI-First Cloud Security | Knowledge Graphs | LLM Post Training | Handling Large Scale AI Infrastructure
6mo Edited
Report this post
Explain Infini Attention in layman terms. Recently, Google introduced a paper on InfiniAttention, which marks an important milestone for achieving infinite context. While I don't believe true infinite context will ever be achieved, I think a very long context length, which is sufficient for most industry use cases, is within reach. Context length refers to the number of tokens a Large Language Model (LLM) can process at any given time. Support for very long context lengths, efficient long-term memory retrieval, and the integration of agents represent what I believe to be the future. But what is InfiniAttention? Let's understand it in layman's terms and why it's needed. In the current vanilla attention mechanism, doubling the context length doesn't just double the memory and compute requirements—it quadruples them, which is not feasible. Imagine you're preparing for an exam and must read very lengthy books beforehand. There may be some parts of the books you learned just yesterday, and you'll be able to answer questions on those excellently. But what about the lessons you revised a week ago? Typically, we try to remember keywords, which is analogoues to what is called compressed memory in InfiniAttention. You don't lose the context you've seen earlier; instead, you compress it and use it in subsequent steps. Now, let's discuss the high-level technical overview of InfiniAttention, as outlined in the paper: 1. The Infini transformer operates on a sequence of segments. The method for deciding these segments is not clearly explained in the paper, but it likely involves segmentation during the training loop, with gradient accumulation after each segment. 2. For each segment, previous global memory block is retrieved and added to the computation for the new segment. This approach aims to achieve an infinite context by always incorporating previous states into new computations. I am currently working on implementing a CUDA-optimized pytorch version of the Infini Transformer, where the global compressed memory can utilize the HBMs of GPUs, and local context states can use the SRAM in the streaming processors. #llms #generativeai
1 Comment
Like Comment
To view or add a comment, sign in
Stephen Ozoigbo

Head of Government Partnerships & Ecosystems
1mo
Report this post
PyTorch is a widely-used open-source library for #machine #learning. At Arm , along with our partners, we’ve been enhancing PyTorch’s inference performance over the past few years. In this blog, Ashok Bhat describes how PyTorch inference performance on Arm Neoverse has been improved using Kleidi technology, available in the Arm Compute Library and KleidiAI library.

Faster PyTorch Inferencing using Kleidi on Arm Neoverse

community.arm.com
Like Comment
To view or add a comment, sign in
Komal Khetlani

Data Scientist @Shell India | Kaggle 3 x Expert | Machine Learning | NLP | Data Visualization | Data Analysis
9mo
Report this post
PyTorch is one of the most famous open-source ML and DL frameworks developed by Meta. 💻 There are a lot of features of PyTorch that make it a preferable choice over other DL frameworks, which we will talk about in the upcoming post. 😁 Installing PyTorch is straightforward. Visit their official website, select your appropriate configuration, and get the exact command to install PyTorch on your machine. #backtobasics #day1 #pytorch #machinelearning #artificialintelligence #naturallanguageprocessing
2 Comments
Like Comment
To view or add a comment, sign in
Alex Tumiri Huanca

Computer Science Engineer | Backend Developer | Java | Python | Spring | AWS | Typescript .
4mo
Report this post
currente state : learn more about Pytorch #IA #DeepLearning #MachineLearning 😋
Like Comment
To view or add a comment, sign in
Shubrashankh Chatterjee

Building Intelligent AI Systems
1mo
Report this post
Have been waiting for this for a while. The gains are quite great as in the forum post: Up to ~29% forward pass speedup and ~8% E2E speedup in Llama3 7B. Up to ~20% forward pass speedup and ~8% E2E speedup in Llama3 70B.

Soumith Chintala

PyTorch. Robots. Research @ Meta
1mo

While async Tensor Parallelism is common among elite private large-scale training codebases, the PyTorch team put together a public, accessible and easily readable one. pretty cool work from Yifu Wang Horace He Less Wright Luca Wehrstedt Tianyu Liu and Wanchao L. Read more here: https://lnkd.in/eWWyDQJq

[Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch

discuss.pytorch.org
Like Comment
To view or add a comment, sign in
Siddhesh Gunjal

Senior Research Scientist @ NielsenIQ | Creator of Slackker (PyPi Package) | Former Adjunct faculty @ upGrad
7mo
Report this post
Want to understand how pytorch works but documentation seems a little bit too complicated? Here's an interesting article I found... link: https://lnkd.in/dHCuJBdt #pytorch #opensource #deeplearning
Like Comment
To view or add a comment, sign in
Ameet Kulkarni

Technology Leader , Experience in Banking & Financial services & Manufacturing Solutions - Digital Transformation, Digitization, AI, Data, Automation, Cloud solutions
7mo
Report this post
Big day for #GenAI with #AWS! Anthropic has launched their latest #Claude3 models which are available for easy use on Amazon #Bedrock. Claude 3 exceeds existing models such as GPT-4 and Gemini Ultra on standardized evaluations such as math problems, programming exercises, and scientific reasoning. What will you build with it? https://lnkd.in/di4rwa_4
Like Comment
To view or add a comment, sign in
Bhuvan Chennoju

Senior Data Scientist @T-Mobile | ML Engineer | GenAI,ML, Cloud, Big Data | Fraud & Risk
2mo Edited
Report this post
Let’s say you are training an LLM with custom datasets that include billions of tokens. In my use case, sequence of DNA string which tokenized at character level with vocab of A, C, G, and T. As this tokenization is very simplistic, my datasets often explode to tens of gigabytes with billions of tokens, making I/O bottlenecks and memory issues a huge challenge. Now, when building an LLM model on a single machine without distributed training is a major problem due to scale of tokens. How can I solve this infrastructure problem? how can you train this model faster and scale it to get most out of GPU clusters? My first hand experience is with two main strategies: vertical scaling (increasing the capacity of a single machine) and horizontal scaling (dividing the data into smaller chunks). Given the limits of vertical scaling for my usecase needs, horizontal scaling, specifically data sharding, is the better option. With horizontal scaling, I divied the dataset into multiple smaller shards and distribute these across the GPU cluster with PyTorch distributed data parallel. This approach helps to synchronize and distribute the data more efficiently across the network, overcoming the I/O and memory limitations of vertical scaling. It’s a practical solution for handling massive datasets and scaling model training effectively. What are your thoughts on large scale computing? #LLM #genAI #challenges #DataProblems
2 Comments
Like Comment
To view or add a comment, sign in
Divya Sree

Enthusiast in VLSI, Embedded Systems, and Digital Electronics | "Aspiring Expert in Core Engineering"✨💨
5mo
Report this post
"Thrilled to share that I've earned my Machine Learning on Arm Certificate from edX! Excited to apply these new skills in my professional journey and explore the limitless possibilities of machine learning on Arm architecture. #MachineLearning #AI #edX #ProfessionalDevelopment #TechSkills"
1 Comment
Like Comment
To view or add a comment, sign in

25,364 followers

View Profile Follow

Run:ai’s Post

More Relevant Posts

Explore topics