New software lets you run a private AI cluster at home with networked smartphones, tablets, and computers — Exo software runs LLama and other AI models

exo allows you to run an AI cluster at home
(Image credit: Shutterstock)

Big AI developers like OpenAI, Google’s Gemini team, and the folks behind Microsoft Copilot have massive data centers at their disposal for AI workloads. Thanks to the work of a team of developers, a new software could allow you to run your own AI cluster at home using your existing smartphones, tablets, and computers.

The experimental exo software splits up your Large Language Model (LLM) to use some or all of your computing devices at home to run your personal chatbot or other AI project. This can include your Android phones and tablets, as well as computers running macOS or Linux. 

The result of this is allowing you to use your various devices together to appear like one powerful GPU to the AI model. The developer showed off a demo of the software running Llama-3-70B at home using an iPhone 15 Pro Max, an iPad Pro M4, a Galaxy S24 Ultra, an M2 MacBook Pro, an M3 MacBook Pro, and two MSI Nvidia GTX 4090 graphics cards.

The exo software is compatible with Llama and other popular AI models. It also includes, through a one-line change in the application, a ChatGPT-compatible API for running models. You only need your compatible devices running Python 3.12.0 or higher to compile and run the software.

Once compiled and running, exo automatically discovers devices on your network to include in the cluster. It provides device equality using peer-to-peer connections. While exo supports various partitioning strategies to distribute the work across devices, it defaults to a ring memory-weighted scheme that allocates the workload based on how much memory each device has.

The exo software also supports iOS, but the developers say the code needs some work to be ready for mainstream use. It’s pulled the iOS version but will allow access to it for those who email the lead developer. 

The developers also plan further refinement and features. They also have a bounty program for helping add new features and compatibility. As of this writing, these included support for LLaVa, batched requests, a radio networking module, and pipeline parallel inference support. In its current state, it already looks like a cool project to experiment with.

Jeff Butts
Contributing Writer

Jeff Butts has been covering tech news for more than a decade, and his IT experience predates the internet. Yes, he remembers when 9600 baud was “fast.” He especially enjoys covering DIY and Maker topics, along with anything on the bleeding edge of technology.

  • mikeebb
    At what point can I start claiming a household AI as a dependent on my taxes? 🤑

    Much too early. Getting more coffee...
    Reply
  • Peksha
    Viral AI mining is now on your phone...
    Reply
  • evdjj3j
    "The developer showed off a demo of the software running Llama-3-70B at home using an iPhone 15 Pro Max, an iPad Pro M4, a Galaxy S24 Ultra, an M2 MacBook Pro, an M3 MacBook Pro, and two MSI Nvidia GTX 4090 graphics cards."

    You could buy a really nice proper AI gpu for the price of all that hardware.
    Reply
  • Peksha
    evdjj3j said:
    "The developer showed off a demo of the software running Llama-3-70B at home using an iPhone 15 Pro Max, an iPad Pro M4, a Galaxy S24 Ultra, an M2 MacBook Pro, an M3 MacBook Pro, and two MSI Nvidia GTX 4090 graphics cards."

    You could buy a really nice proper AI gpu for the price of all that hardware.
    Amazing achievements, especially since the llama3-70b model runs perfectly on only 24 GB GPU + 64 GB CPU ;)
    Reply
  • jp7189
    Peksha said:
    Amazing achievements, especially since the llama3-70b model runs perfectly on only 24 GB GPU + 64 GB CPU ;)
    I'd like to know what speeds they are getting. I get around 2 tokens per second with llama3-70b 8Q using a 4090 for prompt ingestion and a 9654 with 768GB for inference. If they got say 5 tok/sec I would be impressed, or if they did it using an un-quantized model, they would be impressive also.
    Reply
  • Peksha
    jp7189 said:
    I'd like to know what speeds they are getting. I get around 2 tokens per second with llama3-70b 8Q using a 4090 for prompt ingestion and a 9654 with 768GB for inference. If they got say 5 tok/sec I would be impressed, or if they did it using an un-quantized model, they would be impressive also.
    ollama run llama3:70b --verbose
    >>> why is the sky blue
    ...
    total duration: 2m2.554287s
    load duration: 12.1295ms
    prompt eval count: 15 token(s)
    prompt eval duration: 2.179578s
    prompt eval rate: 6.88 tokens/s
    eval count: 401 token(s)
    eval duration: 2m0.361818s
    eval rate: 3.33 tokens/s

    7900xtx+7950x3d@64GB
    Reply