Artificial Analysis

Artificial Analysis

Technology, Information and Internet

Independent analysis of AI models and hosting providers: https://artificialanalysis.ai/

About us

Leading provider of independent analysis of AI models and providers. Understand the AI landscape to choose the best AI technologies for your use-case.

Website
https://artificialanalysis.ai/
Industry
Technology, Information and Internet
Company size
11-50 employees
Type
Privately Held

Employees at Artificial Analysis

Updates

  • View organization page for Artificial Analysis, graphic

    6,413 followers

    Thanks for the support Andrew Ng! Completely agree, faster token generation will become increasingly important as a greater proportion of output tokens are consumed by models, such as in multi-step agentic workflows, rather than being read by people.

    View profile for Andrew Ng, graphic
    Andrew Ng Andrew Ng is an Influencer

    Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of Landing AI

    Shoutout to the team that built https://lnkd.in/g3Y-Zj3W . Really neat site that benchmarks the speed of different LLM API providers to help developers pick which models to use. This nicely complements the LMSYS Chatbot Arena, Hugging Face open LLM leaderboards and Stanford's HELM that focus more on the quality of the outputs. I hope benchmarks like this encourage more providers to work on fast token generation, which is critical for agentic workflows!

    Model & API Providers Analysis | Artificial Analysis

    Model & API Providers Analysis | Artificial Analysis

    artificialanalysis.ai

  • View organization page for Artificial Analysis, graphic

    6,413 followers

    Initial results from our AI video generation model arena are in! With almost 20k votes, we now have an initial ranking of video generation models 🥇 MiniMax's Hailuo is the clear leader with an ELO of 1092 and a win rate of 67% 🥈 Genmo's Mochi 1 model, released last week, takes the silver and is the leading open-source video generation model 🥉 Runway, a long-time leader in the video generation model space, takes bronze with Runway Gen 3 Alpha which has an ELO of 1051 and a win rate of 61% The Video Arena provides a comparison of video generation models across a wide variety of prompts. Each model has unique strengths, and so we encourage you to test them based on your specific use case. Link below to contribute to the Artificial Analysis Video Arena 👇 . After 30 votes you will also be able to see your own personalized ranking of the video generation models - feel free to share yours below in the comments.

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,413 followers

    Inference optimization techniques can cause different prompts to run at different speeds. For example, speculative decoding uses a smaller draft model to generate speculative tokens for an LLM to verify. One implication of speculative decoding is that ‘simple’ prompts can get even faster speeds than normal/harder prompts! This occurs when a higher proportion of the draft model’s output tokens are accepted as correct by the target model. Below, you can see that for a prompt with simpler output tokens (repeating the Gettysburg Address), we see a much higher output speed than for a more complex prompt.

    • No alternative text description for this image
    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,413 followers

    Cerebras has launched a major upgrade and is now achieving >2,000 output token/s on Llama 3.1 70B, >3x their prior speeds This is a dramatic new world record for language model inference. Cerebras Systems' language model inference offering runs on their custom "wafer scale" AI accelerator chips. Cerebras had previously achieved speeds in this range for Llama 3.1 8B and is now delivering these speeds with a much larger model. We have independently benchmarked Cerebras’ updated offering and can confirm that we have observed no quality degradation in the latest version of the API. We understand that Cerebras is achieving these speeds with a range of optimizations throughout their inference stack, including speculative decoding. Speculative decoding is an inference optimization technique that uses a smaller draft model to generate speculative tokens for an LLM to verify. Speculative decoding does not impact quality when implemented correctly.

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,413 followers

    Tesla discloses in their Q3 Earnings Deck that they will have a 50k H100 cluster at Gigafactory Texas by the end of October. Putting this in context, Tesla’s new H100 cluster will be larger than the rumoured sizes of the clusters that have been used to train current frontier language models. Tesla’s cluster would likely be able to complete the original GPT-4 training run (~3 months on ~25k A100s) in less than three weeks.

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,413 followers

    Stability AI released Stable Diffusion 3.5 yesterday. Below are comparisons of how Stabile Diffusion has improved in the past year since SDXL in July 2023 We have also added Stability AI's Stable Diffusion 3.5 & the Turbo variant to our Image Arena. Our Image Arena crowdsources preferences to understand & compare the quality of image models - currently we have >800k preferences submitted. See Stable Diffusion 3.5 in our Image Arena, link below 👇

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,413 followers

    Anthropic’s Claude 3.5 Sonnet leapfrogs GPT-4o, takes back the frontier and extends its lead in coding Our independent quality evals of Anthropic's Claude 3.5 Sonnet (Oct 2024) upgrade yesterday confirm a 3 point improvement in Artificial Analysis Quality Index vs. the original release in June. Improvement is reflected across evals and particularly in coding and math capabilities. This makes Claude 3.5 Sonnet (Oct 2024) the top scoring model that does not require the generation of reasoning tokens before beginning to generate useful output (ie. excluding OpenAI’s o1 models). With no apparent regressions and no changes to pricing or speed, we generally recommend an immediate upgrade from the earlier version of Claude 3.5 Sonnet. Maybe Claude 3.5 Sonnet (Oct 2024) can suggest next time to increment the version number - 3.6? See below for further analysis 👇

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,413 followers

    Announcing Artificial Analysis Video Arena - the first crowdsourced comparison for Text to Video models Text to Video models are accelerating rapidly and crossing quality thresholds every month. We created Video Arena to compare them using the only source of truth for visual media - human preference! Video Arena includes hundreds of videos from the leading video models, including: - Runway Gen 3 Alpha - Pika 1.5 - Luma AI's Dream Machine - MiniMax / Hailuo AI - KLING - AI Videos's Kling 1.0 - Zhipu AI's CogVideoX-5B Voting is open now and we’ll be announcing the first leaderboard results within 24 hours. Any predictions? In the meantime, you can see your own ‘personal leaderboard’ of how you’ve ranked the video models after 30 votes. Contribute to the Video Arena! 🔗 https://lnkd.in/gXbjAjFE

  • View organization page for Artificial Analysis, graphic

    6,413 followers

    Groq is has launched their endpoint of OpenAI's new Whisper Large v3 Turbo Speech-to-Text model! 💬 OpenAI released a new 'Turbo' version of Whisper Large v3 last week, which is nearly 50% smaller than the non-Turbo Large v3 model, reducing the parameter count from 1.55B to 0.8B. The new Turbo model's word error rate is marginally higher than the Large v3 non-Turbo model at 12% vs. 10% in our evaluation. However, it is a lot faster/less compute intensive given its smaller size making it an attractive option for transcription use-cases that are speed-dependent. OpenAI has stated they achieved the reduction in size with marginal quality impact by reducing the number of model layers down to 4, from 32, and by further post-training - fine-tuning for another 2 epochs on multilingual transcription data. Groq with their launch today is allowing everyone to access these speed gains. We are benchmarking a Speed Factor of ~216x real-time audio time, >6X faster than OpenAI's Whisper v2 endpoint and a ~15% gain over Groq's non-Turbo Large v3 endpoint. Well done Groq on the fast launch of an API endpoint of the model and allowing all to access it! Link to our Speech to Text analysis below 👇

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    6,413 followers

    Groq has set a world record in LLM inference API speed by serving Llama 3.2 1B at >3k tokens/s 🏁 Meta's Llama 3.2 3B and 1B models are well positioned for two categories of use-cases. Firstly, applications running on edge devices or on-device, where compute resources are limited. Secondly, use-cases which require very fast response times and/or very cheap token generation. Groq with their custom LPU chips are taking fast and cheap token generation to the extreme by serving the models at >3k tokens per second and pricing at $0.04/1M input/output tokens. To put this in context, this is ~25X faster than GPT-4o's API and ~110X cheaper. While intelligence of these models is not comparable to the much larger frontier models, not all use-cases require frontier intelligence. Consumer apps which require real-time interaction and cheap token generation, live monitoring and classification are both example use-cases which suit these smaller models. Link below for our analysis of how Llama 3.2 3B & 1B compare to other smaller models, and of the providers serving them 👇

    • No alternative text description for this image

Similar pages