Ever notice your standards for “good” vs. “bad” LLM outputs start to shift once you’ve seen more examples? That’s Criteria Drift—our evaluation rules evolve the moment unexpected outputs appear. It’s perfectly natural, but can quickly complicate consistency and alignment if we’re not prepared. Why does this matter? 🤔 • Each new batch of outputs can reveal fresh failure modes, nudging us to redefine or add evaluation criteria. • Without a solid process, you’ll constantly be playing catch-up, with unclear or ever-changing metrics. The good news? 🙌 Our latest course breaks down strategies to manage Criteria Drift and keep your evaluations stable—so you always know what “good” looks like, no matter what your LLM throws at you. Check it out below and safeguard your LLM evals from unplanned shifts! 🎓 LLM Apps: Evaluation Course is here: https://lnkd.in/gCHffA24
Weights & Biases
Software Development
San Francisco, California 77,027 followers
The AI developer platform.
About us
Weights & Biases: the AI developer platform. Build better models faster, fine-tune LLMs, develop GenAI applications with confidence, all in one system of record developers are excited to use. W&B Models is the MLOps solution used by foundation model builders and enterprises who are training, fine-tuning, and deploying models into production. W&B Weave is the LLMOps solution for software developers who want a lightweight but powerful toolset to help them track and evaluate LLM applications. Weights & Biases is trusted by over a 1,000 companies to productionize AI at scale including teams at OpenAI, Meta, NVIDIA, Cohere, Toyota, Square, Salesforce, and Microsoft. Sign up for a 30-day free trial today at http://wandb.me/trial.
- Website
-
https://wandb.ai/site
External link for Weights & Biases
- Industry
- Software Development
- Company size
- 201-500 employees
- Headquarters
- San Francisco, California
- Type
- Privately Held
- Founded
- 2017
- Specialties
- deep learning, developer tools, machine learning, MLOps, GenAI, LLMOps, large language models, and llms
Products
Weights & Biases
Machine Learning Software
Weights & Biases helps AI developers build better models faster. Quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, and manage your ML workflows end-to-end.
Locations
-
Primary
400 Alabama St
San Francisco, California 94110, US
Employees at Weights & Biases
Updates
-
DeepSeek AI, Stargate and AI's $600 Billion Question with with Sequoia Capital's David Cahn In this episode of Gradient Dissent, Our CEO and Co-founder Lukas Biewald sits down with David Cahn, partner at Sequoia Capital, for a compelling discussion on the dynamic world of AI investments. They dive into recent developments, including Deepseek and Stargate, exploring their implications for the AI industry. Drawing from his articles, "AI's $200 Billion Question" and "AI's $600 Billion Question," David unpacks the financial challenges and opportunities surrounding AI infrastructure spending and the staggering revenue required to sustain these investments. Together, they examine the competitive strategies of cloud providers, the transformative impact of AI on business models, and predictions for the next wave of AI-driven growth. This episode offers an in-depth look at the crossroads of AI innovation and financial strategy. 🎙️ Tune in here: https://lnkd.in/gzDbupv3
-
🚀 Groundbreaking AI Research with Reinforcement Learning, powered by Weights & Biases Incredible advancements are being made in AI reasoning and problem-solving, and we’re thrilled to share this exciting achievement by Jiayi Pan, Xingyao Wang & Lifan Yuan. Using reinforcement learning (RL), they reproduced DeepSeek R1-Zero—a method that enabled a 3B parameter language model to develop self-verification and search abilities autonomously. This was demonstrated in Countdown, a game where players combine numbers and arithmetic to reach a target number. Key highlights from their findings: 🔸 The model starts with dummy outputs and, over time, learns revision, search, and self-verification tactics—critical reasoning behaviors. 🔸 These abilities scale with model size, emerging at 1.5B parameters and beyond. 🔸 RL algorithms like PPO, GRPO, and PRIME all work well, showing robustness in the approach. 🔸 The training process costs less than $30, making this a highly accessible method for furthering RL research. 📊 Why it matters: Their work sheds light on how reinforcement learning can unlock reasoning behaviors in language models without extensive instruction fine-tuning. These insights could transform the way we approach problem-solving tasks across industries. ✨ Powered by Weights & Biases: This research leveraged Weights & Biases to log, monitor, and analyze the experiment results, enabling transparency and collaboration. Their experiment logs and findings are publicly available on W&B for others to explore and build upon: wandb.ai/jiayipan/TinyZero Congratulations to the team on this incredible achievement! We’re proud to support research that makes advanced AI methods accessible and drives innovation in the field. See Jiayi's thread that has over 650k views right now on X here: https://lnkd.in/gYCxbw5n
-
We have enormous goals for 2025, and we want YOU to be a part of them! Weights & Biases is on the lookout for passionate Software Engineers and dynamic Sales professionals to help us build the best tools for AI developers. With over 1,000 customers (including OpenAI, NVIDIA, Microsoft, and Toyota Motor Corporation) and over $250M in funding, we’re on a mission to revolutionize machine learning and empower teams building the future of AI. Explore all our open roles here: https://lnkd.in/gdrG-ien Join us as we take on the most consequential challenges in AI, together.
-
Why does Pass/Fail work so well for LLM evaluations? It forces clarity. No more guessing the difference between a 3 and a 4. With Pass/Fail, every decision is immediately actionable, and easier to improve. Our LLM Apps: Evaluation course dives deeper into this framework, helping you create better systems and more effective GenAI apps. 📚: https://lnkd.in/gCHffA24
-
Why does the order of words matter for LLMs? Two words: Position Bias. LLMs rely on positional embeddings to determine “who did what to whom.” Without this positional context, words lose their relationships, making it nearly impossible to capture true meaning. If you’re ready to dive deeper into these concepts—and more—check out our new, free, on-demand course: LLM Apps: Evaluation. In just 2 hours, you’ll learn how to: - Build an evaluation pipeline for LLM applications. - Leverage LLMs as evaluators to assess outputs programmatically. - Minimize human input by aligning auto-evaluations with best practices. By the end of the course, you’ll have hands-on experience, practical implementation methods, and a clear understanding of how to effectively evaluate and improve your GenAI apps. Meet your expert instructors: Ayush Thakur – AI Engineer at Weights & Biases Anish Shah – AI Engineer at Weights & Biases Paige Bailey – AI Developer Relations Lead at Google Graham Neubig – Co-Founder at All Hands AI Join us and take the next step in advancing your LLM expertise—one (positional) token at a time! 📚: https://lnkd.in/gCHffA24
-
Join us in Paris, France on January 22 for an exclusive panel discussion featuring industry leaders at the cutting edge of generative AI—Mistral AI, Thales, and NVIDIA. Discover how to harness GenAI to fuel innovation, enhance customer experiences, and accelerate growth. What to expect • Expert insights: Real-world applications and transformative advancements in generative AI from Mistral AI, Thales, and NVIDIA. • Actionable strategies: Practical guidance on adopting AI to drive business results. • Interactive discussions: Dive into ethical, regulatory, and technical considerations, and get your questions answered during the Q&A. Featured speakers • Adrien Bécue – AI & Cybersecurity Expert, Thales • Richard Wright – EMEA DGX AI Platform Segment Sales Lead, NVIDIA • Sophia Yang, Ph.D. – Head of Developer Relations, Mistral AI Register here to secure your seat: https://lnkd.in/gNKxnz29
-
🚀 How do you build an autonomous programming agent that dominates SWE-bench Verified? Our Co-Founder and CTO, Shawn Lewis, tackled this challenge and delivered an o1-based AI agent that now holds the new state-of-the-art, solving 64.6% of issues on SWE-bench Verified! SWE-bench is the ultimate benchmark for autonomous programming agents. It evaluates an agent’s ability to autonomously read, write, test, and iterate on code in a real-world, GitHub-issue-like environment. So, how did he achieve this? By combining OpenAI’s powerful o1 model, our W&B Weave toolkit, and relentless experimentation, including 977 logged evaluations. The result? Precise debugging, streamlined iteration, and groundbreaking results on SWE-bench. This achievement reaffirms what we stand for at Weights & Biases: the belief that the BEST tools unlock the BEST results. For a detailed breakdown of Shawn’s process, check out his blog post here: https://lnkd.in/gsiRjg8e
-
🛠️ New Tutorial: Weights & Biases Models + Weave Integration The combination of W&B Models and Weave simplifies: • LLM fine-tuning and tracking. • RAG chatbot integration. • Comprehensive evaluations, including metrics like accuracy, latency, and cost. Want to see it in action? Full Tutorial: https://lnkd.in/g725SWpq Colab Demo: https://lnkd.in/g8Fzy_-J Public Workspace: https://lnkd.in/gdyENh6Q
-
🛠️ Ready to build a flawless RAG system? Join Pinecone & the W&B team on 1/22 in NYC 🗽 for a hands-on session on designing, evaluating, and optimizing Retrieval-Augmented Generation workflows. This workshop will cover: 1️⃣ Structuring RAG systems for balanced retrieval + generation. 2️⃣ Evaluating performance to identify improvements. 3️⃣ Advanced tools like Pinecone & W&B Weave for optimization. Details: 📍 1375 Broadway, NYC 🗓️ 1/22 ⏰ 6-9 PM EST RSVP here: https://lnkd.in/gzVsSEus