Arka Bagchi’s Post

Founding Engineer @ First Drafts AI

4mo

GPT-4o struggles with debugging its own code. It often repeats the same incorrect solution without improvement. For coding tasks, consider using other LLMs like Anthropic Claude 3 Opus or Claude 3.5 Sonnet, which perform substantially better. Despite being trained for real-time voice conversations, GPT-4o seems less effective at long-form, multi-turn conversations compared to older GPT models. The function calling with GPT-4o is actually worse than GPT-4 Turbo.

To view or add a comment, sign in

More Relevant Posts

JC de Jongh Operational Consultant, Execution Specialist

Founder at The Burgeon Group | Operations Consultant for SMEs | Driving Efficiency & Innovation for Growth | Open to Leadership Roles in Operational Excellence
4w
Report this post
What an insightful course. My brain is rushing with operational applications for custom GPT's. I just finished the course “Build Your Own GPTs” by Alina Zhang! Check it out: https://lnkd.in/dE4HVsCx #chatbotdevelopment.

Certificate of Completion

linkedin.com

2 Comments
Like Comment
To view or add a comment, sign in
Ryota Kanai

Project Manager of Moonshot Research & Development Program
3w
Report this post
We created a benchmark called ProcBench where the task is just to follow the instructed procedures. The tasks are relatively simple for humans but prone to errors for LLMs as the number of steps to solve them increases. Even top-tier LLMs like o1-preview show significant performance drops as procedure complexity increases. Since these tasks seem to illuminate a critical weakness of the current LLMs, it will be fascinating to tackle them to overcome the current weaknesses of LLMs. Perhaps, solving them might be an important step toward AGI. It’s intriguing to see whether simply larger LLMs can eventually solve them, or we need a new paradigm and approach. "ProcBench: A Benchmark for Procedural Reasoning in Large Language Models." by Fujisawa et al. https://lnkd.in/gu7Ezx-f

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

arxiv.org

1 Comment
Like Comment
To view or add a comment, sign in
Gaurav Sharma

Spend Analysis. Ai in Procurement. Digital Procurement. I specialize in creating advanced Procurement Centre of Excellence setups.
5mo
Report this post
48 hours of using GPT 4o, few things that I was not expecting : (Caution: I am a average user and do not have access to their desktop app or early beta user feature that you saw in demo) 1) 4o is bad at coding. It fails to remember your constraints provided. I had a better debugging assistance from 4. But don’t be naive in your assumptions and expect a fully functional solution from these models. These models are good at batch size small code blocks. You are the engineer and brain! I am back to GPT4 for this! 2) After the introduction of 4o, both 4 and 4o have become quite slow. It gives you outputs in chunks and often lag. But, this could be because of heavy traffic load and their new multimodal input capabilities. But, it is slow at the moment. I haven’t noticed a significant better output yet! May be it is just my use cases at the moment (largely coding related) but the incremental improvement isn’t noticeable yet! I’ll keep stress testing this for Procurement use cases and share my feedback! Supernegotiate

1 Comment
Like Comment
To view or add a comment, sign in
Steven Ge

Professor, Founder & CEO of Orditus, an AI startup. Developer of Chatlize.ai, RTutor.ai, iDEP & ShinyGO. Topics: AI, Data Science, Bioinformatics
1mo Edited
Report this post
For coding, the real gem is perhaps o1-mini. It has better performance than GPT-4o at a smaller cost. The downside is speed, and slightly reduced context window. The o1-preview is way too slow and costly.
Like Comment
To view or add a comment, sign in
Gavin Hatheway

COO at threesixfive Media Inc.
9mo
Report this post
I recently started using the DALL·E integration in GPT for storyboarding our projects and it's proving to be extremely useful. A single sentence prompt with a few key words produced something more than just usable for our proof of concept. Our DP for the job referenced in the image, Jack Leahy also remarked 'it's annoying how good that looks'. The biggest struggle I had generating 'realistic' images even a few months ago was learning how to structure my prompts in a way that the software would understand. That obstacle seems to be getting smaller and smaller! The really cool thing is that the generated image follows the basic framework for capturing something interesting – backlit, reflections of light, leading lines, atmosphere, etc.... The ability to visualize your ideas so early on in the creative process is very powerful.
9 Comments
Like Comment
To view or add a comment, sign in
Santiago Valdarrama

Computer scientist and writer. I teach hard-core Machine Learning at ml.school.
3mo
Report this post
Claude Sonnet 3.5 destroys GPT-4 at writing code, and it's not even close. Faster. Fewer errors. Less verbose. Higher quality code. To compare multiple models: https://bit.ly/3WwtFo6. You get GPT-4, GPT-4o, GPT-4o mini, Sonnet, and Gemini in the same place.

12 Comments
Like Comment
To view or add a comment, sign in
Ken Judy

Technologist & Human. Trusted Implementer, Advisor, Consultant, and Coach for Profit, People, and Planet. Interested in sustainability, responsible use of technology in generative ai and renewables.
7mo
Report this post
In terms of which models to use for code conversion, Claude 3 is becoming a favorite. I still use copilot in context for individual debugging and questions. But for the simple act of converting code, Claude is the most amenable to following instructions. And for this task, I'm not noticing any serious hit using Sonnet (medium) vs Opus (large). GPT-4 is more opinionated about how it chooses to convert the code, and in this particular use case, that is not what I want.
Like Comment
To view or add a comment, sign in
Matteo Cappelloni

CTO & Innovation Manager at Next Adv
6mo
Report this post
Finally some public OCR datasets! This is very helpful to train document models and make character recognition a problem of the past. We've seen the speed that OpenSource LLMs gave to generative writing and code fixing, it won't take much before we'll be able to parse documents even from dead languages! https://lnkd.in/dFnfevYR

Pablo Montalvo (@m_olbap) on X

twitter.com
Like Comment
To view or add a comment, sign in
Gaylord Aulke

CEO at 100 DAYS software projects
2mo Edited
Report this post
i know: Elon bad. But look at this: "An early version of Grok-2 has been tested on the LMSYS leaderboard under the name "sus-column-r." At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo."

Grok-2 Beta Release

x.ai
Like Comment
To view or add a comment, sign in
David GURCAN

👨💻 Tech Whisperer | 🚀 Exploring the digital frontier, one line of code at a time | 💡 Innovator at heart | 🤖 AI aficionado | #TechLife
3mo
Report this post
Large Language Models in Code Generation: Overcoming Common Bugs and Improving Accuracy *** One effective method to enhance code accuracy in large language models (LLMs) involves introducing a self-critique mechanism. This iterative process allows LLMs to analyze their generated code, identify errors based on a detailed bug taxonomy, and correct them using compiler feedback. Implementing this approach can significantly reduce bugs and increase the passing rate of generated code. *** https://lnkd.in/eVbG7KS7
Like Comment
To view or add a comment, sign in

1,032 followers

55 Posts

View Profile Follow

Arka Bagchi’s Post

More Relevant Posts

Explore topics