The second part of your article I found most interesting and useful. It brought to my attention (which is as we now know is so important) the “Grokked Transformers are Implicit Reasoners” research paper that came out in May. The fact that this was almost 3 months ago, and your reference is the first I have seen about this, does make me ponder though how important this research may be.
Never-the-less, I found it very interesting that “an extended period of training far beyond overfitting” could result in vastly superior performance.
I find that at least some AI researchers have been aware of the advantages of overfitting for many years. But then the question is why are the best LLMs not doing this - as far as we know? One reason may be found in this statement from the “Grokked” paper: “…almost perfect accuracy after extended optimization lasting around 50 times the steps taken to fit the training data.”
50 times!
Already the training of LLMs takes weeks if not months. If overfitting was done 50 times, training would then take years! However, with LLMs now using mixtures of experts, perhaps a subset of experts might be overfitted. In particular, it might be very beneficial for the math/logic expert to be overfitted.
“Implicit reasoning”, which is what grokking improves, is Kahneman’s “System 1 thinking”. When humans do System 2 thinking we take System 1 results as input, but to do correctly we must also do a fresh look at the facts and apply our own reasoning abilities, otherwise biases in System 1 will lead to false conclusions in our System 2.
But System 2 as Kahneman points out takes time and energy, for both humans and LLMs. So we and LLMs generally default to using System 1. So improvements in System 1 thinking will definitely help in the overall performance of man and machine.
But on the other hand, grokking appears not to be the secret sauce needed to help LLMs to achieve expert human System 2 thinking.
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
6moThe debate over the significance of LLM benchmarks echoes past discussions in technological advancements, where metrics sometimes overshadow broader goals. Historical precedents show how early focus on specific benchmarks led to both innovation and distortion of priorities. Considering the intersection of research and commercial interests in AI, it's crucial to scrutinize the influence of benchmarks on model selection. However, amidst this scrutiny, how can the AI community strike a balance between benchmark-driven progress and the pursuit of broader AI understanding? If we delve deeper into the implications of benchmark-centric competition on AI development, what strategies can researchers and companies adopt to ensure transparent and unbiased evaluation methodologies?