> These tools can easily be manipulated further to label anyone outside of the white, heteronormative, cisgender conglomerate as a non-person in the eyes of the larger system. All they need is a huge company like Google to “not recognize” a few key factors like whole neighborhoods in redlined areas or a gender marker (as compared to a birth certificate or something).
What a weird rant at the end. Last I checked, there is no gender.google.com, and straight up erasing people from maps for their race/gender/ethics/whatever is not something I have heard of happening.
FWIW, renaming neighborhoods, including on Google Maps, to erase race and ethnicity is something which does happen.
> Research in Philadelphia by sociologist Jackelyn Hwang shows that gentrification not only shifts the demographics of a given area, but leads to divergent definitions of neighborhoods.
> Minority residents were more likely to call a wide area one neighborhood, named “South Philly.” White residents, by contrast, divided the same area into multiple neighborhoods, such as “Graduate Hospital,” “G-Ho,” “So-So,” “South Rittenhouse,” “South Square” and “Southwest Center City,” splitting up areas by their socioeconomic characteristics and crime levels.
> In such cases, the use of different neighborhood definitions served to legitimize one’s presence in a community. Neighborhoods do this by evoking a sense of place for residents, describing a relationship that the place has with someone’s biography, imagination and personal experiences. The names create boundaries between those who are perceived to belong to these communities – and those who do not.
I think he's talking about the potential for abuse as opposed to something that's currently happening. I'm trans and even sites where I don't look at queer content at all seem to figure that out. It's not hard to imagine how that could end up being abused.
This may be a joke, but counting your fingers to lucid dream has been a thing for a lot longer than diffusion models.
That being said, your reality will influence your dreams if you're exposed to some things enough. I used to play minecraft on a really bad PC back in the day, and in my lucid dreams I used to encounter the same slow chunk loading as I saw in the game.
Playing Population One in VR did this to me. Whenever I hopped into a new game, I'd ask the other participants if they'd had particularly vivid dreams since getting VR, and more than half of folks said they had.
Ah, that was one short gravy train even by modern tech company standards. Really wish the space was more competitive and open so it wouldn't just be one company at the top locking their models behind APIs.
API only model, yet trying to compete with only open models in their benchmark image.
Of course it'd be a complete embarrassment to see how hard it gets trounced by GPT4o and Claude 3.5, but that's par for the course if you don't want to release model weights, at least in my opinion.
I'd also like to point out that they omit Qwen2.5 14B from the benchmark because it doesn't fit their narrative(MMLU Pro score of 63.7[0]). This kind of listing-only-models-you-beat feels extremely shady to me.
Yes, I agree, for these small models it's wasted potential to be closed source, they can only be used effectively if they are open.
EDIT: HN is rate-limiting me so I will reply here: In my opinion 1B and 3B truly shine on edge devices, if not than it's not worth the effort, you can have much better models for already dirt cheap using an API.
An open small model means I can experiment with it. I can put it on an edge device and scale to billions of users, I can use it with private resources that I can't send externally.
When it's behind an API its just a standard margin/speed/cost discussion.
I think what the parent means is that small models are more useful locally on mobile, IoT devices etc. so it defeats the purpose to have to call an API.
Big models take up more VRAM just to have the weights sitting around hot in memory, yes. But running two concurrent inferences on the same hot model, doesn't require that you have two full copies of the model in memory. You only need two full copies of the model's "state" (the vector that serves as the output of layer N and the input of layer N+1, and the pool of active low-cardinality matrix-temporaries used to batchwise-compute that vector.)
It's just like spawning two copies of the same program, doesn't require that you have two copies of the program's text and data sections sitting in your physical RAM (as those get mmap'ed to the same shared physical RAM); it only requires that each process have its own copy of the program's writable globals (bss section), and have its own stack and heap.
Which means there are economies of scale here. It is increasingly less expensive (in OpEx-per-inference-call terms) to run larger models, as your call concurrency goes up. Which doesn't matter to individuals just doing one thing at a time; but it does matter to Inference-as-a-Service providers, as they can arbitrarily "pack" many concurrent inference requests from many users, onto the nodes of their GPU cluster, to optimize OpEx-per-inference-call.
This is the whole reason Inference-aaS providers have high valuations: these economies of scale make Inference-aaS a good business model. The same query, run in some inference cloud rather than on your device, will always achieve a higher-quality result for the same marginal cost [in watts per FLOP, and in wall-clock time]; and/or a same-quality result for a lower marginal cost.)
Further, one major difference between CPU processes and model inference on a GPU, is that each inference step of a model is always computing an entirely-new state; and so compute (which you can think of as "number of compute cores reserved" x "amount of time they're reserved") scales in proportion to the state size. And, in fact, with current Transformer-architecture models, compute scales quadratically with state size.
For both of these reasons, you want to design models to minimize 1. absolute state size overhead, and 2. state size growth in proportion to input size.
The desire to minimize absolute state-size overhead, is why you see Inference-as-a-Service providers training such large versions of their models (OpenAI's 405b models, etc.) The hosted Inference-aaS providers aren't just attempting to make their models "smarter"; they're also attempting to trade off "state size" for "model size." (If you're familiar with information theory: they're attempting to make a "smart compressor" that minimizes the message-length of the compressed message [i.e. the state] by increasing the information embedded in the compressor itself [i.e. the model.]) And this seems to work! These bigger models can do more with less state, thereby allowing many more "cheap" inferences to run on single nodes.
The particular newly-released model under discussion in this comments section, also has much slower state-size (and so compute) growth in proportion to its input size. Which means that there's even more of an economy-of-scale in running nodes with the larger versions of this model; and therefore much less of a reason to care about smaller versions of this model.
> It is increasingly less expensive (in OpEx-per-inference-call terms) to run larger models, as your call concurrency goes up. Which doesn't matter to individuals just doing one thing at a time; but it does matter to Inference-as-a-Service providers, as they can arbitrarily "pack" many concurrent inference requests from many users
In a way it also matters to individuals, because it allows them to run more capable models with a limited amount of system RAM. Yes, fetching model parameters from mass storage during inference is going to be dog slow (while NVMe transfer bandwidth is getting up there, it's not yet comparable to RAM) but that matters if you insist on getting your answer interactively, in real time. With a local model, it's trivial to make LLM inference a batch task. Some LLM inference frameworks can even save checkpoints for a single inference to disk and be cleanly resumed later.
> they're attempting to make a "smart compressor" that minimizes the message-length of the compressed message [i.e. the state] by increasing the information embedded in the compressor itself [i.e. the model.]) And this seems to work! These bigger models can do more with less state, thereby allowing many more "cheap" inferences to run on single nodes.
Not sure I follow. CoT and go over length of the states is a relatively new phenomenon and I doubt when training the model, minimize the length of CoT is an explicit goal.
The only thing probably relevant to this comment is the use of grouped-query attention? That reduces the size of KV cache by factor of 4 to 8 depending on your group strategy. But I am unsure there is a clear trade-off between model size / grouped-query size given smaller KV cache == smaller model size naively.
What I'm talking about here is the fact that you need a longer + multi-shot prompt to get a dumber model to do the same thing a smarter model will do with a shorter + zero-shot prompt.
Pretend for a moment that Transformers don't actually have context-size limits (a "spherical cow" model of inference.) In this mental model, you can make a small, dumb model arbitrarily smarter — potentially matching the quality of much larger, smarter models — by providing all the information and associations it needs "at runtime."
It's just that the sheer amount of prompting required to get a dumb model to act like a smart model, goes up superlinearly vs. the marginal increase in intelligence. And since (for now) the compute costs scale quadratically with the prompt size, you would quickly hit resource limits in trying to do this. To have a 10b model act like a 405b model, you'd either need an inordinate amount of time per inference-step — or, for a more interesting comparison, an amount of parallel GPU hardware (VRAM to hold state, and GPU-core-compute-seconds) that in both dimensions would far exceed the amount required to host inference of the 405b model.
(This superlinear relationship still holds with context-size limits in place; you just can only do the "make the dumb model smarter with a good prompt" experiment on roughly same-order-of-magnitude-sized models [e.g. 3b vs 7b] — as a 3b really couldn't "act as" anything above 7b, without a prompt that far exceeds its context-size limit — and so, in practice, you can't calculate enough of the ramp at once to fit a curve to it.)
The obvious corollary to this, is that by increasing model size (in a way that keeps more useful training around, retains intelligence, etc), you decrease the required resource consumption to compute at a fixed level of intelligence, and that this decrease scales superlinearly.
This dynamic explains everything current Inference-as-a-Service providers do.
It explains why they they are all seeking to develop their own increasingly-large models — they want, as much as possible, to get their models to achieve better results with less prompting, in fewer inference steps, and in proportionately cheaper inference steps — as these all increase their economies of scale, by decreasing the compute and memory requirements per concurrent inference call.
And it explains why they charge users for queries by the input/output token, not by the compute-second. To them, "intelligent responses" are the value they provide; while "(prompt size + output size) x (number of inference steps)" is the overhead cost of providing that value, that they want to minimize. A per-token pricing structure does several things:
• most obviously, as with any well-thought-out SaaS business model, it pushes the overhead costs onto the customer, so that customers are always paying for their own costs.
• it therefore disincentivizes users from sending prompts that are any longer than necessary (i.e. it incentivizes attempting to "pare down" your prompt until it's working just well enough)
• and it incentivizes users to choose their smarter models, despite the higher costs per token, as these models will achieve the same result with a shorter prompt; will require fewer retries (= wasted tokens) to give a good result; can "say more" in fewer tokens by focusing in on the spirit of the question rather than rambling; and require less CoT-like "thinking out loud" steps to arrive at correct conclusions.
• it also incentivizes the company to put effort into R&D work to minimize per-token overhead, to increase profitability per token. (Just like e.g. Amazon is incentivized to optimize the per-request overhead of S3, to increase the profitability per call.)
• and, most cynically, it locks in their customers, by getting them to rely on building AI agents that send minimal prompts and expect useful + accurate + succinct output; where you can only achieve that with these huge models, which in turn can only run on the huge vertically-scaled cluster nodes these Inference-aaS providers run. The people who've built working products on top of these Inference-aaS providers can't meaningfully threaten to switch away to "commodity" hosted open-source-model Inference-aaS providers (e.g RunPod/Vast/etc) — as nobody but the few largest players can host models of this size.
(Fun tangent: why was it not an existential mistake for Meta to open-source Llama 3.1 405b? Because nobody but their direct major competitors in the Inference-aaS space have compute shaped the right way to run that kind of model at scale; and those few companies all have their own huge models they're already invested in, so they don't even care!)
I like Qwen2-VL 7B because it outputs shorter captions with less fluff. But if you need to do anything advanced that relies on reasoning and instruction following the model completely falls flat on it's face.
For example, I have a couple way-too-wordy captions made with another captioner, which I'd like to cut down to the essentials while correcting any mistakes. Qwen2 is completely ignoring images with this approach, and decides to only focus on the given caption, which makes it unable to even remotely fix issues in said caption.
I am really hoping Pixtral will be better for instruction following. But I haven't been able to run it because they didn't prioritize transformers support, which in turn has hindered the release of any quantized versions to make it fit on consumer hardware.
>Qwen2-VL is the latest addition to the vision-language models in the Qwen series, building upon the capabilities of Qwen-VL. Compared to its predecessor, Qwen2-VL offers:
>State-of-the-Art Image Understanding
>Extended Video Comprehension
Besides, it'd have been pretty silly for them to mention it on their slides if it wasn't.
Curiously, I've had the exact same problem when I was in Britain. At Heathrow Airport. They would not announce which gate flights leave from until ~20 minutes before boarding.
Considering there's no 'crush risk' in this scenario, what even is the point of it? In the end I just used any of the myriad of online sites that list flight data to know which gate I needed to head to 1.5 hours before everyone else, and got to enjoy some peace and quiet.
What a weird rant at the end. Last I checked, there is no gender.google.com, and straight up erasing people from maps for their race/gender/ethics/whatever is not something I have heard of happening.