See ScOp Venture Capital’s activity on LinkedIn

Partner @ ScOp VC

8mo Edited

Word2Vec was a pivotal paper published a decade ago by researchers at Google. They showed that by attempting to predict a word from their neighbors (or the neighbors from the word), the resulting model acquired compelling semantic capabilities. The main element of this model is a n x m matrix, where n is the vocabulary size and m is the dimension of the vector to encode each word. A typical value from m is 300. Rather than having an identifier for each of the roughly 200k words in the English language (in addition to proper names and numbers, which are also words), each word can be represented with a coordinate in 300-dimensions. This is known as an embedding, and is a primordial type of language model (and a component of most language models since). The implication is that if two words are close to each other in this 300-dimensional space, they are similar. Hence, "dog" and "cat" would presumably be closer to each other than "airplane" is to either of them. This is the so called cosine similarity, enabled by vector databases like Pinecone. Furthermore, each of the 300 axes can be thought as having a meaning (although it might not be monosemantic). So for example "dog" would score high on the axis intended to represent "animalness", whereas "plane" would score low. It's important to understand that this sort of distance calculation wouldn't be possible if each word was identified with a sequential ID, since that's equivalent to having a single dimension. Having higher cardinality allows many words to cluster together around central concepts, like "animals", "computers", "politics", and so on. The shocking result is that this structure is conducive of performing a sort of arithmetic with words. The canonical example being "king" - "man" + "woman" = "queen". Performing this arithmetic operation on the vectors for the terms in question will yield a new vector within a short proximity of "queen". Proximity, as opposed to identical, is an important characteristic. The relationship between "Paris" => "France" and "Rome" => "Italy", will be similar but not identical, given that it was learned from statistical properties of a text corpus. So the inexactness is a feature. To learn, I implemented word2vec from scratch (following Olga Chernytska ). The code is here: https://lnkd.in/gV5J4Ugx . Run word2vec_py and play with the results using Inference_ipynb Take a look at this interactive visualization. Try to interpret what is different about the two green clusters: https://lnkd.in/guRHnwNY Note: "king" - "man" + "woman" doesn't always work as expected. When embeddings are trained, dimensions acquire different meaning, due to randomness. In order to work, the embeddings have to learn the right properties (e.g. male vs female). I suspect a larger training set would improve results. After several runs, I ended up with these n-best results: king: 0.782 queen: 0.535 woman: 0.516 monarch: 0.515

8 Comments

Ivan Bercovich

Partner @ ScOp VC

8mo

Matthew Rybak, finally got around to put our conversation in writing.

Luciano Bernardino Bertolotti

Director of Engineering - Patient Specific at Paragon 28

8mo

Attention is all you need ☺️

1 Reaction

Michael Krause

COO/Chief Growth Officer

7mo

Ivan Bercovich is this consistent with any language set?

Emilio Goldenhersch

Spatial Intelligence & Data to Remove Guesswork and Drive Growth

8mo

Closer and closer to psychology

Chetan Kale

Engineering Leader@Procore

8mo

How do the results compare against Euclidean distance?

Jim Beno

UX Design Leader / Cybersecurity, Data Science & AI/ML

8mo

Just coded the major parts of word2vec as part of a Stanford NLP course I’m taking now. Really love the concept. I wonder if proximity in vector space could map to priming of associated memories in humans

1 Reaction

See more comments

To view or add a comment, sign in

ScOp Venture Capital’s Post

Explore topics