§ Everything you know about word2vec is wrong

The classic explanation of word2vec, in skip-gram, with negative sampling, in the paper and countless blog posts on the internet is as follows:
while(1) {
   1. vf = vector of focus word
   2. vc = vector of context word
   3. train such that (vc . vf = 1)
   4. for(0 <= i < negative samples):
           vneg = vector of word *not* in context
           train such that (vf . vneg = 0)
Indeed, if I google "word2vec skipgram", the results I get are: the list goes on. However, every single one of these implementations is wrong. The original word2vec C implementation does not do what's explained above, and is drastically different. Most serious users of word embeddings, who use embeddings generated from word2vec do one of the following things:
  1. They invoke the original C implementation directly.
  2. They invoke the gensim implementation, which is transliterated from theC source to the extent that the variables names are the same.
Indeed, the gensim implementation is the only one that I know of which is faithful to the C implementation.

§ The C implementation

The C implementation in fact maintains two vectors for each word, one where it appears as a focus word, and one where it appears as a context word. (Is this sounding familiar? Indeed, it appears that GloVe actually took this idea from word2vec, which has never mentioned this fact!) The setup is incredibly well done in the C code:
  • An array called syn0 holds the vector embedding of a word when it occursas a focus word. This is random initialized.
  for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) {
    next_random = next_random * (unsigned long long)25214903917 + 11;
    syn0[a * layer1_size + b] =
       (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size;

  • Another array called syn1neg holds the vector of a word when it occursas a context word. This is zero initialized.
for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++)
  syn1neg[a * layer1_size + b] = 0;
  • During training (skip-gram, negative sampling, though other cases arealso similar), we first pick a focus word. This is held constant throughoutthe positive and negative sample training. The gradients of the focus vectorare accumulated in a buffer, and are applied to the focus wordafter it has been affected by both positive and negative samples.
if (negative > 0) for (d = 0; d < negative + 1; d++) {
  // if we are performing negative sampling, in the 1st iteration,
  // pick a word from the context and set the dot product target to 1
  if (d == 0) {
    target = word;
    label = 1;
  } else {
    // for all other iterations, pick a word randomly and set the dot
    //product target to 0
    next_random = next_random * (unsigned long long)25214903917 + 11;
    target = table[(next_random >> 16) % table_size];
    if (target == 0) target = next_random % (vocab_size - 1) + 1;
    if (target == word) continue;
    label = 0;
  l2 = target * layer1_size;
  f = 0;

  // find dot product of original vector with negative sample vector
  // store in f
  for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];

  // set g = sigmoid(f) (roughly, the actual formula is slightly more complex)
  if (f > MAX_EXP) g = (label - 1) * alpha;
  else if (f < -MAX_EXP) g = (label - 0) * alpha;
  else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;

  // 1. update the vector syn1neg,
  // 2. DO NOT UPDATE syn0
  // 3. STORE THE syn0 gradient in a temporary buffer neu1e
  for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
  for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
// Finally, after all samples, update syn1 from neu1e
// Learn weights input -> hidden
for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];

§ Why random and zero initialization?

Once again, since none of this actually explained in the original papers or on the web, I can only hypothesize. My hypothesis is that since the negative samples come from all over the text and are not really weighed by frequency, you can wind up picking any word, and more often than not, a word whose vector has not been trained much at all. If this vector actually had a value, then it could move the actually important focus word randomly. The solution is to set all negative samples to zero, so that only vectors that have occurred somewhat frequently will affect the representation of another vector. It's quite ingenious, really, and until this, I'd never really thought of how important initialization strategies really are.

§ Why I'm writing this

I spent two months of my life trying to reproduce word2vec, following the paper exactly, reading countless articles, and simply not succeeding. I was unable to reach the same scores that word2vec did, and it was not for lack of trying. I could not have imagined that the paper would have literally fabricated an algorithm that doesn't work, while the implementation does something completely different. Eventually, I decided to read the sources, and spent three whole days convinced I was reading the code wrong since literally everything on the internet told me otherwise. I don't understand why the original paper and the internet contain zero explanations of the actual mechanism behind word2vec, so I decided to put it up myself. This also explains GloVe's radical choice of having a separate vector for the negative context --- they were just doing what word2vec does, but they told people about it :). Is this academic dishonesty? I don't know the answer, and that's a heavy question. But I'm frankly incredibly pissed, and this is probably the last time I take a machine learning paper's explanation of the algorithm seriously again --- from next time, I read the source first.