§ Word2Vec C code implements gradient descent really weirdly

I'll be posting snippets of the original source code, along with a link to the Github sources. We are interested in exploring the skip-gram implementation of Word2Vec, with negative sampling, without hierarchical softmax. I assume basic familiarity with word embeddings and the skip-gram model.

§ Construction of the sigmoid lookup table

// https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L708

expTable = (real *)malloc((EXP_TABLE_SIZE + 1) * sizeof(real));
for (i = 0; i < EXP_TABLE_SIZE; i++) {
  expTable[i] = exp((i / (real)EXP_TABLE_SIZE * 2 - 1) *
                    MAX_EXP);  // Precompute the exp() table
  expTable[i] =
      expTable[i] / (expTable[i] + 1);  // Precompute f(x) = x / (x + 1)
}
Here, the code constructs a lookup table which maps [0...EXP_TABLE_SIZE-1] to [sigmoid(-MAX_EXP)...sigmoid(MAX_EXP)]. The index i first gets mapped to (i / EXP_TABLE_SIZE) * 2 - 1, which sends 0 to -1 and EXP_TABLE_SIZE to 1. This is then rescaled by MAX_EXP.

§ Layer initialization

  • syn0 is a global variable, initialized with random weights in the range of[-0.5...0.5]. It has dimensions VOCAB x HIDDEN. This layer holds thehidden representations of word vectors.
// https://github.com/imsky/word2vec/blob/master/word2vec.c#L341
a = posix_memalign((void **)&syn0, 128,
               (long long)vocab_size * layer1_size * sizeof(real));
...

// https://github.com/imsky/word2vec/blob/master/word2vec.c#L355
for (a = 0; a < vocab_size; a++)
        for (b = 0; b < layer1_size; b++) {
            next_random = next_random * (unsigned long long)25214903917 + 11;
            syn0[a * layer1_size + b] =
                (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size;
        }
  • syn1neg is a global variable that is zero-initialized. It has dimensionsVOCAB x HIDDEN. This layer also holds hidden representations of word vectors,when they are used as a negative sample.
// https://github.com/imsky/word2vec/blob/master/word2vec.c#L350
a = posix_memalign((void **)&syn1neg, 128,
                   (long long)vocab_size * layer1_size * sizeof(real));
...
for (a = 0; a < vocab_size; a++)
    for (b = 0; b < layer1_size; b++) syn1neg[a * layer1_size + b] = 0;
  • neu1e is a temporary per-thread buffer (Remember that the word2vec C codeuse CPU threads for parallelism) which is zero initialized. It has dimensions1 x HIDDEN.
// https://github.com/imsky/word2vec/blob/master/word2vec.c#L370
real *neu1e = (real *)calloc(layer1_size, sizeof(real));

§ Backpropogation

Throughout word2vec, no 2D arrays are used. Indexing of the form arr[word][ix] is manually written as arr[word * layer1_size + ix]. So, I will call word * layer1_size as the "base address", and ix as the "offset of the array index expression henceforth. Here, l1 is the base address of the word at the center of window (the focus word). l2 is the base address of either the word that is negative sampled from the corpus, or the word that is a positive sample from within the context window. label tells us whether the sample is a positive or a negative sample. label = 1 for positive samples, and label = 0 for negative samples.
// zero initialize neu1e
// https://github.com/imsky/word2vec/blob/master/word2vec.c#L419
for (c = 0; c < layer1_size; c++) neu1e[c] = 0;
...
// loop through each negative sample
// https://github.com/imsky/word2vec/blob/master/word2vec.c#L508
if (negative > 0)  for (d = 0; d < negative + 1; d++) {
  ...
  // https://github.com/imsky/word2vec/blob/master/word2vec.c#L521
  // take the dot product: f=  syn0[focus] . syn1neg[context]
  for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2];

  // compute: g = (label - sigmoid(2f - 1)) * alpha
  // g is computed using lookups into a lookup table and clamping for
  // efficiency.
  if (f > MAX_EXP) g = (label - 1) * alpha;
  else if (f < -MAX_EXP) g = (label - 0) * alpha;
  else
  g = (label - expTable[(int)((f + MAX_EXP) *
                              (EXP_TABLE_SIZE /
                               MAX_EXP / 2))]) * alpha;
  // Now that we have computed the gradient:
  // `g = (label - output) * learningrate`,
  // we need to perform backprop. This is where the code gets weird.

  for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2];
  for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
  } // end loop through negative samples
// Learn weights input -> hidden
for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];
  • We have two vectors for each word, one called syn0[l1 + _] andthe other syn1neg[l2 + _]. The syn1neg word embedding is used whenevera word is used a negative sample, and is not used anywhere else. Also,the syn1neg vector is zero initialized, while the syn0 vectors arerandomly initialized.
  • The values we backprop with g * syn1neg[l2 + _], g * syn0[l1 + _] arenot the correct gradients of the error term! The derivative of a sigmoidis dsigmoid(x)/dx = sigmoid(x) [1 - sigmoid(x)]. The [1 - sigmoid(x)]is nowhere to be seen, let alone the fact that we are usingsigmoid(2x - 1) and not regular sigmoid. Very weird.
  • We hold the value of syn0 constant throughout all the negative samples,which was not mentioned in any tutorial I've read.
The paper does not mentioned these implementation details, and neither does any blog post that I've read. I don't understand what's going on, and I plan on updating this section when I understand this better.