§ Self attention? not really

The code is taken from The annotated transformer which explains the "attention is all you need paper". On skimming the code, one sees the delightful line of code:
class EncoderLayer(nn.Module):
"Encoder is made up of self-attn and feed forward (defined below)"
def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 2)
self.size = size
"Follow Figure 1 (left) for connections."
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
return self.sublayer[1](x, self.feed_forward)
where the line:
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
seems to imply that we are, indeed, performing a self attention with the same value x as the query, key, and value. However, reading the code of the self-attention (or the paper) leads one to realise:
def __init__(self, h, d_model, dropout=0.1):
"Take in model size and number of heads."
assert d_model % h == 0
# We assume d_v always equals d_k
self.d_k = d_model // h
self.h = h
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)

def forward(self, query, key, value, mask=None):
"Implements Figure 2"
nbatches = query.size(0)

# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
[l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]

# 2) Apply attention on all the projected vectors in batch.
dropout=self.dropout)

# 3) "Concat" using a view and apply a final linear.
x = x.transpose(1, 2).contiguous() \
.view(nbatches, -1, self.h * self.d_k)
return self.linears[-1](x)
where we notice:
# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
[l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]

# 2) Apply attention on all the projected vectors in batch.
dropout=self.dropout)
where we see that query, key, value are being linearly transformed before being used. Hence, an input of $(x, x, x)$ is transformed to $(q', k', v') = (Qx, Kx, Vx)$ where $Q, K, V$ are arbitrary matrices. Next, when we pass these into attention, the output we get is:
$\text{softmax}(q', k'^T) v = (Q x) (K x)^T (V x) = Q x x^T K^T V x$
the code below is the same thing, spelled out:
def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'"
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) \
/ math.sqrt(d_k)