You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/10/30 01:32:19 UTC

[GitHub] [incubator-tvm] altanh commented on pull request #6794: [TOPI] Add embedding op and gradient

altanh commented on pull request #6794:
URL: https://github.com/apache/incubator-tvm/pull/6794#issuecomment-719116287

re: API review
- mxnet: https://mxnet.apache.org/versions/1.6/api/r/docs/api/mx.symbol.Embedding.html
- PyTorch: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
- TF/Keras: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

### Naming
All of the above APIs call it `Embedding`, so we may want to rename `embed` to `embedding` (although grammatically I do feel like "embed" is more correct).

### Arguments
I don't think we need to pass in the vocabulary size or embedding dimension like these examples do, since we can infer it from the weight/table matrix (I imagine they use it for bookkeeping in training, which is a separate matter). Likewise, we can ignore anything related to weight initialization.

PyTorch has the following additional arguments:
- `padding_idx: int`, index into embedding table that will always have 0 gradient, generally used for padding
- `scale_grad_by_freq: boolean`, "scale gradients by the inverse of frequency of the words in the mini-batch." I believe this means the gradient update for index `j` will be divided by `sum(indices == j)` (count of `j` in input indices).
- `sparse: boolean`: "gradient w.r.t. weight matrix will be a sparse tensor."

mxnet has:
- `sparse_grad: boolean`: gradient for weight is row-sparse (probably the same as PyTorch above?)

TF/keras has:
- `mask_zero: boolean`: "whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1)." I don't fully understand this but it seems similar to `padding_idx` from PyTorch, but requires TF-specific masking support. I prefer PyTorch's approach if they are both equivalent.
- `input_length: int`: "Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed)." Again, this sounds like some kind of TF/Keras design quirk.

In my opinion, we should aim for PyTorch's API over TF/Keras, but perhaps others can give more insight. We are also thinking about adding sparse gradient support, so it may be best to add it as an `attr` but raise an error for now.

### Shapes
PyTorch and mxnet support arbitrary input shape. In particular, if our embedding dimension is `dim`, the shape relation is `(d1,...,dn) -> (d1,...,dn, dim)`.

TF/Keras is strange as they have `(batch_size, input_length) -> (batch_size, input_length, dim)`, which just seems like a restriction of PyTorch and mxnet.

This PR currently proposes `(flat_length,) -> (flat_length, dim)`. Note that we can easily support the PyTorch and mxnet approach by flattening the indices and then "reshaping the first n dimensions": `(d1,...,dn) -> (d1 * ... * dn) -> (d1 * ... * dn, dim) -> (d1,...,dn,dim)`. I imagine this should be easy to implement but I'm not too familiar with TOPI.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org