- Fix predict mode in PositionalEncoding with d_feature.PiperOrigin-RevId: 378013716, Copybara-Service
- Comparing decoding time of sparse and baseline transformer.PiperOrigin-RevId: 376289561, Copybara-Service
- Experiments with addition learning with Transformer and Reformer.PiperOrigin-RevId: 376249972, Copybara-Service
- Monkey patching mask of Terraformer, re-added tests of Terraformer.This is a (hopefully very temporary) fix for incorrect padding in predict mode for merged encoder and decoder used in Reformer2. Essentially, a lot of the code assumes that decoder doesn't require mask, as it just uses a causal mask (if any is needed). Unfortunately, this is not the case for merged encoder/decoder attention in predict mode.Normally for causal attention padding tokens are simply moved to the end of the sequence – this makes them irrelevant for computing attention weights. However, in merged attention in Reformer2 in each step after the first one, the new tokens will be concatenated to the end of the sequence – so padding tokens are in the middle of the sequence, and they are attended to.The same problem is there for LocallyConvDense and SRU unit (used in FeedForward), so I've patched them too.Pure LSH attention is not patched at this moment; mixed LSH attention will work up to "std_length".I fixed all of this by monkey patching i.e. changing the underlying class/function. With this I could bypass the stack and many layers of abstraction to get it working. It's not really a long-term solution.PiperOrigin-RevId: 376019425, Copybara-Service
- No need for a validation split, if eval_holdout_size has been specified.PiperOrigin-RevId: 375127276, Copybara-Service
Was this post helpful?
Let us know if you liked the post. That’s the only way we can improve.