easy_vision.python.core.transformer¶

easy_vision.python.core.transformer.attention_layer¶

Implementation of multiheaded attention and self-attention layers.

class easy_vision.python.core.transformer.attention_layer.Attention(hidden_size, num_heads, attention_dropout, train)[source]¶

Bases: tensorflow.python.layers.base.Layer

Multi-headed attention layer.

__init__(hidden_size, num_heads, attention_dropout, train)[source]¶

call(x, y, bias, cache=None)[source]¶

Apply attention mechanism to x and y.

Parameters:

x – a tensor with shape [batch_size, length_x, hidden_size]
y – a tensor with shape [batch_size, length_y, hidden_size]
bias – attention bias that will be added to the result of the dot product.
cache –
(Used during prediction) dictionary with tensors containing results of previous attentions. The dictionary must have the items:

{“k”: tensor with shape [batch_size, i, key_channels],

”v”: tensor with shape [batch_size, i, value_channels]}

where i is the current decoded length.

Returns:

Attention layer output with shape [batch_size, length_x, hidden_size]

combine_heads(x)[source]¶

Combine tensor that has been split.

Parameters:	x – A tensor [batch_size, num_heads, length, hidden_size/num_heads]
Returns:	A tensor with shape [batch_size, length, hidden_size]

split_heads(x)[source]¶

Split x into different heads, and transpose the resulting value.

The tensor is transposed to insure the inner dimensions hold the correct values during the matrix multiplication.

Parameters:	x – A tensor with shape [batch_size, length, hidden_size]
Returns:	A tensor with shape [batch_size, num_heads, length, hidden_size/num_heads]

class easy_vision.python.core.transformer.attention_layer.SelfAttention(hidden_size, num_heads, attention_dropout, train)[source]¶

Bases: easy_vision.python.core.transformer.attention_layer.Attention

Multiheaded self-attention layer.

call(x, bias, cache=None)[source]¶

Apply attention mechanism to x and y.

Parameters:

x – a tensor with shape [batch_size, length_x, hidden_size]
y – a tensor with shape [batch_size, length_y, hidden_size]
bias – attention bias that will be added to the result of the dot product.
cache –
(Used during prediction) dictionary with tensors containing results of previous attentions. The dictionary must have the items:

{“k”: tensor with shape [batch_size, i, key_channels],

”v”: tensor with shape [batch_size, i, value_channels]}

where i is the current decoded length.

Returns:

Attention layer output with shape [batch_size, length_x, hidden_size]

easy_vision.python.core.transformer.beam_search¶

Beam search to find the translated sequence with the highest probability.

Source implementation from Tensor2Tensor: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/beam_search.py

class easy_vision.python.core.transformer.beam_search.SequenceBeamSearch(symbols_to_logits_fn, vocab_size, batch_size, beam_size, alpha, max_decode_length, eos_id)[source]¶

Bases: object

Implementation of beam search loop.

__init__(symbols_to_logits_fn, vocab_size, batch_size, beam_size, alpha, max_decode_length, eos_id)[source]¶

Parameters:	symbols_to_logits_fn – a decoding function that calculates logits of the next tokens vocab_size – size of vocabulary dict batch_size – batch size beam_size – beam search width alpha – length penalty for beam search max_decode_length – max decode steps eos_id – end of sequence od

search(initial_ids, initial_cache)[source]¶: Beam search for sequences with highest scores.

easy_vision.python.core.transformer.beam_search.sequence_beam_search(symbols_to_logits_fn, initial_ids, initial_cache, vocab_size, beam_size, alpha, max_decode_length, eos_id)[source]¶

Search for sequence of subtoken ids with the largest probability.

Parameters:

symbols_to_logits_fn –
A function that takes in ids, index, and cache as arguments. The passed in arguments will have shape:

ids -> [batch_size * beam_size, index] index -> [] (scalar) cache -> nested dictionary of tensors [batch_size * beam_size, …]

The function must return logits and new cache.

logits -> [batch * beam_size, vocab_size] new cache -> same shape/structure as inputted cache
initial_ids – Starting ids for each batch item. int32 tensor with shape [batch_size]
initial_cache – dict containing starting decoder variables information
vocab_size – int size of tokens
beam_size – int number of beams
alpha – float defining the strength of length normalization
max_decode_length – maximum length to decoded sequence
eos_id – int id of eos token, used to determine when a sequence has finished

Returns:

Top decoded sequences [batch_size, beam_size, max_decode_length] sequence scores [batch_size, beam_size]

easy_vision.python.core.transformer.common¶

class easy_vision.python.core.transformer.common.LayerNormalization(hidden_size)[source]¶

Bases: tensorflow.python.layers.base.Layer

Applies layer normalization.

__init__(hidden_size)[source]¶

build(_)[source]¶: Creates the variables of the layer.

call(x, epsilon=1e-06)[source]¶

This is where the layer’s logic lives.

Parameters:	inputs – Input tensor, or list/tuple of input tensors. **kwargs – Additional keyword arguments.
Returns:	A tensor or list/tuple of tensors.

class easy_vision.python.core.transformer.common.PrePostProcessingWrapper(layer, params, train)[source]¶

Bases: object

Wrapper class that applies layer pre-processing and post-processing.

__init__(layer, params, train)[source]¶: x.__init__(…) initializes x; see help(type(x)) for signature

easy_vision.python.core.transformer.ffn_layer¶

Implementation of fully connected network.

class easy_vision.python.core.transformer.ffn_layer.FeedFowardNetwork(hidden_size, filter_size, relu_dropout, train)[source]¶

Bases: tensorflow.python.layers.base.Layer

Fully connected feedforward network.

__init__(hidden_size, filter_size, relu_dropout, train)[source]¶

call(x, padding=None)[source]¶

This is where the layer’s logic lives.

Parameters:	inputs – Input tensor, or list/tuple of input tensors. **kwargs – Additional keyword arguments.
Returns:	A tensor or list/tuple of tensors.

easy_vision.python.core.transformer.transformer_utils¶

Transformer model helper methods.

easy_vision.python.core.transformer.transformer_utils.get_decoder_self_attention_bias(length)[source]¶

Calculate bias for decoder that maintains model’s autoregressive property.

Creates a tensor that masks out locations that correspond to illegal connections, so prediction at position i cannot draw information from future positions.

Parameters:	length – int length of sequences in batch.
Returns:	float tensor of shape [1, 1, length, length]

easy_vision.python.core.transformer.transformer_utils.get_padding(sequence_length, dtype=tf.float32)[source]¶

Return float tensor representing the padding values in x.

Parameters:

sequence_length – input sequence length with shape [batch_size]
dtype – type of the output

Returns:

float tensor with same shape as x containing values 0 or 1.: 0 -> non-padding, 1 -> padding

easy_vision.python.core.transformer.transformer_utils.get_padding_bias(sequence_length, res_rank=4)[source]¶

Calculate bias tensor from padding values in tensor.

Bias tensor that is added to the pre-softmax multi-headed attention logits, which has shape [batch_size, num_heads, length, length]. The tensor is zero at non-padding locations, and -1e9 (negative infinity) at padding locations.

Parameters:	sequence_length – input sequence length with shape [batch_size] res_rank – int indicates the rank of attention_bias.
Returns:	Attention bias tensor of shape [batch_size, 1, 1, length] if res_rank = 4 - for Transformer or [batch_size, 1, length] if res_rank = 3 - for ConvS2S

easy_vision.python.core.transformer.transformer_utils.get_position_encoding(length, hidden_size, min_timescale=1.0, max_timescale=10000.0)[source]¶

Return positional encoding.

Calculates the position encoding as a mix of sine and cosine functions with geometrically increasing wavelengths. Defined and formulized in Attention is All You Need, section 3.5.

Parameters:	length – Sequence length. hidden_size – Size of the min_timescale – Minimum scale that will be applied at each position max_timescale – Maximum scale that will be applied at each position
Returns:	Tensor with shape [length, hidden_size]