# API - Reinforcement Learning¶

We provide two reinforcement learning libraries:

• RL-tutorial for professional users with low-level APIs.

• RLzoo for simple usage with high-level APIs.

 `discount_episode_rewards`([rewards, gamma, mode]) Take 1D float array of rewards and compute discounted rewards for an episode. `cross_entropy_reward_loss`(logits, actions, …) Calculate the loss for Policy Gradient Network. `log_weight`(probs, weights[, name]) Log weight. `choice_action_by_probs`([probs, action_list]) Choice and return an an action by given the action probability distribution.

## Reward functions¶

`tensorlayer.rein.``discount_episode_rewards`(rewards=None, gamma=0.99, mode=0)[source]

Take 1D float array of rewards and compute discounted rewards for an episode. When encount a non-zero value, consider as the end a of an episode.

Parameters
• rewards (list) – List of rewards

• gamma (float) – Discounted factor

• mode (int) –

Mode for computing the discount rewards.
• If mode == 0, reset the discount process when encount a non-zero reward (Ping-pong game).

• If mode == 1, would not reset the discount process.

Returns

The discounted rewards.

Return type

list of float

Examples

```>>> rewards = np.asarray([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1])
>>> gamma = 0.9
>>> discount_rewards = tl.rein.discount_episode_rewards(rewards, gamma)
>>> print(discount_rewards)
[ 0.72899997  0.81        0.89999998  1.          0.72899997  0.81
0.89999998  1.          0.72899997  0.81        0.89999998  1.        ]
>>> discount_rewards = tl.rein.discount_episode_rewards(rewards, gamma, mode=1)
>>> print(discount_rewards)
[ 1.52110755  1.69011939  1.87791049  2.08656716  1.20729685  1.34144104
1.49048996  1.65610003  0.72899997  0.81        0.89999998  1.        ]
```

## Cost functions¶

### Weighted Cross Entropy¶

`tensorlayer.rein.``cross_entropy_reward_loss`(logits, actions, rewards, name=None)[source]

Calculate the loss for Policy Gradient Network.

Parameters
• logits (tensor) – The network outputs without softmax. This function implements softmax inside.

• actions (tensor or placeholder) – The agent actions.

• rewards (tensor or placeholder) – The rewards.

Returns

The TensorFlow loss function.

Return type

Tensor

Examples

```>>> states_batch_pl = tf.placeholder(tf.float32, shape=[None, D])
>>> network = InputLayer(states_batch_pl, name='input')
>>> network = DenseLayer(network, n_units=H, act=tf.nn.relu, name='relu1')
>>> network = DenseLayer(network, n_units=3, name='out')
>>> probs = network.outputs
>>> sampling_prob = tf.nn.softmax(probs)
>>> actions_batch_pl = tf.placeholder(tf.int32, shape=[None])
>>> discount_rewards_batch_pl = tf.placeholder(tf.float32, shape=[None])
>>> loss = tl.rein.cross_entropy_reward_loss(probs, actions_batch_pl, discount_rewards_batch_pl)
>>> train_op = tf.train.RMSPropOptimizer(learning_rate, decay_rate).minimize(loss)
```

### Log weight¶

`tensorlayer.rein.``log_weight`(probs, weights, name='log_weight')[source]

Log weight.

Parameters
• probs (tensor) – If it is a network output, usually we should scale it to [0, 1] via softmax.

• weights (tensor) – The weights.

Returns

The Tensor after appling the log weighted expression.

Return type

Tensor

## Sampling functions¶

`tensorlayer.rein.``choice_action_by_probs`(probs=(0.5, 0.5), action_list=None)[source]

Choice and return an an action by given the action probability distribution.

Parameters
• probs (list of float.) – The probability distribution of all actions.

• action_list (None or a list of int or others) – A list of action in integer, string or others. If None, returns an integer range between 0 and len(probs)-1.

Returns

The chosen action.

Return type

float int or str

Examples

```>>> for _ in range(5):
>>>     a = choice_action_by_probs([0.2, 0.4, 0.4])
>>>     print(a)
0
1
1
2
1
>>> for _ in range(3):
>>>     a = choice_action_by_probs([0.5, 0.5], ['a', 'b'])
>>>     print(a)
a
b
b
```