API - Reinforcement Learning¶
Reinforcement Learning.
discount_episode_rewards ([rewards, gamma, mode]) |
Take 1D float array of rewards and compute discounted rewards for an episode. |
cross_entropy_reward_loss (logits, actions, …) |
Calculate the loss for Policy Gradient Network. |
log_weight (probs, weights[, name]) |
Log weight. |
choice_action_by_probs ([probs, action_list]) |
Choice and return an an action by given the action probability distribution. |
Reward functions¶
-
tensorlayer.rein.
discount_episode_rewards
(rewards=None, gamma=0.99, mode=0)[source]¶ Take 1D float array of rewards and compute discounted rewards for an episode. When encount a non-zero value, consider as the end a of an episode.
Parameters: - rewards (list) – List of rewards
- gamma (float) – Discounted factor
- mode (int) –
- Mode for computing the discount rewards.
- If mode == 0, reset the discount process when encount a non-zero reward (Ping-pong game).
- If mode == 1, would not reset the discount process.
Returns: The discounted rewards.
Return type: list of float
Examples
>>> rewards = np.asarray([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1]) >>> gamma = 0.9 >>> discount_rewards = tl.rein.discount_episode_rewards(rewards, gamma) >>> print(discount_rewards) ... [ 0.72899997 0.81 0.89999998 1. 0.72899997 0.81 ... 0.89999998 1. 0.72899997 0.81 0.89999998 1. ] >>> discount_rewards = tl.rein.discount_episode_rewards(rewards, gamma, mode=1) >>> print(discount_rewards) ... [ 1.52110755 1.69011939 1.87791049 2.08656716 1.20729685 1.34144104 ... 1.49048996 1.65610003 0.72899997 0.81 0.89999998 1. ]
Cost functions¶
Weighted Cross Entropy¶
-
tensorlayer.rein.
cross_entropy_reward_loss
(logits, actions, rewards, name=None)[source]¶ Calculate the loss for Policy Gradient Network.
Parameters: - logits (tensor) – The network outputs without softmax. This function implements softmax inside.
- actions (tensor or placeholder) – The agent actions.
- rewards (tensor or placeholder) – The rewards.
Returns: The TensorFlow loss function.
Return type: Tensor
Examples
>>> states_batch_pl = tf.placeholder(tf.float32, shape=[None, D]) >>> network = InputLayer(states_batch_pl, name='input') >>> network = DenseLayer(network, n_units=H, act=tf.nn.relu, name='relu1') >>> network = DenseLayer(network, n_units=3, name='out') >>> probs = network.outputs >>> sampling_prob = tf.nn.softmax(probs) >>> actions_batch_pl = tf.placeholder(tf.int32, shape=[None]) >>> discount_rewards_batch_pl = tf.placeholder(tf.float32, shape=[None]) >>> loss = tl.rein.cross_entropy_reward_loss(probs, actions_batch_pl, discount_rewards_batch_pl) >>> train_op = tf.train.RMSPropOptimizer(learning_rate, decay_rate).minimize(loss)
Log weight¶
-
tensorlayer.rein.
log_weight
(probs, weights, name='log_weight')[source]¶ Log weight.
Parameters: - probs (tensor) – If it is a network output, usually we should scale it to [0, 1] via softmax.
- weights (tensor) – The weights.
Returns: The Tensor after appling the log weighted expression.
Return type: Tensor
Sampling functions¶
-
tensorlayer.rein.
choice_action_by_probs
(probs=(0.5, 0.5), action_list=None)[source]¶ Choice and return an an action by given the action probability distribution.
Parameters: - probs (list of float.) – The probability distribution of all actions.
- action_list (None or a list of int or others) – A list of action in integer, string or others. If None, returns an integer range between 0 and len(probs)-1.
Returns: The chosen action.
Return type: float int or str
Examples
>>> for _ in range(5): >>> a = choice_action_by_probs([0.2, 0.4, 0.4]) >>> print(a) ... 0 ... 1 ... 1 ... 2 ... 1 >>> for _ in range(3): >>> a = choice_action_by_probs([0.5, 0.5], ['a', 'b']) >>> print(a) ... a ... b ... b