Reinforcement Learning

By: Abhinav Rai and Ashar Hasan

In this project, I tried to model the motor control unit in a virtual environment. The virtual environment used is OpenSim - a biomechanical physics environment for musculoskeletal simulations. The musculoskeletal model used has 18 muscles which can be controlled and 9 degrees of freedom.

Every action we take begins with our brain. We learn to coordinate our muscles using our brain and take actions such as standing, walking, jumping, etc for granted. This process by which humans or more broadly animals coordinate and activate their muscles using their brain is referred to as motor control. In this post we see that artificial neural networks can be used to mimic the way our brain controls the human body.

Q-learning cannot be applied straightforwardly for continuous action space. This is because finding greedy policy in continuous spaces requires optimization of action a_t at every timestep, this optimization is not practical because of the large non-trivial action spaces. DDPG achieves this using the actor-critic approach. The actor critic function helps to represent the policy function independent of the value function. The actor takes as input the current state of the environment and gives an action as output. The critic gives a temporal difference error signal based upon the state and the resultant reward. The output obtained from the critic is used to update both the actor and critic.

The actor and critic structures can be modelled as neural networks which try to choose an action from the continuous action space according to the current state to try and minimize the temporal-difference (TD) error signal each time step. However, when using neural networks for reinforcement learning, the algorithm assumes that the input samples are independent and identically distributed. But this assumption is wrong as the inputs obtained are sequential in nature. To tackle this DDPG uses a finite-sized buffer representing historical states called a replay buffer. All inputs to the actor are sampled from a minibatch from the replay buffer. Once the replay buffer is full, the oldest samples are removed. The input of the actor network is the current state, and the output is a single real value representing an action chosen from a continuous action space. The critic outputs the estimated Q-value of the current state and of the action chosen by the actor. The actor is updated using the deterministic policy gradient theorem. The critic is updated from the gradients obtained from the TD error signal.

Some actions that were successfully taught to the model include standing on both legs, legs joined and in split position. Other actions included standing on one leg and crouching. Videos for these actions is shown above.