CrossMark

Available online at www.sciencedirect.com

ScienceDirect

Procedía Manufacturing 11 (2017) 329 - 337

27th International Conference on Flexible Automation and Intelligent Manufacturing, FAIM2017,

27-30 June 2017, Modena, Italy

Reinforcement Learning for Manipulators Without Direct Obstacle Perception in Physically Constrained Environments

Marie Ossenkopf*, Philipp Ennenb, Rene Vossenb and Sabina Jeschkeb

aDistributed Systems Group, Universität Kassel, Wilhelmshöher Allee 73, 34121 Kassel, Germany bInstitute of Information Management in Mechanical Engineering, RWTHAachen University, Dennewartstr. 25, 52068 Aachen, Germany

Abstract

Reinforcement Learning algorithms have the downside of potentially dangerous exploration of unknown states, which makes them largely unsuitable for the use on serial manipulators in an industrial setting. In this paper, we make use of a policy search algorithm and provide two extensions that aim to make learning more applicable on robots in industrial environments without the need of complex sensors. They build upon the use of Dynamic Movement Primitives (DMPs) as policy representation. Rather than model explicitly the skills of the robot we describe actions the robot should not try to do. First, we implement potential fields into the DMPs to keep planned movements inside the robot's workspace. Second, we monitor and evaluate the deviation in the DMPs to recognize and learn from collisions. Both extensions are evaluated in a simulation

© 2017 The Authors.PublishedbyElsevierB.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.Org/licenses/by-nc-nd/4.0/).

Peer-review under responsibility of the scientific committee of the 27th International Conference on Flexible Automation and IntelligentManufacturing

Keywords: Dynamic Movement Primitives; Relative Entropy Policy Search; Intelligent Manufacturing; Potential Field; Assembly Robot

* Corresponding author. Tel.: +49 561 804-6280; fax: +49 561 804-6277. E-mail address: marie.ossenkopf@vs.uni-kassel.de

2351-9789 © 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.Org/licenses/by-nc-nd/4.0/).

Peer-review under responsibility of the scientific committee of the 27th International Conference on Flexible Automation and Intelligent Manufacturing doi:10.1016/j.promfg.2017.07.115

1. Introduction

Adapting a robotic assembly system to a new task requires manual setup, which represents a major bottleneck for the automated production of small batch sizes. This paper aims at an alternative future approach to the standard programming procedures like e.g. RobotStudio. This approach reduces manual commissioning time through a self-learning manufacturing system [1,2].

Serial manipulators for assembly or pick and place tasks in industrial environments (henceforth called robots) yield high physical danger for themselves and their environment. The physical danger emerges from potential collisions of the robot with obstacles in the environment or itself. Without knowledge or sensing of the obstacles it is not possible to avoid a collision in the first place. As self-learning robots requires trials for learning a new task, the robot suffers from an increasing learning time if the robot tries movements that cannot lead to suitable solution. For instance, if the robot tries to exceed its mechanical constraints with respect to maximum joint angles, maximum velocity and maximum acceleration. Therefore, learning algorithms for use on industrial serial manipulators need to be adapted to meet these problems.

We present two enhancements for the Relative Entropy Policy Search (REPS) algorithm [3] that enable a robot to: (1) detect collisions during the learning process, (2) react to collisions, (3) learn from collisions and (4) avoid planning movements outside the maximum joint angles. The enhancements utilize Dynamic Movement Primitives (DMPs) as policy representation, which are an established policy representation for serial manipulators [4-6]. DMPs map an acceleration onto a position and a velocity in state space by modeling every dimension as a spring-damper-system. To be independent of the kinematic model, we use the dimensions of the robot's joint angle space instead of world coordinates. In particular, the two enhancements are: (1) We integrate potential fields into the DMPs to lower the possibility of exceeding the maximum joint angles. (2) We monitor the deviation between the planned trajectory and the current position to detect collisions and the exceedance of mechanical constraints. This enables us to interrupt a colliding movement and to use the deviation as an evaluation of the policy.

The new features work on any serial robot with an angular encoder. There is no need of additional sensors, an elaborated vision or modeling of the environment. The approach is independent of the knowledge of the kinematic and dynamic model of the robot, so no exact model has to be determined and the algorithm stays unaffected by model errors. The obstacle avoidance can be learned without knowledge of the obstacles. Hence, these additional properties reduce the requirements imposed on the assembly system and the commissioning time. Though we will test the algorithm on a well known and controlled robot, the approach is explicitly aimed towards reconfigurable robotic systems, whose kinematic model is not fully known at commissioning time and which therefore cannot be programmed in TCP space. We tested the algorithm in a simulation of the ABB IRB1201 robot. As evaluation task we used a simple reaching task with obstacles. Our results show that the exceedance of maximum joint angles can be significantly reduced, that collisions can be detected instantaneously, and that the algorithm learns to avoid experienced collisions.

1.1. Overview

The remainder of this paper is organized as follows. Section 2 summarizes the related work for making a learning process safe on real robots. Furthermore, Section 3 gives a brief explanation of DMPs and the used algorithm. In Section 4 we describe our algorithm setup. Section 5 and 6 outline the two extensions in detail, which are then justified in the evaluation scenario in Section 7. Finally, the paper is concluded in Section 8.

http://new.abb.com/products/robotics/industrial-robots/irb-120

2. Related Work

Peters et al. [3,7] developed policy search algorithms that aim to be applicable on hardware robotic systems. Their approach includes limiting the distribution of trajectories as well as limiting the change generated by the update step. By demonstrating a policy near a possible solution for imitation learning, they avoid exploration in dangerous regions of the action space. However, their application examples mostly lack obstacles or easy to damage tools. Therefore, their approach cannot be freely transferred to an industrial assembly task.

There are several proposals on how to avoid obstacles in the work space by additional force terms and potential fields in the DMPs. Hoffmann et al. [8] added a term to the DMP, which changes the direction of the velocity vector according to the angle between the current velocity vector and the vector from the tool center point (TCP) to the obstacle position. Park et al. [9] introduced the dynamic potential field to avoid obstacles. It consists of a repulsive force that is proportional to the distance and its decreasing speed, as well as trend adjustment according to the current velocity vector. They also extended the approach to avoid link collisions. Rai et al. [10] extended this to apply on spatially extended obstacles. Though these are impressive applications, they always rely on the knowledge of the inverse kinematic, as the DMPs plan the movement in world coordinates of the TCP. To build potential fields around objects in the work space, the DMPs of the joints need to be interconnected, since only all joints together determine where the robot is located in the work space. Additionally, the robot can collide with every part of its body, not only the TCP. Hence, modeling all joint angle sets that lead to a collision with a given external object together with a convenient force function pointing away from the obstacle is a tremendous task that might hardly be automatable.

An easy way to let the algorithm respect the mechanical constraints of the robot is punishing their violation through the evaluation strategy. In some existing work this is done by laying the minimum of the spatial reward function in the middle of the obstacle position. The policy is then learned in a simulation, where the obstacle does not exist and the reward structure leads to a solution that does not touch the obstacle in real life [11]. This approach requires samples in which the robot moves through the simulated obstacle. That makes this approach unsuitable for the learning process on a real robot without previous simulation and hence modeling, since the real robot will not be able to move through an obstacle.

Secondly, the obstacle's position has to be described sufficiently by the reward function to limit the solution space in the necessary but minimal way. This makes it rather inapplicable for tasks in complex environments, as the position of the obstacles sometimes is not fully known or even unpredictable. This problem could be counteracted by a complex vision system that detects all existing obstacles and formulates the reward function based on observations. In contrast, we rather concentrate on the motor skills and will therefore solve the problem for a practically blind robot.

3. Background

Policy Search is one approach to Reinforcement Learning that tries to tackle the curse of dimensionality. This term refers to the exponential growth of states in high dimensional state-spaces, which also occurs in many robotics problems [12]. The policy n:(a|s) links an observed state s to a subsequent action a. Its structure is defined by the representation, which should be chosen depending on the given task. Together with the exploration strategy, the policy results in a certain behavior of the agent during the task. This behavior sums up to an executed trajectory (in case of a robotic movement) that fulfills the given task to some extent. The evaluation strategy defines a reward for the state action pair according to the fulfillment of the given task. Sometimes it also concerns the subsequent state and action. This reward is then used by the update strategy to calculate new parameters for the policy in order to maximize the expected reward for the next trials.

3.1. Dynamic Movement Primitives

DMPs were introduced by Ijspeert in 2002 [5]. They form a policy representation in which each policy handles one dimension of the current position <p and velocity ^ as input state and gives the according acceleration <p as output action.

ip = T2a{fi{g -p)-ip) + t2ft (1)

The resulting policy resembles a mass in a spring-damper-system that is pushed by a force. The acceleration of the mass ip is proportional to the distance to goal position g — q> (deflection of the spring), the velocity (p of the mass, and the external force ft on the mass. The parameters a and ¡3 in equation (1) correspond to how hard the spring and the damper are adjusted. t determines how fast the system answers to external and internal forces, which correlates to the inertia of the mass. The force term ft consists of N weighted Gaussian basis functions ipn, whose means are distributed equidistantly over the time of a trial. The height of the weights wn determines the eventual movement.

The force function ft is designed to converge to zero for t ^ <x> which leads to the position <p in state space converging to g, which is achieved by the canonical system x, which decays with ax over t. This makes the DMP internally stable.

f,(x)= ffij ^ ^expi--^*-^) x = -^x (2)

^Jn (x) I )

The equations (2) are written for a one-dimensional system and therefore apply on each degree of freedom separately. <p can either be the elements of position and orientation of the TCP or the joint angles [9]. If the state of the TCP is chosen, an inverse kinematic is necessary to obtain motor commands from the DMPs. Therefore, this is not suitable for a system without a known physical model or in scenarios where modulation or model errors should be avoided.

3.2. Relative Entropy Policy Search

A problem of many Policy Search algorithms is that they update the policy in a way that moves the resulting state distribution far away from already sampled data [3]. This makes them very data inefficient, hence they need to draw more time consuming real life samples. Peters et al. used the idea of bounding the relative entropy between the old and the new distribution to form the Relative Entropy Policy Search (REPS) algorithm. Assuming that the parameter distribution determines the state distribution, the algorithm uses the Kullback-Leibler divergence (KL) to measure the entropy between the old parameter distribution (?(0) and the probability distribution over parameters 8 implied by a new policy jt(d\

e > KL(jr{e\q(e)) = (3)

1 = Hn(e)de

Equation (3) is formulated as I-divergence [13] to ensure that the new policy only encourages trajectories in already explored areas. In addition, the old parameter distribution is estimated from the observed trajectories rather than the planned ones, so only realizable parameter sets are further encouraged. This prevents the new trajectory distribution from lying outside the old one and prevents the robot from moving into potentially dangerous marginal areas. Together with the maximization of the expected reward, this builds a constraint optimization problem, which can be solved using the dual function [14]. The policy n(0) is then obtained by a weighted maximum likelihood estimate as it is used in the M-step of Expectation Maximization algorithms. This approach also allows for reusing samples to be more data efficient.

4. Control Policy

This section explains our exact policy setup to allow for better understanding of the subsequently presented extensions. Our action policy consists of a stochastic upper level policy nw (0) and a deterministic lower level policy ns (rref), which is formed by Dynamic Movement Primitives. The parameters w are the mean and variance of Gaussian distributions over the parameters 6, which the algorithm learns. These set the goal position g and the weights wn of the applied force function (see equations (1) and (2)). The other parameters t, a, ¡3 were empirically set to allow stable and fast system dynamics, mainly depending on the step size dt, which is 0.01s in our case. The lower level policy proposes a trajectory Tref , which consists of reference joint angles and velocities, (pref and <pref, for each joint and time step. We add no exploration noise to this trajectory since the stochastics of the upper level policy already serves as exploration in parameter space.

The actual control policy is an extended PD controller that selects the joint acceleration to apply, depending on the difference between the current joint angles and velocities and the reference trajectory. It features an additional term that takes future changes into account by comparing the current velocity and the one planned for the next time step. This results in the following policy formulation.

: KP {(Pref ,t -P, ) + KD {pref ,t - P, ) + ^

ref ,t+l

ref ,t

The PD gains are not learned by the algorithm but set to follow trajectories with the given dynamics the best. In our case, regarding the actuation bound of 84 ra<^/s and typical deviations of 0.05 rad and 0.2 s, gains of KP = 400 and KD = 100 are well suited. This follows close enough to allow an early diagnosis of a collision by monitoring the deviation as well as applying not too high forces or oscillations on the robot.

5. Potential Field against Angular Exceedance

The first extension consists of a potential field, we implemented into the DMPs to reduce the exceedance of maximum joint angles. For inspiration, we examined the human arm. In human motor control the way of treating angular restrictions of serial joints is to make it harder to move the joint near the maximum angle. This is can be realized by introducing an additional force function into the DMP that pushes the trajectory away from the maximum angles. Since the angular limits only depend on the technical specification of the used robot, the potential field can be implemented without any further knowledge of the kinematic model.

The potential field needs to scale with the amplitude of force used by the action policy to prevent the algorithm from outrunning it with bigger parameters. Therefore, equation (5) is used as the potential field introduced into the DMP (see also Figure 1):

fP (i) =

- w tan

: |v| < Vt

: it £ M

■ 10| w\sign(<p)

— w \w

I max|Y^

tan —

1 ^ imax

> 10 A 1 < Mm

fmax denotes the angular limits, which lie at 0.9n in our case, shown in Figure 1 by the red lines. <ptis the threshold angle at which the potential field jumps to action. For our scenario, we chose this to be n/2. The part of fp located inside the angular limits scales with the medium weight of the learned force function, so there is still the possibility for the algorithm to find a solution near the edge by choosing one weight sufficiently high. The part of fp located outside the maximum angles is scaled by the maximum weight, so no parameter is able to move the trajectory deeper into this region. Figure 2a shows the reference trajectories for three joints (green, blue and purple)

with stepwise scaled up weighting factors wn under the influence of the potential field. The scaling of wn correlates with initial variances of 0.05 to 0.29 in steps of 0.02. The red lines mark the maximum joint angles. With these potential fields the violation of the maximum joint angles can be decreased significantly. The trajectories that exceed the angular bounds move back to the feasible area in a couple of time steps as shown in Figure 2b, so the planned movement can be followed further on.

The occurrence of reference trajectories that exceed the maximum angles with increasing initial variance for a mean of zero is shown in Figure 3 a. With a mean of zero, the initial variance is still limited by the maximum velocity and acceleration (see Figure 3b). If the mean of the initial policy is not zero, e.g., due to learning the first

policy from imitation, the now closer angular limits become the decisive element and the potential field strongly increases the feasible variance in this case. This makes it possible to move the initial variance higher to be able to find a sufficient policy.

Fig. 1. Example course of the potential field force

Fig. 2. (a) Trajectories with stepwise up scaled parameters, the direction of rising parameters wn is denoted by the arrows; (b) Initial trajectory distribution with the same variance, right with potential field, left without potential field.

-without potential field

-with potential field

Velocity

80% 60% 40% 20% 0%

300 400 500 600 Standard deviation on weights

300 400 500 600 Standard deviation on weights

Fig. 3. (a) Percentage of proposed trajectories exceeding the maximum angles over the parameter variance with a mean of zero; (b) Percentage of proposed trajectories exceeding the maximum velocity over the parameter variance with a mean of zero.

6. Monitoring Trajectory Deviation

For the second extension, we use the property of our control policy to follow a preplanned reference trajectory. The deviation between planned and executed trajectory gives us a hint about the exceedance of motoric abilities and collisions. We use this to reinforce feasible trajectory plans and detect collisions during the trial.

6.1. Punishing Exceedance of Motoric Abilities and Collisions

The natural way of experiencing the own limits is by hitting them and thereby missing the set objective. The deviation between planned and executed trajectory is induced by either exceedance of motoric abilities or a collision and is also proportional to the size of the occurring violation. Therefore, it is a good measure to evaluate the feasibility of the tested policy. We therefore assign negative credit proportional to the sum of the deviation over all time steps in the evaluation at the end of a trial. A collision is connected to strong punishment, because the control policy breaks off following the planned trajectory in this case. Hence, the summed deviation increases drastically until the end of the trial.

6.2. Detecting Collisions

In human motor control the way to detect a collision is by sensing the force applied on the moving body by an obstacle. This force could be directly measured with force sensors but is also measurable in the deviation between the planned and the executed movement, which can be directly seen in our control policy. The difference between ^ and <pref, <pref is checked after every transition to recognize a collision through the emerging deviation. A deviation counts as detected collision if it rises over the threshold ed.

V, - Vref ,t

<Pt - <Pref ,t

VVt * Vmax,<Pt * <Pm

In case of a detected collision, the control policy is adapted to achieve an emergency stop.

at = - KD(pt

To ignore deviation due to exceedance of maximum joint angles and velocities, only the deviation of joints is considered that do not move with the maximum velocity or angle at the moment (see condition of inequation (6)). This is not possible for the acceleration, as the applied acceleration will jump up in case of a collision, because the control policy sets the applied acceleration proportional to the deviation. If the reference trajectory exceeds the maximum acceleration this is erroneously detected as a collision.

The problem of a false positive detection is that the operating parameter set will not be considered again due to the connected low reward. If that happens near a possible solution, it will seriously distract the algorithm from finding this solution. Nevertheless, a policy with high acceleration should not lie anywhere near the solution of an assembly task, therefore detecting a high acceleration as collision does not cause any damage to the learning process and the robot. The height of ed also determines how hard the robot can touch an obstacle and being unnoticed by the collision detection.

7. Test Scenario

As test scenario, we designed the task illustrated in Figure 4a. The robot starts in the shown position and is rewarded for reaching for the green star with its TCP. The hatched area represents a vertical wall that seriously hinders the robot in its movements.

Our modified learning algorithm is implemented using the MATLAB toolbox for policy search algorithms developed at the institute for Intelligent Autonomous Systems (IAS) at TU Darmstadt. The simulation of the ABB IRB120 six-axis-robot is implemented with the Robotic toolbox of Peter Corke [15].

The control policy operates a model of the kinematics and dynamics of the robot during trial and takes the physical abilities of the robot as well as occurring collisions into account. To model the noisiness of the transition, noise with a variance of 1 rad/s is added to the applied action. We found that the threshold on the deviation for collision detection could be set in a way that every occurring collision was detected within one time step. This leads

to a not negligible number of false positives where the start acceleration outruns the robots possibilities. As already discussed this can even be seen as a benefit.

Fig. 4 (a) Test scenario task seen from above with robot in initial position; (b) Development of collision frequency over the progress of learning

Figure 4b shows the number of occurring collisions in 100 trials per iteration of the learning algorithm. This is repeated for different weightings of the deviation penalty to the task reward. The most interesting finding is that the algorithm without any penalty on collisions comes to the conclusion that a collision is a necessary part of the solution. The relative height of the penalty does not strongly influence the speed of collision decline but the stability of the update step. The computation of one run with 20 Iterations x 100 trials x 500 time steps took around an hour on an i7 with 8GB RAM.

8. Conclusion

The aim of this work is to extend an existing policy representation in order to achieve a learning process that considers mechanical constraints of a serial manipulator. For this work, DMPs are chosen as a policy representation and Relative Entropy Policy Search as a learning algorithm. We proposed two extensions that successfully reduce the exceedance of angular limits in the reference trajectories and enable collision detection without further sensors. This brings us closer to the goal of autonomous commissioning though still unsolved problems regarding the exceedance of the maximal velocity and the suitability for real life industry tasks remain. The results motivate several aspects for future research. For this work the robot's behavior is simulated and reduced to a simple task (goal reaching). In future research the developed algorithm should be validated on a physical robot with complex real world tasks. In addition further research is necessary to improve the compliance with velocity limits. Regarding the desired autonomy of the learning process, additional work also needs to follow on the issue of resolving a collision without human intervention.

References

[1] P. Ennen, S. Reuter, R. Vossen, S. Jeschke, Automated Production Ramp-up Through Self-Learning Systems, Procedia CIRP 51 (2016) 5762.

[2] C. Baroglio, A. Giordana, R. Piola, M. Kaiser, M. Nuttin, Learning controllers for industrial robots, Machine learning 23 (2-3) (1996) 221249.

[3]J. Peters, K. Mulling, Y. Altun, Relative Entropy Policy Search, in: Proceedings of the National Conference on Artificial Intelligence, Atlanta, 2010.

[4] S. Schaal, Dynamic movement primitives-a framework for motor control in humans and humanoid robotics, in: Adaptive Motion of Animals and Machines, Springer, 2006, pp. 261-280.

[5] A.J. Ijspeert, J. Nakanishi, S. Schaal, Movement imitation with nonlinear dynamical systems in humanoid robots, in: Proceedings of the IEEE International Conference on Robotics and Automation, 2002, pp. 1398-1403.

[6] M.P. Deisenroth, G. Neumann, J. Peters, others, A Survey on Policy Search for Robotics, Foundations and Trends in Robotics 2 (1-2) (2013) 1-142.

[7] C. Daniel, G. Neumann, J. Peters, Hierarchical Relative Entropy Policy Search, in: Proceedings of the International Conference of Artificial Intelligence and Statistics, La Palma, 2012, pp. 273-281.

[8] H. Hoffmann, P. Pastor, D.-H. Park, S. Schaal, Biologically-inspired dynamical systems for movement generation: automatic real-time goal adaptation and obstacle avoidance, in: Proceedings of IEEE International Conference on Robotics and Automation, 2009, pp. 2587-2592.

[9] D.-H. Park, H. Hoffmann, P. Pastor, S. Schaal, Movement reproduction and obstacle avoidance with dynamic movement primitives and potential fields, in: 8th IEEE-RAS International Conference on Humanoid Robots, 2008, pp. 91-98.

[10] A. Rai, F. Meier, A. Ijspeert, S. Schaal, Learning coupling terms for obstacle avoidance, in: IEEE-RAS International Conference on Humanoid Robots, 2014, pp. 512-518.

[11] M.P. Deisenroth, C.E. Rasmussen, D. Fox, Learning to control a low-cost manipulator using data-efficient reinforcement learning, in: Proceedings of the International Conference on Robotics: Science and Systems, Los Angeles, 2011.

[12] E. Keogh, A. Mueen, Curse of dimensionality, in: Encyclopedia of Machine Learning, Springer, 2011, pp. 257-258.

[13] I. Csiszar, I-divergence geometry of probability distributions and minimization problems, The Annals of Probability (1975) 146-158.

[14] D.P. Bertsekas, Constrained optimization and Lagrange multiplier methods, Academic press, 2014.

[15] P. Corke, Robotics, vision and control: fundamental algorithms in MATLAB, Springer, 2011.