Abstract
Movement generation that is consistent with observed or demonstrated behaviour is an efficient way to seed movement planning in complex, high-dimensional movement systems like humanoid robots. We present a method for learning potential-based policies from constrained motion data. In contrast to previous approaches to direct policy learning, our method can combine observations from a variety of contexts where different constraints are in force, to learn the underlying unconstrained policy in form of its potential function. This allows us to generalise and predict behaviour where novel constraints apply. We demonstrate our approach on systems of varying complexity, including kinematic data from the ASIMO humanoid robot with 22 degrees of freedom.
Notes
1For a review on DPL, please see (CitationBillard et al. 2007) and references therein.
2It should be noted that, as with all DPL approaches, the choice of state-space is problem specific (CitationSchaal et al. 2003) and, when used for imitation learning, depends on the correspondence between demonstrator and imitator. For example if we wish to learn the policy a human demonstrator uses to wash a window, and transfer that behaviour to an imitator robot, an appropriate choice of x would be the Cartesian coordinates of the hand, which would correspond to the end-effector coordinates of the robot. Transfer of behaviour across non-isomorphic state spaces, for example if the demonstrator and imitator have different embodiments, is also possible by defining an appropriate state-action metric (CitationAlissandrakis et al. 2007).
3 A† denotes the (unweighted) Moore–Penrose pseudoinverse of the matrix A
4It should be noted that in general the orientation of the constraint plane onto which the policy is projected may vary both with state position and time.
5It should be noted that these trajectories are not outliers in the sense of containing corrupt data and could in fact be used for further training of the model. For example one could take a hierarchical approach, where groups of strongly connected trajectories are aligned first to form models consisting of groups of trajectories with good alignment. We can then recursively repeat the process, aligning these larger (but more weakly connected) groups until all of the data has been included.
6Since the goal of the experiments was to validate the proposed approach, we used policies known in closed form as a ground truth. In the follow-up paper we apply our method to human motion capture data.
7A detailed explanation of the error measures used can be found in Appendix B.
8Please note that we also discard the outliers for evaluating the error statistics—we can hardly expect to observe good performance in regions where the learnt model f(x) has seen no data.
93 DOFs per hand × 2 hands.