强化学习之Learning VS Shaping

2012年12月29日 Machine Learning 暂无评论 阅读 69,933 次

以前只知道learning,不知道shaping为何物。查了一下意思shaping可翻译成“行为塑造”。不同于trail-and-error的盲目的搜索方法,在学习初期,通过reward shaping尽可能早的尝试有前景的动作。

具体算法可以看看Eric Wiewiora最新的Potential-Based Shaping 文章



Reward shaping is a technique inspired by animal train-ing where supplemental rewards are provided to make a problem easier to learn. There is usually an obvious natural reward for any problem. For games, this is usually a win or loss. For financial problems, the reward is usually profit. Reward shaping augments the natural reward signal by adding additional rewards for making progress toward a good solution.


1938 年心理学家 Skinner[42]首次提出“引导”(shaping)这个概念,认为可以通过引导动物完成一些简单的行为以合成(constitute)相对复杂的行为。通过连续地对不断接近期望行为(鸽子走到选定的地点)的行为(鸽子向选定地点方向前进)给以回报(食物),鸽子可以被引导到选定的地点进行啄食,马可以学会不断的踏地直到踏地的次数和给它看的数字一样,就好像马会数数一样[43]。据此,一些研究者[44-50]提出来可以通过引导 Agent 的行为来加快强化学习。Laud[44]提出来可以用引导回报(shaping reward)鼓励 Agent 选择特定的动作,避免随机地选择动作而降低学习效率。


按照 Mitchell[2]的观点:“所有机器学习问题都归结于搜索问题”。但搜索问题都无法避免状态空间的组合爆炸,如何加快强化学习的速度,提高强化学习的效率,扩展强化学习应用于更大规更复杂问题域一直是强化学习研究领域不变的热点。尤其现今大多数应用都强调“在线”(online,“实时”(real-time),和“适应性”(adaptive),以快速地适应问题环境的变化。这些都使得对加速强化学习方法和技术的研究更加迫切。


Randlov[48]认为要加快强化学习自身的学习过程,就应该不要再坚持“Agent是一无所知的(tabula rasa),只能试错地自主学习发现最优策略”的要求,而是接受“Agent 可以知道或者被暗示它所选择的动作(行为)好不好”的事实。换句话说,接受“引导”,即使学习不完全自主,但学习加快带来的好处也是值得的。的确,引导强化学习的成效是非常显著的[45-48]。也正如 Maclin[51]所提到的“来自外部轻微的提示可导致自主 Agent 的学习有显著地提高”。Maclin 在其论文[51]中就提出一种可以接受建议的 Q 学习 Agent,其工程实现的效果非常完美。但其实这样的学习应该归结为监督学习了。 



Insight Learning and Shaping



Thorpe defined insight in his 1964 book “Learning and Instinct in Animals” as: “ The sudden production of a new adaptive response not arrived at by trial behavior or as the solution of a problem by the sudden adaptive reorganization of experience.”

Shaping is a powerful animal training method, described by B.F. Skinner, in which a novel behavior (target behavior) or a behavioral sequence is created. It works through successive steps of learning , where one desired behavior at a time is rewarded and learnt. This results in a shaped behavior or in a shaped behavioral sequence.

Theoretical Background

Complex behavior, or sequences of actions, that lead to particular goals, can be found in many animals. How-ever, it is not possible from just observing such complex behavior to know whether they are a result of genetic programming or of different kinds of learning phenomena. Because the exact same behavior, for example , tool use, can arise through different mechanisms , we want to emphasize that on cannot judge any experiment or observation by the end result alone. Insight learning and shaping refer to two learning processes giving rise to complex behavior. It is theoretically unclear how insight learning and shaping are different, and it is also problematic empirically how to distinguish between the two. Strong claims are common in the literature in both directions. Apparently seemingly intelligent behavior can arise in many ways. What an animal does can be easily observed, but to decide through what mechanism an animal reaches its decision is a task of much greater difficulty. Key to understanding is knowledge of an animal’s previous experiences, its background. Herein lies a fundamental problem; how can one separate insight learning from shaping?

The question of what causes intelligent is of great interest for both scholars and the public . It is also an important question that can shed light upon the evolution of mental capacities in both nonhuman animals and humans.


Copyright © ExBot易科机器人实验室 保留所有权利.   Theme   Robin modified by poyoten