Vendredi 23 Mars 2018
taille du texte
Lundi, 12 Décembre 2011 18:00

When Reinforcement Fails

Rate this item
(0 Votes)

When Reinforcement Fails
The world is a complicated place. Reality is dense with patterns, but these patterns are often subtle and inconsistent. We think we understand how things work — X always causes Y — but then Z happens. It’s very confusing.

Needless to say, such complexity poses a big problem for biology. How should animals learn from such unpredictable situations? What’s the best way to cope with contingency? We don’t need perfection, but we do require an efficient mental mechanism that allows us to maximize utility most of the time.

Enter reinforcement learning, a theoretical framework that helps explain how the rewards and punishments of life get translated into effective behavior. It doesn’t matter if it’s monkeys responding to squirts of juice or rats jonesing for pellets or humans plying the stock market: The algorithms of reinforcement learning neatly describe our decisions. The persuasive power of reinforcement is why we give kindergartners gold stars and professionals a monetary bonus: Nothing influences outcomes like a bit of positive feedback. Furthermore, neuroscientists have identified several mechanisms in the cortex that seem to obey these computational principles. It’s an incredibly elegant link between the software of mind and the hardware of brain.

However, one of the longstanding limitations of much reinforcement learning research is the lack of naturalistic context, as scientists have been forced to rely on abstract games in the lab. We don’t observe rats in the wild — we track them in Plexiglas cages. We don’t watch monkeys swing through the forest — we give them sweet treats preceded by lights and bells. This makes the data easier to comprehend, but it also makes it unclear how these same mechanisms might operate in a more complicated environment. Does reinforcement learning always work? Or do the same habits that make us look so smart in the lab sometimes backfire in the real world? Is there such a thing as too much feedback?

To answer these questions, Tal Neiman and Yonatan Loewenstein at the Hebrew University of Jerusalem turned to professional basketball. More specifically, they looked at 200,000 three-point shots taken by 291 leading players in the NBA between 2007 and 2009. (They also looked at 15,000 attempted shots by 41 leading players in the WNBA during the 2008 and 2009 regular seasons.) The scientists were particularly interested in how makes and misses influenced subsequent behavior. After all, by the time players arrives in the NBA, they’ve executed hundreds of thousands of shots and played in countless games. Perhaps all that experience reduces the impact of reinforcement, making athletes less vulnerable to the unpredictable bounces of the ball. A make doesn’t make them too excited and a miss isn’t too discouraging.

But that’s not what the scientists found. Instead, they discovered that professional athletes were exquisitely sensitive to reinforcement, so that a successful three-pointer made players far significantly more likely to attempt another distant shot. In fact, after a player made three three-point shots in a row — they were now “in the zone” — they were nearly 20 percent more likely to take another three-point shot. Their past success — the positive reinforcement of the made basket — altered the way they played the game.

In many situations, such reinforcement learning is an essential strategy, allowing people to optimize behavior to fit a constantly changing situation. However, the Israeli scientists discovered that it was a terrible approach in basketball, as learning and performance are “anticorrelated.” In other words, players who have just made a three-point shot are much more likely to take another one, but much less likely to make it:

What is the effect of the change in behaviour on players’ performance? Intuitively, increasing the frequency of attempting a 3pt after made 3pts and decreasing it after missed 3pts makes sense if a made/missed 3pts predicted a higher/lower 3pt percentage on the next 3pt attempt. Surprizingly, our data show that the opposite is true. The 3pt percentage immediately after a made 3pt was 6% lower than after a missed 3pt. Moreover, the difference between 3pt percentages following a streak of made 3pts and a streak of missed 3pts increased with the length of the streak. These results indicate that the outcomes of consecutive 3pts are anticorrelated.

This anticorrelation works in both directions. as players who missed a previous three-pointer were more likely to score on their next attempt. A brick was a blessing in disguise.

What’s the larger lesson? It turns out that professional athletes over-generalize from their most recent actions and outcomes. They modify their behavior based on the result of a single shot, even though the success of the shot was shaped by unpredictable forces (a butterfly flapping its wings in Tokyo, etc.) and depended on situational details that are unlikely to be repeated. (Perhaps the defender was momentarily distracted, or failed to run around the screen.) As the scientists note, “The behavior of basketball players shows the limitations of learning from reinforcement, especially in a complex environment such as a basketball game.”

This problem, of course, isn’t confined to athletes. Investors modify behavior based on recent market performance, even though the market is mostly a random walk. Gamblers won’t leave a casino if they’re on a hot streak. Pundits who make an accurate prediction are convinced they’ve now solved the world. Military generals are always preparing for the last war. Although people can’t help but learn from the reinforcement signals of the world — that’s just the way the mind is designed — we need to remember that these signals come with stark limitations, especially when they emerge from a complex situation. Sometimes, the best thing we can do is not learn from what just happened.

Photo: Flickr/riebschlager


French (Fr)English (United Kingdom)


Parmi nos clients