Scalar Reward - Paperclips!

Return to site

Scalar Reward - Paperclips!

· AI

So I started slowly (glacially slowly) working through the fascinating material on machine learning on this curriculum: https://www.agisafetyfundamentals.com/ai-alignment-curriculum

I've just finished listening to Hado van Hasselt’s lecture 1/13 of his Intro to Reinforcement Learning course, where he makes the point that ‘Any goal can be formalised as the outcome of maximising a cumulative reward’. This is the Reward Hypothesis:

So it seems that at the moment there is no better way to incentivise an AI system than to use a simple scalar feedback signal.

Urghh!

I'll assume that as I work through the course, better ideas for motivating an AI will be revealed. I'm probably being naive here, or haven't learned anywhere near enough about the reward function yet. But just in case, let me state with the arrogance of a newbie that this absolutely cannot be our final solution. Let me explain why.

I've spent nearly twenty years working in a government cubicle job, and I can say with absolute confidence that simple metrics like this do not work. They are inevitably subverted and distorted. In fact, you can see this if you know anything about public sector services in the UK.

For example, the National Health Service of Britain incentivises Accident & Emergency staff by using a metric that counts the number of sick, lame and wounded 'seen' within a certain time period, perhaps two hours. People wander into A&E with injuries, illness, or a slightly sore throat. The waiting rooms are so packed with malingerers and people genuinely in need, that it's almost impossible to achieve this two hour window. So hospitals get around it by either sending a doctor into the waiting area to swiftly sweep his gaze over the poor souls waiting - and this counts as 'being seen'. Or those people approaching the two hours of waiting will be moved into a side room, and again that counts as 'being seen'.

Another example. In order to give politicians some basis for claiming success, the UK Home Office pressures police forces to maximise the number of arrests and detections. So officers will arrest a cannabis smoker, give him a ticket on the street, then de-arrest him. And that counts as an arrest. Similarly, serious incidents or calls from the public, like a burglary taking place, are graded as 'I' (immediate), to be attended by a police unit within fifteen minutes. However, if a unit can't get there fast enough, which is usually the case, the control centre quietly re-grades the call as 'S' (slow).

Since Britain moved into the 'video game' model of performance indicators in the 1970s, the services have never been so poor. This system is even worse than having no metric at all, because workers understand that they do not have to actually be good at their jobs - which might require effort and commitment - but simply to know how to game the metric.

Using a simple scale is really the same problem as a map not being the same as the territory. 

I really hope (and assume that this must be obvious?) that such a simplistic reward function as a simple scalar metric doesn’t get accidentally baked in. At the least there would have to be some kind of multivariable function. It should heavily bias the seeking of clarification and expressing of doubt prior to execution.

Such a simple metric, a mere number, cannot be our ultimate solution for motivating an AI. If we don’t want to build a paperclip maximiser.

And obviously the spitting out of a simple scalar metric is not ‘understanding’, which is what we need in order to avoid building a paperclip maximiser.

Digressing, I was thinking about what first questions we might post to a super-intelligent AGI. It really wouldn't be 'How do we create peace'? or 'Please fix the climate'. Or 'Are you conscious?'

We would surely ask it: 'You've read the entire Internet, so please tell me which parts of our scientific theories are wrong and which are true?'

There could be interesting and important recursive possibilities here too: for example we might ask it how to retrospectively solve the alignment problem, or for a theory of consciousness.