Jump to content


Photo

Reinforcement Learning


  • This topic is locked This topic is locked
9 replies to this topic

#1 Ozarka

Ozarka

    GMC Member

  • New Member
  • 12 posts
  • Version:Unknown

Posted 04 March 2012 - 12:37 AM

So, I'm interested in machine learning and I've made a simple neural network that used real time evolution to further the intelligence of the species as a whole. However, I want to learn how to utilize reinforcement learning in the AI because it allows for a single object to learn rather than just producing more intelligent offspring. I'm not asking for code, but for a basic outline of how it works. And please don't give me a link to a Wikipedia page or something of that nature, I'm just looking for a fairly simple and straightforward explanation. I hope this is the right forum to ask this type of question, however it may not be as I'm fairly inexperienced with the GMC. Please note that I'm not a skilled programmer, but that I do have some experience.

Also, if there is another method that is easier and more effective at training a single object, I wouldn't be against you posting how it worked. As long as it accomplishes the same goal.

Please help keep this topic alive until my question is answered.

Edited by Ozarka, 04 March 2012 - 12:41 AM.

  • 0

#2 IceMetalPunk

IceMetalPunk

    InfiniteIMPerfection

  • Retired Staff
  • 9322 posts
  • Version:Unknown

Posted 06 March 2012 - 10:36 PM

I've tried to understand the intricacies of neural nets before, but I can never quite seem to grasp them. As I understand it, neural nets simply process data in layers, and all but the outer layers evolve the weights during training. If that is correct, this should apply to single objects as well.

Basically, you'd have certain properties of the object that are controlled by variables, and thus subject to change. For example, if you're trying to make it learn how to fly straight, things like wing synchronicity, body angle, flap speed, and flap strength all can be modified over time.

From there, you just apply basic evolutionary principles: as it tries to fly, these properties are all "mutated", or changed via a Gaussian Random Number Generator. These mutations, or more accurately "corrections", take place at regular intervals during flight. You'd keep a running average of each of the values. During a correction, the accuracy of the flight is considered compared to the accuracy during the last correction. If it's better or the same, you'd average the mutated values together with the running average, and that's your new value. If it's worse, you would not average the new values with the old, and instead you'd discard the old values and just use the mutations.

Note: in the above example, the running average is the value you'd be using to determine the behavior. So, for example, if my running average for some value is 7, during the next correction the following process would occur:

Mutate by choosing a Gaussian Random Number centered around 7.
If current flight is more accurate than during previous correction, then
   Average 7 and the mutated number together and store as the new property value.
Else if current flight is less accurate than during previous correction, then
   Store mutated number as the new property value, discarding the previous average of 7.
Either way, record the accuracy of the flight for use in the next correction.

It's basically evolution, only rather than mating two individuals to produce a third, you're averaging two corrections to get a third. It works the same way, though.

-IMP

*EDIT* Oh, and in case you were wondering, Yourself made a nice Gaussian RNG script which you can find here: http://www.gmlscripts.com/script/gauss . "Mean" is the "center" value, and "deviation" determines, in essence, the width of the bell curve (larger deviation means larger mutations on average).

Edited by IceMetalPunk, 06 March 2012 - 10:39 PM.

  • 0

#3 tangibleLime

tangibleLime

    Lunatic

  • Retired Staff
  • 2520 posts
  • Version:GM:HTML5

Posted 07 March 2012 - 04:05 AM

Reinforcement Learning (often abbreviated RL) is a much too complicated subject for someone to just provide a "straightforward explanation". There are many (MANY) theories, algorithms, and other processes that come together to make reinforcement learning possible.

EDIT: Check out this book by one of my old professors: http://www.amazon.co...31093202&sr=8-1
I don't personally own it and I haven't read it - I stuck with the classic Russell and Norvig "Artificial Intelligence: A Modern Approach", which I also recommend.
  • 0

#4 Kegsay

Kegsay

    GMC Member

  • New Member
  • 5 posts

Posted 10 March 2012 - 12:42 PM

What IceMetalPunk is describing is more akin to genetic algorithms (modelling natural selection) rather than reinforcement learning as a general field.

Reinforcement learning is the process of a machine being able to learn based on the actions it has taken in the past. At a minimum, it requires a feedback loop from the output of a system to be returned as an input. That is to say:
INPUT --> [Processing] ---> OUTPUT
   ^                                       |
    \_____________________/

This usually requires the input and output to be correlated in some way. For example, imagine you have a robot that wishes to find the best path through a maze. It can utilise reinforcement learning to do this. The input to the robot will be the current position (state) the robot is in. The processing will involve choosing a path to take. The output would be the finished movement. This can then be fed back in as the input state, and so on. The robot needs to learn the best path, meaning it needs to remember. This means the maze needs to be represented in some form (e.g. a 2d array of wall/open space/hasExplored would do). The robot then needs to draw on this information in the processing to decide on the best path.

That's the idea behind reinforcement learning. Genetic algorithms can be used for reinforcement learning. Lets say there are 100 robots placed in the same place. Each robot has a preference, a small weighting, that prefers a given direction. Maybe Robot A likes to go East, maybe Robot B like to be on a corner tile, etc. Now, dump them all in the maze and let them find the exit. Clearly, the ones which find the exit first are the best (this is the 'fitness' metric; the desired attribute for the robot). You could then take the top 10 robots and let them breed, combining some of their weights and introducing a small mutation. E.g. Robot A likes to go East 30% of the time, Robot B likes to go South 20% of the time, the child will have some proportion of both, plus a mutation (e.g. goes North 1% of the time). Then rinse and repeat, until a robot with the best weightings for that maze has been found.

Reward based learning involves each state of the maze having a certain reward. E.g. finding the exit gives a reward of 100, otherwise every square is a reward of 1. When the robot goes through the maze, you change the reward of the state you're going to as a function of the previous reward (or as a function of the true exit). For example, the square NEXT TO the exit may have a high reward of say 95 (the reward is a function of the true exit). Alternatively, if a square you're going to is not the reward square, then reduce the reward of the previous square (function of the previous reward). This technique will eventually find the local optimum solution.

There's also many variations of these techniques, which can have some neat effects. For example, what if all the robots worked together?

Edited by Kegsay, 10 March 2012 - 12:44 PM.

  • 0

#5 Ozarka

Ozarka

    GMC Member

  • New Member
  • 12 posts
  • Version:Unknown

Posted 11 March 2012 - 12:18 AM

Thanks for the replies, everyone. They've been a great help. I will try to utilize what you have told me and hopefully it will turn out well.

Can we add to this topic now and describe how you can create the memories to base off of? I assume that you could feed the past inputs in as new inputs each iteration (although it would be VERY CPU intensive). And as has been mentioned, in some cases, 2d arrays can be implemented. However, is there an efficient way to implement memory that isn't omniscient?

Also, my main issue is how to evolve the weights of the neural network real time within the same object.

Again, this is not limited to Reinforcement Learning. If you know of a better method, feel free to describe it.
  • 0

#6 Ozarka

Ozarka

    GMC Member

  • New Member
  • 12 posts
  • Version:Unknown

Posted 17 March 2012 - 09:00 PM

*Bump
  • 0

#7 IceMetalPunk

IceMetalPunk

    InfiniteIMPerfection

  • Retired Staff
  • 9322 posts
  • Version:Unknown

Posted 18 March 2012 - 05:36 AM

What IceMetalPunk is describing is more akin to genetic algorithms (modelling natural selection) rather than reinforcement learning as a general field.

Reinforcement learning is the process of a machine being able to learn based on the actions it has taken in the past. At a minimum, it requires a feedback loop from the output of a system to be returned as an input. That is to say:

INPUT --> [Processing] ---> OUTPUT
   ^                                       |
    \_____________________/

This usually requires the input and output to be correlated in some way. For example, imagine you have a robot that wishes to find the best path through a maze. It can utilise reinforcement learning to do this. The input to the robot will be the current position (state) the robot is in. The processing will involve choosing a path to take. The output would be the finished movement. This can then be fed back in as the input state, and so on. The robot needs to learn the best path, meaning it needs to remember. This means the maze needs to be represented in some form (e.g. a 2d array of wall/open space/hasExplored would do). The robot then needs to draw on this information in the processing to decide on the best path.

That's the idea behind reinforcement learning. Genetic algorithms can be used for reinforcement learning. Lets say there are 100 robots placed in the same place. Each robot has a preference, a small weighting, that prefers a given direction. Maybe Robot A likes to go East, maybe Robot B like to be on a corner tile, etc. Now, dump them all in the maze and let them find the exit. Clearly, the ones which find the exit first are the best (this is the 'fitness' metric; the desired attribute for the robot). You could then take the top 10 robots and let them breed, combining some of their weights and introducing a small mutation. E.g. Robot A likes to go East 30% of the time, Robot B likes to go South 20% of the time, the child will have some proportion of both, plus a mutation (e.g. goes North 1% of the time). Then rinse and repeat, until a robot with the best weightings for that maze has been found.

Reward based learning involves each state of the maze having a certain reward. E.g. finding the exit gives a reward of 100, otherwise every square is a reward of 1. When the robot goes through the maze, you change the reward of the state you're going to as a function of the previous reward (or as a function of the true exit). For example, the square NEXT TO the exit may have a high reward of say 95 (the reward is a function of the true exit). Alternatively, if a square you're going to is not the reward square, then reduce the reward of the previous square (function of the previous reward). This technique will eventually find the local optimum solution.

There's also many variations of these techniques, which can have some neat effects. For example, what if all the robots worked together?

Well, I have pretty much no experience in machine learning, but quite a bit of experience in genetic algorithms, so I suppose that accounts for my bias :lol: . But still, it seems to me like reinforcement learning would work with an evolutionary foundation even within a single instance (i.e. one robot instead of your suggested many). You can still apply the rewards, and then you'd simply use that as the fitness metric for each correction loop. For example, if the last loop you earned 20 reward points, and this loop you've earned 50 reward points, that's a sign of a fit mutation that should be averaged into the property value. If you only earned 10 reward points this loop, then you'd discard the new value and re-mutate. Re-mutate should so be a word, by the way :P .

Now, the real problem is that you want it to evolve in a way that "isn't omniscient". In order to learn in any way, you need some way of gauging how accurate your last evolutionary step made you. To do this, you actually need to know something about your surroundings. Even in the reward-based system, rewards are contingent on the distance from the goal--which means you need to know where the goal is. Humans are the same way--a baby learns to walk because it knows what walking is and what it's doing, so it can figure out where it's going wrong. In a maze-solving bot, it would need to know things like "have I been here before? How many times? Is that wall an exit?" and...well, I think that's it. The exit is one of those pre-programmed things, the equivalent of someone recognizing a door in a wall. Visited tiles can simply be marked in an array as the robot passes them, adding 1 each time. This way, it can have a preference for moving towards or away from higher-values visited cells which evolves over time as well with each evolutionary iteration ("correction step").

-IMP

Edited by IceMetalPunk, 18 March 2012 - 05:41 AM.

  • 0

#8 Ozarka

Ozarka

    GMC Member

  • New Member
  • 12 posts
  • Version:Unknown

Posted 24 March 2012 - 06:13 AM


What IceMetalPunk is describing is more akin to genetic algorithms (modelling natural selection) rather than reinforcement learning as a general field.

Reinforcement learning is the process of a machine being able to learn based on the actions it has taken in the past. At a minimum, it requires a feedback loop from the output of a system to be returned as an input. That is to say:

INPUT --> [Processing] ---> OUTPUT
   ^                                       |
    \_____________________/

This usually requires the input and output to be correlated in some way. For example, imagine you have a robot that wishes to find the best path through a maze. It can utilise reinforcement learning to do this. The input to the robot will be the current position (state) the robot is in. The processing will involve choosing a path to take. The output would be the finished movement. This can then be fed back in as the input state, and so on. The robot needs to learn the best path, meaning it needs to remember. This means the maze needs to be represented in some form (e.g. a 2d array of wall/open space/hasExplored would do). The robot then needs to draw on this information in the processing to decide on the best path.

That's the idea behind reinforcement learning. Genetic algorithms can be used for reinforcement learning. Lets say there are 100 robots placed in the same place. Each robot has a preference, a small weighting, that prefers a given direction. Maybe Robot A likes to go East, maybe Robot B like to be on a corner tile, etc. Now, dump them all in the maze and let them find the exit. Clearly, the ones which find the exit first are the best (this is the 'fitness' metric; the desired attribute for the robot). You could then take the top 10 robots and let them breed, combining some of their weights and introducing a small mutation. E.g. Robot A likes to go East 30% of the time, Robot B likes to go South 20% of the time, the child will have some proportion of both, plus a mutation (e.g. goes North 1% of the time). Then rinse and repeat, until a robot with the best weightings for that maze has been found.

Reward based learning involves each state of the maze having a certain reward. E.g. finding the exit gives a reward of 100, otherwise every square is a reward of 1. When the robot goes through the maze, you change the reward of the state you're going to as a function of the previous reward (or as a function of the true exit). For example, the square NEXT TO the exit may have a high reward of say 95 (the reward is a function of the true exit). Alternatively, if a square you're going to is not the reward square, then reduce the reward of the previous square (function of the previous reward). This technique will eventually find the local optimum solution.

There's also many variations of these techniques, which can have some neat effects. For example, what if all the robots worked together?

Well, I have pretty much no experience in machine learning, but quite a bit of experience in genetic algorithms, so I suppose that accounts for my bias :lol: . But still, it seems to me like reinforcement learning would work with an evolutionary foundation even within a single instance (i.e. one robot instead of your suggested many). You can still apply the rewards, and then you'd simply use that as the fitness metric for each correction loop. For example, if the last loop you earned 20 reward points, and this loop you've earned 50 reward points, that's a sign of a fit mutation that should be averaged into the property value. If you only earned 10 reward points this loop, then you'd discard the new value and re-mutate. Re-mutate should so be a word, by the way :P .

Now, the real problem is that you want it to evolve in a way that "isn't omniscient". In order to learn in any way, you need some way of gauging how accurate your last evolutionary step made you. To do this, you actually need to know something about your surroundings. Even in the reward-based system, rewards are contingent on the distance from the goal--which means you need to know where the goal is. Humans are the same way--a baby learns to walk because it knows what walking is and what it's doing, so it can figure out where it's going wrong. In a maze-solving bot, it would need to know things like "have I been here before? How many times? Is that wall an exit?" and...well, I think that's it. The exit is one of those pre-programmed things, the equivalent of someone recognizing a door in a wall. Visited tiles can simply be marked in an array as the robot passes them, adding 1 each time. This way, it can have a preference for moving towards or away from higher-values visited cells which evolves over time as well with each evolutionary iteration ("correction step").

-IMP


When I say that I don't want it to omniscient, I mean that it doesn't know everything about the room that its in. For example, in a maze simulation, it wouldn't know the positions of every wall instance, but only the ones that its seen. I do see what you mean though, and it makes sense. Of course, correcting every step wouldn't work, but, as you know, fitness over a period of time could. I will most likely try to use what you said.

However, I still would like to know any ideas about memory of past experiences. In the maze game, this would be easier, but in a less straightforward simulation, it would be more difficult. You could, in theory, reintroduce all past inputs every step, but it would be very slow and its really not an option without an enormous supercomputer. If anyone has any suggestions about a solution, feel free to disclose them.
  • 0

#9 slayer 64

slayer 64

    GMC Member

  • GMC Member
  • 3813 posts
  • Version:GM8.1

Posted 29 March 2012 - 12:24 AM

i made this a long time ago. i thought searching through a maze would be too hard. this is a platformer. the bear remembers what's underneath him and executes an action. the actions can either be move left, right, or jump, or a combination of them. if the bear doesn't have anything in it's memory, it chooses some random actions. the bear knows he successful if he doesn't fall off the map, if he falls off the map he removes the last actions he made from his memory. if you watch the bear for a minute or two he usually learns how to get to the right side of the map. i've never seen him get the flag though =/

you should put the room speed on 9999

http://www.host-a.ne...4/AI learns.gmk

Edited by slayer 64, 29 March 2012 - 12:25 AM.

  • 0

#10 Ozarka

Ozarka

    GMC Member

  • New Member
  • 12 posts
  • Version:Unknown

Posted 12 April 2012 - 03:07 AM

i made this a long time ago. i thought searching through a maze would be too hard. this is a platformer. the bear remembers what's underneath him and executes an action. the actions can either be move left, right, or jump, or a combination of them. if the bear doesn't have anything in it's memory, it chooses some random actions. the bear knows he successful if he doesn't fall off the map, if he falls off the map he removes the last actions he made from his memory. if you watch the bear for a minute or two he usually learns how to get to the right side of the map. i've never seen him get the flag though =/

you should put the room speed on 9999

http://www.host-a.ne...4/AI learns.gmk


Interesting simulation. I want to make a machine learning controlled platform ai soon, and I might. I have a decent idea of how neural networks work and I'm continuing to improve.

As for the genetic algorithm approach within a single instance, I understand what you mean and I think that it would probably work well.

And as for omniscience, I don't mind using it to train the neural network, I just don't want the neural network to use it as input.
  • 0




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users