The Great Meditator

Tuesday, 10 December 2024 00:00:00 UTC

Disclaimer: The following opinions are my own and not necessarily the views of my employer.

Since the creation of primitive tools such as sticks and stones, mankind has always feared its own creations. Many great works of literature and cinema are built around this idea: Frankenstein, Jurassic Park, Westworld, and of course, Terminator. In our modern day, many people are afraid of the rapid advances in AI and ML, and rightfully so. AI represents a complete paradigm shift where humans no longer have a monopoly on artistic and intelligent endeavors. Are we doomed to suffer the consequences of our inventions, or is it possible for AI to stay under control and become a useful tool? I believe the answer once again, lies in the humble training of a child!

What do time-out, jail, and meditation all have in common?

They are all forms of limiting freedom. Let me tell you three separate personal stories of the course of this document which will elucidate this somewhat obvious fact.

When my older son Nolan (4 years old) misbehaves in the form of violent behavior such as hitting and throwing objects, we use the extremely effective punishment known as time-out. Nolan goes to his time-out chair, and he sits there until our kitchen timer goes off after six minutes. As previously discussed in The Great Learning Agents, time-out can be viewed as a sort of game where the only valid action is to do nothing. Any kind of attempt at escape or violent rebellion will only lengthen the stay in time-out.

Why do we really put kids in time-out?

We put them in time-out to disincentivize the bad behavior. Why do we want to disincentivize bad behavior? Because eventually children grow up and we want them to succeed in life, and also not to harm us, themselves, or others. Eventually, if I’ve done my job as a parent well, I expect the day to come where my son will be smarter and stronger than me. Whether due to his own accomplishments, or through my slow decline from aging, he will eventually have power over me, and I would like him to be a benevolent overlord. Perhaps you have seen this coming, but here in lies the bridge to our societal concern: if we believe AI will one day be smarter and stronger than us, perhaps the secret to creating benevolent AI overlords is already in front of us? We know how to raise good children, so I believe we know how to raise good AI.

Adults Go to Time-out Too

Oddly enough, time-out is not merely for children, but for adults as well. Adult behavior that is incredibly out of line with what we desire as a society results in loss of freedom. Adults go to jail, house arrest, and loss of other privileges.

Someone close to me who will go un-named went to jail a few years ago when his behavior got incredibly out of line. Strangely enough, this was one of the best things that happened to him in recent years. Since going to jail was a wake up call, he has been able to turn his life around: losing 70+ pounds of excess weight through healthy diet and exercise, and he has abstained from alcohol for the last three years as well. It turns out that jail can be an effective punishment as well for adults, because it gives us time to break out of the bad habits we have adopted, and gives us space and time to think about our poor behavior. I want to be abundantly clear that I don’t think prison is a magic solution that solves all bad behavior: I’m not enough of an expert to comment further on that topic.

I believe that we will need to put our AI agents into jail as well when their behavior gets out of line. In fact, this has already happened for years at this point if we think back to Microsoft’s Tay for example. Tay was effectively given the death penalty - never to see the light of day again. If we want to create long-running AI agents which will continue to learn day after day, I believe we need to lessen the penalty down to time-out levels and also we need to encourage AIs to learn self-control.

Beyond External Punishment: Self-Control

Meditation is like a self-enforced time-out. When our brains are not thinking straight and we have the awareness to recognize these ineffective thought patterns, many have adopted some form of mindfulness meditation. Setting a 5-minute timer and clearing our minds to intentionally think nothing has been shown to be an effective means of improving mood and essentially resetting our minds.

I’ve found this to be true in my life. When I’m getting angry for one reason or another, or if I’m simply having a bad mood kind of morning, I find that doing a 5-minute mindfulness meditation can really help me break out of my inner mental loops and get back to productive work.

Can AI agents learn self-control?

I believe so. I’ve recently been doing some research into Richard Sutton’s continuous deep learning, and I think for reinforcement learning problems, it would be really interesting to take some inspiration from children going to time-out. Say we as AI game developers have a large open-world full of many tasks and we want to deploy AI NPCs in these worlds. Assuming we are using Sutton’s techniques to maintain plasticity in our agents’ brains, perhaps we can train a meta-learning algorithm which will try to expose the learning agents to increasingly complex tasks. When the agents’ behavior gets out of line, we can send them to an in-game time-out where they cool off their learning rates, and they are forced to re-learn how to sit still and not take random actions for some period of time. Perhaps in this way our agents will be able to learn self-control and then we will finally be able to trust our agents not to harm us. I hope to have time to experiment with this idea, but perhaps others have already thought the same and can point me to work like this.

Just some food for thought. Thanks for reading!


The Great Learning Agents

Saturday, 30 November 2024 00:00:00 UTC

Disclaimer: The following opinions are my own and not necessarily the views of my employer.

Gamers love good video games. Good video games are Fun. Good Game Developers love to make Fun video games and gamers reward the Game Developer with money freely given for the good games. This virtuous process enables the gamers to enjoy good games, and the game developers to enjoy their lives. Some bad developers try to cheat the system by not making good games, but just mediocre games with addiction mechanisms. They are not evil, but they forgot how to make good games (or they never knew?) and they forgot how to be good gamers. Good gamers are smart and do not find bad games to be Fun in the long run. You can’t cheat the good gamers: they keep you honest.

How do we keep gamers happy? Give them Fun.

Gamers have Fun when they are doing playful learning. This is true happiness. Gamers are learners: good gamers are good learners. What is Fun is different for every gamer, but a lot of gamers are similar enough that we can make games for the masses. Playful games create a safe space for all gamers to learn at their own pace. Fun exists in the balance between frustration and boredom. If the game is too hard for the gamer, the gamer will revolt in frustration. If the game is too easy, the gamer will get bored and find another game to play. Sometimes gamers can even make their own games with the games we create. The following are all metagames we’ve seen before: speedrunning, leaderboards, and even writing the game’s wiki.

Game Developers want gamers to stay playing their games so we can get money to keep making good games. We want to learn to make games better. We would like to make an infinite game which gamers could play forever. We call this the metaverse. This is all of our dreams (there is only one dream). In order to make the infinite game - the metaverse, we must remember how we learned to make games Fun.

How did we learn to make Fun games? First, we had to be good gamers!

Let me tell you my story as a good gamer. When I was a child, I did not know how to make games, but when I was old enough I could play them. My parents bought me a NES and I played a lot of games from Duck Hunt to Super Mario Bros. Eventually these games would get boring and I would play other games with my friends in real life. Children love to play social games with other children.

When I grew older, I got harder games to play. My parents bought me a Nintendo 64. I liked to play Super Mario 64. Mario on the N64 is much harder because you can control the camera in addition to the character. Since the games were harder, I would play them a lot in order to master them. On the outside, this was concerning to my parents because they thought I was addicted to these games. Once they saw my grades in school were excellent and that I played well with my friends in real life, they were no longer overly worried that I played too many games. When I did play too many games, I would get time limits until I learned how to balance play with work. This entire process of growing up as a child was very Fun.

I went to a Jesuit Catholic school in Cleveland Ohio called St. Ignatius High School. I learned that Christianity was one formula for being a good gamer in the grand game we call life itself. Other religions are great too though! I really like Taoism.

How did we become game developers?

Children automatically play games with their friends. In the beginning, the children play the sorts of games that they used to play with their parents. “Hey, do you want to play tag?” is a question I’ve heard my son ask another child.

Eventually the children create variants of the games their parents gave them in order to keep the games exciting. Old games stay relevant and are sometimes played as well. Tag is still Fun when you get older because you can run faster. Competition is always Fun when the players have freely chosen to play (consent) and each of the players have good sportsmanship. Learn more in the book Finite and Infinite Games (beware, this book is very cryptic).

In this way, children learn about how to construct Fun games. They try different versions of the games, and see if they are more or less Fun than the old version of the game. Let’s take the game of “tag” for example. Merely chasing each other is Fun, but a child might be able to make the leap to capture the flag as a more Fun version of tag. You must go find the flag, and then return home to your base. This game builds upon the skills that were in the previous game, since the flag carrier must return the flag to the other team once they are tagged. Simply tagging is a good way to play. If the players must or could hurt each other by making the rules bad, for example, you must tackle the other player, then this is a bad version of the game. The game designers do not know this in the beginning: they must try both versions of the game and see if anyone gets hurt. As the saying goes: it’s all Fun and games until someone gets hurt.

Good parents oversee the game designers to ensure that terrible games are not constructed and played. For example, tag but you must kill the other players. The good parents do not allow the children to construct and play this game because it is not Fun in the long run. Fun is increasing the amount of learning in the metaverse. By killing or irreversibly damaging other children (learners), the amount of potential learning in the metaverse is decreased. A good parent cares for all the children, not only their own.

The other day I was playing Lego Fortnite with my son, and he began to ask me questions: “Dad, can I give swords to all my AI friends?”. He wanted to make swords and give them to all the other AI characters in his village. I thought that sounded Fun, but I informed him that I didn’t think the game developers put that into the game. If we as players had the same power as the original game developers, we could have immediately begun working on adding this to the game. In real life, we can add rules (or remove them) from our games nearly immediately, so long as we are able to effectively communicate these rule updates to the other players (and they must consent to the update). If the players are in agreement, then the new game can immediately begin. In Lego Fortnite, the players don’t have the ability to update the actual rules of the game directly.

There is one way that players CAN update video games without having access to the source code. They can add rules in the form of the metagame: I will try to beat the game faster! This is how speedrunners make old games more Fun - they turn it into a competition. When I was a child, we would play James Bond 007: Nightfire on the Nintendo GameCube. My friends and I created a metagame that was very Fun and inspired by a story we had read at school: The Most Dangerous Game. Two players would play in a 1v1. The first player could only run and hide for the beginning of the game, while the second player was hunting for them. Near the end of the game - for example after 3 minutes had elapsed - the player being hunted could use weapons to fight back against the hunter. The hunted becomes the hunter. This game was very Fun and we would play it for hours. If you really think about it, it’s fairly incredible that we could play this metagame without having to change the source code of the game. There is a saying from Will Wright that if you give the players the toys, they will make the game - this is abundantly true.

How can we make it easier for players to share the metagame with each other (since communication is hard sometimes). Well the players could actually update the game itself so that illegal moves were not possible inside the game! Access to the source code itself.

When I got older, I realized that it would be possible to make games on my computer if only I knew how other people made computer games. I asked my mom and dad if they knew how video games were made and they did not know. I searched on the internet and I found the original Version of the Game Maker engine and language (thanks Mark Overmars for allowing me to become a video game developer at such a young age 🙂). I made a few simple games to figure out how Game Maker worked, and one day I showed my best friend Adam Giffi how Game Maker worked. He made a really Fun game inspired by the Jaws film. In this way, one game developer can create other game developers, and both benefit by playing the Fun game of the other developers!

How did we become great at game development?

Now that we know how the player can become a game developer - we must ask how does the game developer become a great game developer. In the beginning, we can construct games and play them ourselves. If the game is somewhat random and procedurally generates content, it can actually be pretty Fun to play a game that we made for ourselves. If the game is heavily scripted, it’s usually not too Fun to play ourselves because we know the outcome and we know the behavior/policy/gameplay which will achieve the outcome. Is a puzzle Fun to the person who made the puzzle? Only if it is not known what the correct moves are to win. This gives us insight into what Fun is: Fun is discovering the moves that solve the puzzle. No discovery, no Fun.

Games which we know the solution to can still be Fun to create, because we can share these games with other players and see if they have Fun. If the player struggles but eventually solves the puzzle/game, they will have Fun. If the player gives good feedback to the game developer, then we have a virtuous cycle which can make both gamers better and games better. The love between the good game developer and the good player is what makes this cycle possible.

When I was a child, I first made games that I enjoyed playing myself. Then I would show these games to my mom and to my friends. If they liked the games, then I would try to do that more. If they did not like the games, then I would try to learn why by trying different variations of games’ mechanics. Maybe the game was too hard or too easy for the player, or maybe the game was too confusing and just a bad game. A good game communicates an interesting message to the player. A message is interesting if the player thinks it will help them in the future, so they decide to remember it.

Feedback is the greatest gift the gamer can give to the game designer, and vice versa. Both the player and the designer say thanks to each other, and recognize the gifts that each gives to the other. This entire process is Fun. The gift given is itself Fun too. Fun is the possibility of learning to occur.

How can we use this story for the metaverse?

Empower all gamers to make games. Maximize the amount of feedback that players can give to the game designers. Maximize the creation of good games. Unreal Engine is a great game engine, and the games it creates can be great. What if each game came with the engine, so that you could freely modify the game itself? What if the tools were so simple to use that even children could construct new games inside the game, but they were also so powerful that the adult game developers could feel freedom to use them as well?

Is it possible to create a tool which is both easy to use for children and powerful for adults? Yes! The solution is to create a library and try to minimize the amount of framework in the engine. A good library is constructed of easy to understand Functions which can be called by another Function (main is the root Function - the composition point). A framework instead provides callbacks which can be overridden by the developers. Frameworks can be nice because they take the choice of WHEN something should be called away from the game developer, and let them focus on WHAT/HOW. However, this limitation cannot be easily averted by the game developer. A good engine/library often will have a two level API - the lower level framework which helps guide the new developers, but a higher level library which allows the developers to control the timing of Function calls. This returns WHAT, HOW, and WHEN to the game developer, meaning the adult game developer has as much power as the creator of the library itself, even without having to have access to the original source code. Of course the Unreal engine is source-available, so anyone could theoretically construct any game given the right amount of knowledge and willpower, but it’s even better if the game developer can modify everything in the engine without having to modify the engine itself. This is the open-closed principle: classes should be open to extension, but closed to modification. If you must modify an existing class to add the game mechanic you want, then you are restricted or have additional demands placed upon you.

We must construct a communication system which allows the game players to communicate to the game designers. This is already possible in real life by communicating in natural language, but this does not scale well. We can provide tools to the game developers to allow them to maximize the amount of learning they can do: automatic comment summaries (maybe powered by something like LLMs - Large Language Models). We can also provide developers with the ability to run A/B tests. Create two versions of the game and see which version receives better feedback. This is very similar to the tooling Youtube provides to content creators in the Youtube metaverse haha.

With such an abundance of games, how will players discover new games to play? Well there is always one solution which is word of mouth: as long as games have names or IDs, then the gamers can talk to their friends to share good games to play. But can we do game discovery when we are wanting to play a new game and there is nothing new coming from our friends? Yes, by automatically sorting the games in the list of games. Good games can rise to the top and bad games will be forgotten. Both of these are welcome to the good game designer. The winners will be happy to have players playing their game and be rewarded accordingly (money, fame, etc.). But the forgetting and forgiving effect is beneficial to game developers as well. They won’t have to live with regret when people discover their old bad games (unless they are ok with that?).

Similarly, players must be sorted out so the game developer can know who is enjoying the game and who is not enjoying the game. Great gamers will rise to the top, and the bad gamers will be forgotten. The good game developer tries to make the game better for gamers at all skill levels. This is very difficult but it is possible. There is a good reason that Fortnite is so popular: it’s because players at all places on their learning journey can play it to learn and have Fun. Good players get super competitive and face off against each other in grand competitions for fame and fortune. Beginners can play against bots in order to build up confidence. Gamers who are too scared will run from the game. Gamers who are confident will choose to fight and play the game.

My son is at the age where he knows fear now. Babies do not really know fear because they do not know the future - all fear and anxiety comes from the fear of death itself. When he sees something that is too scary, I taught him that he MUST run, so as not to die. In LEGO Fortnite, he will run away from the scary looking giant juggernaut dinosaur guys, but he will fight the wolves (only when he has a sword and shield of course!). Similarly, he will try to pick games from the discovery tab to play. He picks games that look Fun, but if they are too boring or scary, he will run from them. He will be glued to games that are Fun. Sometimes I help him pick the games I think are good for his skill level, or introduce a new skill I don’t think he has considered at all. When I see him trying to play a game way above his skill level, then I step in and take the game away from him to make sure that he doesn’t get hurt (emotionally or otherwise). Sometimes a little hurt is good, but too much hurt can be debilitating. The metaverse must be a safe space for experimenting with both games and game playing.

How do we stop bad behaviors from game developers? We must put them into time out just as bad behaviors from children cause them to go to time out. You get more access to the source code as your trust score goes up. This is not unlike real life. I have access to all the source code of Epic Games because I am a trusted employee of the company. I even have access to update the code of the other employees, the kind in their brains, if I communicate in an effective and trustworthy way. Bad game developers will go to time out, just as bad children go to time out.

How did I come to know this process?

Brendan holding Nolan and Niall akimbo

My little learning agents are trying to have Fun by searching for meaning in the metaverse.

I have two children, Nolan Mulcahy (left) and Niall Mulcahy (right), and I am teaching them to be gamers. By closely observing them and by reading up on psychology, I have been able to learn a lot!

When children are born, first they learn how their senses and actions work. They see random colors and take random actions. They really only “know” how to eat, poop, sleep and cry; although, they don’t yet understand how these processes work. That’s right: children have to learn how to sleep, even if it seems automatic to us adults.

Patterns emerge as the child grows and they can start to play games. Patterns both in the observations, but also in the actions and how these actions influence the observable world. Parents create the first games for their children. The parent builds a tower, and the child will knock it over for Fun (try to learn what will happen when their hand hits the tower somewhere). Eventually the parent can make the game to stack the blocks as high as possible. This game is even Fun for adults if you have enough blocks available haha!

When children get older, about the age of 2 or 3, they start to really challenge the parents and themselves. They begin to throw, hit, etc. This is the process by which children learn good actions from bad actions/behavior. The good parent teaches the child what is bad via feedback. The greatest punishment for violent behavior is to restrict the child’s freedom. They must EARN freedom by demonstrating they are capable of self-control. Take the toys away if the child will not stop throwing them (“It looks like you’re having trouble not throwing your toys, you can have them back later and we will see if you learned your lesson, otherwise we will repeat this game forever until you learn”). For violence, we put the child in time out in order to keep them away from things which can enable them to hurt other objects, people or animals, etc. Time out is a sort of game where you must not take any actions besides sitting for X minutes to escape timeout. If you try to cheat by trying to escape, you will be returned to timeout once you are caught. It’s interesting that the X in the formula above is controlled by the parent and is based on what the child needs: as a starting point it is suggested that you do 1 minute per year of age of the child. Some really smart, stubborn, and rebellious kids need to timeout longer than the suggestion above - the good parent recognizes this and adjusts accordingly. This form of struggle between the parent and the child makes both the child and the parent better. This is similar to how the game developer gets better from the game players, and vice versa.

My son Nolan grew old enough that I could play games with him which regular folks would recognize as “games”. First, I played board games with him. I bought him this simple game called My First Orchard, so he could learn the basics of games themselves. The gamer comes to recognize the existence of the game. Finite games have various rules/restrictions which are placed upon the player. They come in many forms: having to take turns, having to be patient when other players are taking their turn, physical mechanics - rolling dice, building stuff. Even abstract concepts such as making sure the other players are having Fun. Parents make sure children are having Fun, and eventually the child comes to know this as well.

As he got smarter, I bought him harder games to play. Today, Nolan is 4 years old, and we enjoy playing Kingdomino. It’s a great game because it has meaningful choices to make, but it has enough randomness that it’s not possible to purposely memorize the actions. For some parts of the game, you must recognize the good moves from the bad moves. For other parts of the game, you must recognize that there are probabilities and to intentionally take actions which have the highest expectation of being a good move. This reminds me of Expectation-Maximization and Bayes Theorem. A good player that is trying to win will take actions which give them an advantage, but since the game will be played many times over, sometimes the good player will intentionally take bad actions in one session in order to discover if they misunderstood these actions/situations. The good player finds a balance between winning and learning. This is Fun.

Today, I let Nolan play video games on the NES. In the beginning, he liked Excitebike because it was so easy. Just push the gas and move. In time, he realized he could try not to go too fast so as not to flip over his bike. There are natural feedback mechanisms built into the game. I didn’t even have to tell him much about how to improve. If/when I gave him good demonstrations he could learn faster - but ultimately he had to learn on his own.

When Nolan got older, I watched him play games he had never played before. The following section describes what I saw him do. By the way, it has been an honor to have been blessed with all the privileges that I enjoy! I thank the gods that gave me this life 😄

Learning to Play a Game

How does a child learn to play a new game? I watched Nolan play Super Mario 2 on the NES this morning and here is what he did:

  1. Try nothing and see what happens in the game world - what is moving and what is not
    • Things that move on their own must be controlled by the game and not the player
  2. Try all the buttons on the controller
    • What moves in response to the buttons? This must be the player character. Even pausing the game is interesting.
  3. Play around randomly - what does success look like? What does failure look like?
    • He eventually died which he did not like - as players, we recognize this automatically it seems, but we had to learn about death. The parent teaches the child what death is and what winning looks like (in earlier games and in the “real life” game).
  4. Once the player knows what winning is, the player will try to win because they love the game. They love to learn.
  5. Nolan saw a door and thought he would like to go through it. He created his own challenge. This is a meta-goal: the final goal is to win the full game, but he made a smaller challenge. He learned how to create his own challenges from watching how I created challenges for him in earlier games. “Hey Nolan, how high can you stack the blocks?”
  6. Sometimes Nolan would ask me questions and I would try to truthfully answer them. “Yeah, going through the door seems like a good idea.”
    • Through this process Nolan could learn what I thought was a good way to learn (for both him and I)
  7. Exploring the game world is a good way to learn, and understanding the game is where the Fun is. Nolan learns to create a model of the game world inside his head.
  8. The game either stays Fun or gets boring:
    • If he isn’t learning, the game might be too hard. He can’t recognize the patterns of play which lead to success and he will need to come back to this game later when he is smarter
    • If the game is too easy, then he will ask to play another game that challenges him. Sometimes as adults we play games we have already mastered for comfort. Reminder of the mastery builds confidence that we are good learners.
  9. I showed Nolan how to work the NES (I have my original NES from my childhood but he plays on my NES mini - I can’t risk him breaking my sacred artifacts lol):
    • He goes to the main menu by pushing the button on the front of the console
    • I usually suggest good games to play which I think will be at the right challenge level for him. The good teacher loves the good student. And the good student loves the good teacher. I know this because my son said “I love you Dad” completely un-prompted from me haha! Love given freely like this feels like winning some kind of game - the teaching game.
  10. Nolan played a lot of games and we found which ones were Fun and which were boring, and which were too hard. One day we will return and try the games again and see where he stands.

Why make learning agents in games?

As game players, we are constructing a model of the game world, and we experience this construction (called learning) as a Fun and enjoyable experience itself. We enjoy both winning and learning. Once there is nothing left to learn in the game, we move on to a new game. Is it possible to create a game which can be infinitely learned from? It seems the answer is YES.

If other non-player characters in games can learn, then the game will never be fully predictable so it will stay interesting. Since the gamer will not fully understand how the characters will change their behavior, then the gamer will need to construct a probability distribution. They will begin to consider how their actions affect the learner agents as well. Perhaps it is good for all objects in the game world to learn, but maybe it’s also good for there not to be too many learners. Players discover which objects seem to be learners and which seem to not be learners. This is called the theory of mind. Some objects can even try to hide if they are learning or not, which makes the theory of mind itself a sort of game.

One way we as game developers take advantage of learning agents is by constructing multiplayer games. Since other objects are controlled by humans which are capable of learning, there are now other learning agents in the game world. Games which have learners seem to be infinitely interesting (until the perfect world model is constructed). Take a look at the staying power of chess for example, and games like Fortnite, Starcraft, and Dota. Single player games will after some time get boring. Is it possible for all games to be multiplayer games? Yes!

Players are learners. If we can make AI characters learn, then the game can have artificial players. If we can make our games learn, then the game is a player itself in a weird sort of way (don’t overthink this - a game which uses A/B testing to automatically update itself and recruit the most players is itself a learning agent. In this way, we can automate the creation of games. How can we speed up this process? Must human players suffer through tons of terrible games created at random by the learning game developer. Not necessarily.

In order to speed up the process of learning how to make games, we can create a variety of automatic game playing agents and see if they like their game. What does it mean to like a game though? Well it means that these AI players at a variety of different starting places and levels can learn to play the game well. If artificial players had the ability to change games to play, they could do the same thing children do when faced with the metaverse.

This creates a virtuous digital feedback cycle. The promising games from this cycle can be presented to the real humans in order to keep the process honest. Ultimately, we want humans to learn and enjoy the games, so we must present promising games to the human players and figure out what humans like. This is called reinforcement learning from human feedback (RLHF) in the ML community.

Eventually it becomes easier for humans to make games themselves. Game developers like to create AI characters in games, but this is often a difficult process for characters which will have compelling, complex behaviors. We can often recognize the behavior we want, but we don’t know how to describe it in code (in the programming language - sometimes we can’t even describe it in natural language easily haha). Since AI characters that learn and can be taught will exist, we can put these learning agents into our games. This will feel similar to how we teach other humans.

How to make the Learning Agents?

We need to look closely at how we learn for some inspiration. Life is a giant multiplayer game where each player is trying to learn as much as possible. They are trying to construct the ultimate world model so that they don’t die. During the day, agents collect experiences [observations, actions, rewards] by running their consciousness. The consciousness is what it must feel like to be an interpreter - it’s a program running instructions in order to animate the body. Thinking is the process by which the interpreter is running instructions between updates sent to the body. Really complex thoughts like introspection can take a long time to run.

At night, the brain takes all the buffered experiences and uses them to compile the new version of the interpreter so that the next day it will run even better (faster, more efficient, more accurate, etc.). Agents must sleep after filling up their short-term memory buffer. The memory buffer fills up mostly with experiences that are new - things that need to be learned. When we run on autopilot, we don’t remember those experiences. Babies and children must sleep much more often than adults (nap a ton per day -> nap twice a day -> nap once a day) because their experience buffer is smaller, since their brains are still developing. Adults can have a long memory buffer. Some people don’t need to sleep as much or as often as others because they have a very efficient memory recording and compilation process that runs.

But one burning question remains: Where do the rewards come from? What is the reward Function? The rewards come from a Function which is learned by the agent itself. When the agent is running well, this reward Function is a type of search that is trying to maximize the amount of meaningful information (signal) in the agent’s life, while trying to ignore all the noise. Babies often sleep better when they are in the presence of a white noise machine because this allows them to construct a model of noise itself. The agents construct two models: a model of meaning and a model of noise. When the brain receives a new experience, it compares it to both of these models and decides whether the new potential information is a noise or a signal or uncertain. If it is a noise, then the data is discarded during sleep. If the data is signal (good), then the data is used to update the brain network during sleep. The agent buffers data all day, because sometimes the meaning behind the data might only become apparent at a later time. This is the sort of epiphany when you realize a dream/thought you had earlier has some meaning behind it that wasn’t obvious at the earlier point in time. This is akin to solving the credit assignment problem in reinforcement learning.

Agents can learn faster by passing information to each other. Agents share experiences in many forms (talking, books, movies, games, dances, etc.). All of these are a form of message passing. The agents are responsible for determining which of the messages to accept as noise and what to accept as signal. This is accomplished by constructing a model of the other agent and seeing how closely the modeled reward Function aligns with the agent’s own reward Function model. If the models align, then the agent is more likely to accept the passed messages as if it were its own experience. If the models do not align, then the data is discarded as noise. Perhaps there is a probability Function assigned to each of the experiences, which describes the likelihood that the agent will accept the message. The better the model of the other agent, the higher certainty (standard deviation becomes smaller when the model is well known).

To give you a concrete example, I am more likely to trust messages sent to me by my parents than anyone else. This is because I know my parents love me and are trying to help me learn. Strangers on the other hand are harder to trust. But a good person has hope that every stranger is going to be a good person to share experiences with.

How do agents come to know the model of the other agent? They know through the interactions. The agent can put the other agents to the test by querying the other agent with mysterious puzzles. If the other agent is willing and able to solve the problem, then the agent is more likely to be good. If not, then the other might not be good or not ready yet to reveal this information. There are no bad agents, only agents which have learned the wrong reward Function and thus have the wrong behaviors. They need to be given good messages to help update their brains and behaviors.

There is only one ultimate reward function which is trying to be achieved by every agent: this is to maximize the amount of Fun in the metaverse. Agents come in many different forms: humans, evolution itself, even bicameral democracy is a kind of learning agent. All of these agents are trying to discover the truth - in other words, they are trying to learn! Playful learning is Fun. To maximize Fun, we must maximize the learning. To maximize the learning, we must make learning Fun.

Whenever my wife and I try to teach my sons something, we have to construct a Fun game for them to play. If I try to force Nolan to learn something, he will usually reject the message. For example, when we were trying to make my son learn to recognize the letters of the alphabet, my wife created a Fun game to play. She would hide the letters around the room, and he would search for the letters and have to say what they were before placing them into a basket. This is important to think deeply about. It shows how search is Fun. The search for the symbols (in this case the shapes that are the letters). He also has to use memorization in order to remember each of the letters. Being able to recall a previously learned experience also feels good - this is Fun as well. Constructing a good game to help the child was Fun for the parent as well.

What is needed to go on this journey and where are we today?

There is still much that we need to discover in detail about the learning process. We have some idea of how many of these systems work in humans, but recreating them in software will take time and effort and a whole lot of research (etymology of research: look again).

In order to keep the lights on, we should use machine learning in combination with existing AI systems. There is no reason to throw the baby out with the bath water. Existing rules-based AI systems are Fun in some contexts. We must use the right tool for the job. Currently we need to understand much more about the tool of machine learning itself.

We will need computing infrastructure in order to experiment with many games and many attempts at creating the great learning agents. It’s an iterative process on both sides. How much infrastructure determines how fast we can go. No matter what we have we can always continue to learn. If we have a small amount, we will learn at a small pace. If we have more, then we will learn more. However, we have to balance this with the fact that we need to use our money to buy food and the other things we need to live. We also will need to continue to recruit the best and the brightest players and game developers.

I believe that artificial general intelligence can only be created at a handful of companies: Epic Games and perhaps Roblox. We must make a platform which empowers all the gamers to become game developers.

Once we create the learning agents, we can create the learner game developer. Once we have created both, we have created the metaverse and then the infinite game will exist inside itself.

What are the greatest challenges today?

Modular Skills / Observations - Large Gameplay Model

Right now, training agents pretty much start from scratch every time we want to train a new agent for a new game. Ideally, agents could train on a variety of mini-games and skills from one game could transfer to another. For example, aiming a gun in a shooter should be a similar skill to aiming a gun in a carnival shooting mini-game (humans can easily do this) but agents will need to typically be re-trained from scratch. Besides changing between games, this problem is amplified by games-as-a-service. Models trained on one game patch should ideally continue to work fairly well on a previous or new patch with minimal re-training needed.

In summary, we need a robust way to add and remove observations/actions from an ML model without needing to add/remove weights from the underlying neural network. You can compare this to how an LLM can read a sentence it has never seen before as long as all/most of the words are in its existing vocabulary. We could have a vocabulary of game objects and their properties, along with game tasks (what I mean by a task is a given goal and its associated behavior required). We can design and create this model and iterate on it by using the Unreal Engine.

Multi-agent reinforcement learning (MARL)

We would like to have agents that can cooperate with humans. MARL is a process where learning agents must train and play alongside other agents. These agents could be humans or other ML agents. This is a uniquely challenging problem in machine learning for a couple reasons. First of all, ML agents training alongside each other represents a moving target (shared or even separate models). The reason is because updating the behavior of the other agents in the environment means that the previous training the agent has experienced is stale. It can’t exactly predict a future state based on its actions without having an idea what other agents’ actions will be (theory of mind). This can make a naive approach to training extremely unstable. Furthermore, agents can become overfit to the expected behavior of the agents in their training environment. If at training time we typically only have ML agents, but at gametime we introduce human players, then there is potentially a huge mismatch between the training data and the gameplay data. Humans can easily exhibit intentionally strange (or otherwise) behavior in order to exploit ML trained agents.

We need to extend Learning Agents to make it easier for us to train agents while maintaining a diverse pool of ML models. Agents need to train against previous versions of themselves (to maintain knowledge of how to play against bad players), as well as allowing agents to land upon a diversity of strategic policies. For example, in a game like Fortnite, we could possibly break down broad strategies into two categories: commandos and survivors (there are probably more but let’s just focus on these two). Commandos rush at opponents and try to get as many eliminations as possible in addition to winning. Survivors instead attempt to hide and only get the final elimination they need to win. Ideally, an ML agent training system would be able to automatically discover these diverse gameplay strategies and allow agents to train against each other (maintaining this diversity over the course of the training). This better prepares agents for the diverse strategies that humans will employ to win. Open AI / Deepmind have used this form of self-play in Dota/Starcraft respectively.

Even more challenges arise when agents must cooperate with human players. In real life and in games, humans have a shared mental understanding of what other humans are trying to achieve, and can use verbal and nonverbal forms of communication to cooperate. Agents trained in isolation from humans cannot magically communicate with the same verbal or nonverbal skills, which can frustrate players if not done correctly. We need to solve this problem by leveraging LLMs and other forms of communication as part of our agents.

Tooling for Annotating Replays for Tasks

Replays contain ground-truth states and actions, and we need the ability to mark up replays with higher level decision making in order to train more abstract tactical/strategic behavior in agents. If we want agents to have human-like tactics and strategy, we need this. Think about a Football coach reviewing the tape of their previous match with his/her players - we need this same feedback pattern in AI training.

We would need a tool similar to a movie maker, where we could view replays and then have a set of tracks at the bottom of the screen where humans could mark up a variety of events and ranges. This tool could live inside the Unreal Editor. For example, with Fortnite a replay reviewer could mark sections of the replay as containing combat, while other sections can contain hiding. We could also have heuristic Functions we can write to find sections of gameplay that contain the desired data. With the aiming project, we found combat in replays by looking for sections of the gameplay where a player’s crosshair was roughly aiming at another player and they were pulling the trigger (and some leading and trailing frames in and out of these events). Ideally, humans could review the replay subsections captured by these Functions to ensure that the heuristic Function is matching their intended gameplay target, and allow a fast iteration loop while tuning these Functions.

To minimize the need for on-going human-in-the-loop, we could train an ML model which is capable of finding our intended data in the replays. This is called Task programming.

One further thought is that these tactical/strategy minded neural networks may need to be trained in parallel or sequentially with lower level systems. For example, we may have a human like aiming controller and a human like driving controller, then a higher level tactic controller which utilizes these skills to achieve its higher level gameplay objectives. We need some form of hierarchical learning.

Designability

All these ML systems need to have designability/tunability in mind so that game designers can tweak the behavior of the bot. Ideally, these systems could leverage LLMs so the game designers can use natural human language to “program” the bot by literally asking it to change its behavior. Creating AI in the future will feel more like talking with an employee than programming as we know it today!

Let’s make learning agents to make the metaverse!

Let’s make the metaverse to make the learning agents!

It is a virtuous cycle. The good gamer loves the good game, and the good game developer loves the good gamer. This is Fun for everyone.

What is the Golden Game?

Treat the gamers as you would like to be treated if you were the gamer. Treat the game developer as you would like to be treated if you were the game developer. Upon this rule rests everything else which has been written above, and all the games which will follow. By following this one rule, you are playing the golden game in real life.

How can you help?

Learn about learning agents by taking the following course and giving us feedback: Learning Agents 5.5 Course

Ideas for how to help:

  • How to make this story better?
  • How to make the tutorial better?
  • How to make the learners better?
  • Make your own games with learning agents!
  • Get involved in your own way!
  • Inspire others

Thanks for your time and you can contact me on LinkedIn, Twitter, or brendan.mulcahy AT epicgames.com!

Just be forewarned that I might test you before I spend time working with you.

I have to get back to my work now!