sparse-dense by FoYo: Running Q-Learning on your palm (Part1)

Japanese summary　このアプリは、人工知能技術の一分野である「強化学習（特にQ-Learning）」の基本的な考え方と仕組みを、簡単な例を動かしながら学ぶものです。強化学習は、AlphaGo（囲碁）や自動運転、自律制御ロボットなどで使われ、注目を集めています。ここでは、そのような実用レベルに立ち向かう前に、身近なスマートフォンを用いて、その技術のエッセンスに触れて親しむことができます。あなたの掌の上のスマホで強化学習を楽しもう！（このアプリは、MIT App Inventorを用いて開発されました。）

This app won the "2021 October's MIT APP INVENTOR OF THE MONTH" award.

Abstract
This app explains the basic idea and mechanism of reinforcement learning (especially Q-Learning), which is a field of artificial intelligence technology, using simple examples. Reinforcement learning is used in AlphaGo, autonomous driving, autonomous control robots, etc., and is attracting a lot of attention. Here, you can get familiar with the essence of the technology using your familiar smartphone before confronting such a practical level. As an example, a robot is trained to go straight down the corridor and successfully acquire the gem placed along the way. For this purpose, I use the reinforcement learning method called Q-Learning. Results of the training can be confirmed by the animation of the robot movement. I have developed a smartphone app that realizes these using MIT App Inventor. Q-Learning runs on your palm!

# This application for Android is published below:
Source code: here

# This app can also be applied to robots with more complex behaviors than the examples below. In that case, it is enough to add a definition of new actions and rewards. For details, please see Part 2 here.

# For the case where the robot moves on the 2D grid, please see Part3 here.

● Overview of the Q-Learning app
The configuration of the developed application is explained in Fig. 1. You can see a robot and a green gem in the corridor at the bottom. Train the robot to pick up the gem at the right place. (The robot's behavior will be explained later.) To do that, first press the "init" button, then train with the "train" button. Each "train" will learn 100 steps. One step corresponds to the robot moving one block in the corridor or picking up (taking) the gem.

At each step, the action is rewarded. Positive actions that lead to success are highly rewarded. A series of actions of the robot starting from the left end to the final success or failure is called an episode. Although the position of the gem changes with each play (game), the average rewards sum for successful episodes can be calculated theoretically in this example. Repeat "train" until it approaches this theoretical value. If the theoretical value cannot be calculated, the iteration should be stopped when it is considered to have converged to a certain high value.

The data in the large yellow area (Q-table) on the screen indicates which action is preferable (worth it) depending on the situation along the way. As I will explain later, with repeated training, this Q-table will approach the correct values.

After training, press the "anim for test" button to watch the robot's behavior in animation. A major feature of this reinforcement learning is that the behavior of the robot can be determined by the contents of the Q-table. Therefore, if the Q-table is inaccurate, it often fails as follows:

On the other hand, if the contents of the Q-table are correct, gem acquisition will always succeed, as shown below:

● Challenges imposed on robots and possible actions
Let's take this example in a little more detail. The actions that the robot can take, the conditions for obtaining the gem, and the conditions for completing the task (play) are shown in Fig. 2. On the other hand, Figure 3 illustrates whether “Take” (the action of picking the gem) or “Forward” (moving to the right) leads to success in a certain situation. To get the gem, the robot must do a "Take" action at the same place as the gem. Obviously, in this example, the robot should select (2) "Forward" instead of selecting "Take" in (1). After that, select "Take" in (3) and it will succeed.

It is easy for humans to write a program to solve this problem (challenge). But here, instead, let the robot itself discover the solution through learning. Please note this point!

● How to choose an action
As already mentioned above, I introduced rewards to determine whether to choose between “Take” action and “Forward” action in a particular situation. As shown in Figure 4, the reward is +5 for the "Take" action that succeeds in acquiring the gem, and -1 for the other actions. And it is advantageous to choose an action with a larger sum of rewards mentioned above.

● Update Q-table
To achieve the above, I use a Q-table that represents the value (in other words, worth) of both actions for the state at that time, as shown in Figure 5. In the figure, in the untrained state (train = 0 step), the action value of "Take" is higher (larger) than that of "Forward", so the "Take" action is selected. But this is wrong because the Q-table is inaccurate. On the other hand, in the fully trained state (train = 800 steps), the Q-table has been updated to a reasonable value, and the "Forward" action is taken correctly. Roughly speaking, Q-table is an estimate of the sum of rewards.

The Q-table update is based on a famous learning rule called Q-Learning as shown below:

This update formula means that the value of Q when taking an action in the current state is brought close to the highest value of Q that can be taken in the next state. It is known that such updates converge to the optimum Q value, assuming a sufficient number of episodes are attempted for all states. If you need a more detailed explanation, please see one of the references ([1][2][3][4][5]). And the relationship between this value of Q and the Q-table in this application is illustrated below:

In fact, the figure below shows that the contents of the Q-Table have almost reached optimal after 1100 steps of training.

● Expand the app a little
Finally, here is an example of how to expand this app. This app has three hyper parameters (α, γ, ε) to increase versatility. For example, how much will the convergence speed and stability be affected if the value of the learning rate α is changed?

To confirm this, you need to display it as a line graph instead of the slider as above. It's easy to achieve. I used ChartMaker (an extension created by Kate & Emily) in reference [7]. The result is shown in Fig.6.

Enjoy reinforcement learning (Q-Learning) with this smartphone app!

● Even deeper expansion
The above example is to familiarize you with the basic idea of "Q-Learning". In this example, the number of states is so small that the entire contents of the Q-table could be saved and updated. However, consider an example where the robot's range of movement is not a one-dimensional corridor, but a wide plane, or there are places where it cannot proceed due to obstacles. In such cases, the number of states will be enormous and will not be solved by basic Q-Learning. Therefore, the Q-table needs to be approximated by another method. One promising method is to use neural networks. The training itself will need to be done on a PC, but it will be possible to bring the trained model to a smartphone and run the animation for testing. I would like to discuss such advanced expansions in another article.

Acknowledgments
This app is my original work. However, I refer to the explanation of Q-Learning and the example Python program in reference [6]. I would like to thank Dr. Makoto Ito, the author of this article.

References
[1] Richard S. Sutton and Andrew G. Barto, "Reinforcement Learning: An Introduction, second edition", The MIT Press, 2018.
[2] Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, Joelle Pineau, "An Introduction to Deep Reinforcement Learning", Now Publishers, 2019.
[3] Etsuji Nakai, "Reinforcement Learning for Software Engineers", Gijyutsu-Hyoron-Sha, 2020. (in Japanese)
[4] Azuma Ohuchi, Masahito Yamamoto, Hidenori Kawamura, "Theory and application of multi-agent systems - computing paradigm for complex systems engineering", Corona-sha, 2002. (in Japanese)
[5] Tomah Sogabe, "Introduction to reinforcement learning algorithm", Ohmsha, 2019. (in Japanese)
[6] Makoto Ito, “Learn Reinforcement Learning with Python”, Nikkei Software 2021.07, Nikkei BP, 2021, pp.24-39 (in Japanese)
[7] Kate Manning and Emily Kager,
https://github.com/MillsCS215AppInventorProj/chartmaker

2021年8月26日木曜日

Running Q-Learning on your palm (Part1)

0 件のコメント:

コメントを投稿