
Running Q-Learning on your palm (Part1)

Japanese summary このアプリは、人工知能技術の一分野である「強化学習(特にQ-Learning)」の基本的な考え方と仕組みを、簡単な例を動かしながら学ぶものです。強化学習は、AlphaGo(囲碁)や自動運転、自律制御ロボットなどで使われ、注目を集めています。ここでは、そのような実用レベルに立ち向かう前に、身近なスマートフォンを用いて、その技術のエッセンスに触れて親しむことができます。あなたの掌の上のスマホで強化学習を楽しもう!(このアプリは、MIT App Inventorを用いて開発されました。)

This app won the "2021 October's MIT APP INVENTOR OF THE MONTH" award. 

This app explains the basic idea and mechanism of reinforcement learning (especially Q-Learning), which is a field of artificial intelligence technology, using simple examples. Reinforcement learning is used in AlphaGo, autonomous driving, autonomous control robots, etc., and is attracting a lot of attention. Here, you can get familiar with the essence of the technology using your familiar smartphone before confronting such a practical level. As an example, a robot is trained to go straight down the corridor and successfully acquire the gem placed along the way. For this purpose, I use the reinforcement learning method called Q-Learning. Results of the training can be confirmed by the animation of the robot movement. I have developed a smartphone app that realizes these using MIT App Inventor. Q-Learning runs on your palm!

# This application for Android is published below:
Source code: here

# This app can also be applied to robots with more complex behaviors than the examples below. In that case, it is enough to add a definition of new actions and rewards. For details, please see Part 2 here.

# For the case where the robot moves on the 2D grid, please see Part3 here

Overview of the Q-Learning app
The configuration of the developed application is explained in Fig. 1. You can see a robot and a green gem in the corridor at the bottom. Train the robot to pick up the gem at the right place. (The robot's behavior will be explained later.) To do that, first press the "init" button, then train with the "train" button. Each "train" will learn 100 steps. One step corresponds to the robot moving one block in the corridor or picking up (taking) the gem.

At each step, the action is rewarded. Positive actions that lead to success are highly rewarded. A series of actions of the robot starting from the left end to the final success or failure is called an episode. Although the position of the gem changes with each play (game), the average rewards sum for successful episodes can be calculated theoretically in this example. Repeat "train" until it approaches this theoretical value. If the theoretical value cannot be calculated, the iteration should be stopped when it is considered to have converged to a certain high value.

The data in the large yellow area (Q-table) on the screen indicates which action is preferable (worth it) depending on the situation along the way. As I will explain later, with repeated training, this Q-table will approach the correct values.

After training, press the "anim for test" button to watch the robot's behavior in animation. A major feature of this reinforcement learning is that the behavior of the robot can be determined by the contents of the Q-table. Therefore, if the Q-table is inaccurate, it often fails as follows:

On the other hand, if the contents of the Q-table are correct, gem acquisition will always succeed, as shown below:

Challenges imposed on robots and possible actions
Let's take this example in a little more detail. The actions that the robot can take, the conditions for obtaining the gem, and the conditions for completing the task (play) are shown in Fig. 2. On the other hand, Figure 3 illustrates whether “Take” (the action of picking the gem) or “Forward” (moving to the right) leads to success in a certain situation. To get the gem, the robot must do a "Take" action at the same place as the gem. Obviously, in this example, the robot should select (2) "Forward" instead of selecting "Take" in (1). After that, select "Take" in (3) and it will succeed.

It is easy for humans to write a program to solve this problem (challenge). But here, instead, let the robot itself discover the solution through learning. Please note this point!

How to choose an action
As already mentioned above, I introduced rewards to determine whether to choose between “Take” action and “Forward” action in a particular situation. As shown in Figure 4, the reward is +5 for the "Take" action that succeeds in acquiring the gem, and -1 for the other actions. And it is advantageous to choose an action with a larger sum of rewards mentioned above. 

Update Q-table
To achieve the above, I use a Q-table that represents the value (in other words, worth) of both actions for the state at that time, as shown in Figure 5. In the figure, in the untrained state (train = 0 step), the action value of "Take" is higher (larger) than that of "Forward", so the "Take" action is selected. But this is wrong because the Q-table is inaccurate. On the other hand, in the fully trained state (train = 800 steps), the Q-table has been updated to a reasonable value, and the "Forward" action is taken correctly. Roughly speaking, Q-table is an estimate of the sum of rewards. 

The Q-table update is based on a famous learning rule called Q-Learning as shown below:

This update formula means that the value of Q when taking an action in the current state is brought close to the highest value of Q that can be taken in the next state. It is known that such updates converge to the optimum Q value, assuming a sufficient number of episodes are attempted for all states. If you need a more detailed explanation, please see one of the references ([1][2][3][4][5]). And the relationship between this value of Q and the Q-table in this application is illustrated below:

In fact, the figure below shows that the contents of the Q-Table have almost reached optimal after 1100 steps of training.

Expand the app a little
Finally, here is an example of how to expand this app. This app has three hyper parameters (α, γ, ε) to increase versatility. For example, how much will the convergence speed and stability be affected if the value of the learning rate α is changed?

To confirm this, you need to display it as a line graph instead of the slider as above. It's easy to achieve. I used ChartMaker (an extension created by Kate & Emily) in reference [7]. The result is shown in Fig.6.

Enjoy reinforcement learning (Q-Learning) with this smartphone app! 

 Even deeper expansion
The above example is to familiarize you with the basic idea of "Q-Learning". In this example, the number of states is so small that the entire contents of the Q-table could be saved and updated. However, consider an example where the robot's range of movement is not a one-dimensional corridor, but a wide plane, or there are places where it cannot proceed due to obstacles. In such cases, the number of states will be enormous and will not be solved by basic Q-Learning. Therefore, the Q-table needs to be approximated by another method. One promising method is to use neural networks. The training itself will need to be done on a PC, but it will be possible to bring the trained model to a smartphone and run the animation for testing. I would like to discuss such advanced expansions in another article.

This app is my original work. However, I refer to the explanation of Q-Learning and the example Python program in reference [6]. I would like to thank Dr. Makoto Ito, the author of this article.

[1] Richard S. Sutton and Andrew G. Barto, "Reinforcement Learning: An Introduction, second edition", The MIT Press, 2018.
[2] Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, Joelle Pineau, "An Introduction to Deep Reinforcement Learning", Now Publishers, 2019.
[3] Etsuji Nakai, "Reinforcement Learning for Software Engineers", Gijyutsu-Hyoron-Sha, 2020. (in Japanese)
[4] Azuma Ohuchi, Masahito Yamamoto, Hidenori Kawamura, "Theory and application of multi-agent systems - computing paradigm for complex systems engineering", Corona-sha, 2002. (in Japanese)
[5] Tomah Sogabe, "Introduction to reinforcement learning algorithm", Ohmsha, 2019. (in Japanese)
[6] Makoto Ito, “Learn Reinforcement Learning with Python”, Nikkei Software 2021.07, Nikkei BP, 2021, pp.24-39 (in Japanese)
[7] Kate Manning and Emily Kager,



【what is this】最近の日経ソフトウェア誌に、「Pythonで強化学習を学ぶ」の解説記事がありました。丁寧に書かれていて分かりやすく、提供されているPythonプログラムも完全に動かすことができました。しかし、ここで留まらずに、理解をさらに深めるため、別のプラットフォーム(Androidスマホ)とプログラミング環境(MIT App Inventor)で、独自にそれを再構築してみました。

■ 解説記事:Pythonで「強化学習」を学ぶ



■ スマホアプリとして上記強化学習プログラムを独自に作る
 実は、小生は、上記のPythonコードの中味はほとんど読んでいません。にも拘わらず、図3に示すような、同等機能のスマホアプリを作成することができました。これは、この解説自体が素晴らしかったことに他なりません。MIT App Inventorを利用して開発しました。

Q-Tableの[宝石を取る行動価値, 右へ前進する行動価値] 
= [-0.71757, -1.01168]
となっており、ロボットがまだ宝石の位置にないのに、より行動価値が高い(すなわち、-0.71 > -1.01)と評価された「宝石を取る」行動を行ったためです。

Q-Tableの[宝石を取る行動価値, 右へ前進する行動価値] 
= [5.0, -1.9]
にしたがい、最終的に自信をもって(すなわち、5.0 > -1.9)、「宝石を取る」行動が成功しています。

■ スマホアプリで「強化学習」の意義
 Pythonプログラミング、もちろん良いでしょう。でも、スマホでアプリを開発するのならば、MIT App Inventorは非常に効率良く行えます。上記解説記事では、Q-Tableの実現に、Pythonの辞書型変数やnumpy配列を使っていますが、App Inventorでも同等のことが可能です。また、Pythonのmatplotlibほど高機能ではありませんが、図3に示したとおり、App Inventorでも折線グラフ(報酬の経緯)も描けています。



■ MIT App Inventorプログラムの複雑度
 上に述べたとおり、Pythonで約970行とほぼ同等のプログラムをMIT App Inventorで作成しました。それがどの程度の複雑度なのかを詳しく述べることは、ここではできませんが、プログラム全体(ブロック図)をご参考までに示します。小さくで中味は見えませんがご容赦下さい。




U18 IT夢コンテスト(神奈川工科大学主催)
MIT App Inventor Summer Appathon(米国MIT主催)

(1)U18 IT夢コンテスト(神奈川工科大学主催)

(2)MIT App Inventor Summer Appathon(米国MIT主催)
 こちらは、3つのテーマ(City of the Future, Improving Academics, Community Computational Action)に関するアイディア(説明書とビデオも必須)とそれを実現したスマホアプリの提出が必要です。提出はすでに締め切られています。現在、誰でも投票できる「People's Choice Award」を受け付けています。彼らがどんなアイディアを持ち、どうそれを実現しているかを知って、応援しましょう!(今後、審査員による賞の発表もあります。)



【what is this】前回の「Scratchで強化学習(2)」の続編です。伊藤真著[1]にある、レベル3の例題を検討します。前回のレベル2では、開始状態から最終状態までの「エピソード」を扱いましたが、今回の例題は最終状態の無いゲームでの得点(報酬)を競うものです。予測報酬(=行動価値Q)の計算において、「割引率」を導入するのがひとつのポイントです。

■ レベル3例題:お化けの飛行訓練ゲーム

 ボタンnを押すという行動により現在の状態が状態nへ推移します。例えば、現在の状態(from)が状態1の時にボタン2を押すと、次の状態(to)としての状態2へ推移します。その際、報酬表(状態1, 状態2)の値(すなわち1)を報酬として得ます。報酬表の対角要素はすべて-1としてありますので、同じボタンを連続して押した場合は、いつも-1の報酬を得ます。


■ 「強化学習」プレイヤーの戦略


行動価値Q(状態, 行動)の計算を観察する
 上記の割引率付きの学習則によるQ(状態、行動)の計算結果を観察してみます。上に述べたとおおり、Q(状態、行動) Q(from-状態、to-状態)とみなすことができます。報酬は、図1に示した報酬表のとおりだとします。


  • 状態1では、最大Qはボタン2の場合であり、状態2へ。
  • 次に、状態2では、最大Qはボタン4の場合であり、状態4へ。
  • さらに、状態4では、最大Qはボタン1の場合であり、状態1へ。
ボタン1→ボタン2→ボタン4→ボタン1→ . . . 

■ 感想
[1] 伊藤 真:ScratchでAIを学ぼう- ゲームプログラミングで強化学習を体験、日経BP、2020年8月11日第1版


A scene with a slide rule

Sometimes I write old, nostalgic articles. The following is what happened when I went to the United States in July 2014, seven years ago, to present my paper at the MIT App Inventor Summit 2014.

Perhaps modern students have never heard or touched what is called a "slide rule", but it was widely used in the world of technical computing. For example, it was actually used in some US national projects, so it's no wonder that some traces of it remain in the United States. The picture below is the elevator entrance of a hotel near MIT (Massachusetts Institute of Technology) where I was staying. I was surprised that a huge slide rule was hung on the top of the elevator door.

A huge slide rule at the entrance of the elevator

Enlarged view

The slide rule was discontinued 40 years ago with the advent of electronic calculators, but I still hold one of them. It was probably produced over 50 years ago. The photo below shows the calculation of "2 x Pi = 6.28" using this slide rule. In addition to multiplication and division, the slide rule can also be used to calculate trigonometric functions, logarithms, square roots, cube roots, etc. by exchanging the middle pard rule.

"2 x Pi = 6.28" on my slide rule

Most of the slide rules made in Japan, including the one shown in the above figure, were made of bamboo. Therefore, there is little deterioration in accuracy due to temperature and humidity fluctuations, and it has been highly evaluated worldwide. The times have advanced. And in the new era, the physical slide rule has disappeared, but as you can see in the picture below, there is a "slide rule software" for Android smartphones! It is useful for understanding the principle of a slide rule.

"2 x Pi = 6.28" in the slide rule app



 【what is this】前回の「Scratchで強化学習」の続編です。伊藤真著[1]にある、レベル2の例題を検討します。前回のレベル1では、「行動、報酬」だけでしたが、今回は「状態、行動、報酬」を扱う本格的な強化学習の仕組みが入っています。そこにおいて、各状態での行動に対する予測報酬(=行動価値Q)の計算を観察します。

■ レベル2例題:月面でダイヤ集めゲーム


■ 「強化学習」プレイヤーの戦略


Q(状態, 行動)の計算を観察する
 以下では、このQ(状態、行動)の計算がどのように進行するのかを観察します。図1の底辺部分から始めます。まず、図2aは、状態2における行動価値Qの推移(エピソードの進行にともなう)を示しています。50エピソード時点でみると、Q(状態2, 左)=0.25、Q(状態2, 右)=0.86が得られています。状態2における実際の報酬確率は、左=0.20、右=0.90ですから、Qの計算結果はかなり正解に近いと言えます。素晴らしい、使える!という気になります。






■ 感想


[1] 伊藤 真:ScratchでAIを学ぼう- ゲームプログラミングで強化学習を体験、日経BP、2020年8月11日第1版