sparse-dense by FoYo: Q-Learning

ラベル Q-Learning の投稿を表示しています。すべての投稿を表示

2021年11月5日金曜日

スマホでニューラルネットワーク（ml5JS/TensorflowJS利用）その２

【what is this】引き続き、「スマホでニューラルネットワーク」にこだわります。以前の記事 [1]では、Q-Learning（強化学習）をニューラルネットワークで行うための基本手法を検討し、実装しました。今回は、さらにそれを深めた手法である「Experience Replay + Target Networks」があることを知り、それをTensorflow.js (JavaScript環境)で構築し、妥当なQ値が得られるまでPCで学習させました。さらに、その学習済みモデルをスマホへ格納し、スマホのアプリとして、Q-Learningを用いた2Dグリッドの経路探索課題を解きました。

■ ExperienceReplay+TargetNetworksによるQ-Learningスマホアプリ
　この手法によるQ-Learningの結果を先に示します。課題は参考文献[1]で扱ってきた、2Dグリッドでの経路探索です。Fig.3にあるように、ロボットが壁や障害物を避けて宝石（緑色の球体）に最短距離で到達するように学習させます。その学習結果を、スマホアプリで示したのがFig.3です。この例では、ロボットが障害物（黒色の正方形）を避けてうまく宝石に到達するルートを学習したことが分かります。

■ ExperienceReplay+TargetNetworksの概要
　強化学習は一般に、「教師付き学習ではない」と言われています。強化学習では、固定した正確なラベル（目標）を設定すること自体が困難です(というよりも、それ自体が求める解なのですから)。しかし、刻々変化するラベルを対象として、ニューラルネットワークで「教師付き学習」させる方式があります。それが、文献[1]で検討したものです。

　しかし、この方法には、２つほど欠点があります。一つは、「ある状態に対してある行動を取った場合」の一組づつしか学習できない。つまり、ニューラルネットワークで本来の性能を発揮するためのミニバッチ処理（多数の入力の一括処理）ができないことです。第二は、もっと本質的な問題ですが、この方法では、「状態と行動」が強く結合した学習となってしまい、多様な入力（状態）に対する学習が収束しずらいか、振動してしまう可能性が高いことです。

　これを解決すべく登場したのが、今回の「Experience Replay + Target Networks」なのです。小生の場合、参考文献[2]と[3]を読んでその概要を学びました。Fig.1に示すPseudo codingは文献[2]から引用したものです。これに従って、独自にJavaScript（Tensorflow.js使用）でそれを試作することができました。

　大雑把に言うならば、上図において、ExperienceReplayは、ミニバッチによる教師付き学習を可能にするために、replay memory Dに「状態と行動に関する観測結果」を蓄積していきます。そして、TargetNetworksの方は、上に述べた「状態と行動の関連性」を解消すべく、学習用のニューラルネットワークとは別に用意された、予測（prediction）用のニューラルネットワークです。すなわち、２つの分離されたネットワークを使います。TargetNetworksは、定期的に、学習用のネットワークの重みで置き換えられて、新しくなって行きます。このため、学習と予測のネットワークに時間差が生じますが、これが実は求めるべきQ値の推定を安定させることに繋がる。そのように考えられます。

■ Tensorflow(Python)よりもTensorflow.js(JavaScript)を使った理由
　この手法の実装における、「出力値とラベルの差（誤差）」の最小化を図るには、やはりTensorflowの学習関数（fit）を使うのが便利です。Pythonでやってももちろん良いのですが、今回は、JavaScript上のTensorflow.jsを使いました。その理由は、スマホアプリとの相性が良いことによります。

　ただし、学習済みモデルは、スマホ（あるいは外部の）webサーバに配置する必要があります。スマホ用のwebサーバーはいくつも公開されていますので、手軽に使えます。Fig.2はその一例です。この例では、学習済みモデルは２つのファイル（ネットワークトポロジー等の.jsonと重みの.bin）が、スマホのwebサーバに配置されています。スマホのアプリ側では、このモデルを（JavaScriptプログラムで）ロードして、予測に使うことができます。

■ 留意点：ExperienceReplay+TargetNetworksをTensorflow.jsで行う場合

（１）Tensorflow.jsは非同期関数

　Tensorflow.jsは非同期関数の仕様になっているので、その使用は、非同期関数（先頭にasync付き）の中で行う必要があります。そして、学習用のfit関数等を呼び出す場合は、awaitによって、fit関数の実行終了を待つ必要があります。そうしないと、思わぬところで別のコード部が実行されたりしますので、注意が必要です。

（２）モデルのsaveとloadの方法

　上に述べたように、学習用のネットワークtrain_netの重みを、定期的にTatget Networksへloadする必要があります。その際に、Tensorflow.jsは、非常に使いやすいsave/loadの仕組みを提供しています。つまり、以下のように、ブラウザのメモリにsaveできる、localstorageスキームを使うことができます。

// target_netのモデルを更新するため、train_netのモデルをsave

await train_net.save('localstorage://2dgrid_model');

// train_netのmodelをtarget_netへロード

target_net = await tf.loadLayersModel('localstorage://2dgrid_model');

　一方、これとは別に、train_netの学習済みモデルを外部へ取り出して利用したい場合には、downloadsスキームによって、save/loadを行うことができます。以下はその例です。

// 学習済みモデルのダウンロード

await train_net.save('downloads://2dgrid_model');

// 学習済みモデルをmodelフォルダに配置してそこからload

train_net = await tf.loadLayersModel('./model/2dgrid_model.json');

　なお、loadした学習モデルを単に予測に使うだけなら、これOKですが、さらにそれを学習させる場合は、以下のように、再度、最適化関数を同じものに設定してコンパイルする必要があります。

train_net.compile({loss: 'meanSquaredError', optimizer: 'sgd'});

（３）学習関数fitへ与える入力データの形式

　PythonでのTensorflowでは、fit関数の引数となる訓練データとラベルデータは、pandas array形式ですが、Tensorflow.jsでは、通常配列をtensor2dで変換して与えます。

【注】上記の(1), (2)は、Tensorflow for JavaScriptの下位レベルAPIであるCore APIを使った場合です。そうではなく、上位レベルAPIであるLayers APIを使う場合には、非同期性を考慮しなくても使える関数が用意されています。

■ ニューラルネットワークの構成と学習性能

　上に述べた２つのニューラルネットワーク（学習用と予測用）の構成は以下のようにしました。試行錯誤の結果、これに落ち着きました。隠れ層は１層よりも２層の方が良い。また、隠れ層１層目のノード数はやや多い方が良い（ここでは128とした）などが分かりました。活性化関数は、隠れ層ではreluに、出力層ではlinearに設定しました。

_______________________________________________________

Layer (type) Output shape Param #

=================================================

dense_Dense1 (Dense) relu [null, 128] 512

_______________________________________________________

dense_Dense2 (Dense) relu [null, 32] 4128

_______________________________________________________

dense_Dense3 (Dense) linear [null, 4] 132

================================================

Total params: 4772

Trainable params: 4772

Non-trainable params: 0

_______________________________________________________
（各隠れ層の後にdropout層を挿入した場合も試しましたが、この問題に関しては、特段の効果は確認できませんでした。）

　Fig.3に示した4x4グリッドについて、学習したモデルを次回の学習の初期モデルにしてさらに訓練を続けるというやり方で（ε-Greedyのεの値はその度に一定比率で減少させて）学習させました。その結果を使って、状態（宝石の位置、障害物の位置、ロボットの位置）をランダムに100組み生成して、ロボットが宝石に最短距離で到達できるか否かをテストしました。その成功率は以下の通りでした。

･１回目：0.97

･２回目：0.95

･３回目：0.97

■ 感想
　「不正確なラベル」を使って「近似的な解法」を実行して行くという、この（一見確信を持てないような）手法ですが、実際にやってみると段々にラベルが正確な値に近づき、正しい答えを出すようになるのは不思議な気もします..

　この手法で学習させた結果は、上記のとおり、かなり満足のできるものでした。従来のベルマン方程式に基づくQ-tableを構築しながら学習させる方法では、ほぼ確実に厳密解に到達できました。これに対して、今回のニューラルネットワークを用いた上記のExperience Replay + Target Networksでも、95％程度の正解率を得ることができました。そして、問題規模がさらに増大した場合は、Q-tableによる学習は明らかに破綻するので、今回の手法の有効性が高まるものと感じます。

　この実装は、ほとんどFig.1の情報だけから出発して（参考文献[2]や[3]に載っていたPytonコードを見ずに）、自分で考えながら具体化し、Tensorflow.jsを使って実現しました。どこか間違っているかも知れないという不安は残っていたのですが、上記のとおり95%の正解率を得ることができたので、恐らく、妥当な作りになっているだろうと思います。

参考文献

[1] Running Q-Learning on your palm (Part3)

http://sparse-dense.blogspot.com/2021/09/running-q-learning-on-your-palm-part3.html

[2] Jordi TORRES.AI, Deep Q-Network (DQN)-II

Experience Replay and Target Networks, Aug.16, 2020

https://towardsdatascience.com/deep-q-network-dqn-ii-b6bf911b6b2c

[3] Ketan Doshi, Reinforcement Learning Explained Visually (Part 5): Deep Q Networks, step-by-step

A Gentle Guide to DQNs with Experience Replay, in Plain English, Dec 20, 2020

https://towardsdatascience.com/reinforcement-learning-explained-visually-part-5-deep-q-networks-step-by-step-5a5317197f4b

2021年9月25日土曜日

Running Q-Learning on your palm (Part3)

Japanese summary　本シリーズのPart1とPart2では、スマホ向けのQ-Learningアプリを開発し、それを簡単な例（直線の廊下でロボットが宝石を得る）に適用しました。今回は、このアプリを改訂し、２次元グリッドでロボットが行動できるようにしました。そして、グリッドサイズが大きくなるにしたがい、これまでのQ-tableを保持してQ値を更新する方法は、メモリ量と処理量の急増により破綻することを確認しました。それに変わる有望な方法として、Neural Networkの利用を検討し、それを（スマホではなく）PC上のPythonで実現した結果を示します。

Abstract
In Part 1 and Part 2 of this series, I developed a Q-Learning app for smartphones and applied it to a simple example (a robot gets a gem in a straight corridor). This time, I revised this app so that the robot can act on the 2D grid. However, as the grid size increases, the traditional method of holding the Q-table and updating the Q value becomes difficult due to the increase in memory capacity and processing volume. As a promising alternative, I considered using a neural network. And here's the result of doing that with Python on a PC (not a smartphone).

● Revised version of the Q-Learning app
In the revised version of the smartphone app, as shown in Fig. 1(a), the robot is trained to reach the gem while avoiding barriers on the 4x4 grid. The Q-Learning algorithm is basically the same as last time. There is one gem and one barrier, and their positions change randomly for each play (every episode). The robot also starts from a random position. After sufficient training, the robot can always reach the gem in the shortest route. On the other hand, if not well trained, the robot often gets lost and hits a wall, as shown in Fig. 1 (b).

● Memory capacity required for Q-Learning
The size of the Q-table required for this learning can be calculated according to the grid size and the number of gems and barriers. See Fig.2. The number of Q-table entries (i.e., the number of keys) is the total number of possible states, which in Case 1 (4x4) is 3,360. At this level, it can be held sufficiently even on a smartphone, and the amount of calculation is within an acceptable range. However, in Case2 (5x5), the total number of states increases sharply to over 6,000,000, even though only one gem and one barrier have been added. In this situation, regardless of whether it is a smartphone or a PC, processing is almost impossible due to both the amount of memory and the amount of calculation.

● Calculate Q-values with neural network (without holding Q-table)
For cases like Case2 above, you can think of a way to calculate the required Q-value with a neural network without holding the Q-table. To do this, transform the Q-value update formula for the neural network, as shown in Fig.3. This makes it possible to compare output and target ([1]). It can be used to solve this problem with common supervised machine learning. This machine learning iteration allows the output to be closer to the target and, as a result, the Q-value to be closer to the exact value.

Fig.4 clearly shows how to use this neural network in the case of Case1. Note that in this example, the action "W (west)" is taken in the current state S. In this way, one learning is done only for one action in one state. This learning should be repeated for as many actions as possible, in as many states as possible.

● Calculation example of Q-value by neural network
I implemented a learning method using such a neural network in Python and executed it on a PC. This program is based on the Python program (using Tensorflow / Keras) published by Dr. Makoto Ito in reference [1]. Fig.5 shows the learning process for Case1 (4x4). It shows the situation where 10000 episodes were randomly trained. In the upper graph, the average sum of rewards per episode has reached about 0.8. On the other hand, when the neural network is not used, as shown in the figure on the right of Fig. 1(a) (although the characters are small and difficult to see), it is 0.8305, so both results are almost the same. The lower graph shows that the average number of steps a robot takes to reach the gem is about 2.9. This value is also valid considering the situation in Fig.1.

I have omitted the details, but in the case of Case2 (5x5), I was able to train well with this neural network as well. It took only about 3 minutes to run on a general PC, so I was able to confirm the usefulness of this method. This time I've only used the most basic neural networks, but for more complex problems (for example, if you need to remember the location of an object), you may need other neural networks such as LSTMs.

Acknowledgments
I was able to create a Q-value calculation program using a neural network by referring to the Python program published in the reference [1]. I would like to thank Dr. Makoto Ito, the author of this article.

References
[1] Makoto Ito's Blog Article: M-note Program and Electronic Work (in Japanese)
http://itoshi.main.jp/tech/

2021年9月12日日曜日

Running Q-Learning on your palm (Part2)

Japanese summary　前回の記事では、スマホ向けのQ-Learningアプリを開発し、それを簡単な例（ロボットが宝石を得る）に適用しました。今回は、ロボットの行動にいくつかのバリエーションを与えてみました。その場合でも、新しい行動の記述を追加する以外は、このスマホアプリをほとんど変更していません。今回の例題でも、Q-Learningによる学習の結果、ロボットは宝石を得るための最適な手順を自ら発見できました。

Abstract
In the previous article, I developed a Q-Learning app for smartphones and applied it to a simple example (a robot gets a gem). This time, I gave some variations to the behavior of the robot. Even so, I haven't changed much of this smartphone app, except to add a new behavioral description. In this example as well, as a result of learning by Q-Learning, the robot was able to discover the optimal procedure for obtaining the gem.

# For the case where the robot moves on the 2D grid, please see this revised version.

● New examples (two cases)
As in the last time, as shown in the figure below, the task is for the robot to move the corridor and get the gem. The actions that the robot can take are different from the last time, but learning the best steps to successfully acquire a gem is the same.

Consider the following two cases regarding robot behavior and its rewards. In both cases, an episode ends when a "Take" is performed (regardless of success or failure) or the robot deviates from the corridor boundary.

Case1:

Take: Take the gem (reward = +5 if successful, otherwise -1)
Forward: Move forward one block in the corridor (reward = -1)
Jump: Move forward two blocks in the corridor (reward = -1)

Case2:

Take: Take the gem (reward = +5 if successful, otherwise -1)
Back: Go back one block in the corridor (reward = -1)
Skip2: Skip two blocks in the corridor (reward = -1)

● Learning results in Case1 and the robot moving example
As a result of fully executing Q-Learning for Case1, we obtained a highly accurate Q-table. Using it, the robot was able to discover the optimal procedure for obtaining the gem, as shown in Fig.1. In the initial state of this example, the positions of R (Robot) and G (Gem) are expressed as "R . . G . .". The corresponding maximum value of Q-table is given by "Forward" and "Jump" (where, both values are 3.0.). Whichever is adopted, it will be the same after all, but here, "Jump" was taken. At the transition destinations after this, the action that maximizes the Q-table value was also taken, so the gem was successfully acquired. This is the best procedure.

● Learning results in Case2 and the robot moving example
The robot's actions possible in Case2 is different from Case1, but similarly, the robot was able to discover a procedure for obtaining the gem. The situation is shown in Fig.2. In the initial state of this example, the positions of R (Robot) and G (Gem) are expressed as "R . G . . .". This procedure is optimal by combining "Skip2" and "Back".

Here's another slightly more complicated example in Case2. See the Gif animation below. In this example, the robot found the best steps to get the gem:

“R . . . G .” →Skip2→Back→Back→Skip2→Take[success]

● Setting rewards according to purpose

The reward values shown above can be changed depending on the purpose. For example, unlike the above, let's say you want to get the gem in the best way, even if the robot starts at any position. In such cases, change the reward design of Case 2 as follows and name it Case 3.

Case3:

Take: Take the gem (reward = +5 if successful, otherwise -1)
Back: Go back one block in the corridor (reward = -1 if it is in the corridor after moving, otherwise -2)
Skip2: Skip two blocks in the corridor (reward = -1 if it is in the corridor after moving, otherwise -2)

This reward design will serve your purpose, as in the examples below:

2021年8月26日木曜日

Running Q-Learning on your palm (Part1)

Japanese summary　このアプリは、人工知能技術の一分野である「強化学習（特にQ-Learning）」の基本的な考え方と仕組みを、簡単な例を動かしながら学ぶものです。強化学習は、AlphaGo（囲碁）や自動運転、自律制御ロボットなどで使われ、注目を集めています。ここでは、そのような実用レベルに立ち向かう前に、身近なスマートフォンを用いて、その技術のエッセンスに触れて親しむことができます。あなたの掌の上のスマホで強化学習を楽しもう！（このアプリは、MIT App Inventorを用いて開発されました。）

This app won the "2021 October's MIT APP INVENTOR OF THE MONTH" award.

Abstract
This app explains the basic idea and mechanism of reinforcement learning (especially Q-Learning), which is a field of artificial intelligence technology, using simple examples. Reinforcement learning is used in AlphaGo, autonomous driving, autonomous control robots, etc., and is attracting a lot of attention. Here, you can get familiar with the essence of the technology using your familiar smartphone before confronting such a practical level. As an example, a robot is trained to go straight down the corridor and successfully acquire the gem placed along the way. For this purpose, I use the reinforcement learning method called Q-Learning. Results of the training can be confirmed by the animation of the robot movement. I have developed a smartphone app that realizes these using MIT App Inventor. Q-Learning runs on your palm!

# This application for Android is published below:
Source code: here

# This app can also be applied to robots with more complex behaviors than the examples below. In that case, it is enough to add a definition of new actions and rewards. For details, please see Part 2 here.

# For the case where the robot moves on the 2D grid, please see Part3 here.

● Overview of the Q-Learning app
The configuration of the developed application is explained in Fig. 1. You can see a robot and a green gem in the corridor at the bottom. Train the robot to pick up the gem at the right place. (The robot's behavior will be explained later.) To do that, first press the "init" button, then train with the "train" button. Each "train" will learn 100 steps. One step corresponds to the robot moving one block in the corridor or picking up (taking) the gem.

At each step, the action is rewarded. Positive actions that lead to success are highly rewarded. A series of actions of the robot starting from the left end to the final success or failure is called an episode. Although the position of the gem changes with each play (game), the average rewards sum for successful episodes can be calculated theoretically in this example. Repeat "train" until it approaches this theoretical value. If the theoretical value cannot be calculated, the iteration should be stopped when it is considered to have converged to a certain high value.

The data in the large yellow area (Q-table) on the screen indicates which action is preferable (worth it) depending on the situation along the way. As I will explain later, with repeated training, this Q-table will approach the correct values.

After training, press the "anim for test" button to watch the robot's behavior in animation. A major feature of this reinforcement learning is that the behavior of the robot can be determined by the contents of the Q-table. Therefore, if the Q-table is inaccurate, it often fails as follows:

On the other hand, if the contents of the Q-table are correct, gem acquisition will always succeed, as shown below:

● Challenges imposed on robots and possible actions
Let's take this example in a little more detail. The actions that the robot can take, the conditions for obtaining the gem, and the conditions for completing the task (play) are shown in Fig. 2. On the other hand, Figure 3 illustrates whether “Take” (the action of picking the gem) or “Forward” (moving to the right) leads to success in a certain situation. To get the gem, the robot must do a "Take" action at the same place as the gem. Obviously, in this example, the robot should select (2) "Forward" instead of selecting "Take" in (1). After that, select "Take" in (3) and it will succeed.

It is easy for humans to write a program to solve this problem (challenge). But here, instead, let the robot itself discover the solution through learning. Please note this point!

● How to choose an action
As already mentioned above, I introduced rewards to determine whether to choose between “Take” action and “Forward” action in a particular situation. As shown in Figure 4, the reward is +5 for the "Take" action that succeeds in acquiring the gem, and -1 for the other actions. And it is advantageous to choose an action with a larger sum of rewards mentioned above.

● Update Q-table
To achieve the above, I use a Q-table that represents the value (in other words, worth) of both actions for the state at that time, as shown in Figure 5. In the figure, in the untrained state (train = 0 step), the action value of "Take" is higher (larger) than that of "Forward", so the "Take" action is selected. But this is wrong because the Q-table is inaccurate. On the other hand, in the fully trained state (train = 800 steps), the Q-table has been updated to a reasonable value, and the "Forward" action is taken correctly. Roughly speaking, Q-table is an estimate of the sum of rewards.

The Q-table update is based on a famous learning rule called Q-Learning as shown below:

This update formula means that the value of Q when taking an action in the current state is brought close to the highest value of Q that can be taken in the next state. It is known that such updates converge to the optimum Q value, assuming a sufficient number of episodes are attempted for all states. If you need a more detailed explanation, please see one of the references ([1][2][3][4][5]). And the relationship between this value of Q and the Q-table in this application is illustrated below:

In fact, the figure below shows that the contents of the Q-Table have almost reached optimal after 1100 steps of training.

● Expand the app a little
Finally, here is an example of how to expand this app. This app has three hyper parameters (α, γ, ε) to increase versatility. For example, how much will the convergence speed and stability be affected if the value of the learning rate α is changed?

To confirm this, you need to display it as a line graph instead of the slider as above. It's easy to achieve. I used ChartMaker (an extension created by Kate & Emily) in reference [7]. The result is shown in Fig.6.

Enjoy reinforcement learning (Q-Learning) with this smartphone app!

● Even deeper expansion
The above example is to familiarize you with the basic idea of "Q-Learning". In this example, the number of states is so small that the entire contents of the Q-table could be saved and updated. However, consider an example where the robot's range of movement is not a one-dimensional corridor, but a wide plane, or there are places where it cannot proceed due to obstacles. In such cases, the number of states will be enormous and will not be solved by basic Q-Learning. Therefore, the Q-table needs to be approximated by another method. One promising method is to use neural networks. The training itself will need to be done on a PC, but it will be possible to bring the trained model to a smartphone and run the animation for testing. I would like to discuss such advanced expansions in another article.

Acknowledgments
This app is my original work. However, I refer to the explanation of Q-Learning and the example Python program in reference [6]. I would like to thank Dr. Makoto Ito, the author of this article.

References
[1] Richard S. Sutton and Andrew G. Barto, "Reinforcement Learning: An Introduction, second edition", The MIT Press, 2018.
[2] Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, Joelle Pineau, "An Introduction to Deep Reinforcement Learning", Now Publishers, 2019.
[3] Etsuji Nakai, "Reinforcement Learning for Software Engineers", Gijyutsu-Hyoron-Sha, 2020. (in Japanese)
[4] Azuma Ohuchi, Masahito Yamamoto, Hidenori Kawamura, "Theory and application of multi-agent systems - computing paradigm for complex systems engineering", Corona-sha, 2002. (in Japanese)
[5] Tomah Sogabe, "Introduction to reinforcement learning algorithm", Ohmsha, 2019. (in Japanese)
[6] Makoto Ito, “Learn Reinforcement Learning with Python”, Nikkei Software 2021.07, Nikkei BP, 2021, pp.24-39 (in Japanese)
[7] Kate Manning and Emily Kager,
https://github.com/MillsCS215AppInventorProj/chartmaker

2021年8月17日火曜日

スマホアプリで「強化学習」を学ぶ魅力！

【what is this】最近の日経ソフトウェア誌に、「Pythonで強化学習を学ぶ」の解説記事がありました。丁寧に書かれていて分かりやすく、提供されているPythonプログラムも完全に動かすことができました。しかし、ここで留まらずに、理解をさらに深めるため、別のプラットフォーム（Androidスマホ）とプログラミング環境（MIT App Inventor）で、独自にそれを再構築してみました。

■ 解説記事：Pythonで「強化学習」を学ぶ
　まず、図１に示すのがこの記事です。全16ページに渡って、強化学習が非常に丁寧に解説されています。前半７ページでは、簡単な例題を使って強化学習の概念（Q-Learning）と具体的な動作が説明されています。後半９ページでは、Pythonでこれを実現する方法を説いています。コードの説明だけではなく、肝となるQ-Tableの学習則（学習率や割引率を含む）の解説が分かりやすく示されています。

　そして、何よりも嬉しいことに、提供されているPythonプログラム（４つのPythonファイルで合計約970行）が小生の環境でも問題なく、完全に動いたことです。図２はそれを示しています。

　素晴らしい！分かった気になる！でも本当にそうなのか。単にトレースしただけではないのか？という思いもあります。理解を本当に深めるのであれば、自分で、この解説の仕様に沿って、プログラムを独自に再構築するのがいいでしょう！ということで、それを実際にやってみました。

■ スマホアプリとして上記強化学習プログラムを独自に作る

　実は、小生は、上記のPythonコードの中味はほとんど読んでいません。にも拘わらず、図３に示すような、同等機能のスマホアプリを作成することができました。これは、この解説自体が素晴らしかったことに他なりません。MIT App Inventorを利用して開発しました。

■ スマホアプリの強化学習での「学習（訓練）」と「評価（実行）」

　詳細はここには書けませんが、このスマホアプリによる「学習」と「評価」を簡単に示します。まず、図４の(a)と(b)は、それぞれ、学習が不十分な場合と十分な場合に、タスクを実行した様子です。

　ここでは、ロボットの行動は、「右へ前進する」か「宝石を取る」のいずれかです。ロボットが緑色の宝石の位置と一致した時に「宝石を取る」ように学習（訓練）するわけです。図４(a)は、ロボットが宝石を得ることに失敗しています。なぜなら、学習が不十分であり、今の状況（赤枠）では、

Q-Tableの[宝石を取る行動価値, 右へ前進する行動価値]

= [-0.71757, -1.01168]

となっており、ロボットがまだ宝石の位置にないのに、より行動価値が高い（すなわち、-0.71 > -1.01）と評価された「宝石を取る」行動を行ったためです。

　これに対して、図4(b)は、学習が十分に進んでいたため、Q-Tableの中味は正当なものになっており、ロボットは今ここで「宝石を取る」のではなく、宝石へ向かうことになります。

　そして、図５に示すように、ロボットはさらに宝石に近づき（途中の一歩の図示は省略）、位置が一致したところで、

Q-Tableの[宝石を取る行動価値, 右へ前進する行動価値]

= [5.0, -1.9]

にしたがい、最終的に自信をもって（すなわち、5.0 > -1.9）、「宝石を取る」行動が成功しています。

　動作を確認するため、十分に学習済み（Q-Tableの内容が妥当になった）後の評価実行例を以下に示します。

■ スマホアプリで「強化学習」の意義
　Pythonプログラミング、もちろん良いでしょう。でも、スマホでアプリを開発するのならば、MIT App Inventorは非常に効率良く行えます。上記解説記事では、Q-Tableの実現に、Pythonの辞書型変数やnumpy配列を使っていますが、App Inventorでも同等のことが可能です。また、Pythonのmatplotlibほど高機能ではありませんが、図３に示したとおり、App Inventorでも折線グラフ（報酬の経緯）も描けています。

　スマホで、強化学習を実行し、その状況をグラフで可視化し、学習結果としてのQ-Tableの数値を確認し、評価のためのアニメーションも実行する。それらを、ボタン操作で、掌ですべてインタラクティブに行える。この魅力は大きなものと改めて感じます！

　例えば、図３で使用したハイパーパラメータを変えて学習させたい、という場合も、図６のように直ぐにその効果（収束速度や安定度など）をグラフで確認できます。

■ MIT App Inventorプログラムの複雑度
　上に述べたとおり、Pythonで約970行とほぼ同等のプログラムをMIT App Inventorで作成しました。それがどの程度の複雑度なのかを詳しく述べることは、ここではできませんが、プログラム全体（ブロック図）をご参考までに示します。小さくで中味は見えませんがご容赦下さい。

2021年6月4日金曜日

強化学習：モンテカルロ法とQ-Learning

【what is this】強化学習に関するさらなる続編です。前回のレンタカーショップ問題とは異なり、環境モデルが分からない場合の例としての迷路問題をとりあげます。そして、その解法としての、モンテカルロ法とQ-Learningの性能を観察します。

■迷路問題とレンタカーショップ問題
　先のレンタカーショップ問題では、ある状態Sにおいて行動aを取った場合に、得られる報酬rと次の状態S'が起こる条件付き確率が分かっている（計算できる）必要がありました。しかし、現実の問題ではそれを満たさない場合も多いです。以下に述べる迷路問題（開始点から終点までの最短経路を求める）もその一つです。

■モンテカルロ法とQ-Learning
　ここでは、迷路問題に対する解法として、モンテカルロ法とQ-Learningなどを検討しています。本記事は、実際のところ、これまでと同じく、中井悦司著[1]の第４章を学んだ成果を簡単に書いたものです。

　モンテカルロ法では、上に述べたような条件付き確率に頼るのではなく、シミュレーションによって、行動aに対して得られる結果の情報にもとづいて学習を行います。この方法では、シュミレーションのエピソード（episode：開始点から終了点に達する）が完了した情報が必要です。これに対して、Q-Learningと呼ばれる方法は、エピソードの完了を待たずに、１ステップ分の新しい情報を使って学習を進められる点が特徴です。

　手法の詳細は、参考文献[1]などをお読みいただきたいのですが、ここでは結果だけを示して、さらに知識を深める手掛かりとしたいと思います。図１は、両方法による迷路の最短経路の探索結果です。開始点(S)からゴール(G)を目指して、壁(#)を避けながら、上下左右のいずれかへ進みます。両方法で辿ったパスは違っていますが、ともに最適解になっているはずです。注目すべきは、学習に要した時間です。Q-Learningの方が圧倒的な高性能を示しました。

　これらの性能をさらに明確に示しているのが、図２です。学習中にどれだけの長さのepisodeを得たかを比較しています。モンテカルロ法では、初期の段階では、非常に長いepisodeを取得しながら学習しています。つまり、長いパスをさまよってなかなかゴールへ辿り着かない状況となっています。これに対して、Q-Learningでは、開始まもなく、急激に短いパスとなり、効率的に学習が行われていること示しています。

■感想

　冒頭に述べた中井悦司著[1]では、今回の第４章（Q-Learningあたり）がクライマックスと言えましょう。しかし、書籍全体の約６割を占める第３章「環境が分かっている場合」の状態価値関数による（ベルマン方程式に基づく）統計学的な厳密解の求め方までで、それに必要な基礎は出来上がっています。つまり、第３章まで熟読すれば、第４章はスムーズに理解できるはずです。この書では、非常に丁寧な叙述が特色であることを強く感じます。また、確認のためのPythonコードについても詳しい説明がついていますので、少なくとも、Pythonの初級を終えていれば、理解にはほとんど困らないと思います。

[参考文献]

[1] 中井悦司：ITエンジニアのための強化学習理論入門、技術評論社、2020年７月

登録: 投稿 (Atom)