วันพฤหัสบดีที่ 3 มีนาคม พ.ศ. 2565

Reinforcement learning

RL learns from interaction rather than labeled data, the core idea of gradually improving performance through experience.

1. Learning Through Trial and Error

  • The agent tries actions, observes results (state transitions and rewards), and updates its knowledge or policy.

  • Over time, it learns which actions lead to better outcomes.


2. Parameter Updates

  • Just like in supervised learning, the model (e.g., Q-table, neural network) has parameters (weights).

  • During training, these parameters are updated to minimize a loss function (e.g., temporal difference error in Q-learning or prediction loss in DQNs).


3. Exploration vs. Exploitation

  • In training, the agent often explores new actions (e.g., epsilon-greedy strategy) to improve learning.

  • In the final (deployment) phase, it mainly exploits the learned policy.

====

There are many algorithms for reinforcement learning, please see https://en.wikipedia.org/wiki/Reinforcement_learning 

Well-known algorithm is Q-learning.







Reinforcement learning involves an agent, a set of states , and a set  of actions per state. By performing an action , the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score).

Algorithm

After  steps into the future the agent will decide some next step. The weight for this step is calculated as , where  (the discount factor) is a number between 0 and 1 () and has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start").  may also be interpreted as the probability to succeed (or survive) at every step .

The algorithm, therefore, has a function that calculates the quality of a state–action combination:

.

Before learning begins,  is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time  the agent selects an action , observes a reward , enters a new state  (that may depend on both the previous state  and the selected action), and  is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the current value and the new information.

Cf. https://en.wikipedia.org/wiki/Q-learning#Deep_Q-learning


ChatGPT:

Q learning uses Q table to store Q values, representing qualities of rewards the agent can achieve at state s when taking action a. Q table's row represents states and column represents actions; each data item represents Q value.

Deep Q-Learning (DQL) uses neural network whose output is an approximated current Q-values instead of using Tabular (Q-table for discrete states/actions) to store current Q-values as in Q learning. The neural network is trained to reduce loss function value between target Q value (derived from Bellman equation) and current Q value. In Q-Learning, Bellman equation is directly applied to update the Q-values stored in a table for each state-action pair.

Snake game:

You want the agent (snake) to learn how to survive and grow longer by playing many games.

The environment (game board) provides feedback through rewards (e.g., +1 for eating food, -1 for dying).

You want the AI to develop strategies like avoiding collisions, planning moves, or maximizing score over time.