Dr.Jiw: Reinforcement learning

วันพฤหัสบดีที่ 3 มีนาคม พ.ศ. 2565

Reinforcement learning

RL learns from interaction rather than labeled data, the core idea of gradually improving performance through experience.

1. Learning Through Trial and Error

The agent tries actions, observes results (state transitions and rewards), and updates its knowledge or policy.
Over time, it learns which actions lead to better outcomes.

2. Parameter Updates

Just like in supervised learning, the model (e.g., Q-table, neural network) has parameters (weights).
During training, these parameters are updated to minimize a loss function (e.g., temporal difference error in Q-learning or prediction loss in DQNs).

3. Exploration vs. Exploitation

In training, the agent often explores new actions (e.g., epsilon-greedy strategy) to improve learning.
In the final (deployment) phase, it mainly exploits the learned policy.

====

There are many algorithms for reinforcement learning, please see https://en.wikipedia.org/wiki/Reinforcement_learning

Well-known algorithm is Q-learning.

Reinforcement learning involves an agent, a set of states $�$ , and a set $�$ of actions per state. By performing an action $a\in A$ , the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score).

Algorithm

After $\Delta t$ steps into the future the agent will decide some next step. The weight for this step is calculated as $\gamma ^{\Delta t}$ , where $\gamma$ (the discount factor) is a number between 0 and 1 ( $0\leq \gamma \leq 1$ ) and has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). $\gamma$ may also be interpreted as the probability to succeed (or survive) at every step $\Delta t$ .

The algorithm, therefore, has a function that calculates the quality of a state–action combination:

Q:S\times A\to \mathbb {R}

Before learning begins, $�$ is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time $�$ the agent selects an action $a_{t}$ , observes a reward $r_{t}$ , enters a new state $s_{t+1}$ (that may depend on both the previous state $s_{t}$ and the selected action), and $�$ is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the current value and the new information.

Cf. https://en.wikipedia.org/wiki/Q-learning#Deep_Q-learning

ChatGPT:

Q learning uses Q table to store Q values, representing qualities of rewards the agent can achieve at state s when taking action a. Q table's row represents states and column represents actions; each data item represents Q value.

Deep Q-Learning (DQL) uses neural network whose output is an approximated current Q-values instead of using Tabular (Q-table for discrete states/actions) to store current Q-values as in Q learning. The neural network is trained to reduce loss function value between target Q value (derived from Bellman equation) and current Q value. In Q-Learning, Bellman equation is directly applied to update the Q-values stored in a table for each state-action pair.

Snake game:

You want the agent (snake) to learn how to survive and grow longer by playing many games.

The environment (game board) provides feedback through rewards (e.g., +1 for eating food, -1 for dying).

You want the AI to develop strategies like avoiding collisions, planning moves, or maximizing score over time.

วันพฤหัสบดีที่ 3 มีนาคม พ.ศ. 2565

Reinforcement learning

1. Learning Through Trial and Error

2. Parameter Updates

3. Exploration vs. Exploitation

ค้นหาบล็อกนี้

คลังบทความของบล็อก