There are many algorithms for reinforcement learning, please see https://en.wikipedia.org/wiki/Reinforcement_learning
Well-known algorithm is Q-learning.
Reinforcement learning involves an agent, a set of states
, and a set
of actions per state. By performing an action
, the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score).
Algorithm
After
steps into the future the agent will decide some next step. The weight for this step is calculated as
, where
(the discount factor) is a number between 0 and 1 (
) and has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start").
may also be interpreted as the probability to succeed (or survive) at every step
.
The algorithm, therefore, has a function that calculates the quality of a state–action combination:
.
Before learning begins,
is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time
the agent selects an action
, observes a reward
, enters a new state
(that may depend on both the previous state
and the selected action), and
is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the current value and the new information.
Cf. https://en.wikipedia.org/wiki/Q-learning#Deep_Q-learning
ChatGPT:
Q learning uses Q table to store Q values, representing qualities of rewards the agent can achieve at state s when taking action a. Q table's row represents states and column represents actions; each data item represents Q value.
Deep Q-Learning (DQL) uses neural network whose output is an approximated current Q-values instead of using Tabular (Q-table for discrete states/actions) to store current Q-values as in Q learning. The neural network is trained to reduce loss function value between target Q value (derived from Bellman equation) and current Q value. In Q-Learning, Bellman equation is directly applied to update the Q-values stored in a table for each state-action pair.