RL learns from interaction rather than labeled data, the core idea of gradually improving performance through experience.
1. Learning Through Trial and Error
-
The agent tries actions, observes results (state transitions and rewards), and updates its knowledge or policy.
-
Over time, it learns which actions lead to better outcomes.
2. Parameter Updates
-
Just like in supervised learning, the model (e.g., Q-table, neural network) has parameters (weights).
-
During training, these parameters are updated to minimize a loss function (e.g., temporal difference error in Q-learning or prediction loss in DQNs).
3. Exploration vs. Exploitation
-
In training, the agent often explores new actions (e.g., epsilon-greedy strategy) to improve learning.
-
In the final (deployment) phase, it mainly exploits the learned policy.
====
There are many algorithms for reinforcement learning, please see https://en.wikipedia.org/wiki/Reinforcement_learning
Well-known algorithm is Q-learning.
Reinforcement learning involves an agent, a set of states
, and a set
of actions per state. By performing an action
, the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score).
--ChatGPT
Snake game:
You want the agent (snake) to learn how to survive and grow longer by playing many games.
The environment (game board) provides feedback through rewards (e.g., +1 for eating food, -1 for dying).
You want the AI to develop strategies like avoiding collisions, planning moves, or maximizing score over time.
Neural network used in DQN for Snake game:
Input: a representation of the environment’s state.
- Grid / image input → CNN-based DQN
- Feature vector input → MLP-based DQN
1.Grid input
Treat the snake game board as a matrix (like an image).
Input:
0 = empty cell
1 = snake body
2 = snake head
3 = food
If the board is 20×20 → the input is 20×20 matrix (sometimes flattened into 400 values).
Neural nets for this usually use CNNs (like in Atari DQN).
2. Features Vector
Simpler and often more efficient. Common features:
[Snake head position,Food position,
Relative position of food,
Snake direction (one-hot: [up, down, left, right]),
Danger information (is there a wall or body in the next cell up/down/left/right?),
Snake length]
Output: estimated Q-values for all possible actions .
A vector of 4 Q values, each for moving up, down, left, right.