Arthur Danjou
Back to Projects

Reinforcement Learning for Tennis Strategy Optimization

Academic ProjectCompleted

An academic project exploring the application of reinforcement learning to optimize tennis strategies. The project involves training RL agents on Atari Tennis (ALE) to evaluate strategic decision-making through competitive self-play and baseline benchmarking.

March 13, 2026 3 min read
Reinforcement LearningPythonGymnasiumAtariALE

Comparison of Reinforcement Learning algorithms on Atari Tennis (ALE/Tennis-v5 via Gymnasium/PettingZoo).

Overview

This project implements and compares five RL agents playing Atari Tennis against the built-in AI and in head-to-head tournaments.

Algorithms

AgentTypePolicyUpdate Rule
RandomBaselineUniform randomNone
SARSATD(0), on-policyε-greedyWaWa+α(r+γq^(s,a)q^(s,a))ϕ(s)W_a \leftarrow W_a + \alpha \cdot (r + \gamma \hat{q}(s', a') - \hat{q}(s, a)) \cdot \phi(s)
Q-LearningTD(0), off-policyε-greedyWaWa+α(r+γmaxaq^(s,a)q^(s,a))ϕ(s)W_a \leftarrow W_a + \alpha \cdot (r + \gamma \max_{a'} \hat{q}(s', a') - \hat{q}(s, a)) \cdot \phi(s)
Monte CarloFirst-visit MCε-greedyWaWa+α(Gtq^(s,a))ϕ(s)W_a \leftarrow W_a + \alpha \cdot (G_t - \hat{q}(s, a)) \cdot \phi(s)
DQNDeep Q-Networkε-greedyMLP (256→256) with experience replay & target network

Architecture

  • Linear agents (SARSA, Q-Learning, Monte Carlo): q^(s,a;W)=Waϕ(s)\hat{q}(s, a; \mathbf{W}) = \mathbf{W}_a^\top \phi(s) with ϕ(s)R128\phi(s) \in \mathbb{R}^{128} (RAM observation)
  • DQN: MLP network (128 → 128 → 64 → 18) trained with Adam optimizer, Huber loss, and periodic target network sync

Environment

  • Game: Atari Tennis via PettingZoo (tennis_v3)
  • Observation: RAM state (128 features)
  • Action Space: 18 discrete actions
  • Agents: 2 players (first_0 and second_0)

Project Structure

.
├── Project_RL_DANJOU_VON-SIEMENS.ipynb    # Main notebook
├── README.md                              # This file
├── checkpoints/                           # Saved agent weights
│   ├── sarsa.pkl
│   ├── q_learning.pkl
│   ├── montecarlo.pkl
│   └── dqn.pkl
└── plots/                                 # Training & evaluation plots
    ├── SARSA_training_curves.png
    ├── Q-Learning_training_curves.png
    ├── MonteCarlo_training_curves.png
    ├── DQN_training_curves.png
    ├── evaluation_results.png
    └── championship_matrix.png

Key Results

Win Rate vs Random Baseline

AgentWin Rate
SARSA88.9%
Q-Learning41.2%
Monte Carlo47.1%
DQN6.2%

Championship Tournament

Full round-robin tournament where each agent faces every other agent in both positions (first_0/second_0).

Notebook Sections

  1. Configuration & Checkpoints — Incremental training workflow with pickle serialization
  2. Utility Functions — Observation normalization, ε-greedy policy
  3. Agent DefinitionsRandomAgent, SarsaAgent, QLearningAgent, MonteCarloAgent, DQNAgent
  4. Training Infrastructuretrain_agent(), plot_training_curves()
  5. Evaluation — Match system, random baseline, round-robin tournament
  6. Results & Visualization — Win rate plots, matchup matrix heatmap

Known Issues

  • Monte Carlo & DQN: Checkpoint loading issues — saved weights may not restore properly during evaluation (training works correctly)

Dependencies

  • Python 3.13+
  • numpy, matplotlib
  • torch
  • gymnasium, ale-py
  • pettingzoo
  • tqdm

Authors

  • Arthur DANJOU
  • Moritz VON SIEMENS