Trainer

class pytorch_accelerated.trainer.Trainer(model, loss_func, optimizer, callbacks=(<class 'pytorch_accelerated.callbacks.MoveModulesToDeviceCallback'>, <class 'pytorch_accelerated.callbacks.TerminateOnNaNCallback'>, <class 'pytorch_accelerated.callbacks.PrintProgressCallback'>, <class 'pytorch_accelerated.callbacks.ProgressBarCallback'>, <class 'pytorch_accelerated.callbacks.LogMetricsCallback'>), run_history=None)[source]

The Trainer is designed to encapsulate an entire training loop for a specific task, bringing together the model, loss function and optimizer, and providing a specification of the behaviour to execute for each step of the training process.

The trainer has been implemented such that it provides (overridable) implementations of the parts of training that rarely change after they have been defined – such as creating a data loader, or how a batch of data is fed to the model – whilst remaining decoupled from components that are likely to change, such as the model, dataset, loss function and optimizer.

__init__(model, loss_func, optimizer, callbacks=(<class 'pytorch_accelerated.callbacks.MoveModulesToDeviceCallback'>, <class 'pytorch_accelerated.callbacks.TerminateOnNaNCallback'>, <class 'pytorch_accelerated.callbacks.PrintProgressCallback'>, <class 'pytorch_accelerated.callbacks.ProgressBarCallback'>, <class 'pytorch_accelerated.callbacks.LogMetricsCallback'>), run_history=None)[source]

Create a new trainer object which can be used to train the given model using the provided loss function and optimizer.

Parameters:

model – a subclass of nn.Module to be trained
loss_func – the loss function to use when training the model
optimizer – the optimizer to update the model’s parameters
callbacks – a list of callbacks to use during training runs. If a list of callbacks is not provided, the default selection will be used.
run_history – an instance of a RunHistory subclass to track training runs. If this is not provided, a new one will be created.

The callbacks that are used by default are ( MoveModulesToDeviceCallback, TerminateOnNaNCallback, PrintProgressCallback, ProgressBarCallback, LogMetricsCallback, )

class pytorch_accelerated.trainer.TrainerWithTimmScheduler(*args, **kwargs)[source]: Subclass of the Trainer that works with timm schedulers instead of standard PyTorch learning rate schedulers

Training a model

The main entrypoint for the Trainer is the train() method, which is used to launch a training run.

Trainer.train(train_dataset, num_epochs, eval_dataset=None, per_device_batch_size=8, max_num_train_steps=None, gradient_accumulation_steps=1, gradient_clip_value=None, create_scheduler_fn=None, train_dataloader_kwargs: dict = None, eval_dataloader_kwargs: dict = None, reset_run_history=True, collate_fn=None, resume_from=None)[source]

Start a training run. If an evaluation dataset is provided, this routine will include both training and evaluation epochs.

Note

As the optimizer needs to be internally prepared prior to training, in order to use a learning rate scheduler, a factory function must be provided to create_scheduler_fn. This must be a function which accepts the optimizer as a single parameter and returns an instance of a learning rate scheduler. Passing an instance of a learning rate scheduler will not work here.

Parameters:

train_dataset – the dataset to use during training epochs
num_epochs – the number of epochs to train for
eval_dataset – the dataset to use during evaluation epochs, if this is not provided, evaluation is skipped.
per_device_batch_size – the batch size to use per device
max_num_train_steps – the maximum number of steps across all processes to train for. If both max_num_train_steps and num_epochs are provided, the smaller of the two limits is used.
gradient_accumulation_steps – accumulate gradients to the specified number of steps to simulate a bigger batch size. By default, this is set to 1
gradient_clip_value – if specified, the gradients of the model’s parameters will be clipped to the range [-gradient_clip_value, gradient_clip_value]
create_scheduler_fn – a function which accepts an optimizer as an argument and returns a learning rate scheduler
train_dataloader_kwargs – : a dictionary of keyword arguments to pass to the training dataloader constructor, for details see torch.utils.data.DataLoader
eval_dataloader_kwargs – a dictionary of keyword arguments to pass to the evaluation dataloader constructor, for details see torch.utils.data.DataLoader
reset_run_history – reset any run history saved by the trainer from previous training runs
collate_fn – the collate function to be used by the training and evaluation dataloaders
resume_from – path to a training state directory saved by save_training_state(). When provided, model weights, optimizer state, scheduler state, RNG states, and the mixed-precision scaler are all restored, and training resumes from the saved epoch. This overrides reset_run_history to False.

Example:

# Initial training (interrupted at epoch 5)
trainer.train(train_dataset=ds, num_epochs=10)

# Resume from saved state
trainer.train(
    train_dataset=ds,
    num_epochs=10,
    resume_from="checkpoints/step_5000",
)

Using learning rate schedulers

As Pytorch schedulers are not consistently called in the same way, to enable maximum flexibility, PyTorch-accelerated’s Trainer expects that a given scheduler should be called after each optimizer update by default.

Note that, as the optimizer and dataloaders need to be internally prepared prior to training, in order to use a learning rate scheduler, a factory function must be provided to train() as the create_scheduler_fn argument. This must be a function which accepts the optimizer as a single parameter and returns an instance of a learning rate scheduler.

Note

Passing an instance of a PyTorch learning rate scheduler as the create_scheduler_fn argument to train() will not work as intended.

A simple method of creating a scheduler factory function this is by using functools.partial() like so:

from functools import Partial

from torch.optim import lr_scheduler

create_scheduler_fn = partial(lr_scheduler.StepLR, step_size=7, gamma=0.1)

Note

The Trainer calls a step on the provided scheduler after every batch. This can lead to unexpected results as some PyTorch schedulers are expected to step only after every epoch.

For instance, in the example above, the learning rate would be multiplied by 0.1 at every batch. As this particular scheduler is designed to be called once per epoch, this is not the desired behaviour! We can resolve this by representing the step_size in terms of the number of updates, like this:

from functools import Partial

from torch.optim import lr_scheduler

from pytorch_accelerated import TrainerPlaceholderValues

epochs_step_size = 7

create_scheduler_fn = partial(
    lr_scheduler.StepLR,
    step_size=TrainerPlaceHolderValues.NUM_UPDATE_STEPS_PER_EPOCH * epochs_step_size
)

Here, to determine the value of the number of steps per epoch, we have used a TrainerPlaceholderValues placeholder, which are described below.

Using `TrainerPlaceHolderValues`

class pytorch_accelerated.trainer.TrainerPlaceholderValues(*values)[source]

Some learning rate schedulers require information such as the total number of steps that will take place during a training run. As this information is not accessible prior to creating the training dataloader - which will be done as part of the train() method - a placeholder value can be used in the cases, as demonstrated below:

from functools import Partial

from pytorch_accelerated import TrainerPlaceholderValues
from torch.optim.lr_scheduler import OneCycleLR

create_scheduler_fn = partial(
            OneCycleLR,
            max_lr=config.lr,
            epochs=TrainerPlaceholderValues.NUM_EPOCHS,
            steps_per_epoch=TrainerPlaceholderValues.NUM_UPDATE_STEPS_PER_EPOCH,
        )

These placeholders will be replaced by the trainer with the correct values during training.

The list of the available placeholders are:

NUM_EPOCHS
NUM_UPDATE_STEPS_PER_EPOCH
TRAIN_DATALOADER_LEN
EVAL_DATALOADER_LEN

Alternatively, the same outcome could be achieved by overriding the Trainer’s create_scheduler() method.

Using PyTorch-accelerated schedulers

PyTorch-accelerated includes some implementations of schedulers, which have the same interface as PyTorch schedulers, as well as base classes to easily create custom schedules; these are discussed in more detail in Schedulers.

These scheduler implementations have an alternative constructor, which can be passed to train() as the the create_scheduler_fn argument directly, as demonstrated below:

from pytorch_accelerated.schedulers import CosineLrScheduler

trainer.train(
        train_dataset=train_dataset,
        num_epochs=num_epochs,
        per_device_batch_size=batch_size,
        create_scheduler_fn=CosineLrScheduler.create_scheduler_fn(num_warmup_epochs=5,
                                                                  warmup_starting_lr=1e-6,
                                                                  num_cooldown_epochs=5),
        )

Using timm schedulers

The schedulers included in timm have a different interface to the native PyTorch schedulers, so do not work with the base Trainer by default.

PyTorch-accelerated includes an alternative trainer TrainerWithTimmScheduler, which is compatible with timm schedulers; schedulers should be passed to this trainer as a factory function the same way as described above.

Evaluating a model

Once a model has been trained, or loaded from a checkpoint, the Trainer can also be used for evaluation, which consists of running a single epoch, using the Trainer’s evaluation loop logic, on the given dataset.

Trainer.evaluate(dataset=None, per_device_batch_size=8, dataloader_kwargs: dict = None, collate_fn=None)[source]

Start an evaluation run.

Note

Starting an evaluation run will reset the Trainer’s run history.

Note

During distributed evaluation, if the per_device_batch_size * the number of processes used does not exactly divide the dataset, and drop_last=False has not been passed as a dataloader kwarg, the dataloader will repeat from the start in processes that run out of batches. This should be taken into consideration when calculating metrics.

Parameters:

dataset – the dataset to use during evaluation
per_device_batch_size – the batch size to use per device
dataloader_kwargs – a dictionary of keyword arguments to pass to the dataloader constructor, for details see torch.utils.data.DataLoader
collate_fn – the collate function to be used by the dataloader

Utility Methods

Saving and loading checkpoints

The Trainer provides two approaches to checkpointing:

Model export (save_checkpoint / load_checkpoint): Saves a single portable .pt file containing model weights and optionally optimizer/scheduler state. Best for model selection, deployment, and sharing weights across different setups. Used by SaveBestModelCallback.

Training state (save_training_state / load_training_state): Saves a complete directory containing everything needed for an exact resume: model weights, optimizer state, scheduler state, RNG states, and the mixed-precision scaler. Handles FSDP-sharded models and torch.compile automatically. Used by SaveTrainingStateCallback and WSDCheckpointCallback (for pre-decay checkpoints).

To resume training from a saved training state, pass the directory path as the resume_from parameter to train():

# Save during training (e.g. via SaveTrainingStateCallback)
trainer.save_training_state("checkpoints/step_5000")

# Resume later
trainer.train(
    train_dataset=dataset,
    num_epochs=10,
    resume_from="checkpoints/step_5000",
)

Trainer.save_checkpoint(save_path, checkpoint_kwargs=None, save_optimizer=True, save_scheduler=True, save_per_node=True)[source]

Save the model, optimizer and specified args as a single checkpoint file.

This is the “model export” checkpoint method — it saves a portable .pt file containing the model weights (and optionally optimizer/scheduler state) that can be loaded independently of the distributed setup. For complete training resume (including RNG states, mixed-precision scaler, and FSDP-sharded state), use save_training_state() instead.

Handles torch.compile transparently by stripping the compilation wrapper before saving, so checkpoints can be loaded into non-compiled models.

Parameters:

save_path – the path where to save the checkpoint, this should end in ‘.pt’
checkpoint_kwargs – additional objects to include in the checkpoint
save_optimizer – flag to indicate whether to include the optimizer in the checkpoint
save_scheduler – flag to indicate whether to include the scheduler in the checkpoint
save_per_node – flag to indicate whether to save the checkpoint once per machine, if False, the checkpoint will only be saved from the world process zero. This is True by default.

Trainer.load_checkpoint(checkpoint_path, load_optimizer=True, load_scheduler=True)[source]

Load the model and optimizer from a checkpoint file created by save_checkpoint().

Parameters:

checkpoint_path – the path of the checkpoint file to load
load_optimizer – flag to indicate whether to load the optimizer if it is included in the checkpoint
load_scheduler – flag to indicate whether to load the scheduler if it is included in the checkpoint

Returns:

the full checkpoint dictionary, including any custom kwargs that were saved via checkpoint_kwargs

Trainer.save_training_state(output_dir, checkpoint_kwargs=None)[source]

Save the complete training state to a directory using Accelerate’s save_state().

This saves everything needed for an exact training resume: model weights, optimizer state, scheduler state, RNG states, and the mixed-precision scaler. It handles FSDP-sharded models and torch.compile automatically.

For simple model export (a single portable .pt file), use save_checkpoint() instead.

The training state directory will contain:

pytorch_model.bin or model.safetensors — model weights
optimizer.bin — optimizer state (handles FSDP sharding)
scheduler.bin — scheduler state (if registered)
random_states.pkl — RNG states for torch, cuda, numpy, and python
scaler.bin — mixed-precision scaler (if applicable)
trainer_metadata.pt — epoch, step count, and any custom kwargs

Parameters:

output_dir – the directory to save the training state to
checkpoint_kwargs – additional metadata to save alongside the training state

Example:

# Save during training (e.g. in a callback)
trainer.save_training_state("checkpoints/step_5000")

# Resume later
trainer.train(
    train_dataset=dataset,
    num_epochs=10,
    resume_from="checkpoints/step_5000",
)

Trainer.load_training_state(input_dir)[source]

Load a complete training state from a directory saved by save_training_state().

Restores model weights, optimizer state, scheduler state, RNG states, and the mixed-precision scaler. Works with FSDP-sharded models and torch.compile.

Parameters:: input_dir – the directory containing the saved training state
Returns:: a dictionary containing the saved trainer metadata (epoch, custom kwargs)

Other utilities

Trainer.print(*args, **kwargs)[source]: Use in replacement of print() to only print once per node.

Trainer.gather(tensor, padding_value=None)[source]

Gather the values in tensor across all processes and concatenate them on the first dimension. This can be useful to regroup the predictions from all processes when doing evaluation.

If a padding value is provided, padding will be applied along the first dimension where necessary, to ensure that tensors in all processes have the same shape.

Note

The given value of padding_value should ideally not appear in the expected range of values that the tensor may contain

Parameters:

tensor – (torch.Tensor, or a nested tuple/list/dictionary of torch.Tensor) The tensors to gather across all processes.
padding_value – if provided, the value with which to pad tensors to ensure that all processes have the same shape

Returns:

The gathered tensor(s) (torch.Tensor, or a nested tuple/list/dictionary of torch.Tensor). The first dimension of the result is num_processes multiplied by the first dimension of the input tensors.

Note

This gather happens in all processes.

Trainer.get_model(keep_torch_compile=True)[source]

Extract the model in Trainer from its distributed containers. Useful before saving a model.

Parameters:: keep_torch_compile – if True, the torch.compile wrapper is preserved. Set to False when saving weights, to avoid _orig_mod. key prefixes in the state dict.
Returns:: the model in Trainer, subclassed from Module

Customizing Trainer Behaviour

Whilst the Trainer should work out of the box in straightforward use cases, subclassing the trainer and overriding its methods is intended and encouraged - think of the base implementation as a set of ‘sensible defaults’!

Note

Methods which are prefixed with a verb such as create or calculate expect a value to be returned, all other methods are used to set internal state (e.g. optimizer.step())

Setup Methods

Trainer.create_train_dataloader(batch_size: int, train_dl_kwargs: dict = None) → Iterable[source]

Create a dataloader to be used during training. This is initialised with the train_dataset and collate function which have been passed to the Trainer.

If no arguments are passed, the arguments returned by Trainer.get_default_train_dl_kwargs() are used.

Note

if batch size is included in train_dl_kwargs, this takes precedence over the batch_size argument.

Parameters:

batch_size – the batch size to use per device
train_dl_kwargs – a dictionary of keyword arguments to pass to the dataloader constructor, for details see torch.utils.data.DataLoader

Returns:

an instance of DataLoader

Trainer.get_default_train_dl_kwargs(batch_size) → dict[source]

Return the default arguments that will be used by the training dataloader.

Parameters:: batch_size – the batch size to use during training
Returns:: a dictionary containing the default arguments for the training dataloader

Trainer.create_eval_dataloader(batch_size: int, eval_dl_kwargs: dict = None) → Iterable[source]

Create a dataloader to be used during evaluation. This is initialised with the eval_dataset and collate function which have been passed to the Trainer.

If no arguments are passed, the arguments returned by Trainer.get_default_eval_dl_kwargs() are used.

Note

if batch size is included in eval_dl_kwargs, this takes precedence over the batch_size argument.

Parameters:

batch_size – the batch size to use per device
eval_dl_kwargs – a dictionary of keyword arguments to pass to the dataloader constructor, for details see torch.utils.data.DataLoader

Returns:

an instance of torch.utils.data.DataLoader

Trainer.get_default_eval_dl_kwargs(batch_size) → dict[source]

Return the default arguments that will be used by the evaluation dataloader.

Parameters:: batch_size – the batch size to use during evaluation
Returns:: a dictionary containing the default arguments for the evaluation dataloader

Trainer.create_scheduler()[source]: Create a learning rate scheduler based on the create_scheduler_fn function which has been passed to the Trainer. :return: a learning rate scheduler instance

Training Run Methods

Trainer.training_run_start()[source]: This method is called at the start of a training run.

Trainer.training_run_epoch_end()[source]: This method is called during a training run after both training and evaluation epochs have been completed.

Trainer.training_run_end()[source]: This method is called at the end of a training run.

Training epoch Methods

Trainer.train_epoch_start()[source]

This method is called at the start of a training epoch.

The default behaviour of this method is to call self.model.train()

Trainer.calculate_train_batch_loss(batch) → dict[source]

Calculates the training loss and return this along with the batch size and model outputs. Any additional values returned will be available in the on_train_step_end() callback method.

Parameters:

batch – the output of the train dataloader

Returns:

A dictionary containing the training loss, model outputs and batch size. Can include any keys, but must include the keys ‘loss’, ‘model_outputs’ and ‘batch_size’.

Optionally, for autoregressive models with token-level loss normalization (e.g., language models), include ‘num_items_in_batch’ with the number of non-padded tokens in the batch. When this key is present and distributed training is used with gradient accumulation, the loss will be automatically scaled to compensate for DDP gradient averaging. This prevents the double-normalization issue described at https://huggingface.co/blog/gradient_accumulation.

Example for autoregressive models:

def calculate_train_batch_loss(self, batch):
    outputs = self.model(**batch)
    num_items = (batch["labels"] != -100).sum()  # Count non-padding tokens
    return {
        "loss": outputs.loss,
        "model_outputs": outputs,
        "batch_size": batch["input_ids"].size(0),
        "num_items_in_batch": num_items,
    }

Trainer.backward_step(loss)[source]

Use the accelerator to perform the backward pass on the calculated value of the loss returned by calculate_train_batch_loss(). If gradient accumulation is enabled, this loss has been scaled by 1 / accumulation steps.

Parameters:: loss – The loss tensor returned by calculate_train_batch_loss().

Trainer.optimizer_step()[source]: Performs a single optimization step and updates the parameters which have been passed to self.optimizer.

Trainer.scheduler_step()[source]: Performs a single scheduler step if self.scheduler has been assigned.

Trainer.optimizer_zero_grad()[source]: Sets the gradients of all optimized torch.Tensor s to zero.

Trainer.train_epoch_end()[source]: This method is called at the end of each training epoch.

Evaluation epoch Methods

Trainer.eval_epoch_start()[source]

This method is called at the start of an evaluation epoch.

The default behaviour of this method is to call self.model.eval()

Trainer.calculate_eval_batch_loss(batch) → dict[source]

Calculates the evaluation loss and return this along with the batch size and model outputs. Any additional values returned will be available in the on_eval_step_end() callback.

Parameters:: batch – the output of the eval dataloader
Returns:: A dictionary containing the evaluation loss, model outputs and batch size. Can include any keys, but must include the keys loss, model_outputs and batch_size

Trainer.eval_epoch_end()[source]: This method is called at the end of an evaluation epoch.

Evaluation Run Methods

Trainer.evaluation_run_start()[source]: This method is called at the start of an evaluation run.

Trainer.evaluation_run_end()[source]: This method is called at the end of an evaluation run.

Internal Methods

Warning

In the spirit of Python, nothing is truly hidden within the Trainer. However, care must be taken as, by overriding these methods, you are fundamentally changing how the Trainer is working internally and this may have untended consequences. When modifying one or more internal methods, it is the responsibility of the user to ensure that the Trainer continues to work as intended!

Internal Setup

Trainer._create_accelerator()[source]: Create an instance of accelerate.Accelerator which will be used to manage training.

Trainer._prepare_model_optimizer_and_dataloaders()[source]

Uses the trainer’s instance of accelerate.Accelerator to wrap the model, optimizer and dataloaders in any wrappers necessary for training. (e.g. torch.nn.parallel.DistributedDataParallel) and ensures the parameters are placed on the appropriate device.

By default, this will convert each dataloader to an instance of accelerate.data_loader.DataLoaderShard. Depending on the value of the drop_last attribute of the dataloaders, either iterations will stop at the first batch that would be too small / not present on all processes or loop with batches from the beginning on processes which run out of data, so that all batch sizes are the same size.

Note

This may change the length of the dataloaders, so this should be called before the number of update steps per epoch is calculated, i.e. to initialise a learning rate scheduler

Trainer._create_run_config(per_device_batch_size, num_epochs, gradient_accumulation_steps, max_num_train_steps, gradient_clip_value) → TrainerRunConfig[source]

Create an instance of TrainerRunConfig representing the current state of the trainer.

Parameters:

per_device_batch_size – the batch size per device
num_epochs – the number of epochs in the current training run
gradient_accumulation_steps – the number of gradient accumulation steps which will be used during the training run
max_num_train_steps – If specified, the maximum number of steps to train for. If present, this will take precedence over num_epochs
gradient_clip_value – the value used to determine the threshold to clip the gradients of the model’s parameters

Training run behaviour

Trainer._run_training()[source]: The method responsible for the orchestration of the high level steps which will be executed during a training run.

Training epoch behaviour

Trainer._run_train_epoch(train_dl)[source]

The method responsible for the behaviour of each training epoch.

Parameters:: train_dl – the dataloader to be used during training

Trainer._clip_gradients()[source]

Clip the gradients of the model’s parameters that fall outside of the threshold specified in train().

By default, this clips the gradients using accelerate.Accelerator.clip_grad_value_()

Evaluation epoch behaviour

Trainer._run_eval_epoch(valid_dl, is_training: bool = True, is_intermediate: bool = False)[source]

The method responsible for the behaviour of each evaluation epoch.

Parameters:

valid_dl – the dataloader to be used during evaluation
is_training – signals whether the evaluation is being run as part of a training run
is_intermediate – signals whether the evaluation is being run as part of an intermediate evaluation during a training epoch

Should I subclass the Trainer or use a callback?

The behaviour of the Trainer can also be extended using Callbacks. All callback methods are prefixed with on_.

It is recommended that callbacks are used to contain ‘infrastructure’ code, which is not essential to the operation of the training loop, such as logging, but this decision is left to the judgement of the user based on the specific use case. If it seems overkill to subclass the Trainer for the modification you wish to make, it may be better to use a callback instead.

For more information on callbacks, see Callbacks.

Recording metrics

The Trainer contains an instance of RunHistory, which can be used to store and retrieve the values of any metrics to track during training. By default, the only metrics that are recorded by the Trainer are the losses observed during training and evaluation.

Note

If the callback PrintMetricsCallback is being used, any metrics recorded in the run history will be printed to the console automatically.

The API for RunHistory is detailed at RunHistory.

Here is an example of how we can subclass the Trainer and use the RunHistory to track metrics computed using TorchMetrics:

from torchmetrics import MetricCollection, Accuracy, Precision, Recall
from pytorch_accelerated import Trainer

class TrainerWithMetrics(Trainer):
    def __init__(self, num_classes, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # this will be moved to the correct device automatically by the
        # MoveModulesToDeviceCallback callback, which is used by default
        self.metrics = MetricCollection(
            {
                "accuracy": Accuracy(num_classes=num_classes),
                "precision": Precision(num_classes=num_classes),
                "recall": Recall(num_classes=num_classes),
            }
        )

    def calculate_eval_batch_loss(self, batch):
        batch_output = super().calculate_eval_batch_loss(batch)
        preds = batch_output["model_outputs"].argmax(dim=-1)

        self.metrics.update(preds, batch[1])

        return batch_output

    def eval_epoch_end(self):
        metrics = self.metrics.compute()
        self.run_history.update_metric("accuracy", metrics["accuracy"].cpu())
        self.run_history.update_metric("precision", metrics["precision"].cpu())
        self.run_history.update_metric("recall", metrics["recall"].cpu())

        self.metrics.reset()

Note

If you feel that subclassing the Trainer seems too excessive for this use case, this could also be done using a callback as demonstrated in Example: Tracking metrics using a callback.

What goes on inside the Trainer?

In pseudocode, the execution of a training run can be depicted as:

train_dl = create_train_dataloader()
eval_dl = create_eval_dataloader()
scheduler = create_scheduler()

training_run_start()
on_training_run_start()

for epoch in num_epochs:
    train_epoch_start()
    on_train_epoch_start()
    for batch in train_dl:
        on_train_step_start()
        batch_output = calculate_train_batch_loss(batch)
        on_train_step_end(batch, batch_output)
        backward_step(batch_output["loss"])
        optimizer_step()
        scheduler_step()
        optimizer_zero_grad()
    train_epoch_end()
    on_train_epoch_end()

    eval_epoch_start()
    on_eval_epoch_start()
    for batch in eval_dl:
        on_eval_step_start()
        batch_output = calculate_eval_batch_loss(batch)
        on_eval_step_end(batch, batch_output)
    eval_epoch_end()
    on_eval_epoch_end()

    training_run_epoch_end()
    on_training_run_epoch_end()

training_run_end()
on_training_run_end()

Similarly, the execution of an evaluation run can be depicted as:

eval_dl = create_eval_dataloader()

evaluation_run_start()
on_evaluation_run_start()

eval_epoch_start()
on_eval_epoch_start()
for batch in eval_dl:
    on_eval_step_start()
    batch_output = calculate_eval_batch_loss(batch)
    on_eval_step_end(batch, batch_output)
eval_epoch_end()
on_eval_epoch_end()

evaluation_run_end()
on_evaluation_run_end()

The best way to understand how the Trainer works internally is by examining the source code for the train() method; significant care has gone into making the internal methods as clean and clear as possible.

Trainer

Training a model

Using learning rate schedulers

Using TrainerPlaceHolderValues

Using PyTorch-accelerated schedulers

Using timm schedulers

Evaluating a model

Utility Methods

Saving and loading checkpoints

Other utilities

Customizing Trainer Behaviour

Setup Methods

Training Run Methods

Training epoch Methods

Evaluation epoch Methods

Evaluation Run Methods

Internal Methods

Internal Setup

Training run behaviour

Training epoch behaviour

Evaluation epoch behaviour

Should I subclass the Trainer or use a callback?

Recording metrics

What goes on inside the Trainer?

Using `TrainerPlaceHolderValues`