Fine-tune GPT (without LoRA) on some fine-tuning dataset

We already have the GPT-2, we already have the classification head.

@laewen found a training loop in the Equinox documentation, maybe I can be inspired by that: Eamples → Advanced → BERT langugage model

Here we should use the optimizations (e.g. @jax.jit)