Mihai-Cristian Farcaș

My PyTorch implementation of a GPT-like language model with text preprocessing utilities

Overview

This project implements a transformer-based language model similar to GPT, designed for character-level text generation. It includes utilities for vocabulary generation and dataset splitting.

In this example, I trained and tested it on the fabulous book The Brothers Karamazov, downloaded from Project Gutenberg. Feel free to change the text file or even try training it on an consacrated dataset (like OpenWebText, for example), though on larger datasets the vocab.py and split.py might not work properly.

Features

Character-level language modeling
Multi-head self-attention mechanism
Memory-efficient data loading using memory mapping
Text preprocessing utilities
Configurable model architecture

Requirements

Python 3.9+
PyTorch
Jupyter Notebooks
CUDA (optional, for GPU acceleration, on Windows)

Project Structure

vocab.py - Generates vocabulary from input text
split.py - Splits text data into training and validation sets
GPT.ipynb - Main model implementation and training

Usage

1. Initialization (details on my Github)

2. Prepare Your Data

First, add your desired data file and generate the vocabulary from your text:

python3 vocab.py

Then, split your data into training and validation sets:

python3 split.py

3. Train the Model

Install a new kernel to use in your Jupyter Notebook:

python3 -m ipykernel install --user --name=venv --display-name "GPTKernel"

Run Jupyter Notebook:

jupyter notebook

Open GPT.ipynb.
Select GPTKernel and run the cells sequentially.

The notebook contains

Model architecture implementation
Training loop
Text generation functionality

Model Architecture

The model implements a transformer architecture with:

Multi-head self-attention
Position embeddings
Layer normalization
Feed-forward networks

You can find the source code on my Github.

Thanks for reading! Cheers! 🍻

P.S. If you enjoyed this post, consider giving me some feedback and subscribing to my newsletter for more insights and updates. You can do so from my contact page. 🚀