Resources

Eugene Yan: List of Applied ML Papers/tech blogs on ML in production
Eugene Yan: Language Modeling Reading List (to Start Your Paper Club)
How Prototyping Can Help You to Get Buy-In
Chip Huyen: Building LLM applications for production
Chip Huyen: Llama Police – A dashboard of Open Source LLM Tools
Eugene Yan’s List of Open-source LLMs
Erik Linder-Norén: Machine Learning From Scratch: Python implementations of fundamental algorithms from scratch
Stas Bekman - Machine Learning Engineering
- Excerpt:
  Tell how many GPUs do you need in 5 secs
  - Training in half mixed-precision: model_size_in_B * 18 * 1.25 / gpu_size_in_GB
  - Inference in half precision: model_size_in_B * 2 * 1.25 / gpu_size_in_GB
  That’s the minimum, more to have a bigger batch size and longer sequence length. Here is the breakdown:
  - Training: 8 bytes for AdamW states, 4 bytes for grads, 4+2 bytes for weights
  - Inference: 2 bytes for weights (1 byte if you use quantization)
  - 1.25 is 25% for activations (very very approximate)
  For example: Let’s take an 80B param model and 80GB GPUs and calculate how many of them we will need for:
  - Training: at least 23 GPUs 80181.25/80
  - Inference: at least 3 GPUs 8021.25/80