How do I leverage LLMs to program better?

9/21/25

Andrej Karpathy has stated, “Most of my ‘coding’ now is in English,” meaning he spends more time writing natural-language instructions for the AI about what to build rather than writing low-level code himself. I think this is nice, but what does this actually mean?

How does it change the way that I do something? How do I actually produce software differently?

These are some points I’ve learned and read about.

Generating code

Andrej Karpathy writes Python to orchestrate final solutions, but lets the AI fill in boilerplate code and suggest implementations. Carmack uses it for refactoring, design, and cleaning up old codebases.

provide context: what we are building here, what the specific code section is trying to do
specify the task clearly: “I’m trying to rewrite this ”
format queries as if you were asking a coworker you admire + respect for help- you don’t want to waste their time
hit enter

Debugging code

Instead of asking something lazy, like “hey, can you help me figure out why my code isn’t working?” You ask, “here is the code section that is causing the error, and the error. It should do X but it’s doing Y. It should output Z. Why is it failing?”*

Just put some effort into asking the question and the result will be better. I know, crazy- it’s almost like everything in all of history is just about putting some work in and focusing.

*I guess this would be in a program where you already know the output, so this question actually might stifle it from generating something new, which is what a lot of deep learning models have shown- novelty.

“Fix the context, not the code”- Bret Taylor

provide context- what we’re trying to do here in the code we are about to build
goal of current code being added: inputs and outputs
show the error: add environment + output from terminal to the chat
hit enter

Use other models against each other- also, I have found copying and pasting outputs from different models into different chat threads is really helpful

Scaffolding a repo

create a guidelines.md, which, yes, does actually produce better results if you write it + improve on it over time consisting of

goal of the project (changes every time, hopefully the rest stays relatively similar from build to build) + why it’s being built (to provide context on purpose)
core technologies you want to use (JAX, SBX, Python3, Triton, PyTorch, tranfsormers, and/or tinygrad in the case of doing )
versions- understand when the training date was cut off for the model you were using, since it hasn’t indexed new changes in the packages you use (for instance, a not-most-recent version of Node was hacked, but the model doesn’t know that information)
versions- tell the model to use principles like DRY, encapsulation, and modularity
tests- [im not sure on this one]
what you want logged and why- like for building DL models, Cursor likes to output everything; I don’t care about everything, give me input/output shapes going between functions when initially compiling it, performance and accuracy stats from training epochs, and results expressed visually through graphs when done

# guidelines.txt

## 1. project goal
- Objective: [Fill in project-specific goal]
- Why: [Explain scientific/engineering purpose]

## 2. core technologies
- Python 3.10+
- JAX / SBX / Triton / PyTorch / Tinygrad
- Transformers
- Bio libs: Biopython, RDKit, Rosetta, ColabFold

## 3. versions
- track model training cutoff vs. package updates
- Pin versions:
  python==3.11
  jax==0.4.35
  torch==2.2.0
  triton==2.2.0
  transformers==4.44.2
  biopython==1.83
  rdkit==2024.03.3

## 4. coding principles
- DRY, encapsulation, modularity
- Separate data, model, train, eval, utils

## 5. tests
- unit tests for core functions + hot paths
- smoke tests: FASTA → embedding shape, docking → score range
- regression test with fixed seed

## 6. logging
- always: tensor shapes, training loss/accuracy, GPU/mem usage
- optional: graphs of metrics, embeddings

## 7. project structure
project/
├── guidelines.txt
├── data/
├── notebooks/
├── src/
│   ├── dataloader.py
│   ├── model.py
│   ├── train.py
│   ├── eval.py
│   └── utils.py
├── tests/
├── scripts/
├── requirements.txt
└── README.md

Setting evals

TBD still getting better at this, it’s an art more than a science. But a good Eval artist would make really good environment/state/action definitions in RL. There is an article below I haven’t read yet that looks good.

Read later

add trial 1 later: Michelangelo van Dam Personal experience with JetBrains Junie™: day 1 - in2it

past about evals: https://hamel.dev/blog/posts/evals/?utm_source=chatgpt.com

go through this guys repos: https://github.com/r2d4?utm_source=chatgpt.com

and his blog post on working with them: https://blog.matt-rickard.com/p/literate-programming-with-llms?utm_source=chatgpt.com