Back Send feedback to ilkka.kuivanen@me.com

Real-life experiences using LLMs for programming in 2025

2025-11-04 Tuesday (Updated on 2025-11-18 Tuesday)

Strong areas

LLMs are effective as...

Autocomplete assistants and tools:
- Fast, inline suggestions works the best
- Aiming for small context windows reduces errors and hallucinations.
- Low risk, mid-high reward: suggestions are easy to accept, edit, or ignore.
"Parental chill mode":
- A restricted, tightly scoped, limited task executor with easily validated output.
- Thinking-based tasks, with small to mid execution time
- Medium risk, mid-high reward: investing in instructions may waste time, but shaping them as useful docs or non-AI tasks reduces that loss.

Weak areas

Fact checkers
- Avoid using LLMs to validate assumptions that are hard to verify yourself and carry critical weight, such as “is this secure?” or “how does this really work?”.
Models lack higher-level goals:
- Input (initiation of inference) is always local optimisation that ignores the progress in time. Original objectives are reset on each prompt. This leads to useless paths of pseudo-improvements and rabbit holes.
- Each prompt, even within the same context window, is considered as new equal ground to please the user.
- Current models and tools lack capability to take into account and prioritise real-world constraints like time, budget or quality.
Finding the optimal context window for the task is tricky.
- Models performance deteriorates even if the context window has plenty of space.
- As the complexity of the input grows, the accuracy of results gets worse. This leads to breaking changes in code or in functionality. This problem feeds to itself.
- Multiple undos in a row is a sign to pivot. Eager manual refinements is the best approach to avoid sinking time on correcting model with additional prompts.
- It feels the most effective prompting occurs with roughly under 1000 lines of code in the same file. This is magical number I came up with, just a ballpark figure.
Language models struggle with template literals and escaped markdown. It is revealing when considering the "intelligence" of the technology.
Tools can generate extra work when they attempt to be clever with prompt injection, e.g. enriching the user input with additional guidance. For instance, CodeX attempts to address // TODO:s in the code by hard-coded interpretation of them being implementation tasks, while some of them can be there just to remind to write more comments. This leads to unwanted code and refactoring to be generated in place where simple comments are expected.
LLMs have different "perceived values" than the user. For instance, a model is completely fine to use deprecated functions even if better modern alternatives exist. Addressing them on system prompt level seems wasteful, because you would need to negate unknowns one-by-one.

Starting point

Strong guardrails on increases speed (parental chill mode).

Manually configured project setup and pre-defined scripts for building and testing is better than asking model to do it.
Strict setup prevents a category of potential mistakes later.
Explicit request:
- Do not try running shell-scripts.
- Do not try to install packages.
- Focus only on code itself.

Workflow

Splitting the workflow to different models and stages makes it easier to manage:
- "Spec-driven" -approach can be followed without "formal" approaches like Github Speck Kit or Amazon's Kiro.
Oppose to what might be commonly suggested: keeping everything in the same file as long as possible improves the model overall performance.
- No pre-mature splitting of code: it does not make sense to make conceptual splits, if the idea remains vague during the rapid cycles.
- Early-split logic tends to leave noise behind: unused files, partially implemented utilities, hard-coded values in duplicated places.
- Language models are bad at re-thinking the splits and organisations. The quality of refactoring is often very low without strong guidance.
  - I suspect the root cause is fact the nature of language model inference is stateless. This leads to a situation where previously LLM generated content is considered user generated. The roles get mixed up which in turn places wrong emphasis on the characteristics of the content. This is sometimes visible when the initial response starts "Let me see your code first...", even though the most recent change came from LLM itself. The result would be conceptually different, if the weight of the intention would not be placed on the user, rather, it would be considered as model's own incorrect reasoning. I believe this connects to the "please the user" -problem.

Example flow

In ChatGPT, iterate and refine the goal and high-level needs. Ask for outputs in a format another LLM can digest. Explicitly prevent starting generation of code. Allow analysis of the existing codebase only as reference. Ensure prompts are stateless and portable so later models need no shared context.
In GitHub Copilot, use that output as input. Push the model to clarify vague parts. Repeat “ask more questions” several times; models tend to default to pleasing instead of probing.
Ask Copilot for an executable, prioritized task list. Review the task list to understand what is about to happen, select tasks you’ll pursue. Save a copy of the task list.
Use a single task as the seed prompt for each new chat.
Expect task list to go stale quickly. Trim auxiliary files and bloat aggressively to keep momentum and clarity. Commit relevant progress. Eagerly drop the tasklist if no progress is made.

Model selection

Models and providers differ, but not that much in the big picture
- In terms of quality of output: Claude Sonnet 4 (and now 4.5) is the best model. Everything else is worse in different ways.
- "GPT-5 Codex" responses has the best structure: concise, clear and no bloat.
- Claude (especially the CLI) has the best interaction for programming.