Voice Capture Cleanup with Sotto and Claude Haiku

I'd been dictating into AI tools for months before I noticed how much noise I was actually sending them.

Filler words, false starts, stuttered repetitions, half-formed thoughts — all of it pasted straight into the context window, expecting the model to parse signal from mess. It worked, mostly. But it's a sloppy input habit; and when you're trying to craft better prompts, starting with clean input matters.

The obvious fix is to talk to your computer instead of typing at it. The less obvious problem is that talking to your computer in an open-plan office makes you the weird one; doing it at home gets you strange looks from your wife and kids. Voice input is genuinely faster than typing — I feel limited by my fingers most of the time — but normalising it is still a work in progress.

I wanted to fix the pipeline at the source: clean up the transcription noise without losing the intent. One session, about three cents in API credits, and a macOS app called Sotto later, I had something working.

🤖

Models: Claude Haiku 4.5

Tools: Sotto, Anthropic Console

Platform: macOS

Complexity: Simple | Build time: <1 hour

The problem

I use Sotto for voice capture on macOS — speak, and it transcribes and pastes into whatever window has focus. Fast, frictionless. But raw voice transcription is messy by nature.

A sentence I'd think as…

Compare the error handling patterns in these two approaches and recommend which scales better.

would arrive as…

Okay so um I need to like compare the error handling in these two uh approaches and like which one scales better basically.

That's fine for casual notes. It's less fine when the transcript is a prompt destined for an LLM, where every word is a signal the model will try to interpret.

Two separate problems emerged:

Cleanup — raw transcripts needed automatic scrubbing: filler words, stutters, false starts, grammar
Prompt shaping — voice-dictated prompts needed optional restructuring into something an LLM could execute cleanly

These are contradictory tasks. Cleanup preserves your voice and intent; prompt shaping restructures it. Mixing them in a single pass would either over-edit transcripts or under-edit prompts.

The approach

Always-on cleanup rules

Sotto supports 'always-on rules' — instructions that run on every transcription via an API-connected model. I set up Claude Haiku 4.5 through the Anthropic API (entirely separate from a Claude Pro subscription; you need a Console account at console.anthropic.com).

Haiku 4.5 costs $1.00 per million input tokens. For short voice transcripts, that's roughly $0.0001 per capture — effectively free.

I configured five rules that fire on every transcription. Four are Sotto's out-of-the-box defaults, which I enhanced with tighter scoping and grounding instructions; the fifth is a custom addition:

Fix Grammar & Spelling — catches transcription errors (enhanced)
Remove Filler Words — strips 'um', 'uh', 'like', 'you know', etc. (enhanced)
Smart Punctuation — adds sentence structure and capitalisation (enhanced)
Be Concise — removes redundancy without rewriting (enhanced)
Remove Stutters — cleans repeated words and false starts (custom)

Each rule needed careful scoping. 'Be Concise', for instance, originally had enough latitude that it hallucinated a model version number — turning 'Claude Haiku 4.5' into 'Claude Haiku 3.5'. I added an explicit constraint: 'Do not rewrite, reorder, or alter proper nouns, names, numbers, or technical terms'.

Two modes of prompt shaping

For the prompt restructuring problem, I leaned on some prompt engineering notes I'd been collecting. A framework by Nate Jones breaks prompting into four disciplines — prompt craft, context engineering, intent engineering, and specification engineering — but it assumes you already know what you want. It optimises for execution mode.

Voice-dictated prompts are often exploratory. You're thinking out loud, not issuing instructions. That's a fundamentally different mode.

I landed on two separate AI functions (manual toggles in Sotto, not always-on). Both are deliberately light-touch; the goal isn't heavy polish or letting the model reshape my thinking. It's cleaning up enough that the intent comes through clearly — making voice input easier for an LLM to ingest without changing what I actually meant.

Prompt Polish — for when you know what you want. Lightly restructures a messy voice prompt into a self-contained problem statement, surfacing any implied context and removing noise. The intent stays exactly as dictated; the structure just gets tidied.

Prompt Explore — for when you're thinking out loud. Structures the rough thought, then assesses whether it has meaningful gaps that would cause an LLM to guess. If so, it surfaces 2–3 targeted clarifying questions. If not, it just returns the cleaned version. No questions for the sake of questions.

The prompts

Here are the final prompts for all seven functions. Each cleanup rule includes a guard condition for Sotto's empty transcript bug and a grounding instruction to prevent the model from treating the input as conversation.

Cleanup rules (always-on)

Fix Grammar & Spelling (enhanced)

plain text

If the input text is empty, null, or missing, return nothing. Do not respond, acknowledge, or explain. Only process text that is explicitly provided.

Process the following transcribed text. Fix any grammar, spelling, and punctuation errors. Return only the corrected text — no commentary, no acknowledgement.

Remove Filler Words (enhanced)

plain text

If the input text is empty, null, or missing, return nothing. Do not respond, acknowledge, or explain. Only process text that is explicitly provided.

Process the following transcribed text. Remove filler words such as 'um', 'uh', 'like', 'you know', 'basically', 'actually', 'literally', and similar verbal fillers. Return only the corrected text — no commentary, no acknowledgement.

Smart Punctuation (enhanced)

plain text

If the input text is empty, null, or missing, return nothing. Do not respond, acknowledge, or explain. Only process text that is explicitly provided.

Process the following transcribed text. Add proper punctuation and sentence structure. Capitalise the first letter of each sentence. Return only the corrected text — no commentary, no acknowledgement.

Be Concise (enhanced)

plain text

If the input text is empty, null, or missing, return nothing. Do not respond, acknowledge, or explain. Only process text that is explicitly provided.

Process the following transcribed text. Remove unnecessary words, redundant phrases, and repeated content. Do not rewrite, reorder, or alter proper nouns, names, numbers, or technical terms. Return only the corrected text — no commentary, no acknowledgement.

Remove Stutters (custom)

plain text

If the input text is empty, null, or missing, return nothing. Do not respond, acknowledge, or explain. Only process text that is explicitly provided.

Process the following transcribed text. Remove stutters, repeated words, and false starts. Do not change meaning, vocabulary, or sentence structure beyond what is necessary for clarity. Return only the corrected text — no commentary, no acknowledgement.

Prompt-shaping functions (manual)

Prompt Polish

plain text

Process the following voice-dictated prompt. Transform it into a self-contained problem statement that an LLM could execute without asking any clarifying questions.

Apply these steps silently:
1. Identify the core task or question
2. Surface any implied context and make it explicit
3. Add a clear definition of what "done" looks like if it can be inferred
4. Remove filler, false starts, and repetition
5. Preserve the original intent exactly — do not add information that wasn't present

Return only the rewritten prompt. No preamble, no explanation, no commentary.

Prompt Explore

plain text

Process the following voice-dictated thought, question, or rough idea.

First, structure it clearly — remove filler, false starts, and repetition while preserving the intent exactly.

Then assess: does this prompt have meaningful gaps that would prevent a useful response — missing context, unclear scope, or ambiguous intent?

If yes: add 2–3 targeted clarifying questions below the structured prompt. Focus on gaps that would cause an LLM to guess or go in the wrong direction.

If no: return only the structured prompt. No questions needed.

Output format: structured prompt first, then clarifying questions if any. No preamble or commentary.

What broke

The transcription model trap

Sotto ships with several local transcription models. I'd been using Parakeet v2 English by NVIDIA (2.6 GB) — it's incredibly fast for on-device voice transcription, and the English-only variant keeps the model small. When I set up the Haiku integration, I also switched the transcription model to WhisperKit Turbo v3 by Apple, thinking a newer model might give better results.

Everything slowed to a crawl. I assumed Haiku's API latency was the bottleneck — reasonable hypothesis, since it was the new variable. It wasn't. Turbo v3 was doing the heavy lifting of being slow; the transcription step, not the cleanup step, was the problem. Once I switched back to Parakeet v2, the pipeline was near-instant again. The accuracy difference between the two models was negligible for my use case; Parakeet v2 with Haiku cleanup downstream handles transcription errors just fine. If you're using Sotto and can get by with English-only transcription, Parakeet v2 is the one to use.

The grounding instruction problem

The most instructive failure: after configuring the rules, Sotto's first output wasn't a cleaned transcript. It was this:

'I'm ready to process text according to your rules. Please provide the text you'd like me to process.'

The model was treating the rules as a conversation, not as processing instructions. There was no grounding instruction telling it to act as a text processor. Every rule needed bookending: Process the following transcribed text. at the start, and: Return only the corrected text — no commentary, no acknowledgement. at the end.

A second instance was worse. I dictated a question:

'Can you tell me what the quantified impact section should contain?'

And the AI responded:

'This is a question/request, not text to process. According to the instructions, I should not answer questions or have conversations.'

Technically correct, practically useless. The grounding instruction helped but didn't fully resolve the edge case of question-shaped input.

Other issues:

Transcription mangling proper nouns — 'Sotto' consistently transcribed as 'Soto' or 'SOTA'. Parakeet v2 (the transcription model) ignores dictionary entries, so this can only be caught downstream by the AI rules
Empty transcript bug — Sotto occasionally fires a zero-length capture, still sends it to the API, and the model generates a response to nothing. I added a guard condition ('If the input text is empty, return nothing') but the root fix requires a Sotto update
Be Concise hallucination — the version number swap mentioned earlier. Tightened the rule; monitoring ongoing

Results and reflection

The pipeline works. Voice in, clean text out, roughly two seconds end-to-end with Parakeet v2 transcription and Haiku 4.5 cleanup. Four days of moderate usage has cost US$0.10 — about 55,000 tokens in and 8,800 out. At that rate I'll hit a dollar of API spend sometime next year.

The real takeaway wasn't the cleanup setup — that's plumbing. It was the execution-versus-exploration distinction in prompt shaping. Most prompt engineering advice assumes you know your goal and just need to express it better. But a lot of the prompting I do — especially by voice — is thinking out loud, not issuing orders. Building two separate functions for those two modes felt obvious in hindsight; I wouldn't have landed there without trying to force a single function to do both.

What's next

The cleanup pipeline is solid; the prompt-shaping modes are the part that still needs real-world testing. Prompt Polish and Prompt Explore work in principle, but I haven't put them through their paces with enough varied input to know where they break. I also want to test whether the grounding instruction fix holds long-term, or whether question-shaped transcripts keep tripping the model up.

And the whole setup will need revisiting when the next model ships — so, probably next month. Prompting as a discipline has changed drastically in the last six months alone; the rules and prompt-shaping functions I've built here are tuned for how these models behave right now. That'll shift. The framework should hold, but the specifics won't stay still for long.

Sources

Nate Jones, 'Prompting just split into 4 different skills. You're probably practicing 1 of them' — the four-discipline framework referenced in this article
Sotto — macOS voice capture and local transcription app by Kitze