Skip to content

feat: add llm funbox with in-browser GPT-2 word generation (@hunterpaulson)#7771

Closed
hunterpaulson wants to merge 2 commits intomonkeytypegame:masterfrom
hunterpaulson:feat/llm-funbox
Closed

feat: add llm funbox with in-browser GPT-2 word generation (@hunterpaulson)#7771
hunterpaulson wants to merge 2 commits intomonkeytypegame:masterfrom
hunterpaulson:feat/llm-funbox

Conversation

@hunterpaulson
Copy link
Copy Markdown

@hunterpaulson hunterpaulson commented Apr 5, 2026

demo

llm-funbox.mov

sequences seem the best in english 5k and 10k wordsets since llm has many more words to choose from.

intro

this is a draft pr. I know this is a lot to review and I can split it up into multiple prs if desired.

I tried to make the best decisions I could to support this funbox in a maintainable way, but I am happy to iterate until this is up to your standards.

motivation

monkeytype’s default word mode samples uniformly from a fixed wordset. that is great for raw speed practice, but it does not reflect the word-to-word transitions people type in normal text. this draft adds an llm funbox that replacing uniform random sampling with sampling (constrained to selected wordset) from GPT-2 running in the browser, producing more plausible word sequences without turning the mode into arbitrary freeform text.

I tried to tune this so it lands somewhere in the middle of the spectrum between uniformly random smapling and quote mode. quote mode already covers fixed realistic text. my aim was to keep the randomness of word mode while making sequences of words statistically probable. I am happy to tune it more based on feedback.

Related discussion: #5188

what this PR adds

A new llm funbox that runs GPT2 in browser to sample words from the wordset.

since LLMs generate tokens instead of words we need to constrain the logits to only sample tokens that lead to valid words in the wordset. to determine which tokens are valid I use a lazily materialized state machine to keep track of the current state of the input sequence and the valid tokens for each state.

core monkeytype changes required

streaming word generation

test startup now supports a very small initial buffer plus background refill.

normal wordsets generate 100 words up front (the full buffer). The LLM wordset returns initialWords: 1, just the seed word. This lets the test start immediately instead of waiting for 100 LLM-generated words.

the "its a feature not a bug" side effect of this it it allows users to try and 'race the llm' to see if they can type faster than the llm can generate words. but I tried to make this fast enough so that would be impossible. (see video above).

wordset can now return words asynchronously

the LLM generator can't guarantee a word is ready synchronously, the model takes ~50ms per token and words take multiple tokens. rather than duplicating the word selection logic into separate sync/async paths, randomWordAsync() was added to the base Wordset class. The base implementation wraps the existing sync randomWord(), so non-LLM wordsets are unchanged.

this required three supporting changes:

  • lifecycle hooks (dispose(), getInitialWordCount(), getStreamingBufferTarget()) so the test engine can manage async generators that hold GPU resources
  • background streaming with cancellation so the test starts with 1 word and fills the buffer while the user types, stopping cleanly on restart/finish
  • serialized word addition so concurrent addWord() calls don't race the DOM or test state

the existing addWord() -> getNextWord() -> TestUI.addWord() pipeline is preserved. the async orchestration wraps it.

extracted full page loading screen controller

llm funbox init requires downloading the model weights, building the state machine for the wordset, initializing the WebGPU runtime, and starting generation.

This takes a couple seconds so I reused the existing full page loading screen controller that was previously only used on page navigation transitions to show a progress bar and loading text.

see #7697

decisions I made and why

can't get a PB

following the precedent set by zipf funbox. anything that changes the probability distribution of the wordset is ineligible for a PB or leaderboard.

allow repetition of the same word

monkeytype normally avoids repeating the previous 2 words. The LLM funbox skips this check because natural language contains valid repetitions (see below) and re-rolling already generated words would waste model output steps (which makes streaming look slow or even stopped) and break sequence coherence.

I know that that is correct.
They had had enough.
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

tbh I am on the fence about this due since incorrect repetitions are common in wordsets not represented in gpt2's training data. I am looking to hear other opinions on this and could probably be swayed either way.

vendoring WebGPT

WebGPT is the most performant small llm base model runtime I was able to find. However because it is not a package I needed to vendor its code so it can be used in monkeytype.

forking the weights

because we do not own the WebGPT project repo we are succeptible to upstream changes, however unlikely, breaking our implementation. So I forked the weights into my own repo I can guarantee future compatibility. see discussion for questions about this that I want to discuss further.

known issues / limitations

wordsets without common words

english_25k wordset is still essentially random because it doesnt not include the 10k most common words (#1303) so there are no glue words to string sequences of words together. This also impacts wordsets like english medical, english contractions. without 'transition' words the llm can sometimes fall into a loop of repeating the same word over and over.

languages other than english not in training data

GPT-2 was trained on WebText and the paper says they “deliberately removed non-English webpages” from WebText. I am not fluent in any other languages but from what I can tell this does not generate realistic sequences in languages other than english. note that this impacts programming language wordsets as well.

punctuation and numbers are not supported (yet)

currently this mode does not support punctuation or numbers. I felt the additional complexity wan't worth it for the initial implementation. But if this is something you want I can look into adding it.

doesnt support words with spaces

the constraint engine uses spaces as word boundaries, so multi-word entries like French "est allé" or German "vor allem" are excluded from generation. this affects a small number of entries in some non-English wordsets. the remaining words still work normally.

WebGPT liscense

the WebGPT liscense is essentially MIT but with an extra novelty clause:

If this software is used for any purpose that is substantially epic, awesome, or
incredible, notice is required to the Author, reachable at will@depue.net.

obviously I think this counts as a 'purpose that is substantially epic, awesome, or incredible' so I will reach out to Will DePue as a courtesy if this funbox is something monkeytype maintainers are interested in adding.

requires WebGPU/browser support

this mode is WebGPU only so it will not on unsupported browsers / devices.

the loading screen progress is fake

the progress indicator uses preset timings and doesnt actually track the steps it mentions. this is because I didn't make any changes to the current full-page loading screen controller mentioned above other than moving it so it's usable for non-navigation loads. the timings I chose are based on the benchmarks I ran on my machine.

discussion

where should we host the model weights?

Currenlty these are hosted in github.com/hunterpaulson/webgpt-gpt2-weights. Should we move them to the monkeytype org? or should we move them to the monkeytype CDN?

how can I make this easier to review?

should I split this into smaller PRs for easier review? if so which ones?

Closes #7697 (cherry picks its only commit)

@monkeytypegeorge monkeytypegeorge added frontend User interface or web stuff packages Changes in local packages labels Apr 5, 2026
@Miodec
Copy link
Copy Markdown
Member

Miodec commented Apr 7, 2026

Sorry, but i think for such a small change (the generated text doesnt seem that different) that only a small percentage of users will see as an option let alone use, 100k lines added is way too much.

Its generally best to discuss big changes before you commit your time to it.

@Miodec Miodec closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend User interface or web stuff packages Changes in local packages

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants