feat: add llm funbox with in-browser GPT-2 word generation (@hunterpaulson)#7771
Closed
hunterpaulson wants to merge 2 commits intomonkeytypegame:masterfrom
Closed
feat: add llm funbox with in-browser GPT-2 word generation (@hunterpaulson)#7771hunterpaulson wants to merge 2 commits intomonkeytypegame:masterfrom
hunterpaulson wants to merge 2 commits intomonkeytypegame:masterfrom
Conversation
Member
|
Sorry, but i think for such a small change (the generated text doesnt seem that different) that only a small percentage of users will see as an option let alone use, 100k lines added is way too much. Its generally best to discuss big changes before you commit your time to it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
demo
llm-funbox.mov
sequences seem the best in english 5k and 10k wordsets since llm has many more words to choose from.
intro
this is a draft pr. I know this is a lot to review and I can split it up into multiple prs if desired.
I tried to make the best decisions I could to support this funbox in a maintainable way, but I am happy to iterate until this is up to your standards.
motivation
monkeytype’s default word mode samples uniformly from a fixed wordset. that is great for raw speed practice, but it does not reflect the word-to-word transitions people type in normal text. this draft adds an
llmfunbox that replacing uniform random sampling with sampling (constrained to selected wordset) from GPT-2 running in the browser, producing more plausible word sequences without turning the mode into arbitrary freeform text.I tried to tune this so it lands somewhere in the middle of the spectrum between uniformly random smapling and quote mode. quote mode already covers fixed realistic text. my aim was to keep the randomness of word mode while making sequences of words statistically probable. I am happy to tune it more based on feedback.
Related discussion: #5188
what this PR adds
A new
llmfunbox that runs GPT2 in browser to sample words from the wordset.since LLMs generate tokens instead of words we need to constrain the logits to only sample tokens that lead to valid words in the wordset. to determine which tokens are valid I use a lazily materialized state machine to keep track of the current state of the input sequence and the valid tokens for each state.
core monkeytype changes required
streaming word generation
test startup now supports a very small initial buffer plus background refill.
normal wordsets generate 100 words up front (the full buffer). The LLM wordset returns
initialWords: 1, just the seed word. This lets the test start immediately instead of waiting for 100 LLM-generated words.the "its a feature not a bug" side effect of this it it allows users to try and 'race the llm' to see if they can type faster than the llm can generate words. but I tried to make this fast enough so that would be impossible. (see video above).
wordset can now return words asynchronously
the LLM generator can't guarantee a word is ready synchronously, the model takes ~50ms per token and words take multiple tokens. rather than duplicating the word selection logic into separate sync/async paths,
randomWordAsync()was added to the base Wordset class. The base implementation wraps the existing syncrandomWord(), so non-LLM wordsets are unchanged.this required three supporting changes:
dispose(),getInitialWordCount(),getStreamingBufferTarget()) so the test engine can manage async generators that hold GPU resourcesaddWord()calls don't race the DOM or test statethe existing
addWord()->getNextWord()->TestUI.addWord()pipeline is preserved. the async orchestration wraps it.extracted full page loading screen controller
llmfunbox init requires downloading the model weights, building the state machine for the wordset, initializing the WebGPU runtime, and starting generation.This takes a couple seconds so I reused the existing full page loading screen controller that was previously only used on page navigation transitions to show a progress bar and loading text.
see #7697
decisions I made and why
can't get a PB
following the precedent set by
zipffunbox. anything that changes the probability distribution of the wordset is ineligible for a PB or leaderboard.allow repetition of the same word
monkeytype normally avoids repeating the previous 2 words. The LLM funbox skips this check because natural language contains valid repetitions (see below) and re-rolling already generated words would waste model output steps (which makes streaming look slow or even stopped) and break sequence coherence.
tbh I am on the fence about this due since incorrect repetitions are common in wordsets not represented in gpt2's training data. I am looking to hear other opinions on this and could probably be swayed either way.
vendoring WebGPT
WebGPT is the most performant small llm base model runtime I was able to find. However because it is not a package I needed to vendor its code so it can be used in monkeytype.
forking the weights
because we do not own the WebGPT project repo we are succeptible to upstream changes, however unlikely, breaking our implementation. So I forked the weights into my own repo I can guarantee future compatibility. see discussion for questions about this that I want to discuss further.
known issues / limitations
wordsets without common words
english_25k wordset is still essentially random because it doesnt not include the 10k most common words (#1303) so there are no glue words to string sequences of words together. This also impacts wordsets like english medical, english contractions. without 'transition' words the llm can sometimes fall into a loop of repeating the same word over and over.
languages other than english not in training data
GPT-2 was trained on WebText and the paper says they “deliberately removed non-English webpages” from WebText. I am not fluent in any other languages but from what I can tell this does not generate realistic sequences in languages other than english. note that this impacts programming language wordsets as well.
punctuation and numbers are not supported (yet)
currently this mode does not support
punctuationornumbers. I felt the additional complexity wan't worth it for the initial implementation. But if this is something you want I can look into adding it.doesnt support words with spaces
the constraint engine uses spaces as word boundaries, so multi-word entries like French "est allé" or German "vor allem" are excluded from generation. this affects a small number of entries in some non-English wordsets. the remaining words still work normally.
WebGPT liscense
the WebGPT liscense is essentially MIT but with an extra novelty clause:
obviously I think this counts as a 'purpose that is substantially epic, awesome, or incredible' so I will reach out to Will DePue as a courtesy if this funbox is something monkeytype maintainers are interested in adding.
requires WebGPU/browser support
this mode is WebGPU only so it will not on unsupported browsers / devices.
the loading screen progress is fake
the progress indicator uses preset timings and doesnt actually track the steps it mentions. this is because I didn't make any changes to the current full-page loading screen controller mentioned above other than moving it so it's usable for non-navigation loads. the timings I chose are based on the benchmarks I ran on my machine.
discussion
where should we host the model weights?
Currenlty these are hosted in github.com/hunterpaulson/webgpt-gpt2-weights. Should we move them to the monkeytype org? or should we move them to the monkeytype CDN?
how can I make this easier to review?
should I split this into smaller PRs for easier review? if so which ones?
Closes #7697 (cherry picks its only commit)