WIP: Implement wordSegmenter by kytta · Pull Request #76 · cometkim/unicode-segmenter

kytta · 2025-05-21T22:03:13Z

This PR tackles #25 and implements a word segmenter. This is still WIP

changeset-bot · 2025-05-21T22:03:16Z

⚠️ No Changeset found

Latest commit: f42217d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cometkim · 2025-05-21T22:17:09Z

Wow, this is amazing. Thank you very much.

Let me know if you need any help or want to discuss.

kytta · 2025-05-22T06:25:02Z

Wow, this is amazing. Thank you very much.

No problem :) I got irritated that there is no good polyfill for word segmenting with up-to-date Unicode rules, and I stumbled upon your project via e18e and decided to contribute :)

Let me know if you need any help or want to discuss.

Yeah so far I think I'm good. It's a bit confusing to implement because I try to base it off grapheme.js but also word.rs, and they are implemented completely differently 😂 It's all very new to me, but I think I'm getting the hang of it now. Will try to finish it this week.

cometkim · 2025-12-28T21:20:46Z

#112 needs to be addressed before implementing other segmenters.
This PR seems quite old, but if you're still interested in implementing it, please let me know.

kytta · 2026-03-22T20:50:29Z

This PR seems quite old, but if you're still interested in implementing it, please let me know.

Hey Hyeseong, sorry for the silence from my side. I'll be honest with you, but I don't know if I'll be working on this PR any time soon.

Since last time I touched the code, my own needs and circumstances have changed, and now I myself don't need a word segmenter for the project I initially needed it for. Because of this, I now have less motivation to finish this :) I might come back to it later when I get the time and inspiration to work with ICU again, but don't expect this to happen any time soon :/

If you, or anyone else, want to pick this up, be my guest 😅

cometkim · 2026-03-25T11:14:30Z

Understandable. I was wondering if people still need the word segmenter, but it seems to receive less attention than grapheme.

I tried a different implementation on my end, but having automated tests was a bit difficult due to the fragmentation of Intl.Segmenter behavior on Node.js.

I think I can postpone the development of additional segments a little more.

spaceemotion · 2026-04-15T17:55:57Z

@cometkim from my own tests, the word segmentation rules also changed between Node v24 and Node v25 - so if you currently base the behavior off of a single runtime Intl.Segmenter call, you will get different results. Snapshot tests using Node v25 as the ground truth might be better here (it uses the most up to date unicode libraries/versions).

cometkim · 2026-04-15T19:38:42Z

@spaceemotion Yeah. I attempted to implement it in my local branch. Before proceeding, I need to decide first whether to follow the Unicode standard or Chrome/Node.js behavior (which seems more correct)

If I decide to follow the Unicode standard strictly, it might be better to run some tests in Bun.

spaceemotion · 2026-04-15T21:07:56Z

@cometkim The way Chrome splits words is different to Node and Firefox as well.

I only got a screenshot, but here was my 'research' into it (created a bunch of JSONs with debug data on each browser/platform, then had them compare):

I then found this discussion on the related topic: unicode-org/icu4x#7373 (reply in thread)

kytta added 7 commits May 21, 2025 23:55

Download data for word breaks

4038db3

Add word.js

560dac9

Implement cat() for word breaks

fbf1927

WIP: Implement isBoundary

f78cd4d

WIP: Implement wordSegments

c9a11e0

Generate testdata for word breaks

6f2b944

Run Unicode tests for word breaks

f42217d

Add changeset

0a50d8a

kytta changed the title ~~Implement wordSegmenter~~ WIP: Implement wordSegmenter May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Implement wordSegmenter#76

WIP: Implement wordSegmenter#76
kytta wants to merge 8 commits intocometkim:mainfrom
kytta:2-word-segmenter

kytta commented May 21, 2025

Uh oh!

changeset-bot Bot commented May 21, 2025

Uh oh!

cometkim commented May 21, 2025

Uh oh!

kytta commented May 22, 2025

Uh oh!

cometkim commented Dec 28, 2025

Uh oh!

kytta commented Mar 22, 2026

Uh oh!

cometkim commented Mar 25, 2026

Uh oh!

spaceemotion commented Apr 15, 2026

Uh oh!

cometkim commented Apr 15, 2026 •

edited

Loading

Uh oh!

spaceemotion commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kytta commented May 21, 2025

Uh oh!

changeset-bot Bot commented May 21, 2025

⚠️ No Changeset found

Uh oh!

cometkim commented May 21, 2025

Uh oh!

kytta commented May 22, 2025

Uh oh!

cometkim commented Dec 28, 2025

Uh oh!

kytta commented Mar 22, 2026

Uh oh!

cometkim commented Mar 25, 2026

Uh oh!

spaceemotion commented Apr 15, 2026

Uh oh!

cometkim commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spaceemotion commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cometkim commented Apr 15, 2026 •

edited

Loading