Conversation
|
|
Wow, this is amazing. Thank you very much. Let me know if you need any help or want to discuss. |
No problem :) I got irritated that there is no good polyfill for word segmenting with up-to-date Unicode rules, and I stumbled upon your project via e18e and decided to contribute :)
Yeah so far I think I'm good. It's a bit confusing to implement because I try to base it off grapheme.js but also word.rs, and they are implemented completely differently 😂 It's all very new to me, but I think I'm getting the hang of it now. Will try to finish it this week. |
|
#112 needs to be addressed before implementing other segmenters. |
Hey Hyeseong, sorry for the silence from my side. I'll be honest with you, but I don't know if I'll be working on this PR any time soon. Since last time I touched the code, my own needs and circumstances have changed, and now I myself don't need a word segmenter for the project I initially needed it for. Because of this, I now have less motivation to finish this :) I might come back to it later when I get the time and inspiration to work with ICU again, but don't expect this to happen any time soon :/ If you, or anyone else, want to pick this up, be my guest 😅 |
|
Understandable. I was wondering if people still need the word segmenter, but it seems to receive less attention than grapheme. I tried a different implementation on my end, but having automated tests was a bit difficult due to the fragmentation of Intl.Segmenter behavior on Node.js. I think I can postpone the development of additional segments a little more. |
|
@cometkim from my own tests, the word segmentation rules also changed between Node v24 and Node v25 - so if you currently base the behavior off of a single runtime Intl.Segmenter call, you will get different results. Snapshot tests using Node v25 as the ground truth might be better here (it uses the most up to date unicode libraries/versions). |
|
@spaceemotion Yeah. I attempted to implement it in my local branch. Before proceeding, I need to decide first whether to follow the Unicode standard or Chrome/Node.js behavior (which seems more correct) If I decide to follow the Unicode standard strictly, it might be better to run some tests in Bun. |
|
@cometkim The way Chrome splits words is different to Node and Firefox as well. I only got a screenshot, but here was my 'research' into it (created a bunch of JSONs with debug data on each browser/platform, then had them compare):
I then found this discussion on the related topic: unicode-org/icu4x#7373 (reply in thread) |

This PR tackles #25 and implements a word segmenter. This is still WIP