fix(group_update): Handle last_seen integer overflow on groups#104297
fix(group_update): Handle last_seen integer overflow on groups#104297
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #104297 +/- ##
=========================================
Coverage 80.57% 80.58%
=========================================
Files 9340 9337 -3
Lines 399123 399487 +364
Branches 25553 25553
=========================================
+ Hits 321581 321912 +331
- Misses 77093 77126 +33
Partials 449 449 |
Using error upsampling groups can cross the MAX_INT value of last_seen, handle these errors and cap the value to deal with these errors for now.
dc9280d to
bbdaa31
Compare
the update dict actually already holds other values than experssion, but the direct assignment we added causes it to error. Instead of creating a complex type for the update dict just set it to Any
src/sentry/buffer/base.py
Outdated
| update_kwargs["times_seen"] = MAX_INT32 | ||
| try: | ||
| group.update(using=None, **update_kwargs) | ||
| logger.warning( |
There was a problem hiding this comment.
Could be worth logging this to datadog instead. I think that warnings will end up as sentry errors too, so you might want to drop these to info.
There was a problem hiding this comment.
ohh right actually these would just spam, I guess a datadog metric is the cheapest way to keep track of it?
There was a problem hiding this comment.
went ahead and replaced with metric, and added a metric for the retry failure as well
src/sentry/buffer/base.py
Outdated
| ) | ||
| except Exception as retry_error: | ||
| # If the capped update also fails, log and skip | ||
| logger.warning( |
There was a problem hiding this comment.
Probably worth logging as an error if this fails
src/sentry/buffer/base.py
Outdated
| else: | ||
| group.update(using=None, **update_kwargs) | ||
| try: | ||
| group.update(using=None, **update_kwargs) |
There was a problem hiding this comment.
Curious whether we could just do something like this to save a db roundtrip?
if "times_seen" in update_kwargs:
update_kwargs["times_seen"] = min(update_kwargs["times_seen"], MAX_INT32)
There was a problem hiding this comment.
yea I thought about it as well, and I did decide to go this route to maintain visibility and keep track of when this happens... but thinking about it some more I dont know if its worth the round trip and the ever increasing metric just to be able to tell this is going on. We will always be able to check for groups with MAX_INT last_seen to know which are affected. It felt like bad practice but now I am actually thinking this is the cleaner option, I think ill change to do this.
There was a problem hiding this comment.
ah ok its not that simple, we dont use the times_seen value, we just tell it to basically inc the existing value - giving it an expression of like V + 1 to handle race conditions and such. Looks like there is no easy way to do it at the DB level without overflowing, the only way is using Case which would really slow down updates for all groups and I am reluctant to do that. I think keeping updates fast for the 99.9% case and slow for these groups might be better.
However we can actually just add a check for the current last_seen value and if it is MAX_INT, not update times_seen, which would prevent the overflow
There was a problem hiding this comment.
Is Least helpful here?
I think something like
for c, v in columns.items():
if c == "times_seen":
update_kwargs[c] = Least(F(c) + v, MAX_INT32)
runs atomically to set times_seen as the smallest of the values provided
There was a problem hiding this comment.
yea I tried, but it turns out Least() has to actually calculate that value, and it overflows and errors in the DB when it does that, so no dice
There was a problem hiding this comment.
the only thing that can be done in DB level that I found is adding a CASE statement, but that would really slow down updates, even Least would but I would have accepted that I think
thetruecpaul
left a comment
There was a problem hiding this comment.
Approving with nits.
This is better than current state, but I think (either before landing or as a followup) I'd suggest you look into @shashjar's comment about Least. Obviously we can't do the impossible, but if there is a way to do this in fewer DB hits then that'd be awesome
src/sentry/buffer/base.py
Outdated
| logger = logging.getLogger(__name__) | ||
|
|
||
| # Maximum value for a 32-bit signed integer | ||
| MAX_INT32 = 2_147_483_647 |
There was a problem hiding this comment.
Nit: could you use BoundedPositiveIntegerField.MAX_VALUE instead of a new constant to avoid duplication & match the Group field definition?
|
|
||
| if not signal_only: | ||
| update_kwargs: dict[str, Expression] = {c: F(c) + v for c, v in columns.items()} | ||
| update_kwargs: dict[str, Any] = {c: F(c) + v for c, v in columns.items()} |
There was a problem hiding this comment.
Nit: Can we keep the type as constrained as possible? Maybe dict[str, Expression | int] instead of opening the door to Any?
There was a problem hiding this comment.
the original Expression is actually not accurate and that dict does not contain only Expression, mypy just doesnt detect it but the update_kwargs contains other things than Expression, the real type is more convoluted and it actually not enforced so I wanted it to be clear here. The extra dict can contain datetime or any other arbitrary type, so Any is probably warranted here.
Using error upsampling groups can cross the MAX_INT value of last_seen, handle these errors and cap the value to deal with these errors for now.
Using error upsampling groups can cross the MAX_INT value of last_seen, handle these errors and cap the value to deal with these errors for now.