Skip to content

Add support for checking all living bthreads#3096

Merged
chenBright merged 1 commit intoapache:masterfrom
ZhengweiZhu:master
Oct 22, 2025
Merged

Add support for checking all living bthreads#3096
chenBright merged 1 commit intoapache:masterfrom
ZhengweiZhu:master

Conversation

@ZhengweiZhu
Copy link
Copy Markdown
Contributor

@ZhengweiZhu ZhengweiZhu commented Sep 17, 2025

User can check all living bthreads by curl ip:port/bthreads/all or curl ip:port/bthreads/all?st=1 to show bthread stack trace. This is an enhancement of the original /bthreads service which provides a method to check a specified bthread by designated bthread id, as user has no idea what the bthread id is.

Condisering the performance cost brought by recording the bthread id on bthread startup and finish, currently this function is only enabled when BRPC_BTHREAD_TRACER is defined.

What problem does this PR solve?

brpc kindly provides a bthreads_service to check a specified thread by curl ip:port/bthreads/<bthread_id>. The problem is that we have no idea what the <bthread_id> is as it is generated by code, which makes this service useless.

Issue Number:
#3088

The sample output with stack trace (note that bthread in jumping status is not displayed due to implementation restriction, so it may not show all bthreads) :

# curl ip:port/bthreads/all?st=1
bthread=11716670785433 :
stop=0
interrupted=0
about_to_quit=0
fn=0x48ba90
arg=0x7f5b0401e380
attr={stack_type=3 flags=0 specified_tag=0 keytable_pool=0x3953570}
has_tls=0
uptime_ns=90442
cputime_ns=0
nswitch=0
status=5
traced=0
worker_tid=140029305263872
bthread call stack:
No frame
Error message: Forbid to trace self=11716670785433

bthread=4294969344 :
stop=0
interrupted=0
about_to_quit=0
fn=0x507850
arg=0
attr={stack_type=3 flags=0 specified_tag=-1 keytable_pool=0}
has_tls=0
uptime_ns=65724290725
cputime_ns=202946
nswitch=66
status=6
traced=0
worker_tid=140029437884160
bthread call stack:
# 0 0x0x5f6de1 bthread::TaskGroup::sched()
# 1 0x0x5f6f79 bthread::TaskGroup::usleep()
# 2 0x0x5ed514 bthread_usleep
# 3 0x0x508155 brpc::GlobalUpdate()
# 4 0x0x5f7989 bthread::TaskGroup::task_runner()
# 5 0x0x603401 bthread_make_fcontext

bthread=4294969345 :
stop=0
interrupted=0
about_to_quit=0
fn=0x501770
arg=0x3a22768
attr={stack_type=3 flags=320 specified_tag=0 keytable_pool=0}
has_tls=0
uptime_ns=65722683480
cputime_ns=65677492951
nswitch=13278
status=5
traced=0
worker_tid=140028751640320
bthread call stack:
# 0 0x0x7f5b26b7cc20 __restore_rt
# 1 0x0x2 <unknown>

bthread=4294969346 :
stop=0
interrupted=0
about_to_quit=0
fn=0x479dc0
arg=0x7ffec814c040
attr={stack_type=3 flags=0 specified_tag=0 keytable_pool=0}
has_tls=0
uptime_ns=65722871376
cputime_ns=7259150
nswitch=66
status=6
traced=0
worker_tid=140029305263872
bthread call stack:
# 0 0x0x5f6de1 bthread::TaskGroup::sched()
# 1 0x0x5f6f79 bthread::TaskGroup::usleep()
# 2 0x0x5ed514 bthread_usleep
# 3 0x0x47a84d brpc::Server::UpdateDerivedVars()
# 4 0x0x5f7989 bthread::TaskGroup::task_runner()
# 5 0x0x603401 bthread_make_fcontext

bthread=11557756996679 :
stop=0
interrupted=0
about_to_quit=0
fn=0x48ba90
arg=0x7f5b0401e080
attr={stack_type=3 flags=0 specified_tag=0 keytable_pool=0x3953570}
has_tls=0
uptime_ns=327293
cputime_ns=0
nswitch=0
status=5
traced=0
worker_tid=140029338834688
bthread call stack:
# 0 0x0x7f5b26b7cc20 __restore_rt
# 1 0x0x527685 brpc::InputMessenger::OnNewMessages()
# 2 0x0x48baa0 brpc::Socket::ProcessEvent()
# 3 0x0x5f7989 bthread::TaskGroup::task_runner()
# 4 0x0x603401 bthread_make_fcontext

Problem Summary:

What is changed and the side effects?

Changed:

  1. add method butil::list_resources to list all object in the specified typed ResourcePool
  2. filter out those bthreads created by bthread_start* functions
  3. add /bthreads/all and /bthreads/all?st=1 service for user

Side effects:
No side effect now

  • Breaking backward compatibility: N/A

Check List:

@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

This pr will be useful to help diagnose bthread problem like bthread deadlock, which is hard or unrealistic by using gdb only.
However condisering the performance cost brought by lock, temporarily I just enable this feature when BRPC_BTHREAD_TRACER is defined. Do u guys have any suggestion to minimize the performance cost so let it be applicable for all situations. If so, I plan to add method to set bthread name and user can check all living bthreads with more detail, reducing the need to enable BRPC_BTHREAD_TRACER for stack trace.

@wwbmmm
Copy link
Copy Markdown
Contributor

wwbmmm commented Sep 18, 2025

Maybe you can try to get all bthread_id from the butil::ResourcePool<bthread::TaskMeta>? (needs to add some interface to butil::ResourcePool)

@wwbmmm wwbmmm requested a review from chenBright September 18, 2025 02:17
@chenBright
Copy link
Copy Markdown
Contributor

Maybe you can try to get all bthread_id from the butil::ResourcePool<bthread::TaskMeta>? (needs to add some interface to butil::ResourcePool)

I agree with the butil::ResourcePool<bthread::TaskMeta> approach.

@yanglimingcn
Copy link
Copy Markdown
Contributor

yes, I think list all bthread_id is better。

@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

Maybe you can try to get all bthread_id from the butil::ResourcePool<bthread::TaskMeta>? (needs to add some interface to butil::ResourcePool)

@wwbmmm @chenBright Yes I have thought of this approach but with no result.

Firstly, when we create a bthread, a ResourceId and TaskMeta instance is acquired from butil::ResourcePool by this way:
butil::ResourceId<TaskMeta> slot; TaskMeta* m = butil::get_resource(&slot);
If we know all the ResourceId slot then we can get all living bthread id by address_resource(slot)->tid. Inside the ResourcePool implementation, the slot is acquired by the following order: 1. thread local FreeChunk 2. global FreeChunk 3. thread local Block 4. global Block (from BlockGroup). The problem is the ResourcePool does not record the slot in use and it only knows the slot after it is not in use and returned to ResourcePool, which does not satisfy our requirement here.

Secondly, if we add some interface in ResourcePool to record the slot when get_resource and consider removing it when return_resource, certainly we might use thread local LocalPool to record for the sake of performance effect. The problem is the get_resource and return_resource may run in different pthread (different tls) as bthread switches to another worker. What's more, if we plan to get all living bthread id by summarizing all thread local records, the tricky part is the ResourcePool interface can be called from worker and non worker, making it even impossible to summarize.

Any ideas?

Copy link
Copy Markdown
Contributor

@chenBright chenBright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can traverse the global butil::ResourceId<TaskMeta>::_block_groups and get all tids.

@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

I think you can traverse the global butil::ResourceId<TaskMeta>::_block_groups and get all tids.

That's huge. And they represent all bthread id which ever existed,not living one?

@chenBright
Copy link
Copy Markdown
Contributor

I think you can traverse the global butil::ResourceId<TaskMeta>::_block_groups and get all tids.

That's huge. And they represent all bthread id which ever existed,not living one?

Maybe you can use TaskStatus to determine whether a bthread is alive.

@chenBright
Copy link
Copy Markdown
Contributor

I think we can add a function to display all the live bthread ids and names, and click on the link to display the corresponding bthread details.

@wwbmmm
Copy link
Copy Markdown
Contributor

wwbmmm commented Sep 23, 2025

I think you can traverse the global butil::ResourceId<TaskMeta>::_block_groups and get all tids.

That's huge. And they represent all bthread id which ever existed,not living one?

No. After a bthread exit, its TaskMeta will be return to the ResourcePool, and be reused by new bthread. When you traverse the ResourcePool, you only need to traverse those slots in use, you don't need to traverse those in the free list.

@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

I think you can traverse the global butil::ResourceId<TaskMeta>::_block_groups and get all tids.

That's huge. And they represent all bthread id which ever existed,not living one?

No. After a bthread exit, its TaskMeta will be return to the ResourcePool, and be reused by new bthread. When you traverse the ResourcePool, you only need to traverse those slots in use, you don't need to traverse those in the free list.

How to judge if the slot is in use? @chenBright The TaskStatus is only used when TaskTracer is enabled, but I want this pr to be applicable even when TaskTracer is not enabled.

If I can judge if the slot is in use, then I need to traverse in the same way as describe_resources does?

@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

I think we can add a function to display all the live bthread ids and names, and click on the link to display the corresponding bthread details.

If the TaskTracer is enabled, we can easily see call trace of all the living bthread and debug the deadlock problem, just like the same way as gdb/gstack does. And in normal situation, the living bthreads won't be many. Is it necessary then?

@chenBright
Copy link
Copy Markdown
Contributor

I think we can add a function to display all the live bthread ids and names, and click on the link to display the corresponding bthread details.

If the TaskTracer is enabled, we can easily see call trace of all the living bthread and debug the deadlock problem, just like the same way as gdb/gstack does. And in normal situation, the living bthreads won't be many. Is it necessary then?

Yes. Because it allows users to view the call stack of a specified bthread, not all of them.

@chenBright
Copy link
Copy Markdown
Contributor

The TaskStatus is only used when TaskTracer is enabled, but I want this pr to be applicable even when TaskTracer is not enabled.

Perhaps the default support for TaskStatus can meet your needs.

@ZhengweiZhu ZhengweiZhu reopened this Sep 25, 2025
@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

ZhengweiZhu commented Sep 25, 2025

The TaskStatus is only used when TaskTracer is enabled, but I want this pr to be applicable even when TaskTracer is not enabled.

Perhaps the default support for TaskStatus can meet your needs.

Currently the TaskStatus is only set when TaskTracer is enabled. Do u mean I need to change the code to set some status when TaskTracer is not enabled? Like set status to TASK_STATUS_CREATED when create a bthread and set it status to TASK_STATUS_UNKNOWN when that bthread is destroyed, and a bthread can be judged alive when its status is not TASK_STATUS_UNKNOWN?

@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

I think we can add a function to display all the live bthread ids and names, and click on the link to display the corresponding bthread details.

If the TaskTracer is enabled, we can easily see call trace of all the living bthread and debug the deadlock problem, just like the same way as gdb/gstack does. And in normal situation, the living bthreads won't be many. Is it necessary then?

Yes. Because it allows users to view the call stack of a specified bthread, not all of them.

I can add link to display the corresponding bthread details. And still display all living bthreads by default. The bthread name will not be shown and I can make another pr to add support naming bthread (and even execution_queue). Is that ok?

@chenBright
Copy link
Copy Markdown
Contributor

The TaskStatus is only used when TaskTracer is enabled, but I want this pr to be applicable even when TaskTracer is not enabled.

Perhaps the default support for TaskStatus can meet your needs.

Currently the TaskStatus is only set when TaskTracer is enabled. Do u mean I need to change the code to set some status when TaskTracer is not enabled? Like set status to TASK_STATUS_CREATED when create a bthread and set it status to TASK_STATUS_UNKNOWN when that bthread is destroyed, and a bthread can be judged alive when its status is not TASK_STATUS_UNKNOWN?

Yes.

@chenBright
Copy link
Copy Markdown
Contributor

I think we can add a function to display all the live bthread ids and names, and click on the link to display the corresponding bthread details.

If the TaskTracer is enabled, we can easily see call trace of all the living bthread and debug the deadlock problem, just like the same way as gdb/gstack does. And in normal situation, the living bthreads won't be many. Is it necessary then?

Yes. Because it allows users to view the call stack of a specified bthread, not all of them.

I can add link to display the corresponding bthread details. And still display all living bthreads by default. The bthread name will not be shown and I can make another pr to add support naming bthread (and even execution_queue). Is that ok?

No problem.

@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

@wwbmmm @chenBright PTAL. The pr is updated and applicable no matter if BRPC_BTHREAD_TRACER is defined, with no performance side effect. Thanks for the suggestion. As for the idea to add link to display the corresponding bthread details, it's nice to have and probably be implemented later.

Comment thread src/brpc/builtin/bthreads_service.cpp Outdated
Comment thread src/brpc/builtin/bthreads_service.cpp Outdated
Comment thread src/brpc/builtin/bthreads_service.cpp Outdated
Comment thread src/butil/resource_pool_inl.h Outdated
Comment thread src/bthread/task_group.cpp
Comment thread src/bthread/task_meta.h Outdated
Comment thread src/bthread/task_control.h Outdated
Comment thread src/bthread/task_group.cpp
@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

Maybe you can try to get all bthread_id from the butil::ResourcePool<bthread::TaskMeta>? (needs to add some interface to butil::ResourcePool)

There's a trick here. As each task group creates a TaskMeta object internally to run main task, these TaskMeta will also be traversed. I will do a filter to not show those TaskMeta as they are opaque to user.

@ZhengweiZhu ZhengweiZhu force-pushed the master branch 2 times, most recently from 4106c6f to 252a2d2 Compare October 14, 2025 07:58
Comment thread src/bthread/task_control.cpp Outdated
Comment thread src/bthread/task_group.cpp Outdated
Comment thread src/bthread/bthread.cpp Outdated
Comment thread src/bthread/task_control.cpp Outdated
@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

ZhengweiZhu commented Oct 16, 2025

The comments have all been resolved. But there's one problem left, that's when the bthread status is set to TASK_STATUS_READY, the actual status is set to TASK_STATUS_FIRST_READY.

if (TASK_STATUS_READY == s || NULL == m->stack) {
m->status = TASK_STATUS_FIRST_READY;
} else {
m->status = s;
}

so the following judgement will never meet?

} else if (TASK_STATUS_SUSPENDED == status || TASK_STATUS_READY == status) {
return ContextTrace(m->stack->context);
}

According to my test, there seems to always exist a bthread in TASK_STATUS_FIRST_READY status and not traceable. This bthread has flag 320 which means "BTHREAD_NEVER_QUIT | BTHREAD_GLOBAL_PRIORITY" , which seems to be EventDispatcher? Is this expected?
image

@chenBright

Comment thread src/bthread/task_control.cpp Outdated
@chenBright
Copy link
Copy Markdown
Contributor

The comments have all been resolved. But there's one problem left, that's when the bthread status is set to TASK_STATUS_READY, the actual status is set to TASK_STATUS_FIRST_READY.

if (TASK_STATUS_READY == s || NULL == m->stack) {
m->status = TASK_STATUS_FIRST_READY;
} else {
m->status = s;
}

so the following judgement will never meet?

} else if (TASK_STATUS_SUSPENDED == status || TASK_STATUS_READY == status) {
return ContextTrace(m->stack->context);
}

According to my test, there seems to always exist a bthread in TASK_STATUS_FIRST_READY status and not traceable. This bthread has flag 320 which means "BTHREAD_NEVER_QUIT | BTHREAD_GLOBAL_PRIORITY" , which seems to be EventDispatcher? Is this expected? image

@chenBright

It should be if (TASK_STATUS_READY == s && NULL == m->stack), please fix it.

@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

The comments have all been resolved. But there's one problem left, that's when the bthread status is set to TASK_STATUS_READY, the actual status is set to TASK_STATUS_FIRST_READY.

if (TASK_STATUS_READY == s || NULL == m->stack) {
m->status = TASK_STATUS_FIRST_READY;
} else {
m->status = s;
}

so the following judgement will never meet?

} else if (TASK_STATUS_SUSPENDED == status || TASK_STATUS_READY == status) {
return ContextTrace(m->stack->context);
}

According to my test, there seems to always exist a bthread in TASK_STATUS_FIRST_READY status and not traceable. This bthread has flag 320 which means "BTHREAD_NEVER_QUIT | BTHREAD_GLOBAL_PRIORITY" , which seems to be EventDispatcher? Is this expected? image
@chenBright

It should be if (TASK_STATUS_READY == s && NULL == m->stack), please fix it.

Fixed! BTW fix another _enable_priority_queue not initialized bug. 😂

@ZhengweiZhu
Copy link
Copy Markdown
Contributor Author

@wwbmmm @chenBright PTAL

@wwbmmm
Copy link
Copy Markdown
Contributor

wwbmmm commented Oct 16, 2025

LGTM

Comment thread src/bthread/task_control.cpp Outdated
User can check all living bthreads by `curl ip:port/bthreads/all`
or when BRPC_BTHREAD_TRACER is enabled by `curl ip:port/bthreads/all?st=1`
to show bthread stack trace.
This is an enhancement of the original /bthreads service which
provides a method to check a specified bthread by designated
bthread id, as user has no idea what the bthread id is.

BTW, fix _enable_priority_queue not initialized bug and fix task status
incorrectly set to TASK_STATUS_FIRST_READY bug.
Copy link
Copy Markdown
Contributor

@chenBright chenBright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenBright chenBright merged commit 5f1d893 into apache:master Oct 22, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants