Skip to content

Latest commit

 

History

History

README.md

Synopsis

This is an intrusive performance profiler that is used to profile the engine and some of my other tools.

It has many advantages over your more typical style of profiler like the one included with Visual Studio or Perf.

Features

  • Allow viewing data for every invocation of a traced function
    • The ability to see values of arguments
      • Can highlight arguments that are slower
  • Per thread flame graphs which is critical for multithreaded software
  • Lock contention
    • Makes it obvious when one thread took much longer than expected because it was waiting on a lock
    • See which thread had the lock
  • Can be started,paused,resumed without application exit
  • No analysis stage to wait for, profiling data can be viewed live as it's recorded
  • Stream traces to server over TCP
    • Single server can consume data from multiple sessions
    • Can handle 100,000's of events a second per session
    • Multiple clients can connect and view live or completed traces
    • Data is compressed with LZ4 to minimize bandwidth requirements
    • Automatic debug symbol submission
      • Runtime sends missing pdb files to the server for storage
    • Ability to handle large amounts of tracing data
      • Designed to store billions of events
      • A single run of an application can be profiled for months
        • Makes investigating very rare performance bugs in production easy as you simply profile all the time.
  • Event grouping
    • Functions can be grouped to give statistics about each group
  • Counters
  • Memory stats
  • Low overhead runtime
    • Zero allocations (excluding TCP system buffers)
      • Host application passes a single buffer at start time.
    • Collection of data is lock free
      • Single InterlockedExchangeAdd
    • Care is taken to make sure no two threads write collected data to the same cache lane
      • Prevents CPU stalls from false sharing
    • Thread local state for stack tracking
    • Self tracing
      • Easily see how much of your application CPU time is being used by the profiling runtime
  • C API
    • Internally the code is C++ but a C API is exposed for greater portability

Architecture

architecture overview

Screenshots

telem-0

Trace with 220 million zones over 10 minutes (about 350k events a second).

telem-1

Video

Brief demo video showing a sample trace: https://www.youtube.com/watch?v=4Tp5ttDyFuk

Example

telem::ContexHandle ctx;

void foo(int val)
{
    // Zone this function and record the argument
    ttZoneFunction(ctx, "Foo with %d", val);

    while (val > 0) {
        core::Thread::sleep(val);
        val--;
    }
}

core::Thread::ReturnValue threadFunc(const core::Thread& thread)
{
    ttZoneFunction(ctx, "FooBar");

    core::Thread::sleep(rand());

    return core::Thread::ReturnValue(0);
}

int main()
{
    // All Telemetry memory usage will take place in this buffer
    // Size requirements of the buffer depend on a number of factors but 2-4MB is typically enough for most scenarios.
    // A large buffer is useful if latency is high to the receiving server, as too small a buffer can result in stalls.
    // it can also prevents stalls if the application emits an abnormally large burst of data in a short window.
    const size_t telemBufSize = 1024 * 1024 * 4;
    alignas(64) static uint8_t telemBuf[telemBufSize];

    ttInitializeContext(&ctx, telemBuf, sizeof(telemBuf));

    // Connect to server
    // Send some meta like app name and build version that is stored with the trace.
    auto res = ttOpen(
        ctx,
        "My app name",
        X_BUILD_STRING " Version: " X_ENGINE_VERSION_STR " Rev: " X_ENGINE_BUILD_REF_STR,
        "127.0.0.1",
        TtConnectionTypeTcp,
        telem::DEFAULT_PORT,
        1000,
        TtConnectionFlagNone
    );

    if (res != TtErrorOk) {
        return 1;
    }

    // Setup some meta info
    ttSetThreadName(ctx, 0, "Main");
    ttSetThreadGroup(ctx, 0, 0);

    ttSetThreadGroupName(ctx, 0, "Main");
    ttSetThreadGroupName(ctx, 1, "Workers");
    ttSetThreadGroupName(ctx, 2, "IO");

    // Pause collection
    ttPause(ctx, true);

    // Example trace of a scope
    {
        ttZone(ctx, "Sleep");

        core::Thread::sleep(16);
    }

    // Resume
    ttPause(ctx, false);

    // Zone some functions
    foo(0);
    foo(1);
    foo(2);
    foo(3);

    // Start some threads
    const int32_t numThreads = 4;
    core::Thread thread[numThreads];

    core::StackString256 name;
    for (int32_t i = 0; i < numThreads; i++) {
        name.setFmt("Worker %" PRIi32, i);

        thread[i].create(name.c_str());
        thread[i].start(threadFunc);

        // Give the threads names and place all in the workers thread group.
        ttSetThreadName(ctx, thread[i].getID(), "%s", name.c_str());
        ttSetThreadGroup(ctx, thread[i].getID(), 1);
    }

    // join
    for (int32_t i = 0; i < numThreads; i++) {
        thread[i].join();
        thread[i].destroy();
    }

    ttShutdownContext(ctx);
    return 0;
}