When using the tracy ON_DEMAND mode, it is ok in most cases to drop measurements and do a GPU clock synchronization (that may stall) during the first tracyCollect. This is not enabled by default in the CMakeLists for backward compatibility and because it may be a bit intrusive.
This commit also makes the OpenGL tracy TracyGpuZone* a tiny bit more efficient by not calling the threadlocal GetGpuCtx(). It is also more resilient if no context has been declared on this thread. This means that the application will not crash if a context was used on different threads even though declared only on one (thus GetGpuCtx().ptr == nullptr). Tracy does not support this scenario, so on one hand this helps users by not crashing, on the other it is an error that is now silent.
The C++11 spec states in [basic.stc.thread] thread storage duration:
2. A variable with thread storage duration shall be initialized before its
first odr-use (3.2) and, if constructed, shall be destroyed on thread exit.
Previously Tracy relied on the TLS data being initialized:
- During thread creation (MSVC).
- Or during first use in a thread, but the initialization was performed for
the whole TLS block.
It seems that new compilers are more granular with how they perform the
initialization, hence rpmalloc init has to be checked before each allocation,
as it cannot be "folded" into, for example, initialization of the profiler
itself.
There shouldn't be any changes in generated code on modern
architectures, as the memcpy will be reduced to a store/load operation
identical to the one generated with plain struct member access.
GetTime( cpu ) needs special handling, as the MSVC intrinsic for rdtscp
can't store cpu identifier in a register. Using intermediate variable
would cause store to stack, read from stack, store to the destination
address. Since rdtscp is only available on x86, which handles unaligned
stores without any problems, we can have one place with direct struct
member access.