Intro

One big difference of modern APIs such as DirectX 12 compared to older APIs such as DirectX 11 is that there will be little to no validation of the commands you submit. This makes it much easier to create bugs that will cause errors in the driver which can cause your application to lose the device (and thus be unable to do any more DX operations) without an obvious reason why.

In this blog post I will describe a case study in how I used DRED (Device Removed Extended Data) to debug a TDR (Timeout Detection and Recovery) device removal error in my QhenkiX glTFViewer example. This post will be very similar to the articles at:

but with a focus on my specific issue so if you want to learn more about using DRED I recommend reading those.

Background

The glTF viewer uses my RHI to render glTF models using DirectX 12 and 11. Models are loaded at runtime asynchronously on a separate thread, including the creation of GPU resources and copying data to them from system memory. Sometimes when loading a model the app’s framerate would drop significantly and then the app would freeze for a few seconds before displaying a TDR device removal error:

D3D12 ERROR: ID3D12Device::RemoveDevice: Device removal has been triggered for the following reason 
(DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the 
hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. 
The current Device Context was executing commands when the hang occurred. The application may want to 
respawn and fallback to less aggressive use of the display hardware). 
[ EXECUTION ERROR #232: DEVICE_REMOVAL_PROCESS_AT_FAULT]

There is no previous warning or error message from the D3D12 Debug Layer to hint at what was wrong. In comes DRED. DRED can insert “breadcrumbs” into a GPU command stream to track the progress of an executing command list. After a TDR, you can then inspect this info to see what commands were being executed when the TDR occurred, resources that were being used, and more.

Using DRED

Breadcrumbs need to be enabled before you create the ID3D12Device.

// From https://devblogs.microsoft.com/directx/debugger-extension-for-dred/
CComPtr<ID3D12DeviceRemovedExtendedDataSettings> pDredSettings;
if (SUCCEEDED(D3D12GetDebugInterface(IID_PPV_ARGS(&pDredSettings))))
{
    pDredSettings->SetAutoBreadcrumbsEnablement(D3D12_DRED_ENABLEMENT_FORCED_ON);
    pDredSettings->SetPageFaultEnablement(D3D12_DRED_ENABLEMENT_FORCED_ON);
}

After a TDR occurs, there are two ways you can access the information from these breadcrumbs. You can write your own code (for custom logging, dumping, or analysis):

// From https://devblogs.microsoft.com/directx/debugger-extension-for-dred/
void MyDeviceRemovedHandler(ID3D12Device * pDevice)
{
    CComPtr<ID3D12DeviceRemovedExtendedData> pDred;
    VERIFY_SUCCEEDED(pDevice->QueryInterface(IID_PPV_ARGS(&pDred)));
    D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT DredAutoBreadcrumbsOutput;
    D3D12_DRED_PAGE_FAULT_OUTPUT DredPageFaultOutput;
    VERIFY_SUCCEEDED(pDred->GetAutoBreadcrumbsOutput(&DredAutoBreadcrumbsOutput));
    VERIFY_SUCCEEDED(pDred->GetPageFaultAllocationOutput(&DredPageFaultOutput));
    // Custom processing of DRED data can be done here.
    // Produce telemetry...
    // Log information to console...
    // break into a debugger...
}

or use an extension for WinDbg. In case you don’t know what WinDbg is, it is a debugger usually used to debug kernel level applications, like a kernel mode driver. But it can also debug user mode apps like we’re doing here. After loading the extension we can use the !d3ddred command to inspect the removal data. In my case it looks like this:

windbg
@$d3ddred () : Devicestate: 3 [Type: D3D12_DEVICE_REMOVED_EXTENDED_DATA3]
  [<Raw View>]      [Type: D3D12_DEVICE_REMOVED_EXTENDED_DATA3]
  DeviceState  : D3D12_DRED_DEVICE_ STATE _HUNG (3) [Type: D3D12_DRED_DEVICE_STATE]
  DeviceRemovedReason : Error: Unexpected internal error [get DeviceRemovedReason @D3DDred (line 328 col 36)]
  Autorreadermonoces : Count : 2
  PageFaultVA : 0x0
  ExistingAllocations : Count: 0
  RecentFreedAllocations: Count: 0

DRED includes a list of completed and outstanding operations. Upon closer inspection we can see that the device hung before or during a plain old DrawIndexedInstanced call.

dred

So something is likely wrong with a specific mesh being drawn. The glTF viewer uses a scene graph where each node can have its own mesh to draw. Looking at the struct below:

struct Node
{
  std::string name;
  int parent_index = -1;
  int mesh_index; // The suspected culprit of the TDR
  qhenki::Transform local_transform;
  struct
  {
    qhenki::Transform transform;
    bool dirty = true;
  } global_transform;
  std::vector<int> children_indices;
};

I forgot to give a default initialization to the mesh_index field. Adding = -1 seemed to fix the issue as I no longer can reproduce the TDR error. My guess is that the app would use the uninitialized index for the mesh to be drawn which would then refer to bogus or invalid resources such as the vertex/index buffers or through descriptors bound in the shader. With this bug fixed, I can now load and unload as many glTF models as I want without any issues.

Alternatives and Complements to DRED

DRED is not the only way to debug TDR errors. A good complement is GPU based validation which can catch the use of invalid descriptors or descriptors that refer to invalid resources. If you are ok with a platform specific solution, NVIDIA Nsight Aftermath is a library that can be integrated into your app to generate crash dumps which can then be inspected in Nsight Graphics. Happy debugging!