[dxvk] Periodically flush handles for long-running queries #3273

CFSworks · 2023-03-02T02:33:24Z

Some apps (e.g. Halo: MCC) misuse D3D11 queries: they'll Begin(...) a query, and then only afterward decide that they don't actually want the results, returning the query object back to the engine's internal query pool without first calling End(...). The query thus continues to remain active until the engine decides to allocate the query for another purpose... if it ever does.

While this is misuse of the D3D API and is definitely a bug in the app, DXVK ought to tolerate this situation at least as well as D3D itself apparently does. This change will try to free up Vulkan query handles if they're in the available state, tracking their totals on the DxvkGpuQuery object itself instead.

As a bonus, this change should also reduce the CPU time consumed by repeated GetData(...) calls, since it avoids redundantly summing query handle results.

Some apps (e.g. Halo: MCC) misuse D3D11 queries: they'll Begin(...) a query, and then only afterward decide that they don't actually want the results, returning the query object back to the engine's internal query pool without first calling End(...). The query thus continues to remain active until the engine decides to allocate the query for another purpose... if it ever does. While this is misuse of the D3D API and is definitely a bug in the app, DXVK ought to tolerate this situation at least as well as D3D itself apparently does. This change will try to free up Vulkan query handles if they're in the available state, tracking their totals on the DxvkGpuQuery object itself instead. As a bonus, this change should also reduce the CPU time consumed by repeated GetData(...) calls, since it avoids redundantly summing query handle results.

doitsujin

Thanks for debugging that and coming up with a solution; I think the concept is good but I'll want to do some refactoring on the query code before addressing this.

Do you happen to have an apitrace or something to test the changes with? If not, I can hack a quick test app as well in a few minutes as well, would just be nice to see how bad the problem is in practice.

doitsujin · 2023-03-02T13:18:00Z

src/dxvk/dxvk_gpu_query.cpp

@@ -78,13 +73,44 @@ namespace dxvk {


  void DxvkGpuQuery::addQueryHandle(const DxvkGpuQueryHandle& handle) {
+    // Some apps forget to End() their queries; instead of accumulating query
+    // handles indefinitely, try to clean them up periodically.
+    if (m_handles.size() >= 256)


Any reason not to do this on every call? 256 queries as a threshold is a lot for a single query object, and games tend to have thousands of query objects. At the same time, we'll very rarely have more than one Vulkan query per D3D query in well-behaved games, so there won't really be any overhead.

You guessed it in your edit: the reason is to avoid overhead. Checking a deque's size, comparing against a constant, and conditionally branching over a call is one or two orders of magnitude cheaper than a vkGetQueryPoolResults which will likely only come back with VK_NOT_READY if run on a query handle so recently allocated.

Yeah, 256 is a pretty big threshold! The vast, vast majority of D3D queries out there in the wild are used responsibly and don't come anywhere near that number, which I picked since I didn't want to spend time seeking out the not-as-well-behaved games which do use multiple Vulkan queries per D3D query to make sure this new codepath wasn't causing some regression somewhere. I chose 256 conservatively to be sure that this would only affect "leaked" D3D queries. (And, let's remember that the old threshold was actually infinity, and while games do tend to have thousands of query objects, I only observed about a dozen being leaked in this manner.)

If m_handles is to become a small_vector, it does make sense to make the threshold the same as its preallocated region, however.

doitsujin · 2023-03-02T13:19:48Z

src/dxvk/dxvk_gpu_query.h

-
+    mutable DxvkQueryData                  m_queryData;
+    mutable DxvkGpuQueryHandle             m_handle;
+    mutable std::deque<DxvkGpuQueryHandle> m_handles;


I strongly dislike this mutable spam; if methods modify internal state then they shouldn't be const, especially since doing it like this hides potential thread safety concerns.

If anything, this highlights that the current implementation is bugged (in that accesses to m_ended do not enforce memory order).

I don't really like mutable either, though I wanted to avoid a pointer to an extra struct that would have just meant yet one more allocation.

Since DxvkGpuQuery::getData() is part of the "public" (D3D layer facing) API, it seemed appropriate to keep it const as it is indeed logically const. If that isn't The DXVK Way™, it's easy enough to make getData() non-const.

I can replace m_ended with an atomic if you'd like, though this is a separate change and I'd rebase this PR to put that commit deeper than this one.

doitsujin · 2023-03-02T13:25:21Z

src/dxvk/dxvk_gpu_query.cpp

+      m_handles.pop_front();
+    }
+
+    if (m_handle.queryPool) {


Probably easier to turn this whole mess into a small_vector. The main reason why there are two structures at all was to avoid allocations in the common case (1 query), but we have better ways to deal with this now.

The concern that I have here is that the small_vector's equivalent to pop_front() (erase(0)) is O(n), which would make the while loop O(n²). I suppose this could be mitigated if small_vector::erase() got a range overload, which could then be called once after the loop finishes; flushHandles() would be back to O(n).

Either way, that's a separate change so I'd be rebasing the PR to add a commit if I do it.

CFSworks · 2023-03-02T21:17:34Z

Thank you for the expedient and thorough review!

Do you happen to have an apitrace or something to test the changes with? If not, I can hack a quick test app as well in a few minutes as well, would just be nice to see how bad the problem is in practice.

The best I have is a RenderDoc capture, but I haven't found a way to get it to replay the Begin(Query ...) events. All of my repro work has been in Halo MCC itself. It renders full frames even with the pause menu active.

But, the problem is pretty bad: RSS growth of ~130MiB/minute at 60FPS and a new query pool (and associated GPU buffer mmap) every two seconds or so.

vimpostor · 2023-05-19T21:12:32Z

Some apps (e.g. Halo: MCC) misuse D3D11 queries: they'll Begin(...) a query, and then only afterward decide that they don't actually want the results, returning the query object back to the engine's internal query pool without first calling End(...). The query thus continues to remain active until the engine decides to allocate the query for another purpose... if it ever does.

@CFSworks Just out of curiosity, is this the memory leak that only happens in the Halo 1 campaign?

If yes, than it is pretty awesome that this got fixed in dxvk, because the memory leak even happens on Windows and hasn't been fixed there for months, even though the devs are aware of the issue.

CFSworks · 2023-05-19T21:30:09Z

@CFSworks Just out of curiosity, is this the memory leak that only happens in the Halo 1 campaign?

It sounds about right for the timeframe and everything. I was stuck at home sick for most of February and decided to play Halo 1's campaign to pass the time, but was plagued with OOMs after about 30 minutes of play (and progressively-worsening stuttering, probably from Linux's memory management trying to cope with the leak). I haven't tried the other campaigns or multiplayer, so I can't confirm whether my problem was unique only to Halo 1.

I suppose another way of confirming that it's the same issue would be to run Halo 1 on Windows but with dxvk instead of D3D11, and then see if it resolves the leak?

If yes, than it is pretty awesome that this got fixed in dxvk, because the memory leak even happens on Windows and hasn't been fixed there for months, even though the devs are aware of the issue.

Wow, I had no idea that this was affecting the Windows players too. I'm used to things like this being, "well the game itself is technically bugged because it's misusing the Windows API, but Windows itself seems to tolerate it so [dxvk/wine/whatever] should as well" so I never checked my (wrong) assumption that that would be the case here... Guess D3D11 itself doesn't tolerate this query misuse any better, which means dxvk is doing something better than the library it's trying to emulate! :)

I'd thought about sending in a report to the devs, pointing out their exact misuse of the API and giving some tips to finding the actual culprit code. Perhaps I/we still should? Maybe the reason it's gone unfixed for so long is they don't know where to look?

vimpostor · 2023-05-19T23:02:00Z

It sounds about right for the timeframe and everything. I was stuck at home sick for most of February and decided to play Halo 1's campaign to pass the time, but was plagued with OOMs after about 30 minutes of play (and progressively-worsening stuttering, probably from Linux's memory management trying to cope with the leak).

Yeah from your description it's pretty clear that we are indeed talking about the same memory leak, I don't think we have to do any further testing.

I'm used to things like this being, "well the game itself is technically bugged because it's misusing the Windows API, but Windows itself seems to tolerate it so [dxvk/wine/whatever] should as well" so I never checked my (wrong) assumption that that would be the case here...

Wow, pretty impressive that you still were able to find the issue, even without the Halo source code available, while not even 343I was able to find it with the code available :D Reminds me of the time where this happened in GTA V.

I'd thought about sending in a report to the devs, pointing out their exact misuse of the API and giving some tips to finding the actual culprit code. Perhaps I/we still should?

I agree, contacting the devs would be a good idea. Do you want to go ahead? :)

P.S. Stuff like this is why I enjoy the Linux gaming ecosystem. Sometimes it feels like a lot of workarounds, but then occasionally we have a gem like this one, where we are able to fix issues that seem unfixable even on Windows.

CFSworks · 2023-05-20T01:58:39Z

Wow, pretty impressive that you still were able to find the issue, even without the Halo source code available, while not even 343I was able to find it with the code available :D Reminds me of the time where this happened in GTA V.

Counterintuitively, I've identified more of these kinds of bugs in closed-source software than open-source. My guess is it's because they're actually pretty low-hanging fruit when you have the source and know what you need to look for (i.e. you are a user who is actively being bitten by the bug, not a developer who has only seen reports), so in the open-source world, these things get fixed pretty quickly -- long before I come along to encounter them. Even your GTA V example would fit the pattern: building the app with debug information and then using a profiler would have revealed strlen() eating a lot of CPU at load time straight away; no need for the tenacity and skillset of someone like t0st.

But then in the closed-source world, you either need a developer to prioritize that issue enough to take the time to nail it down (hard when there's more issues than developers, harder still when the most skilled devs have even more on their plates, and the developers aren't necessarily also users/players so they might not have quite the same personal motivation to fix it)... or you need the tenacity and skillset of someone like t0st. :)

I agree, contacting the devs would be a good idea. Do you want to go ahead? :)

Well, I replied to the 343I employee on the Reddit thread you linked. My hope is this will generate a few upvotes and make this enough of a "squeaky wheel with an easy fix" that someone over there either finally nails it down or shoots me a message to ask for more details. I'd thought about also opening a support ticket but those teams usually don't/can't escalate something directly to engineering, no matter how technical the reporter is -- any other way you think I should be getting their attention besides just Reddit?

vimpostor · 2023-05-20T08:40:17Z

Counterintuitively, I've identified more of these kinds of bugs in closed-source software than open-source.

Yeah, low hanging fruit are definitely easier to get fixed in free software.

any other way you think I should be getting their attention besides just Reddit?

For now this should be enough. If they still don't bother to reply in a few weeks, we can try to contact them in another way.

Personally I am also very interested in how the hell you debugged this, debugging software without source code available always felt like impossible to me.

So there is also the alternative strategy of writing up a blogpost for this, reach Hackernews front page, which will surely get the devs attention then. :D

(Seriously though, don't feel obliged to write a blog post just because I said this)

vimpostor · 2023-07-25T17:44:00Z

Looks like this might have indeed helped 343I fix the memory leak upstream: The latest patch fixes the problem:

Resolved a memory leak in Halo: Combat Evolved Campaign missions.

doitsujin reviewed Mar 2, 2023

View reviewed changes

doitsujin mentioned this pull request Mar 3, 2023

Rework available query processing #3278

Merged

doitsujin closed this Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dxvk] Periodically flush handles for long-running queries #3273

[dxvk] Periodically flush handles for long-running queries #3273

CFSworks commented Mar 2, 2023

doitsujin left a comment

doitsujin Mar 2, 2023 •

edited

Loading

CFSworks Mar 2, 2023

doitsujin Mar 2, 2023 •

edited

Loading

CFSworks Mar 2, 2023

doitsujin Mar 2, 2023

CFSworks Mar 2, 2023

CFSworks commented Mar 2, 2023

vimpostor commented May 19, 2023

CFSworks commented May 19, 2023

vimpostor commented May 19, 2023 •

edited

Loading

CFSworks commented May 20, 2023

vimpostor commented May 20, 2023

vimpostor commented Jul 25, 2023

[dxvk] Periodically flush handles for long-running queries #3273

[dxvk] Periodically flush handles for long-running queries #3273

Conversation

CFSworks commented Mar 2, 2023

doitsujin left a comment

Choose a reason for hiding this comment

doitsujin Mar 2, 2023 • edited Loading

Choose a reason for hiding this comment

CFSworks Mar 2, 2023

Choose a reason for hiding this comment

doitsujin Mar 2, 2023 • edited Loading

Choose a reason for hiding this comment

CFSworks Mar 2, 2023

Choose a reason for hiding this comment

doitsujin Mar 2, 2023

Choose a reason for hiding this comment

CFSworks Mar 2, 2023

Choose a reason for hiding this comment

CFSworks commented Mar 2, 2023

vimpostor commented May 19, 2023

CFSworks commented May 19, 2023

vimpostor commented May 19, 2023 • edited Loading

CFSworks commented May 20, 2023

vimpostor commented May 20, 2023

vimpostor commented Jul 25, 2023

doitsujin Mar 2, 2023 •

edited

Loading

doitsujin Mar 2, 2023 •

edited

Loading

vimpostor commented May 19, 2023 •

edited

Loading