[PATCH] drm/amdkfd: Restore all process on post reset
Eric Huang
jinhuieric.huang at amd.com
Tue Aug 3 15:02:23 UTC 2021
On 2021-07-30 5:26 p.m., Felix Kuehling wrote:
> Am 2021-07-28 um 1:31 p.m. schrieb Eric Huang:
>> It is to fix a bug of gpu_recovery on multiple GPUs,
>> When one gpu is reset, the application running on other
>> gpu hangs, because kfd post reset doesn't restore the
>> running process.
> This will resume all processes, even those that were affected by the GPU
> reset.
>
> The assumption we made here is, that KFD processes can use all GPUs. So
> when one GPU is reset, all processes are affected. If we want to refine
> that, we'll need some more complex logic to identify the processes
> affected by a GPU reset and keep only those processes suspended. This
> could be based on the GPUs that a process has created queues on, or
> allocated memory on.
>
> What we don't want, is processes continuing with bad data or
> inconsistent state after a GPU reset.
>
Current code doesn't take care of this assumption. When a GPU hangs,
evicting queues will fail on it and roll back to restore all processes
queues on other GPUs, and continue to running with unclear state and
data after a GPU reset.
The original thought about this patch is to call
kfd_suspend_all_processes and kfd_restore_all_processes in pair on
pre_reset and post_reset. And It keeps the consistent behavior for both
amdgpu_gpu_recover and hang_hws.
>> And it also fixes a bug in the function
>> kfd_process_evict_queues, when one gpu hangs, process
>> running on other gpus can't be evicted.
> This should be a separate patch. The code you're removing was there as
> an attempt to make kfd_process_evict_queues transactional. That means,
> it either succeeds completely or it fails completely. I wanted to avoid
> putting the system into an unknown state where some queues are suspended
> and others are not and the caller has no idea how to proceed. So I roll
> back a partial queue eviction if something failed.
Can we let the caller to decide if roll-back is needed? because no all
the callers need to roll back, e.g. kfd_suspend_all_processes and
kgd2kfd_quiesce_mm.
>
> Your changing this to "try to evict as much as possible". Then a failure
> does not mean "eviction failed" but "eviction completed but something
> hung". Then the GPU reset can take care of the hanging part. I think
> that's a reasonable strategy. But we need to be sure that there are no
> other error conditions (other than hangs) that could cause a queue
> eviction to fail.
>
> There were some recent changes in pqm_destroy_queue that check the
> return value of dqm->ops.destroy_queue, which indirectly comes from
> execute_queues (sam as in the eviction case). -ETIME means a hang. Other
> errors are possible. Those other errors may still need the roll-back.
> Otherwise we'd leave stuff running on a non-hanging GPU after we
> promised that we evicted everything.
I think it depends on case scenario. GPU reset doesn't need to know the
return state. Memory eviction may need. Does Memory notifier invalidate
range need?
>
> See one more comment inline.
>
>
>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +-
>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 24 +-----------------------
>> 2 files changed, 2 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 24b5e0aa1eac..daf1c19bd799 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -984,7 +984,7 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>> if (!kfd->init_complete)
>> return 0;
>>
>> - ret = kfd_resume(kfd);
>> + ret = kgd2kfd_resume(kfd, false, true);
> Which branch is this for? On amd-staging-drm-next kgd2kfd_resume only
> has two parameters.
Sorry, it is based on dkms staging 5.11. I didn't notice there is
difference between two branches.
Regards,
Eric
>
> Regards,
> Felix
>
>
>> if (ret)
>> return ret;
>> atomic_dec(&kfd_locked);
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index 38a9dee40785..9272a12c1db8 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -1879,36 +1879,14 @@ int kfd_process_evict_queues(struct kfd_process *p)
>> {
>> int r = 0;
>> int i;
>> - unsigned int n_evicted = 0;
>>
>> for (i = 0; i < p->n_pdds; i++) {
>> struct kfd_process_device *pdd = p->pdds[i];
>>
>> r = pdd->dev->dqm->ops.evict_process_queues(pdd->dev->dqm,
>> &pdd->qpd);
>> - if (r) {
>> + if (r)
>> pr_err("Failed to evict process queues\n");
>> - goto fail;
>> - }
>> - n_evicted++;
>> - }
>> -
>> - return r;
>> -
>> -fail:
>> - /* To keep state consistent, roll back partial eviction by
>> - * restoring queues
>> - */
>> - for (i = 0; i < p->n_pdds; i++) {
>> - struct kfd_process_device *pdd = p->pdds[i];
>> -
>> - if (n_evicted == 0)
>> - break;
>> - if (pdd->dev->dqm->ops.restore_process_queues(pdd->dev->dqm,
>> - &pdd->qpd))
>> - pr_err("Failed to restore queues\n");
>> -
>> - n_evicted--;
>> }
>>
>> return r;
More information about the amd-gfx
mailing list