[PATCH] drm/amdkfd: Restore all process on post reset

Tue Aug 3 15:02:23 UTC 2021


On 2021-07-30 5:26 p.m., Felix Kuehling wrote:
> Am 2021-07-28 um 1:31 p.m. schrieb Eric Huang:
>> It is to fix a bug of gpu_recovery on multiple GPUs,
>> When one gpu is reset, the application running on other
>> gpu hangs, because kfd post reset doesn't restore the
>> running process.
> This will resume all processes, even those that were affected by the GPU
> reset.
>
> The assumption we made here is, that KFD processes can use all GPUs. So
> when one GPU is reset, all processes are affected. If we want to refine
> that, we'll need some more complex logic to identify the processes
> affected by a GPU reset and keep only those processes suspended. This
> could be based on the GPUs that a process has created queues on, or
> allocated memory on.
>
> What we don't want, is processes continuing with bad data or
> inconsistent state after a GPU reset.
>
Current code doesn't take care of this assumption. When a GPU hangs, 
evicting queues will fail on it and roll back to restore all processes 
queues on other GPUs, and continue to running with unclear state and 
data after a GPU reset.

The original thought about this patch is to call 
kfd_suspend_all_processes and kfd_restore_all_processes in pair on 
pre_reset and post_reset. And It keeps the consistent behavior for both 
amdgpu_gpu_recover and hang_hws.
>>   And it also fixes a bug in the function
>> kfd_process_evict_queues, when one gpu hangs, process
>> running on other gpus can't be evicted.
> This should be a separate patch. The code you're removing was there as
> an attempt to make kfd_process_evict_queues transactional. That means,
> it either succeeds completely or it fails completely. I wanted to avoid
> putting the system into an unknown state where some queues are suspended
> and others are not and the caller has no idea how to proceed. So I roll
> back a partial queue eviction if something failed.
Can we let the caller to decide if roll-back is needed? because no all 
the callers need to roll back, e.g. kfd_suspend_all_processes and 
kgd2kfd_quiesce_mm.
>
> Your changing this to "try to evict as much as possible". Then a failure
> does not mean "eviction failed" but "eviction completed but something
> hung". Then the GPU reset can take care of the hanging part. I think
> that's a reasonable strategy. But we need to be sure that there are no
> other error conditions (other than hangs) that could cause a queue
> eviction to fail.
>
> There were some recent changes in pqm_destroy_queue that check the
> return value of dqm->ops.destroy_queue, which indirectly comes from
> execute_queues (sam as in the eviction case). -ETIME means a hang. Other
> errors are possible. Those other errors may still need the roll-back.
> Otherwise we'd leave stuff running on a non-hanging GPU after we
> promised that we evicted everything.
I think it depends on case scenario. GPU reset doesn't need to know the 
return state. Memory eviction may need. Does Memory notifier invalidate 
range need?
>
> See one more comment inline.
>
>
>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c  |  2 +-
>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c | 24 +-----------------------
>>   2 files changed, 2 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 24b5e0aa1eac..daf1c19bd799 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -984,7 +984,7 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>>   	if (!kfd->init_complete)
>>   		return 0;
>>   
>> -	ret = kfd_resume(kfd);
>> +	ret = kgd2kfd_resume(kfd, false, true);
> Which branch is this for? On amd-staging-drm-next kgd2kfd_resume only
> has two parameters.
Sorry, it is based on dkms staging 5.11. I didn't notice there is 
difference between two branches.

Regards,
Eric
>
> Regards,
>    Felix
>
>
>>   	if (ret)
>>   		return ret;
>>   	atomic_dec(&kfd_locked);
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index 38a9dee40785..9272a12c1db8 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -1879,36 +1879,14 @@ int kfd_process_evict_queues(struct kfd_process *p)
>>   {
>>   	int r = 0;
>>   	int i;
>> -	unsigned int n_evicted = 0;
>>   
>>   	for (i = 0; i < p->n_pdds; i++) {
>>   		struct kfd_process_device *pdd = p->pdds[i];
>>   
>>   		r = pdd->dev->dqm->ops.evict_process_queues(pdd->dev->dqm,
>>   							    &pdd->qpd);
>> -		if (r) {
>> +		if (r)
>>   			pr_err("Failed to evict process queues\n");
>> -			goto fail;
>> -		}
>> -		n_evicted++;
>> -	}
>> -
>> -	return r;
>> -
>> -fail:
>> -	/* To keep state consistent, roll back partial eviction by
>> -	 * restoring queues
>> -	 */
>> -	for (i = 0; i < p->n_pdds; i++) {
>> -		struct kfd_process_device *pdd = p->pdds[i];
>> -
>> -		if (n_evicted == 0)
>> -			break;
>> -		if (pdd->dev->dqm->ops.restore_process_queues(pdd->dev->dqm,
>> -							      &pdd->qpd))
>> -			pr_err("Failed to restore queues\n");
>> -
>> -		n_evicted--;
>>   	}
>>   
>>   	return r;