[PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven
Felix Kuehling
felix.kuehling at amd.com
Thu Aug 5 15:10:25 UTC 2021
Am 2021-08-05 um 4:51 a.m. schrieb Zhu, Changfeng:
> [AMD Official Use Only]
>
> Hi Felix,
>
> Can we set noretry=1 for dgpu path(ignore_crat=1) which doesn’t to through iommuv2 path on raven as below:
There are other possible reasons than ignore_crat for Raven to work in
dGPU mode (broken CRAT, disabled IOMMU). However, those are not known
until later in the initialization.
Regards,
Felix
>> + case CHIP_RAVEN:
>> + /*
>> + * TODO: Raven currently can fix most SVM issues with
>> + * noretry =1. However it has two issues with noretry = 1
>> + * on kfd migrate tests. It still needs to root causes
>> + * with these two migrate fails on raven with noretry = 1.
>> + */
>> if (amdgpu_noretry == -1) {
>> If(ignore_crat)
>> gmc->noretry = 1;
>> else
>> gmc->noretry = 0;
>> }
>> else
>> gmc->noretry = amdgpu_noretry;
>> break;
> BR,
> Changfeng.
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> Sent: Wednesday, July 28, 2021 10:22 PM
> To: Zhu, Changfeng <Changfeng.Zhu at amd.com>; amd-gfx at lists.freedesktop.org; Huang, Ray <Ray.Huang at amd.com>; Zhang, Yifan <Yifan1.Zhang at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: set default noretry=1 to fix kfd SVM issues for raven
>
> Doesn't this break IOMMUv2? Applications that run using IOMMUv2 for system memory access depend on correct retry handling in the SQ.
> Therefore noretry must be 0 on Raven.
>
> I believe the reason that SVM has trouble with retry enabled is, that
> IOMMUv2 is catching the page faults, so the driver never gets to handle the page fault interrupts. That breaks page-fault based migration in the SVM code. I think the better solution is to disable SVM on APUs where
> IOMMUv2 is enabled.
>
> Alternatively, we could give up on IOMMUv2 entirely and always rely on SVM to provide that functionality. But that requires more changes in the amdgpu_vm code.
>
> Regards,
> Felix
>
>
> Am 2021-07-28 um 2:36 a.m. schrieb Changfeng:
>> From: changzhu <Changfeng.Zhu at amd.com>
>>
>> From: Changfeng <Changfeng.Zhu at amd.com>
>>
>> It can't find any issues with noretry=1 except two SVM migrate issues.
>> Oppositely, it will cause most SVM cases fail with noretry=0.
>> The two SVM migrate issues also happen with noretry=0. So it can set
>> default noretry=1 for raven firstly to fix most SVM fails.
>>
>> Change-Id: Idb5cb3c1a04104013e4ab8aed2ad4751aaec4bbc
>> Signed-off-by: Changfeng <Changfeng.Zhu at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 15 ++++++++-------
>> 1 file changed, 8 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> index 09edfb64cce0..d7f69dbd48e6 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> @@ -606,19 +606,20 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>> * noretry = 0 will cause kfd page fault tests fail
>> * for some ASICs, so set default to 1 for these ASICs.
>> */
>> + case CHIP_RAVEN:
>> + /*
>> + * TODO: Raven currently can fix most SVM issues with
>> + * noretry =1. However it has two issues with noretry = 1
>> + * on kfd migrate tests. It still needs to root causes
>> + * with these two migrate fails on raven with noretry = 1.
>> + */
>> if (amdgpu_noretry == -1)
>> gmc->noretry = 1;
>> else
>> gmc->noretry = amdgpu_noretry;
>> break;
>> - case CHIP_RAVEN:
>> default:
>> - /* Raven currently has issues with noretry
>> - * regardless of what we decide for other
>> - * asics, we should leave raven with
>> - * noretry = 0 until we root cause the
>> - * issues.
>> - *
>> + /*
>> * default this to 0 for now, but we may want
>> * to change this in the future for certain
>> * GPUs as it can increase performance in
More information about the amd-gfx
mailing list