Cuda error of illegal memory access when using indexes of arrays stored in another array
I am using cuda fortran and I was struggling with this problem in one simple kernel and I couldn't find a solution. Is it not possible to use integer values โโstored in an array as indices for another array?
Here's a simple example (edited in the same way as the main program):
program test
use cudafor
integer:: ncell, i
integer, allocatable:: values(:)
integer, allocatable, device :: values_d(:)
ncell = 10
allocate(values(ncell), values_d(ncell))
do i=1,ncell
values(i) = i
enddo
values_d = values
call multipleindices_kernel<<< ncell/1024+1,1024 >>> (values_d,
+ ncell)
values = values_d
write (*,*) values
end program test
!////////////////////////////////////////////////////
attributes(global) subroutine multipleindices_kernel(valu, ncell)
use cudafor
implicit none
integer, value:: ncell ! ncell = 10
integer :: valu(ncell)
integer :: tempind(10)
integer:: i
tempind(1)=10
tempind(2)=3
tempind(3)=5
tempind(4)=7
tempind(5)=9
tempind(6)=2
tempind(7)=4
tempind(8)=6
tempind(9)=8
tempind(10)=1
i = (blockidx%x - 1 ) * blockdim%x + threadidx%x
if (i .LE. ncell) then
valu(tempind(i))= 1
endif
end subroutine
I understand that if there were repeated values โโin the tempind array, then different threads could access the same memory location for reading or writing, but this is not the case. Although it gives the error "0: copying Memcpy (host = 0x303610, dev = 0x3e20000, size = 40) FAILED: 77 (illegal memory access was encountered)."
Does anyone know if it is possible to use these indices coming from another array in cuda?
After some additional tests, I noticed that the problem does not occur during the launch of the kernel itself, but when transferring data back to the CPU (if I remove "values โโ= values_d", then no errors will be visible). Also, if I replace the kernel (tempind (i)) value with value (i) it works fine, but I want the indices to come from an array as the purpose of this test is to do parallelization of the CFD code where the indices are stored So.
source to share
The problem is that the generated executable is not passing the correct variable ncell
. Running the application through cuda-memcheck
shows that threads outside 1-10 are going through the branch instruction, and adding a print statement to print ncell
inside the kernel also gives weird answers.
It was required that all subroutines attributes(global)
should be inside the module. This requirement seems to have been relaxed in later versions of CUDA Fortran (I cannot find references to it in the programming manual). I believe the code outside the module is causing the error here. By putting multipleindices_kernel
in a module and using that module in test
, I am consistently getting correct answers without errors. The code for this is below:
module testmod
contains
attributes(global) subroutine multipleindices_kernel(valu, ncell)
use cudafor
implicit none
integer, value:: ncell ! ncell = 10
integer :: valu(ncell)
integer :: tempind(10)
integer:: i
tempind(1)=10
tempind(2)=3
tempind(3)=5
tempind(4)=7
tempind(5)=9
tempind(6)=2
tempind(7)=4
tempind(8)=6
tempind(9)=8
tempind(10)=1
i = (blockidx%x - 1 ) * blockdim%x + threadidx%x
if (i .LE. ncell) then
valu(tempind(i))= 1
endif
end subroutine
end module testmod
program test
use cudafor
use testmod
integer:: ncell, i
integer, allocatable:: values(:)
integer, allocatable, device :: values_d(:)
ncell = 10
allocate(values(ncell), values_d(ncell))
do i=1,ncell
values(i) = i
enddo
values_d = values
call multipleindices_kernel<<< ncell/1024+1,1024 >>> (values_d, ncell)
values = values_d
write (*,*) values
end program test
!////////////////////////////////////////////////////
source to share