GPU processing

From CUC3
Jump to navigation Jump to search

For recording how we get on with the experimental NVIDIA cards

Hardware

The Quadro FX 5600 fits nicely into the WoC midi tower cases (after removing the hard drive bay and relocating the hard drive in the floppy bay, where it's somewhat less than securely mounted so treat the hardware gently). We replaced the PSU with an Amtec 560W one to have enough power for the card.

We (well, Daan and Mark) now have zero, a cluster with 12 Nvidia Tesla C870 cards and CUDA 2.0.

Software install

http://developer.nvidia.com/object/cuda.html

The CUDA toolkit I installed in my home directory, and then copied to the NFS server (the installer script does not do any configuration so this should be safe). There's a 'cuda' module to set the required environment variables (and a few others that I think will be useful)

The NVIDIA driver itself requires a bit more decision-making. The NVIDIA site says to install it via Yast, by adding NVIDIA as a package install source. This is against our policy of never mixing third party RPMs with the SuSE ones. So I installed the kernel source package

zypper install kernel-source

and then ran the by hand driver installation. The -e flag is great as it shows you exactly what's going on. Unfortunately it seems you need a version of the driver which EXACTLY matches the CUDA. Mine's a few minor releases newer. All seems to be working, but the logs are full of whining from the kernel. I'll sort that out in a bit.

The install backed up and replaced a few libraries and things. This is a worry as any future SuSE update will trash this. I need a new version of the image without those SuSE libs to use on these machines. The kernel module got dropped into the running kernel's lib/modules directory tree, unsurprisingly. I am inclined to live with the need to update this by hand. The auto-patching script doesn't do kernels on 10.2 so it will always need a computer officer's intervention to do a new kernel anyway.

The SDK is currently in /usr/local/shared/suse-10.2/x86_64/cuda/sdk and requires the 'cuda' module to be loaded.

Device special problems

Not sure if the nvidia.ko kernel module will actually get loaded on boot; for the moment I've modprobed it. Does it even need loading on boot? Apparently yes. Although it looks to me as if SuSE's modprobe.conf actually has the right hooks in it anyway. But the reason that wasn't working turned out to be the lack of device specials.

Unsurprisingly I had to do nasty things to permissions in /dev to get the stuff to work: /dev/nvidiactl and /dev/nvidia0 need to be read-write by the user. Also, weirdly, the device specials weren't there to start with, even though I had loaded the driver. I ran sax and they were made. Will this stick? Nope. You have to start X to get them created after a reboot, and then chmod them! Surely it must be possible to arrange for them to be handled by udev in a sensible way? And possibly PAM. Apparently not. This is because udev relies on sysfs to tell it what devices to create. Non GPLed drivers may not use sysfs, so the nvidia drivers can't do this. The Nvidia X driver creates the device specials by hand on startup! Well I don't want to have to start X and stop it every time we reboot, so I need a better method. It seems that if you create the nodes you want under /lib/udev/devices, udev will make the corresponding things in /dev at boot.

For future reference:

crw-rw-rw- 1 root video 195, 255 2007-10-17 16:56 /dev/nvidiactl
crw-rw-rw- 1 root video 195, 0 2007-10-17 16:56 /dev/nvidia0

Tests

You will need the cuda module loaded, which contains the toolkit and libraries needed.

NVIDIA supply a set of test/example programs. They can be downloaded here: CUDA resources. Note that you can only use v1.0 of SDK, as we haven't installed a later version of the drivers yet. In v1.1 libcuda.so (which is rather vital) is moved from the toolkit into the driver. See NVIDIA forum thread for more information.

To compile, you just need to do run the SDK binary file and then run make in the base of the resultant directory structure.

scepter does not have the glut library installed, so cannot compile some of the openGL tests. We not so interested in this application of the card anyway. You do get errors due to these tests though. To make everything else do:

SDK> make lib/libcutil.so
SDK/projects> for dir in *; do grep lglut $dir/Makefile 2>&1 >/dev/null ; if [ "$?" -ne "0" ]; then cd $dir; \
make; cd ..; fi;  done

where SDK is the directory to which you extracted NVIDIA_CUDA_SDK_1.0.run.

The test binaries are in SDK/bin/linux/release. Some accept the help flag. All pass. You can see the test output in scepter:~jss43/GPGPU/src/SDK_1.0/bin/linux/release/test-output

--james 16:42, 5 March 2008 (GMT)

It turns out that the makefile isn't so hot at getting the glut library. There is libglut.so.3 and libglut.so.3.8.0 installed in /usr/lib64 (thanks to Greg for pointing this out), but SDK wants libglut.so. If you really want to use this, you can. Just do:

SDK/lib> ln -s /usr/lib64/libglut.so.3.0.8 libglut.so

(This is already set up for us on ajax, but not on scepter.) This allows all of the test codes to pass.

--james 18:00, 5 March 2008 (GMT)

Useful links