建立GPU开发环境

前言

手里有一个 =Alienware= 笔记本电脑,外加几本 =CUDA= 开发的书籍,打算学习下 =GPU= 开发方面的知识。

看看了Macbook Pro使用的是 =Intel显卡= ,因此只能使用 =仿真方式= 运行CUDA程序了,但可以使用 =OpenCL= 程序,毕竟苹果提出的OpenCL。

又跑到Alienware笔记本上装Fedora系统,发现折腾无线网卡驱动失败,就换了Ubuntu,又发现没有Nvidia显卡而失败。

然后切换到Win8系统,提醒说要更新Nvidia驱动,心想,这是有Nvidia显卡啊。

于是,删掉所有数据,重新安装了一个Ubuntu,系统装到SSD磁盘上,另外一个三星磁盘用来存数据。

Ubuntu毕竟是桌面化做的比较好,一路还算顺利,但也折腾了几个小时;接下来,可以用CUDA C写写程序,也可以用PyCUDA,当然也不能忘了OpenCL。

为Ubuntu激活无线网络

不知道为什么,每次安装好Linux,几乎总是遇到无线网络的问题,烦不甚烦!

自己安装的是Ubuntu 14.04LTS版本,来源: http://blog.csdn.net/zhaoqiaoshi/article/details/7736936

重新安装b43相关全部驱动和firmware

1
2
3
4
5
$ sudo apt-get install bcmwl-kernel-source #Broadcom 802.11 Linux STA 无线驱动源
$ sudo apt-get install broadcom-sta-common
$ sudo apt-get install broadcom-sta-source
$ sudo apt-get install b43-fwcutter #Broadcom 43xx 固件提取工具
$ sudo apt-get install firmware-b43-installer

激活无线网卡

1
2
$ sudo modprobe -r b43 ssb
$ sudo modprobe b43

开机自动激活

加入文件/etc/rc.local中:

1
$ sudo modprobe b43

为Linux添加磁盘

由于安装系统的时候将系统数据直接弄到那块SSD磁盘上,需要将另外那块三星磁盘添加到系统,来源: http://blog.csdn.net/ymj7150697/article/details/5810683

找到设备文件

1
$ sudo fdisk -l

发现在使用的是 =/dev/sdb= ,另一个没有使用的是 =/dev/sda=

为磁盘分区

1
$ sudo fdisk /etc/sdba

按 =m= 获取帮助信息,按 =n= 创建新分区,选择 =p= 表示主分区,使用 =1= 作为主分区编号。

由于我没有打算弄太复杂,过几天说不定再重装一遍呢。按 =w= 写入修改退出,则有了设备文件 =/dev/sda1=

格式化磁盘

1
$ sudo mkfs -t ext4 /dev/sda1

挂载该分区

1
2
3
$ sudo mkdir /Yi
$ sudo mount /dev/sda1 /Yi
$ sudo chmod -R g+w /Yi

同时为了开机自动挂载,在文件/etc/fstab中加入:

1
/dev/sda1 /Yi ext4 defaults, 0 1

安装Nvidia驱动

跑到Nvidia官网下载驱动文件,结果安装过程还挺复杂,来源: http://wiki.ubuntu.org.cn/NVIDIA

在这个过程的记得常做 =sudo apt-get update= 和重启以便生效。

下载驱动文件

到Nvidia官网下载对应平台、对应显卡型号的驱动文件(.run)

安装编译依赖

1
$ sudo apt-get install build-essential pkg-config xserver-xorg-dev linux-headers-`uname -r`

屏蔽开源驱动nouveau

驱动之间存在冲突,为了避免,在/etc/modprobe.d/blacklist.conf加入:

1
2
3
4
5
6
7
8
9
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv

关闭图形环境

1
$ sudo stop lightdm

在tty控制台安装驱动

关闭图形环境后,按键Ctrl+Alt+F2可以跳到第一个终端控制台,以终端模式登陆后:

1
2
$ cd /home/username
$ sudo bash NVIDIA*.run

安装过程中:

  • 提示有旧驱动,询问是否删除旧驱动,选Yes
  • 提示缺少模块,循环是否上网下载,选No
  • 提示编译模块,询问是否编译,选Yes
  • 提示要修改Xorg.conf,询问是否允许,选Yes

启动图形环境

1
$ sudo start lightdm

其他:卸载旧驱动

1
2
$ sudo apt-get --purge remove nvidia-*
$ sudo apt-get --purge remove xserver-xorg-video-nouveau

重启生效

重启后可能要重新安装驱动,安装驱动后可能显示的分辨率等都要重新设置。

分辨率错误

这中间过程由于下载了几个版本的驱动,也不知道怎么回事就改写了xorg配置文件,导致分辨率出现问题, 一阵折腾.

参考:

如果只是由于Nvidia修改了 =xorg.conf= 文件,可以使用 =nvidia-uninstall= 恢复,或者用 =Xorg -configure= 生成默认配置文件。

用Xorg恢复默认配置后,就发现登陆后就能看到鼠标和桌面,其他状态栏等都看不到,彻底废了。

东查西找,搞了半天是 =Ubuntu 14.04= 存在一个所谓的 =Unity冻结= 问题,这种版本也能出来,真是服了,除了重装还能期待什么?!

试试CUDA程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// This is the REAL "hello world" for CUDA!
// It takes the string "Hello ", prints it, then passes it to CUDA with an array
// of offsets. Then the offsets are added in parallel to produce the string "World!"
// By Ingemar Ragnemalm 2010
#include <stdio.h>
const int N = 16;
const int blocksize = 16;
__global__
void hello(char *a, int *b)
{
a[threadIdx.x] += b[threadIdx.x];
}
int main()
{
char a[N] = "Hello \0\0\0\0\0\0";
int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
char *ad;
int *bd;
const int csize = N*sizeof(char);
const int isize = N*sizeof(int);
printf("%s", a);
cudaMalloc( (void**)&ad, csize );
cudaMalloc( (void**)&bd, isize );
cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );
dim3 dimBlock( blocksize, 1 );
dim3 dimGrid( 1, 1 );
hello<<<dimGrid, dimBlock>>>(ad, bd);
cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
cudaFree( ad );
cudaFree( bd );
printf("%s\n", a);
return EXIT_SUCCESS;
}

编译:

1
$ nvcc source.cu

试试OpenCL程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
// Hello World for OpenCL - the real thing!
// Like my CUDA Hello World, it computes, in parallel, on the GPU,
// the string "World!" from "Hello " and an array of offsets.
// By Ingemar Ragnemalm, based on the hello.c demo.
// Updated 2013 for newer OpenCL versions and #ifdef for Linux
#include <stdio.h>
#include <math.h>
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif
const char *KernelSource = "\n" \
"__kernel void hello( \n" \
" __global char* a, \n" \
" __global char* b, \n" \
" __global char* c, \n" \
" const unsigned int count) \n" \
"{ \n" \
" int i = get_global_id(0); \n" \
" if(i < count) \n" \
" c[i] = a[i] + b[i]; \n" \
"} \n" \
"\n";
#define DATA_SIZE (16)
int main(int argc, char** argv)
{
int err;// error code returned from api calls
cl_device_id device_id; // compute device id
cl_context context; // compute context
cl_command_queue commands; // compute command queue
cl_program program; // compute program
cl_kernel kernel; // compute kernel
cl_mem input; // device memory used for the input array
cl_mem input2; // device memory used for the input array
cl_mem output; // device memory used for the output array
size_t global; // global domain size for our calculation
size_t local; // local domain size for our calculation
int i;
unsigned int count = DATA_SIZE;
// Input data
char a[DATA_SIZE] = "Hello \0\0\0\0\0\0";
char b[DATA_SIZE] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
// Output data
char c[DATA_SIZE];
// Print original data
printf("%s", a);
cl_platform_id platform;
unsigned int no_plat;
err = clGetPlatformIDs(1,&platform,&no_plat);
// Where to run
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);
if (err != CL_SUCCESS) return -1;
context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
if (!context) return -1;
commands = clCreateCommandQueue(context, device_id, 0, &err);
if (!commands) return -1;
// What to run
program = clCreateProgramWithSource(context, 1, (const char **) & KernelSource, NULL, &err);
if (!program) return -1;
err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
if (err != CL_SUCCESS) return -1;
kernel = clCreateKernel(program, "hello", &err);
if (!kernel || err != CL_SUCCESS) return -1;
// Create space for data and copy a and b to device (note that we could also use clEnqueueWriteBuffer to upload)
input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(char) * DATA_SIZE, a, NULL);
input2 = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(char) * DATA_SIZE, b, NULL);
output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(char) * DATA_SIZE, NULL, NULL);
if (!input || !output) return -1;
// Send data
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &input2);
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &output);
err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &count);
if (err != CL_SUCCESS) return -1;
local = DATA_SIZE;
// Run kernel!
global = DATA_SIZE; // count;
err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);
if (err != CL_SUCCESS) return -1;
clFinish(commands);
// Read result
err = clEnqueueReadBuffer( commands, output, CL_TRUE, 0, sizeof(char) * count, c, 0, NULL, NULL );
if (err != CL_SUCCESS) return -1;
//Print result
printf("%s\n", c);
// Clean up
clReleaseMemObject(input);
clReleaseMemObject(output);
clReleaseProgram(program);
clReleaseKernel(kernel);
clReleaseCommandQueue(commands);
clReleaseContext(context);
sleep(1); // Leopard pty bug workaround.
return 0;
}

编译:

1
$ gcc hello_world_cl.c -lOpenCL -I/usr/include/CL/

试试PyCUDA程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
print dest-a*b

运行前确认显卡设备:

1
2
3
4
$ sudo apt-get install bumblebee bumblebee-nvidia
$ lspci |grep -i Nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GK104M [GeForce GTX 780M] (rev ff)

以GPU形式运行pyhton:

1
2
3
4
5
$ optirun python
>>> import pycuda.driver as cuda
>>> import pycuda.autoinit
>>> from pycuda.compiler import SourceModule

如果都执行成功,则说明CUDA环境是安装好并可以使用的.

试试PyOpenCL程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# example provided by Eilif Muller
from __future__ import division
KERNEL_CODE = """
// Thread block size
#define BLOCK_SIZE %(block_size)d
// Matrix dimensions
// (chosen as multiples of the thread block size for simplicity)
#define WA %(w_a)d // Matrix A width
#define HA %(h_a)d // Matrix A height
#define WB %(w_b)d // Matrix B width
#define HB WA // Matrix B height
#define WC WB // Matrix C width
#define HC HA // Matrix C height
/*
* Copyright 1993-2009 NVIDIA Corporation. All rights reserved.
*
* NVIDIA Corporation and its licensors retain all intellectual property and
* proprietary rights in and to this software and related documentation.
* Any use, reproduction, disclosure, or distribution of this software
* and related documentation without an express license agreement from
* NVIDIA Corporation is strictly prohibited.
*
* Please refer to the applicable NVIDIA end user license agreement (EULA)
* associated with this source code for terms and conditions that govern
* your use of this NVIDIA software.
*
*/
/* Matrix multiplication: C = A * B.
* Device code.
*/
#define AS(j, i) As[i + j * BLOCK_SIZE]
#define BS(j, i) Bs[i + j * BLOCK_SIZE]
////////////////////////////////////////////////////////////////////////////////
//! Matrix multiplication on the device: C = A * B
//! WA is A's width and WB is B's width
////////////////////////////////////////////////////////////////////////////////
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE,BLOCK_SIZE,1)))
void
matrixMul( __global float* C, __global float* A, __global float* B)
{
__local float As[BLOCK_SIZE*BLOCK_SIZE];
__local float Bs[BLOCK_SIZE*BLOCK_SIZE];
// Block index
int bx = get_group_id(0);
int by = get_group_id(1);
// Thread index
int tx = get_local_id(0);
int ty = get_local_id(1);
// Index of the first sub-matrix of A processed by the block
int aBegin = WA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + WA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * WB;
// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
float Csub = 0.0f;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
// Load the matrices from device memory
// to shared memory; each thread loads
// one element of each matrix
AS(ty, tx) = A[a + WA * ty + tx];
BS(ty, tx) = B[b + WB * ty + tx];
// Synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
barrier(CLK_LOCAL_MEM_FENCE);
}
// Write the block sub-matrix to device memory;
// each thread writes one element
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = Csub;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import pyopencl as cl
from time import time
import numpy
block_size = 16
ctx = cl.create_some_context()
for dev in ctx.devices:
assert dev.local_mem_size > 0
queue = cl.CommandQueue(ctx,
properties=cl.command_queue_properties.PROFILING_ENABLE)
#queue = cl.CommandQueue(ctx)
if False:
a_height = 4096
#a_height = 1024
a_width = 2048
#a_width = 256
#b_height == a_width
b_width = a_height
elif False:
# like PyCUDA
a_height = 2516
a_width = 1472
b_height = a_width
b_width = 2144
else:
# CL SDK
a_width = 50*block_size
a_height = 100*block_size
b_width = 50*block_size
b_height = a_width
c_width = b_width
c_height = a_height
h_a = numpy.random.rand(a_height, a_width).astype(numpy.float32)
h_b = numpy.random.rand(b_height, b_width).astype(numpy.float32)
h_c = numpy.empty((c_height, c_width)).astype(numpy.float32)
kernel_params = {"block_size": block_size,
"w_a":a_width, "h_a":a_height, "w_b":b_width}
if "NVIDIA" in queue.device.vendor:
options = "-cl-mad-enable -cl-fast-relaxed-math"
else:
options = ""
prg = cl.Program(ctx, KERNEL_CODE % kernel_params,
).build(options=options)
kernel = prg.matrixMul
#print prg.binaries[0]
assert a_width % block_size == 0
assert a_height % block_size == 0
assert b_width % block_size == 0
# transfer host -> device -----------------------------------------------------
mf = cl.mem_flags
t1 = time()
d_a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=h_a)
d_b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=h_b)
d_c_buf = cl.Buffer(ctx, mf.WRITE_ONLY, size=h_c.nbytes)
push_time = time()-t1
# warmup ----------------------------------------------------------------------
for i in range(5):
event = kernel(queue, h_c.shape[::-1], (block_size, block_size),
d_c_buf, d_a_buf, d_b_buf)
event.wait()
queue.finish()
# actual benchmark ------------------------------------------------------------
t1 = time()
count = 20
for i in range(count):
event = kernel(queue, h_c.shape[::-1], (block_size, block_size),
d_c_buf, d_a_buf, d_b_buf)
event.wait()
gpu_time = (time()-t1)/count
# transfer device -> host -----------------------------------------------------
t1 = time()
cl.enqueue_copy(queue, h_c, d_c_buf)
pull_time = time()-t1
# timing output ---------------------------------------------------------------
gpu_total_time = gpu_time+push_time+pull_time
print "GPU push+compute+pull total [s]:", gpu_total_time
print "GPU push [s]:", push_time
print "GPU pull [s]:", pull_time
print "GPU compute (host-timed) [s]:", gpu_time
print "GPU compute (event-timed) [s]: ", (event.profile.end-event.profile.start)*1e-9
gflop = h_c.size * (a_width * 2.) / (1000**3.)
gflops = gflop / gpu_time
print
print "GFlops/s:", gflops
# cpu comparison --------------------------------------------------------------
t1 = time()
h_c_cpu = numpy.dot(h_a,h_b)
cpu_time = time()-t1
print
print "GPU==CPU:",numpy.allclose(h_c, h_c_cpu)
print
print "CPU time (s)", cpu_time
print
print "GPU speedup (with transfer): ", cpu_time/gpu_total_time
print "GPU speedup (without transfer): ", cpu_time/gpu_time

在Mac上运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Choose device(s):
[0] <pyopencl.Device 'Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz' on 'Apple' at 0xffffffff>
[1] <pyopencl.Device 'Iris' on 'Apple' at 0x1024500>
Choice, comma-separated [0]:1
Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
GPU push+compute+pull total [s]: 0.0473513484001
GPU push [s]: 0.000606060028076
GPU pull [s]: 0.00265288352966
GPU compute (host-timed) [s]: 0.0440924048424
GPU compute (event-timed) [s]: 0.0406256
GFlops/s: 46.4479088251
GPU==CPU: True
CPU time (s) 0.0363118648529
GPU speedup (with transfer): 0.76686020736
GPU speedup (without transfer): 0.823540130839

后记

被 Ubuntu的Unity冻结问题 搞的身心疲惫,浪费大量时间。学习GPU开发本不在今年的计划之内,就先开个头,我还会回来的!

资料

吴羽舒 wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!