我是靠谱客的博主 友好月饼,这篇文章主要介绍Halide Lesson05: 向量化, 并行, 循环展开 以及 分块,现在分享给大家,希望可以做个参考。

Halide Lesson05: 向量化, 并行, 循环展开 以及 分块

注意:Halide 默认图像按列存储 column first,x为内循环,y为外循环

复制代码
1
2
3
Func gradient("gradient"); gradient(x, y) = x + y;

gradient.trace_stores();

打印gradient的所有中间结果。

gradient.print_loop_nest();

打印gradient函数的循环过程

复制代码
1
2
3
4
5
produce gradient: for y: for x: gradient(...) = ...

gradient.reorder(y, x);

调整循环顺序,将y作为内循环,x作为外循环,此时循环过程为:

复制代码
1
2
3
4
5
6
Pseudo-code for the schedule: produce gradient_col_major: for x: for y: gradient_col_major(...) = ...

gradient.split(x, x_outer, x_inner, 2);

x -> 拆分的维度
x_outer -> 拆分后的外层循环
x_inner-> 拆分后的内层循环
2 -> 拆分的因子,内存循环从0到factor,外层循环从0到x/factor,原来的index现在为index = outer * factor + inner

复制代码
1
2
3
4
5
6
7
8
9
for (int y = 0; y < 4; y++) { for (int x_outer = 0; x_outer < 2; x_outer++) { for (int x_inner = 0; x_inner < 2; x_inner++) { int x = x_outer * 2 + x_inner; printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y); } } }

如果拆分factor不能整除维度总数,Halide仍然可以很好的处理split,假设factor为3,x维度总数为7

复制代码
1
2
3
gradient.split(x, x_outer, x_inner, 3); Buffer<int> output = gradient.realize(7, 2);

此时等价循环为

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for (int y = 0; y < 2; y++) { for (int x_outer = 0; x_outer < 3; x_outer++) { // Now runs from 0 to 2 for (int x_inner = 0; x_inner < 3; x_inner++) { int x = x_outer * 3; // Before we add x_inner, make sure we don't // evaluate points outside of the 7x2 box. We'll // clamp x to be at most 4 (7 minus the split // factor). if (x > 4) x = 4; x += x_inner; printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y); } } }

gradient.fuse(x, y, fused);

参数融合,将两个参数融合成一个参数

复制代码
1
2
3
4
5
6
for (int fused = 0; fused < 4*4; fused++) { int y = fused / 4; int x = fused % 4; printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y); }

gradient.vectorize(x_inner);

复制代码
1
2
3
4
Var x_outer, x_inner; gradient.split(x, x_outer, x_inner, 4); gradient.vectorize(x_inner);

或者使用gradient.vectorize(x, 4);等价于上面split和vectorize的融合

循环过程为

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
for (int y = 0; y < 4; y++) { for (int x_outer = 0; x_outer < 2; x_outer++) { // The loop over x_inner has gone away, and has been // replaced by a vectorized version of the // expression. On x86 processors, Halide generates SSE // for all of this. int x_vec[] = {x_outer * 4 + 0, x_outer * 4 + 1, x_outer * 4 + 2, x_outer * 4 + 3}; int val[] = {x_vec[0] + y, x_vec[1] + y, x_vec[2] + y, x_vec[3] + y}; printf("Evaluating at <%d, %d, %d, %d>, <%d, %d, %d, %d>:" " <%d, %d, %d, %d>n", x_vec[0], x_vec[1], x_vec[2], x_vec[3], y, y, y, y, val[0], val[1], val[2], val[3]); } }

gradient.unroll(x_inner);

在某个维度上进行循环展开

复制代码
1
2
3
4
Var x_outer, x_inner; gradient.split(x, x_outer, x_inner, 2); gradient.unroll(x_inner);

上面两个操作可以替换为gradient.unroll(x, 2);
循环展开为

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
for (int y = 0; y < 4; y++) { for (int x_outer = 0; x_outer < 2; x_outer++) { // Instead of a for loop over x_inner, we get two // copies of the innermost statement. { int x_inner = 0; int x = x_outer * 2 + x_inner; printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y); } { int x_inner = 1; int x = x_outer * 2 + x_inner; printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y); } } }

Fusing, tiling, and parallelizing.

融合 分tile 并行三个操作

复制代码
1
2
3
4
5
6
7
8
9
// First we'll tile, then we'll fuse the tile indices and // parallelize across the combination. Var x_outer, y_outer, x_inner, y_inner, tile_index; gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4); gradient.fuse(x_outer, y_outer, tile_index); gradient.parallel(tile_index); Buffer<int> output = gradient.realize(8, 8);

可以写到一个序列中:

复制代码
1
2
3
4
5
6
7
gradient .tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4) .fuse(x_outer, y_outer, tile_index) .parallel(tile_index); Buffer<int> output = gradient.realize(8, 8);

等价的循环为

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
// 仍然是列优先 for (int tile_index = 0; tile_index < 4; tile_index++) { int y_outer = tile_index / 2; int x_outer = tile_index % 2; for (int y_inner = 0; y_inner < 4; y_inner++) { for (int x_inner = 0; x_inner < 4; x_inner++) { int y = y_outer * 4 + y_inner; int x = x_outer * 4 + x_inner; printf("Evaluating at x = %d, y = %d: %dn", x, y, x + y); } } }

问题

  • vectorize的时候,如果数字不满足SIMD寄存器要求会怎么样?比如传入3 5 7个数
  • 各个操作的性能如何?

最后

以上就是友好月饼最近收集整理的关于Halide Lesson05: 向量化, 并行, 循环展开 以及 分块的全部内容,更多相关Halide内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(98)

评论列表共有 0 条评论

立即
投稿
返回
顶部