The second fortnight blog post update:
It's almost a month into the coding phase of GSoC. The new Global Optimizer is built and the cleaning work on the PR(Pull Request) is also done. The PR would be merged next week and there has been few follow-up tasks in the current PR.
There is another significance improvement on the profiling results that I earlier shared. After few simplification in computation of convolutional operators, there is a 10sec improvement in optimizer and the optimization time for training SBRNN is now ~20sec.
Currently, there are a few clean-up tasks on this PR. If a node is on the CPU, the output variables of that nodes are on the CPU, which happen to be the input nodes to other nodes. Since the input variable to the next nodes are not on the GPU, the transfer of those nodes to the GPU wouldn't happen, thus all the nodes till the Graph's output node, making the compilation time to be large. There are two ways to fix it, being aggressive, meaning, transferring all the nodes to the GPU, irrespective of if the input Variables to those nodes are GPUVariables or not. The second way to fix it is, to have a backward pass on the graph lifting nodes to the GPU, if their Ops have implementation on the GPU and continuing the transfer from the node that hasn't been transferred. The current thought of doing this would be adapting to one method in the fast_compile mode and the other in the fast_run mode.
The comparison of efficiency of the new optimizer the last fortnight and at the end of last week,
The result of last fortnight
361.601003s - ('gpuarray_graph_optimization', 'GraphToGPU', 0, 24938, 40438) - 0.000s
GraphToGPUOptimizer gpuarray_graph_optimization
time io_toposort 1.066s
Total time taken by local optimizers 344.913s
times - times applied - Node created - name:
337.134s - 455 - 9968 - local_abstractconv_cudnn_graph
7.127s - 1479 - 1479 - local_gpua_careduce
0.451s - 15021 - 19701 - local_gpu_elemwise
0.119s - 12 - 36 - local_gpuaalloc
0.044s - 4149 - 4149 - local_gpua_dimshuffle
0.020s - 84 - 84 - local_gpua_incsubtensor
0.015s - 1363 - 1642 - local_gpua_subtensor_graph
0.001s - 194 - 239 - local_gpureshape
0.000s - 6 - 6 - local_gpua_split
0.000s - 9 - 9 - local_gpua_join
0.000s - 1 - 2 - local_gpua_crossentropysoftmaxargmax1hotwithbias
0.000s - 1 - 1 - local_gpua_crossentropysoftmax1hotwithbiasdx
0.002s - in 2 optimization that were not used (display only those with a runtime > 0)
0.001s - local_lift_abstractconv2d
0.001s - local_gpua_shape
The result by the end of last week,
25.080994s - ('gpuarray_graph_optimization', 'GraphToGPU', 0, 24938, 31624) - 0.000s
GraphToGPUOptimizer gpuarray_graph_optimization
time io_toposort 1.204s
Total time taken by local optimizers 7.658s
times - times applied - Node created - name:
7.059s - 1479 - 1479 - local_gpua_careduce
0.498s - 14507 - 21118 - local_gpu_elemwise
0.038s - 2761 - 2761 - local_gpua_dimshuffle
0.022s - 84 - 84 - local_gpua_incsubtensor
0.020s - 455 - 455 - local_lift_abstractconv2d_graph
0.012s - 533 - 533 - local_gpua_shape_graph
0.004s - 57 - 114 - local_gpua_mrg1
0.002s - 104 - 104 - local_gpua_subtensor_graph
0.001s - 194 - 194 - local_gpureshape
0.001s - 12 - 24 - local_gpuaalloc
0.000s - 147 - 147 - local_gpua_dot22
0.000s - 6 - 6 - local_gpua_split
0.000s - 9 - 9 - local_gpua_join
0.000s - 1 - 1 - local_gpua_crossentropysoftmax1hotwithbiasdx
0.000s - 1 - 1 - local_gpua_crossentropysoftmaxargmax1hotwithbias
0.000s - in 1 optimization that were not used (display only those with a runtime > 0)
The improvement in the time taken by the optimizer is immense! I line profiled all the functions that the `local_gpu_elemwise` is making a call to and detected that this slow-down happens only with the high verbosity flag, because of a call to a print method, without Error raising. Fixing it gave a great speedup of the optimizer!
The plan for implementing a new global(AND or OR) local optimizer in theano, for node replacement is almost done and the implementation would begin soon. This would be mainly targeting to speed up the 'fast_run' flag.
Finally, I'd also be working on removing ShapeOptimizer from the fast_compile phase. The work will go parallel with building the new optimizer, and on top of the GraphToGpu optimizer.
That's it for now!
It's almost a month into the coding phase of GSoC. The new Global Optimizer is built and the cleaning work on the PR(Pull Request) is also done. The PR would be merged next week and there has been few follow-up tasks in the current PR.
There is another significance improvement on the profiling results that I earlier shared. After few simplification in computation of convolutional operators, there is a 10sec improvement in optimizer and the optimization time for training SBRNN is now ~20sec.
Currently, there are a few clean-up tasks on this PR. If a node is on the CPU, the output variables of that nodes are on the CPU, which happen to be the input nodes to other nodes. Since the input variable to the next nodes are not on the GPU, the transfer of those nodes to the GPU wouldn't happen, thus all the nodes till the Graph's output node, making the compilation time to be large. There are two ways to fix it, being aggressive, meaning, transferring all the nodes to the GPU, irrespective of if the input Variables to those nodes are GPUVariables or not. The second way to fix it is, to have a backward pass on the graph lifting nodes to the GPU, if their Ops have implementation on the GPU and continuing the transfer from the node that hasn't been transferred. The current thought of doing this would be adapting to one method in the fast_compile mode and the other in the fast_run mode.
The comparison of efficiency of the new optimizer the last fortnight and at the end of last week,
The result of last fortnight
361.601003s - ('gpuarray_graph_optimization', 'GraphToGPU', 0, 24938, 40438) - 0.000s
GraphToGPUOptimizer gpuarray_graph_optimization
time io_toposort 1.066s
Total time taken by local optimizers 344.913s
times - times applied - Node created - name:
337.134s - 455 - 9968 - local_abstractconv_cudnn_graph
7.127s - 1479 - 1479 - local_gpua_careduce
0.451s - 15021 - 19701 - local_gpu_elemwise
0.119s - 12 - 36 - local_gpuaalloc
0.044s - 4149 - 4149 - local_gpua_dimshuffle
0.020s - 84 - 84 - local_gpua_incsubtensor
0.015s - 1363 - 1642 - local_gpua_subtensor_graph
0.001s - 194 - 239 - local_gpureshape
0.000s - 6 - 6 - local_gpua_split
0.000s - 9 - 9 - local_gpua_join
0.000s - 1 - 2 - local_gpua_crossentropysoftmaxargmax1hotwithbias
0.000s - 1 - 1 - local_gpua_crossentropysoftmax1hotwithbiasdx
0.002s - in 2 optimization that were not used (display only those with a runtime > 0)
0.001s - local_lift_abstractconv2d
0.001s - local_gpua_shape
The result by the end of last week,
25.080994s - ('gpuarray_graph_optimization', 'GraphToGPU', 0, 24938, 31624) - 0.000s
GraphToGPUOptimizer gpuarray_graph_optimization
time io_toposort 1.204s
Total time taken by local optimizers 7.658s
times - times applied - Node created - name:
7.059s - 1479 - 1479 - local_gpua_careduce
0.498s - 14507 - 21118 - local_gpu_elemwise
0.038s - 2761 - 2761 - local_gpua_dimshuffle
0.022s - 84 - 84 - local_gpua_incsubtensor
0.020s - 455 - 455 - local_lift_abstractconv2d_graph
0.012s - 533 - 533 - local_gpua_shape_graph
0.004s - 57 - 114 - local_gpua_mrg1
0.002s - 104 - 104 - local_gpua_subtensor_graph
0.001s - 194 - 194 - local_gpureshape
0.001s - 12 - 24 - local_gpuaalloc
0.000s - 147 - 147 - local_gpua_dot22
0.000s - 6 - 6 - local_gpua_split
0.000s - 9 - 9 - local_gpua_join
0.000s - 1 - 1 - local_gpua_crossentropysoftmax1hotwithbiasdx
0.000s - 1 - 1 - local_gpua_crossentropysoftmaxargmax1hotwithbias
0.000s - in 1 optimization that were not used (display only those with a runtime > 0)
The improvement in the time taken by the optimizer is immense! I line profiled all the functions that the `local_gpu_elemwise` is making a call to and detected that this slow-down happens only with the high verbosity flag, because of a call to a print method, without Error raising. Fixing it gave a great speedup of the optimizer!
The plan for implementing a new global(AND or OR) local optimizer in theano, for node replacement is almost done and the implementation would begin soon. This would be mainly targeting to speed up the 'fast_run' flag.
Finally, I'd also be working on removing ShapeOptimizer from the fast_compile phase. The work will go parallel with building the new optimizer, and on top of the GraphToGpu optimizer.
That's it for now!