Sunday, June 26, 2016

Second Fortnight update

The second fortnight blog post update:
It's almost a month into the coding phase of GSoC. The new Global Optimizer is built and the cleaning work on the PR(Pull Request) is also done. The PR would be merged next week and there has been few follow-up tasks in the current PR.
There is another significance improvement on the profiling results that I earlier shared. After few simplification in computation of convolutional operators,  there is a 10sec improvement in optimizer and the optimization time for training SBRNN is now ~20sec.
Currently, there are a few clean-up tasks on this PR. If a node is on the CPU, the output variables of that nodes are on the CPU, which happen to be the input nodes to other nodes. Since the input variable to the next nodes are not on the GPU, the transfer of those nodes to the GPU wouldn't happen, thus all the nodes till the Graph's output node, making the compilation time to be large. There are two ways to fix it, being aggressive, meaning, transferring all the nodes to the GPU, irrespective of if the input Variables to those nodes are GPUVariables or not. The second way to fix it is, to have a backward pass on the graph lifting nodes to the GPU, if their Ops have implementation on the GPU and continuing the transfer from the node that hasn't been transferred. The current thought of doing this would be adapting to one method in the fast_compile mode and the other in the fast_run mode.

The comparison of efficiency of the new optimizer the last fortnight and at the end of last week,

The result of last fortnight
361.601003s - ('gpuarray_graph_optimization', 'GraphToGPU', 0, 24938, 40438) - 0.000s
         GraphToGPUOptimizer          gpuarray_graph_optimization
           time io_toposort 1.066s
         Total time taken by local optimizers 344.913s
           times - times applied - Node created - name:
           337.134s - 455 - 9968 - local_abstractconv_cudnn_graph
           7.127s - 1479 - 1479 - local_gpua_careduce
           0.451s - 15021 - 19701 - local_gpu_elemwise
           0.119s - 12 - 36 - local_gpuaalloc
           0.044s - 4149 - 4149 - local_gpua_dimshuffle
           0.020s - 84 - 84 - local_gpua_incsubtensor
           0.015s - 1363 - 1642 - local_gpua_subtensor_graph
           0.001s - 194 - 239 - local_gpureshape
           0.000s - 6 - 6 - local_gpua_split
           0.000s - 9 - 9 - local_gpua_join
           0.000s - 1 - 2 - local_gpua_crossentropysoftmaxargmax1hotwithbias
           0.000s - 1 - 1 - local_gpua_crossentropysoftmax1hotwithbiasdx
           0.002s - in 2 optimization that were not used (display only those with a runtime > 0)
             0.001s - local_lift_abstractconv2d
             0.001s - local_gpua_shape

The result by the end of last week,
       25.080994s - ('gpuarray_graph_optimization', 'GraphToGPU', 0, 24938, 31624) - 0.000s
         GraphToGPUOptimizer          gpuarray_graph_optimization
           time io_toposort 1.204s
         Total time taken by local optimizers 7.658s
           times - times applied - Node created - name:
           7.059s - 1479 - 1479 - local_gpua_careduce
           0.498s - 14507 - 21118 - local_gpu_elemwise
           0.038s - 2761 - 2761 - local_gpua_dimshuffle
           0.022s - 84 - 84 - local_gpua_incsubtensor
           0.020s - 455 - 455 - local_lift_abstractconv2d_graph
           0.012s - 533 - 533 - local_gpua_shape_graph
           0.004s - 57 - 114 - local_gpua_mrg1
           0.002s - 104 - 104 - local_gpua_subtensor_graph
           0.001s - 194 - 194 - local_gpureshape
           0.001s - 12 - 24 - local_gpuaalloc
           0.000s - 147 - 147 - local_gpua_dot22
           0.000s - 6 - 6 - local_gpua_split
           0.000s - 9 - 9 - local_gpua_join
           0.000s - 1 - 1 - local_gpua_crossentropysoftmax1hotwithbiasdx
           0.000s - 1 - 1 - local_gpua_crossentropysoftmaxargmax1hotwithbias
           0.000s - in 1 optimization that were not used (display only those with a runtime > 0)

The improvement in the time taken by the optimizer is immense! I line profiled all the functions that the `local_gpu_elemwise` is making a call to and detected that this slow-down happens only with the high verbosity flag, because of a call to a print method, without Error raising. Fixing it gave a great speedup of the optimizer!
The plan for implementing a new global(AND or OR) local optimizer in theano, for node replacement is almost done and the implementation would begin soon. This would be mainly targeting to speed up the 'fast_run' flag.
Finally, I'd also be working on removing  ShapeOptimizer from the fast_compile phase. The work will go parallel with building the new optimizer, and on top of the GraphToGpu optimizer.

That's it for now!

Sunday, June 12, 2016

GSoC Fortnight Update

Week2 Update:
Apologies for the delayed post of update of week 1 and week 2 progress. It is two weeks into the coding phase of GSoC and my experience has been greatly overwhelming. I have been learning a lot of new things everyday, and I am facing challenging tasks that keeps my motivation high to work learn more.
In the first two weeks, I have built a new Global Optimizer in theano. This new global optimizer builds a graph in parallel to the existing graph and makes the transfer of the OPs to the GPU in one single pass. The new optimizer has been performing pretty well as of now and currently I am analysing the time this optimizer takes per optimizing each nodes and working on speeding up the slower ones.
This optimizer is giving some amazing results. It halves the time taken for optimization. I am currently profiling the optimizer on Sparse Bayesian Recurrent Neural Network model. The optimization takes 32.56 seconds to reduce 10599 to 9404 nodes with the new optimizer, whereas it used to take 67.7 seconds for reducing 10599 nodes to 9143 nodes with the old optimizer. This has been a pleasing result.
Also, along the way, there were few more tasks done to speed up the existing optimizers. One such is, reducing the number of instance a new instance of an Op is created by pre-instantiating them. This helped in speeding up by ~2sec. Another speedup task was caching the GPUAlloc class(the class that is used to create initialized memory on the GPU). This has reduced the optimizer timing by ~3sec.
I had a few roadblocks while building this new optimizer. Thanks to my mentor (Frèdèric) who helped me out to get out few of them. It has been an amazing time to work with a highly knowledgable, and an experienced person.
I am profiling the results this new optimizer is giving on few Deep Learning models to evaluate its overall performance. In the next few days, I will write an other blog post elaborating on the profiling result of this optimizer and make this optimizer work with models that take highly unusual time to compile time with the current optimizers if the model parameters are humongous.
That's it for now folks, stay tuned!