Saturday, July 23, 2016

Second Month Blog

Hello there,



The work of GraphToGPU optimizer was finally merged into the master of Theano, giving the bleeding approximately 2-3x speed up on a ResNet model. Well, that is a big thing :-). Now, the compilation time for the graph on the FAST_COMPILE mode had one small block, which was created from the  host_from_gpu(Variable))) and gpu_from_host(host_from_gpu(gpu_from_host(Variable))) patterns. This caused the slowdown of local_cut_gpu_transfers and when tried to investigate where these patterns are created, it was found to be created from one of the AbstractConv2d optimizers. We (Me and Fred) spent sometime to filter out these pattern, but we finally concluded that this speedup wouldn't help as much as the effort and dropped the idea for now.

There were some work done in aching the OP classes from the BaseOp class so that al the instances of Op don't recreate an OP instance. I tried to implement the caching from Op class using a Singleton. I also verified that the instances with the same parameters are not recreated. But there are few problems which require some higher level refactoring. Currently the __call__ methods for the Op is implemented from PureOp which when making a call to the make_node, does not identify and pass all the parameters correctly. This passing parameter issue would hopefully be resolved if all the Ops in theano support __props__, which would make me convenient to access the _props_dict and pass the parameter instead of using the generalized unconventional way from *args and **kwargs. Currently, most of the Ops in the old backend does not have __props__ implemented to make use of the _props_dict. There are few road blocks to this. The instances of Elemwise would require a dict to be passed as parameter, which is of unhashable type and hence could not implement the __props__Early of this week, work would begin on making that parameter hashable  type and hence paving way for both of this PR to get merged. Once it gets merged, there would be at least 0.5X speed up in the optimization time. 

Finally the work has begun on implementing a CGT style optimizer. This new optimizer does optimization in topological sort. In theano, this is being implemented as a local optimizer, aimed at replacing the cannonicalize phase. Currently theano optimizes the node only "once". The main advantage of this optimizer is, it optimizes a node more than once, by trying all the possible optimizers to the node, until None of them apply. This new optimizer applies an optimization to a node, and again tries all the optimization to the newer node(the one that is modified) and so on.. There is one drawback in this approach. After two optimization being applied, the node that is being replaced wouldn't have the fgraph attribute and hence the optimization that would require this attribute could not be tried. An example of working of the new optimizer is shown below, 

Current theano master : 
x ** 4 = T.sqr(x ** 2)
This branch : x ** 4 =  T.sqr(T.sqr(x))

The drawback of this branch is that we won't be able to do this type of speed up for x ** 8 onwards. When profiled with the SBRNN(Stick Based RNN), the initial version of the draft seem to give approx 20sec speed up. Isn't that a good start? :D



That's it for now folks! :)



Friday, July 8, 2016

Third Fornight blog post

Hello there,
The PR of the new optimizer is about to be merged, after all the cleanup tasks that are done. Also, there are some progress on the CleanUp PR that i had started last fortnight. In the CleanUp PR, the op_lifter has been testsed with TopoOptimizer. The op_lifter seemed to work well with the TopoOptimizer, paving way for the possibility of implementation of the backward pass.

A quick summary of the work done over the last fortnight

Over the new_graph2gpu PR
  • I did some cleanups and addressed all the comments regarding cleanups, refactoring and optimizations.
  • Pascal helped in fixing the TestDnnConv2d test by figuring out that the `get_scalar_constant_value` method doesn't not handle SharedVariable of dimension (1, 1) and is broadcastable. 
  • I fixed the failing Gpucumsum Op's test by changing the flatten() operation with a corresponding call to GpuReshape.
  • Made few changes to fix local_gpua_eye(handled those optimization similar to local_gpuaalloc) and its test.
  • Applied the changes needed to the interface post merging of dilation PR, by making the test cases inside theano/gpuarray/dnn test with the new filter dilation parameter
  • Line profiled the more time consuming local_gpua_careduce. I initially thought it was because of a call to as_gpuarray_variable that caused this, until Fred pointed me out the actual reason, which is  because of a call to gpuarry.GpuKernel. I am currently trying to find a fix for that.
2) On CleanUp PR,
  •  Replaced the calls to HostFromGpu with transfer.
  •  Added register_topo decorator to op_lifter. Created a new LocalGroupDB instance and registered there all the optimizer to which op_lifter is applied there. Finally, registered this LocalGroupDB into the gpu_seqopt.
  • I had also tried creating a new TopoOptDB, but I had done this implementation wrong. I had created it similar to LocalGroupDB and that didn't seem to work. I was trying few more ways of implementing it, similar to SequenceDB, that also didn't work out. 
  • Reverted the local_gpua_subtensor to its previous version (as in the current master) as it caused some expected transfers to the GPU not happen. 
  • Removed all the separate caching method, for it to be integrated with the __props__ of the class.
3) On Remove_ShapeOpt PR
  • I was able to add exceptions only at one place, to completely ignore fgraph's shape_feature. There are few optimizer's which mandatorily needs them, which i have commented on the PR. 
  • Skipped all the tests that tests infer_shape, Contains MakeVector, Shape_i, T.second, T.fill and other optimizations done by ShapeFeature. 
  • The profiling results didn't seem to give significance improvement in optimization time as more work needs to be done on this case.
That's it for now!
Cheers,

Sunday, June 26, 2016

Second Fortnight update

The second fortnight blog post update:
It's almost a month into the coding phase of GSoC. The new Global Optimizer is built and the cleaning work on the PR(Pull Request) is also done. The PR would be merged next week and there has been few follow-up tasks in the current PR.
There is another significance improvement on the profiling results that I earlier shared. After few simplification in computation of convolutional operators,  there is a 10sec improvement in optimizer and the optimization time for training SBRNN is now ~20sec.
Currently, there are a few clean-up tasks on this PR. If a node is on the CPU, the output variables of that nodes are on the CPU, which happen to be the input nodes to other nodes. Since the input variable to the next nodes are not on the GPU, the transfer of those nodes to the GPU wouldn't happen, thus all the nodes till the Graph's output node, making the compilation time to be large. There are two ways to fix it, being aggressive, meaning, transferring all the nodes to the GPU, irrespective of if the input Variables to those nodes are GPUVariables or not. The second way to fix it is, to have a backward pass on the graph lifting nodes to the GPU, if their Ops have implementation on the GPU and continuing the transfer from the node that hasn't been transferred. The current thought of doing this would be adapting to one method in the fast_compile mode and the other in the fast_run mode.

The comparison of efficiency of the new optimizer the last fortnight and at the end of last week,

The result of last fortnight
361.601003s - ('gpuarray_graph_optimization', 'GraphToGPU', 0, 24938, 40438) - 0.000s
         GraphToGPUOptimizer          gpuarray_graph_optimization
           time io_toposort 1.066s
         Total time taken by local optimizers 344.913s
           times - times applied - Node created - name:
           337.134s - 455 - 9968 - local_abstractconv_cudnn_graph
           7.127s - 1479 - 1479 - local_gpua_careduce
           0.451s - 15021 - 19701 - local_gpu_elemwise
           0.119s - 12 - 36 - local_gpuaalloc
           0.044s - 4149 - 4149 - local_gpua_dimshuffle
           0.020s - 84 - 84 - local_gpua_incsubtensor
           0.015s - 1363 - 1642 - local_gpua_subtensor_graph
           0.001s - 194 - 239 - local_gpureshape
           0.000s - 6 - 6 - local_gpua_split
           0.000s - 9 - 9 - local_gpua_join
           0.000s - 1 - 2 - local_gpua_crossentropysoftmaxargmax1hotwithbias
           0.000s - 1 - 1 - local_gpua_crossentropysoftmax1hotwithbiasdx
           0.002s - in 2 optimization that were not used (display only those with a runtime > 0)
             0.001s - local_lift_abstractconv2d
             0.001s - local_gpua_shape

The result by the end of last week,
       25.080994s - ('gpuarray_graph_optimization', 'GraphToGPU', 0, 24938, 31624) - 0.000s
         GraphToGPUOptimizer          gpuarray_graph_optimization
           time io_toposort 1.204s
         Total time taken by local optimizers 7.658s
           times - times applied - Node created - name:
           7.059s - 1479 - 1479 - local_gpua_careduce
           0.498s - 14507 - 21118 - local_gpu_elemwise
           0.038s - 2761 - 2761 - local_gpua_dimshuffle
           0.022s - 84 - 84 - local_gpua_incsubtensor
           0.020s - 455 - 455 - local_lift_abstractconv2d_graph
           0.012s - 533 - 533 - local_gpua_shape_graph
           0.004s - 57 - 114 - local_gpua_mrg1
           0.002s - 104 - 104 - local_gpua_subtensor_graph
           0.001s - 194 - 194 - local_gpureshape
           0.001s - 12 - 24 - local_gpuaalloc
           0.000s - 147 - 147 - local_gpua_dot22
           0.000s - 6 - 6 - local_gpua_split
           0.000s - 9 - 9 - local_gpua_join
           0.000s - 1 - 1 - local_gpua_crossentropysoftmax1hotwithbiasdx
           0.000s - 1 - 1 - local_gpua_crossentropysoftmaxargmax1hotwithbias
           0.000s - in 1 optimization that were not used (display only those with a runtime > 0)

The improvement in the time taken by the optimizer is immense! I line profiled all the functions that the `local_gpu_elemwise` is making a call to and detected that this slow-down happens only with the high verbosity flag, because of a call to a print method, without Error raising. Fixing it gave a great speedup of the optimizer!
The plan for implementing a new global(AND or OR) local optimizer in theano, for node replacement is almost done and the implementation would begin soon. This would be mainly targeting to speed up the 'fast_run' flag.
Finally, I'd also be working on removing  ShapeOptimizer from the fast_compile phase. The work will go parallel with building the new optimizer, and on top of the GraphToGpu optimizer.

That's it for now!

Sunday, June 12, 2016

GSoC Fortnight Update

Week2 Update:
Apologies for the delayed post of update of week 1 and week 2 progress. It is two weeks into the coding phase of GSoC and my experience has been greatly overwhelming. I have been learning a lot of new things everyday, and I am facing challenging tasks that keeps my motivation high to work learn more.
In the first two weeks, I have built a new Global Optimizer in theano. This new global optimizer builds a graph in parallel to the existing graph and makes the transfer of the OPs to the GPU in one single pass. The new optimizer has been performing pretty well as of now and currently I am analysing the time this optimizer takes per optimizing each nodes and working on speeding up the slower ones.
This optimizer is giving some amazing results. It halves the time taken for optimization. I am currently profiling the optimizer on Sparse Bayesian Recurrent Neural Network model. The optimization takes 32.56 seconds to reduce 10599 to 9404 nodes with the new optimizer, whereas it used to take 67.7 seconds for reducing 10599 nodes to 9143 nodes with the old optimizer. This has been a pleasing result.
Also, along the way, there were few more tasks done to speed up the existing optimizers. One such is, reducing the number of instance a new instance of an Op is created by pre-instantiating them. This helped in speeding up by ~2sec. Another speedup task was caching the GPUAlloc class(the class that is used to create initialized memory on the GPU). This has reduced the optimizer timing by ~3sec.
I had a few roadblocks while building this new optimizer. Thanks to my mentor (Frèdèric) who helped me out to get out few of them. It has been an amazing time to work with a highly knowledgable, and an experienced person.
I am profiling the results this new optimizer is giving on few Deep Learning models to evaluate its overall performance. In the next few days, I will write an other blog post elaborating on the profiling result of this optimizer and make this optimizer work with models that take highly unusual time to compile time with the current optimizers if the model parameters are humongous.
That's it for now folks, stay tuned!

Wednesday, May 18, 2016

GSoC: Week 0

A little late post, but better late than never. So, Yay! My proposal got accepted for Google Summer of Code 2016 under Theano, a sub organisation of Python Software Foundation under the mentorship of  Frédéric Bastein and Pascal Lamblin!! 😄
 For those who are unaware of Theano, it is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.
My work for this summer will be focussed on improvising large graph’s traversal, serialization of objects, moving computations to GPU, creating a new optimizer_excluding flag, speeding up the slow optimizing phase during compilation and faster cyclic detection in graphs. 
The entire proposal with timeline of deliverables could be viewed here[1].
As the community bonding period is nearing it's end, I was finally done with my end semester exams last week and it was a pretty hectic couple of weeks. I had started my work in the reverse order with respect to my proposal, with "Faster cyclic detection in graphs". The work on new algorithm for detecting cycles in graph has been drafted by Fred during last November. I have resumed over that and carried it from there until we hit a road block, where the graphs do not pass the consistency checks with the new algorithm. So, I have moved on to the next task, and will come back to this once Fred’s schedule eases and when he could help me with this more rigorously as the code complexity is a little high for my understanding level.
The next task (current) that I am working on is the optimization that move the computation to GPU. 
Stay tuned for more updates! 
[1]https://goo.gl/RBBoQl

Cheers.🍻