GSoC 2016: July 2016

Saturday, July 23, 2016

Second Month Blog

Hello there,

The work of GraphToGPU optimizer was finally merged into the master of Theano, giving the bleeding approximately 2-3x speed up on a ResNet model. Well, that is a big thing :-). Now, the compilation time for the graph on the FAST_COMPILE mode had one small block, which was created from the host_from_gpu(Variable))) and gpu_from_host(host_from_gpu(gpu_from_host(Variable))) patterns. This caused the slowdown of local_cut_gpu_transfers and when tried to investigate where these patterns are created, it was found to be created from one of the AbstractConv2d optimizers. We (Me and Fred) spent sometime to filter out these pattern, but we finally concluded that this speedup wouldn't help as much as the effort and dropped the idea for now.

There were some work done in aching the OP classes from the BaseOp class so that al the instances of Op don't recreate an OP instance. I tried to implement the caching from Op class using a Singleton. I also verified that the instances with the same parameters are not recreated. But there are few problems which require some higher level refactoring. Currently the __call__ methods for the Op is implemented from PureOp which when making a call to the make_node, does not identify and pass all the parameters correctly. This passing parameter issue would hopefully be resolved if all the Ops in theano support __props__, which would make me convenient to access the _props_dict and pass the parameter instead of using the generalized unconventional way from *args and **kwargs. Currently, most of the Ops in the old backend does not have __props__ implemented to make use of the _props_dict. There are few road blocks to this. The instances of Elemwise would require a dict to be passed as parameter, which is of unhashable type and hence could not implement the __props__. Early of this week, work would begin on making that parameter hashable type and hence paving way for both of this PR to get merged. Once it gets merged, there would be at least 0.5X speed up in the optimization time.

Finally the work has begun on implementing a CGT style optimizer. This new optimizer does optimization in topological sort. In theano, this is being implemented as a local optimizer, aimed at replacing the cannonicalize phase. Currently theano optimizes the node only "once". The main advantage of this optimizer is, it optimizes a node more than once, by trying all the possible optimizers to the node, until None of them apply. This new optimizer applies an optimization to a node, and again tries all the optimization to the newer node(the one that is modified) and so on.. There is one drawback in this approach. After two optimization being applied, the node that is being replaced wouldn't have the fgraph attribute and hence the optimization that would require this attribute could not be tried. An example of working of the new optimizer is shown below,

Current theano master : x ** 4 = T.sqr(x ** 2)
This branch : x ** 4 = T.sqr(T.sqr(x))

The drawback of this branch is that we won't be able to do this type of speed up for x ** 8 onwards. When profiled with the SBRNN(Stick Based RNN), the initial version of the draft seem to give approx 20sec speed up. Isn't that a good start? :D

That's it for now folks! :)

Friday, July 8, 2016

Third Fornight blog post

Hello there,
The PR of the new optimizer is about to be merged, after all the cleanup tasks that are done. Also, there are some progress on the CleanUp PR that i had started last fortnight. In the CleanUp PR, the op_lifter has been testsed with TopoOptimizer. The op_lifter seemed to work well with the TopoOptimizer, paving way for the possibility of implementation of the backward pass.

A quick summary of the work done over the last fortnight

Over the new_graph2gpu PR

I did some cleanups and addressed all the comments regarding cleanups, refactoring and optimizations.
Pascal helped in fixing the TestDnnConv2d test by figuring out that the `get_scalar_constant_value` method doesn't not handle SharedVariable of dimension (1, 1) and is broadcastable.
I fixed the failing Gpucumsum Op's test by changing the flatten() operation with a corresponding call to GpuReshape.
Made few changes to fix local_gpua_eye(handled those optimization similar to local_gpuaalloc) and its test.
Applied the changes needed to the interface post merging of dilation PR, by making the test cases inside theano/gpuarray/dnn test with the new filter dilation parameter
Line profiled the more time consuming local_gpua_careduce. I initially thought it was because of a call to as_gpuarray_variable that caused this, until Fred pointed me out the actual reason, which is because of a call to gpuarry.GpuKernel. I am currently trying to find a fix for that.

2) On CleanUp PR,

Replaced the calls to HostFromGpu with transfer.
Added register_topo decorator to op_lifter. Created a new LocalGroupDB instance and registered there all the optimizer to which op_lifter is applied there. Finally, registered this LocalGroupDB into the gpu_seqopt.
I had also tried creating a new TopoOptDB, but I had done this implementation wrong. I had created it similar to LocalGroupDB and that didn't seem to work. I was trying few more ways of implementing it, similar to SequenceDB, that also didn't work out.
Reverted the local_gpua_subtensor to its previous version (as in the current master) as it caused some expected transfers to the GPU not happen.
Removed all the separate caching method, for it to be integrated with the __props__ of the class.

3) On Remove_ShapeOpt PR

I was able to add exceptions only at one place, to completely ignore fgraph's shape_feature. There are few optimizer's which mandatorily needs them, which i have commented on the PR.
Skipped all the tests that tests infer_shape, Contains MakeVector, Shape_i, T.second, T.fill and other optimizations done by ShapeFeature.
The profiling results didn't seem to give significance improvement in optimization time as more work needs to be done on this case.

That's it for now!

Cheers,