Saturday, July 23, 2016

Second Month Blog

Hello there,



The work of GraphToGPU optimizer was finally merged into the master of Theano, giving the bleeding approximately 2-3x speed up on a ResNet model. Well, that is a big thing :-). Now, the compilation time for the graph on the FAST_COMPILE mode had one small block, which was created from the  host_from_gpu(Variable))) and gpu_from_host(host_from_gpu(gpu_from_host(Variable))) patterns. This caused the slowdown of local_cut_gpu_transfers and when tried to investigate where these patterns are created, it was found to be created from one of the AbstractConv2d optimizers. We (Me and Fred) spent sometime to filter out these pattern, but we finally concluded that this speedup wouldn't help as much as the effort and dropped the idea for now.

There were some work done in aching the OP classes from the BaseOp class so that al the instances of Op don't recreate an OP instance. I tried to implement the caching from Op class using a Singleton. I also verified that the instances with the same parameters are not recreated. But there are few problems which require some higher level refactoring. Currently the __call__ methods for the Op is implemented from PureOp which when making a call to the make_node, does not identify and pass all the parameters correctly. This passing parameter issue would hopefully be resolved if all the Ops in theano support __props__, which would make me convenient to access the _props_dict and pass the parameter instead of using the generalized unconventional way from *args and **kwargs. Currently, most of the Ops in the old backend does not have __props__ implemented to make use of the _props_dict. There are few road blocks to this. The instances of Elemwise would require a dict to be passed as parameter, which is of unhashable type and hence could not implement the __props__Early of this week, work would begin on making that parameter hashable  type and hence paving way for both of this PR to get merged. Once it gets merged, there would be at least 0.5X speed up in the optimization time. 

Finally the work has begun on implementing a CGT style optimizer. This new optimizer does optimization in topological sort. In theano, this is being implemented as a local optimizer, aimed at replacing the cannonicalize phase. Currently theano optimizes the node only "once". The main advantage of this optimizer is, it optimizes a node more than once, by trying all the possible optimizers to the node, until None of them apply. This new optimizer applies an optimization to a node, and again tries all the optimization to the newer node(the one that is modified) and so on.. There is one drawback in this approach. After two optimization being applied, the node that is being replaced wouldn't have the fgraph attribute and hence the optimization that would require this attribute could not be tried. An example of working of the new optimizer is shown below, 

Current theano master : 
x ** 4 = T.sqr(x ** 2)
This branch : x ** 4 =  T.sqr(T.sqr(x))

The drawback of this branch is that we won't be able to do this type of speed up for x ** 8 onwards. When profiled with the SBRNN(Stick Based RNN), the initial version of the draft seem to give approx 20sec speed up. Isn't that a good start? :D



That's it for now folks! :)



No comments:

Post a Comment