You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2012/03/11 05:16:20 UTC

yslow optimizations

Yslow does some clever correlation-based optimizations to achieve
significant speedups. They have a good paper about it:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Note the Hive/Pig numbers.. we are generating unnecessary jobs, and
too much intermediate data, it seems (not sure which version of Pig
they ran).

D

Re: yslow optimizations

Posted by ASF - Maillists <bu...@gmail.com>.

The correlation based optimization in YSmart looks good as it creates minimal number of jobs by exploiting correlation between the multiple jobs. In the experiment section it is mentioned that they used CDH distribution for their experimental setup. Since the paper is published in ICDCS 2011 in June, a quick glance over CDH3 beta 4 (released in Feb 2011) release history shows Pig 0.8.0.

Looks like they have patched this in Hive 
http://code.google.com/p/ysmart/wiki/HivePatchhttp://code.google.com/p/ysmart/wiki/HivePatch

On Mar 10, 2012, at 11:16 PM, Dmitriy Ryaboy wrote:

> Yslow does some clever correlation-based optimizations to achieve
> significant speedups. They have a good paper about it:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Note the Hive/Pig numbers.. we are generating unnecessary jobs, and
> too much intermediate data, it seems (not sure which version of Pig
> they ran).
> 
> D