You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kevin Burton <bu...@spinn3r.com> on 2011/09/01 00:59:17 UTC

Re: Does Pig not optimize GROUP based on previous GROUPs?

I was wrong about this:

    - perform all the IO on the original file
 - sort all the data
 - send all the data over the network to the reducers

Option 2:
  - read data from potentially the incorrect hosts over the network during
the reduce phase.

I think these are the real steps:

- read all the blocks off disk on the source nodes and send them to mappers
- sort the data / rewriting them to disk during the sort if necessary
- send all the data to the reducers over the network
- write the data to disk on the reducers

vs

- read all the data off disk from the previous reduction phase
- potentially (with a high probability) send it per the network during the
group by …

It seems that we could setup a more simple benchmark by measuring each phase
individually and then benchmark the performance advantage.


-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*