You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kevin Burton <bu...@spinn3r.com> on 2011/09/01 00:59:17 UTC
Re: Does Pig not optimize GROUP based on previous GROUPs?
I was wrong about this:
- perform all the IO on the original file
- sort all the data
- send all the data over the network to the reducers
Option 2:
- read data from potentially the incorrect hosts over the network during
the reduce phase.
I think these are the real steps:
- read all the blocks off disk on the source nodes and send them to mappers
- sort the data / rewriting them to disk during the sort if necessary
- send all the data to the reducers over the network
- write the data to disk on the reducers
vs
- read all the data off disk from the previous reduction phase
- potentially (with a high probability) send it per the network during the
group by …
It seems that we could setup a more simple benchmark by measuring each phase
individually and then benchmark the performance advantage.
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
Skype-in: *(415) 871-0687*