You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Santhosh Srinivasan <sm...@yahoo-inc.com> on 2009/06/12 23:19:37 UTC

Rewire and multi-query load/store optimization

With the implementation of rewire as part of the optimizer
infrastructure, a bug was exposed in the load/store optimization in the
multi-query feature. Below, I will articulate the bug and the
ramifications of a few possible solutions.

Load/store optimization in the multi-query feature?
---------------------------------------------------

If a script has an explicit store and a corresponding load which loads
the output of the store, the store-load combination can be optimized. An
example will illustrate the concept.

Pre-conditions:

1. The store location and the load location should match
2. The store format and the load format should be compatible

{code}

A = load 'input';
B = group A by $0;
store B into 'output';
C = load 'output';
D = group C by $0;
store D into 'some_other_output';

{code}

In the script above, the output of the first store serves as input of
the second load (C). In addition, the store and load use PigStorage() as
the store/load mechanism. In the logical plan this combination by
splitting B into the store and D.

Bug
---

When the load in the store/load combination was removed, the inner plans
of the load's successors (in this case D), were not updated correctly.
As a result, the projections in the inner plans still held references to
non-existing operators.

Consequence of the bug fix
---------------------------

During the map-reduce (M/R) compilation the split operator is compiled
into a store and a load. Prior to multi-query, for each M/R boundary
resulted in a temporary store using BinStorage. The subsequent load
could infer the type as BinStorage returns typed records, i.e., non-byte
array records.

With multi-query and the load/store optimization, the temporary
BinStorage data is not generated. Instead, the subsequent load uses the
output of the previous store as its input. Here, the loader can get
typed or untyped records based on the loader. As a result, the operators
in the map phase that rely on the type information (inferred from the
logical plan) will fail due to type mismatch.

Possible Solutions
------------------

Solution 1
==========
Switch the load/store optimization. Users were primarily storing
intermediate data within the same script to overcome Pig's limitation,
i.e., absence of the multi-query feature. Going forward, with
multi-query turned on, users who store intermediate data will not enjoy
all the benefits of the optimization.

Solution 2
==========
After the M/R compilation is completed, during the final pass of the
plan, fix the types of the projections to reflect typed/untyped data. In
other words, if the loader is returning typed data then retain the types
else change the types to bytearray. In order to make this decision,
loaders should support an interface to indicate if the records are typed
or untyped.


Thanks,
Santhosh

Re: Rewire and multi-query load/store optimization

Posted by Alan Gates <ga...@yahoo-inc.com>.

+1 on option one.  The use of store->load was only to overcome a  
temporary problem in Pig.  We've fixed the problem, so let's not  
propagate it.  We will need to document this very clearly (maybe even  
to the point of issuing warnings in the parser when we see this combo)  
so users understand that this is now a hinderance rather than a help.

Alan.

On Jun 12, 2009, at 2:19 PM, Santhosh Srinivasan wrote:

> With the implementation of rewire as part of the optimizer
> infrastructure, a bug was exposed in the load/store optimization in  
> the
> multi-query feature. Below, I will articulate the bug and the
> ramifications of a few possible solutions.
>
> Load/store optimization in the multi-query feature?
> ---------------------------------------------------
>
> If a script has an explicit store and a corresponding load which loads
> the output of the store, the store-load combination can be  
> optimized. An
> example will illustrate the concept.
>
> Pre-conditions:
>
> 1. The store location and the load location should match
> 2. The store format and the load format should be compatible
>
> {code}
>
> A = load 'input';
> B = group A by $0;
> store B into 'output';
> C = load 'output';
> D = group C by $0;
> store D into 'some_other_output';
>
> {code}
>
> In the script above, the output of the first store serves as input of
> the second load (C). In addition, the store and load use  
> PigStorage() as
> the store/load mechanism. In the logical plan this combination by
> splitting B into the store and D.
>
> Bug
> ---
>
> When the load in the store/load combination was removed, the inner  
> plans
> of the load's successors (in this case D), were not updated correctly.
> As a result, the projections in the inner plans still held  
> references to
> non-existing operators.
>
> Consequence of the bug fix
> ---------------------------
>
> During the map-reduce (M/R) compilation the split operator is compiled
> into a store and a load. Prior to multi-query, for each M/R boundary
> resulted in a temporary store using BinStorage. The subsequent load
> could infer the type as BinStorage returns typed records, i.e., non- 
> byte
> array records.
>
> With multi-query and the load/store optimization, the temporary
> BinStorage data is not generated. Instead, the subsequent load uses  
> the
> output of the previous store as its input. Here, the loader can get
> typed or untyped records based on the loader. As a result, the  
> operators
> in the map phase that rely on the type information (inferred from the
> logical plan) will fail due to type mismatch.
>
> Possible Solutions
> ------------------
>
> Solution 1
> ==========
> Switch the load/store optimization. Users were primarily storing
> intermediate data within the same script to overcome Pig's limitation,
> i.e., absence of the multi-query feature. Going forward, with
> multi-query turned on, users who store intermediate data will not  
> enjoy
> all the benefits of the optimization.
>
> Solution 2
> ==========
> After the M/R compilation is completed, during the final pass of the
> plan, fix the types of the projections to reflect typed/untyped  
> data. In
> other words, if the loader is returning typed data then retain the  
> types
> else change the types to bytearray. In order to make this decision,
> loaders should support an interface to indicate if the records are  
> typed
> or untyped.
>
>
> Thanks,
> Santhosh