You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Swati Jain <sw...@aggiemail.usu.edu> on 2010/07/05 11:34:15 UTC

PIG Logical Optimization: Use CNF in SplitFilter

Hi,

I am interested in implementing logical optimization rules and to target
this I have studied currently implemented logical rules and the rule
framework. In particular, I felt that rules dealing with LOfilter are not
able to handle complicated boolean expressions. I would like to share
suggestions to improve handling of boolean expressions in LOFilter to enable
better optimization.

1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into two by
"AND". However it will not be able to split LOFilter if the top level
operator is "OR". For example:

*ex script:*
A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
J1 = JOIN B by b1, C by c1;
J2 = JOIN J1 by $0, A by a1;
D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
explain D;

In the above example current rule is not able to any filter condition across
any join as it contains columns from all branches (inputs). But if we
convert this expression into "Conjunctive Normal Form" (CNF) then we would
be able to push filter condition c1< 10 and c2 == 5 below both join
conditions. Here is the CNF expression for highlighted line:

( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )

*Suggestions:* It would be a good idea to convert LOFilter's boolean
expression into CNF, it would then be easy to push parts (conjuncts) of the
LOFilter boolean expression selectively.

I have started thinking about the design for implementing this conversion
(arbitrary boolean expression to CNF) and would appreciate any feedback or
ideas.

Thanks!
Swati

RE: PIG Logical Optimization: Use CNF in SplitFilter

Posted by Yan Zhou <ya...@yahoo-inc.com>.
In the original expression, let (a3+b3 > 10) to be true, then it
transformed to (c1 < 10) OR (c2 == 5) ) since "TRUE" OR anything is
still "TRUE"; "TRUE" and anything is that "anything". You can write a
visitor to easily do this type of "partial evaluation". (a3+b3>10) is
chosen because it can not be determined from alias 'C'.

Thanks,

Yan

-----Original Message-----
From: Swati Jain [mailto:swati.j@aggiemail.usu.edu] 
Sent: Monday, July 12, 2010 5:40 PM
To: pig-dev@hadoop.apache.org
Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter

Hi Yan

Thanks for your prompt reply. I did not understand your statement "C1
and
C2, or their equivalent, above JOIN can be easily figured out without
resorting to CNF".

Consider a LOFilter above a LOJoin. The predicate of LOFilter: ( (c1 <
10)
AND (a3+b3 > 10) ) OR (c2 == 5)

The schema for LOJoin:

A = (a1:int,a2:int,a3:int);
B = (b1:int,b2:int,b3:int);
C = (c1:int,c2:int,c3:int);

After CNF: ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )

Now we can push ( (c1 < 10) OR (c2 == 5) ) above the JOIN (in the branch
leading up to the source C) while ( (a3+b3 > 10) OR (c2 ==5) ) stays put
below the JOIN.

Please let me know if there is a way of doing the above optimization
without
converting the original expression to CNF.

Thanks,

Swati


On Mon, Jul 12, 2010 at 4:26 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:

> I see. There looks like some disconnect about "Scenario 1". To me, all
> filtering logics that can be pushed above JOIN can be figured out
> without use of CNF, which is scenario 1; while CNF helps to derive the
> filtering logic after (or, in your example, below) JOIN, which is
> Scenario 2.
>
>
>
> In your example, C1 and C2, or their equivalent, above JOIN can be
> easily figured out without resorting to CNF; C3 may have to be figured
> out with CNF, but evaluation cost of the post-Join filtering logic
thus
> generated may not be cheaper than the original one before pushing up.
>
>
>
> In summary, if we want to support scenario 2(and 1), we should use
CNF;
> if we JUST want to support scenario 1, which will push up all possible
> filters closer to source and have all benefits on pruned I/O, we
should
> not use CNF.
>
>
>
> Thanks,
>
>
>
> Yan
>
>
>
> -----Original Message-----
> From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> Sent: Monday, July 12, 2010 4:04 PM
> To: pig-dev@hadoop.apache.org
> Subject: PIG Logical Optimization: Use CNF in SplitFilter
>
>
>
> Yan,
>
>
>
> What I meant in my last email was that scenario 2 optimizations would
> lead
>
> to more opportunities for scenario 1 kind of optimizations.
>
>
>
> Consider the conjunct list [C1;C2;C3] as the source of a JOIN.
>
>
>
> (a)  Suppose none of these are computable on a join input, in this
case
> we
>
> retain the original expression and discard the CNF.
>
>
>
> (b)  Suppose C1 is computable on join input J1 and C2 is computable on
> join
>
> input J2 but C3 requires a combination of both join inputs. In this
> case, we
>
> push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note
> that
>
> C1 and C2 may be further pushed up (with additional iterations of the
>
> optimizer). If they are now the source of single input operators, it
is
>
> similar to scenario 1.
>
>
>
> Thanks,
>
> Swati
>
>
>
>
>
> On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:
>
>
>
> >  Hopefully by this week. I'm still in the debugging phase of the
work.
>
> > While you are welcome to reuse some of my algorithms, I doubt you
can
> reuse
>
> > the code as much as you want. It's basically for my DNF use. You
might
> need
>
> > to factor out some general codes which you can find
>
> >
>
> > reusable.
>
> >
>
> >
>
> >
>
> > I fully understand the I/O benefits as I put in my first message.
And
> it is
>
> > classified as "Scenario 1". There is no doubt that it should be
> considered
>
> > as part of your work. However, for this, CNF is not necessary.
>
> >
>
> >
>
> >
>
> > For scenario 2, the benefits will be from less in-core logical
> expression
>
> > evaluation costs and no I/O benefits as I can see. And use of CNF
may
> or may
>
> > not lead to cheaper evaluations as the example in my first message
> shows. In
>
> > other words, after use of CNF, you should
>
> >
>
> > compare the eval cost with that in the original expression eval
before
>
> > deciding either the CNF or the original form should be evaluated.
>
> >
>
> >
>
> >
>
> > Please let me know if I miss any of your points.
>
> >
>
> >
>
> >
>
> > Thanks,
>
> >
>
> >
>
> >
>
> > Yan
>
> >  ------------------------------
>
> >
>
> > *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > *Sent:* Monday, July 12, 2010 11:52 AM
>
> >
>
> > *To:* Yan Zhou
>
> > *Cc:* pig-dev@hadoop.apache.org
>
> > *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> >
>
> >
>
> > I was wondering if you are not going to check in your patch soon
then
> it
>
> > would be great if you could share it with me. I believe I might be
> able to
>
> > reuse some of your (utility) functionality directly or get some
ideas.
>
> >
>
> > About your cost-benefit question:
>
> > 1) I will control the complexity of CNF conversion by providing a
>
> > configurable threshold value which will limit the OR-nesting.
>
> > 2) One benefit of this conversion is that it will allow pushing
parts
> of a
>
> > filter (conjuncts) across the joins which is not happening in the
> current
>
> > PushUpFilter optimization. Moreover, it may result in a cascading
> effect to
>
> > push the conjuncts below other operators by other rules that may be
> fired as
>
> > a result. The benefit from this is really data dependent, but in
> big-data
>
> > workloads, any kind of predicate pushdown may eventually lead to big
> savings
>
> > in amount of data read or amount of data transfered/shuffled across
> the
>
> > network (I need to understand the LogicalPlan to PhysicalPlan
> conversion
>
> > better to give concrete examples).
>
> >
>
> > Thanks!
>
> > Swati
>
> >
>
> > On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <ya...@yahoo-inc.com>
wrote:
>
> >
>
> > Yes, I already implemented the "NOT push down" upfront, so you do
not
> need
>
> > to do that.
>
> >
>
> >
>
> >
>
> > The support of CNF will probably be the most difficulty part. But as
I
>
> > mentioned last time, you should compare the cost after the trimming
> CNF to
>
> > get the post-split filtering logic. Given the complexity of
> manipulating CNF
>
> > and undetermined benefits, I am not sure it should be in scope at
this
>
> > moment or not.
>
> >
>
> >
>
> >
>
> > To handle CNF, I think it's a good idea to create a new plan and
> connect
>
> > the nodes in the new plan to the base plan as you envisioned. In my
> changes,
>
> > which uses DNF instead of CNF but processing is similar otherwise, I
> use a
>
> > LogicalExpressionProxy, which contains a "source" member that is
just
> the
>
> > node in the original plan, to link the nodes in the new plan and old
> plan.
>
> >  The original LogicalExpression is enhanced with a counter to trace
> the # of
>
> > proxies of the original nodes since normal form creation will
"spread"
> the
>
> > nodes in the original tree across many normalized nodes. The
benefit,
> aside
>
> > from not setting the plan, is that the original expression is
trimmed
>
> > according to the processing results from DNF; while DNF is created
>
> > separately and as a kinda utility so that complex features can be
> used. In
>
> > my changes, I used multiple-child tree in DNF while not changing the
>
> > original binary expression tree structure. Another benefit is that
the
>
> > original tree is kept as much as it is at the start, i.e., I do not
> attempt
>
> > to optimize its overall structure beyond trimming based upon the
>
> > simplification logics. (I also control the size of DNF to 100
nodes.)
> The
>
> > down side of this is added complexity.
>
> >
>
> >
>
> >
>
> > But in your case, for scenario 2 which is the whole point to use
CNF,
> you
>
> > would need to change the original expression tree structurally
beyond
>
> > trimming for post-split filtering logic. The other benefit of using
>
> > multiple-child expression is depending upon if you plan to support
> such
>
> > expression to replace current binary tree
>
> >
>
> > in the final plan. Even though I think it's a good idea to support
> that,
>
> > but it is not in my scope now.
>
> >
>
> >
>
> >
>
> > I'll add my algorithm details soon to my jira. Please take a look
and
>
> > comment as you see appropriate.
>
> >
>
> >
>
> >
>
> > Thanks,
>
> >
>
> >
>
> >
>
> > Yan
>
> >
>
> >
>
> >
>
> >
>
> >  ------------------------------
>
> >
>
> > *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > *Sent:* Friday, July 09, 2010 11:00 PM
>
> > *To:* Yan Zhou
>
> > *Cc:* pig-dev@hadoop.apache.org
>
> > *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> >
>
> >
>
> > Hi Yan,
>
> >
>
> > I agree that the first scenario (filter logic applied to individual
> input
>
> > sources) doesn't need conversion to CNF and that it will be a good
> idea to
>
> > add CNF functionality for the second scenario. I was also planning
to
>
> > provide a configurable threshold value to control the complexity of
> CNF
>
> > conversion.
>
> >
>
> > As part of the above, I wrote a utility to push the "NOT" operator
in
>
> > predicates below the "AND" and "OR" operators (Scenario 2 in
> PIG-1399). I am
>
> > considering making this utility to push NOT a separate rule in
itself.
> Lmk
>
> > if you have already implemented this.
>
> >
>
> > While implementing this utility I am facing some trouble in
> maintaining
>
> > OperatorPlan consistent as I rewrite the expression. This is because
> each
>
> > operator is referencing the main filter logical plan. Here is my
> current
>
> > approach of implementation:
>
> >
>
> > 1. I am creating a new LogicalExpressionPlan for the converted
boolean
>
> > expression.
>
> > 2. I am creating new logical expressions while pushing the NOT
> operation,
>
> > converting AND into OR, OR into AND eliminating NOT NOT pairs.
>
> > 3. However, I am having trouble updating the LogicalExpressionPlan
if
> it
>
> > reaches the base case ( i.e. root operator is not NOT,AND,OR).
>
> >
>
> > D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) )
)
> );
>
> >
>
> > In the above, for example, I am not sure how to integrate base
> expression
>
> > (c2 == 5) into the new LogicalExpressionPlan. There is no routine to
> set the
>
> > plan for a given operator and its children. Also, there is currently
> no way
>
> > to deepCopy an expression into a new OperatorPlan. It would be great
> if you
>
> > could give me some suggestions on what approach to take for this.
>
> >
>
> > One approach I thought of is to visit the base expression and create
> and
>
> > connect the base expression to the LogicalExpressionPlan as I visit
> it.
>
> >
>
> > Thoughts?
>
> > Swati
>
> >
>
> > ps: About your other point regarding binary vs multi way trees, the
> way I
>
> > am creating the normal form is a list of conjuncts, where each
> conjunct is a
>
> > list of disjuncts. This is logically similar to a multi waytree.
> However,
>
> > the current modeling of boolean expressions (modeled as binary
> expressions)
>
> > requires a conversion back to the binary tree model when adding back
> to the
>
> > main plan.
>
> >
>
> > On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <ya...@yahoo-inc.com>
wrote:
>
> >
>
> > Swati,
>
> >
>
> > I happen to be working on the logical expression simplification
effort
>
> > (https://issues.apache.org/jira/browse/PIG-1399), but not on the
> filter
>
> > split front. So I guess our interests will have some overlaps.
>
> >
>
> > I think the filter logic split problem can be divided into 2 parts:
>
> > 1) the filtering logic that can be applied to individual input
> sources;
>
> > and 2) the filtering logic that has to be applied when merged(or
> joined)
>
> > inputs are processed.
>
> >
>
> > The benefits for 1) are any benefits if the underlying storage
> supports
>
> > predicate pushdown; plus the memory/CPU savings by PIG for not
>
> > processing the unqualified rows.
>
> >
>
> > For 2), the purpose is not paying higher evaluation costs than
>
> > necessary.
>
> >
>
> > For 1), no normal form is necessary. The original logical expression
>
> > tree
>
> > can be trimmed off any sub-expressions that are not constants nor
just
>
> > from a particular input source. The complexity is linear with the
tree
>
> > size; while the use of normal form could potentially lead to
> exponential
>
> > complexity. The difficulty with this approach is how to generate the
>
> > filtering logic for 2); while CNF can be used to easily figure out
the
>
> > logic for 2). However, the exact logic in 2) might not be cheaper to
>
> > evaluate than the original logical expression. An example is "Filter
> J2
>
> > by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2)
> the
>
> > filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
>
> > ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The
cost
>
> > will be 5 logical evaluations (3 OR plus 2 AND), which could be
> reduced
>
> > to 4, compared with 3 logical evaluations in the original form.
>
> >
>
> > In summary, if only 1) is desired, the tree trimming is enough. If
2)
> is
>
> > desired too, then CNF could be used but its complexity should be
>
> > controlled and the cost of the filtering logic evaluation in 2)
should
>
> > be computed and compared with the original expression evaluation
cost.
>
> > Further optimization is possible in this direction.
>
> >
>
> > Another potential optimization to consider is to support logical
>
> > expression tree of multiple children vs. the binary tree after
taking
>
> > into consideration of the commutative property of OR and AND
> operations.
>
> > The advantages are less tree traversal costs and easier to change
the
>
> > evaluation ordering within the same sub-tree in order to maximize
the
>
> > possibilities to short-cut the evaluations. Although this is general
> for
>
> > all logical expressions, this tends to be more suitable for normal
> form
>
> > handlings as normal forms group the sub-expressions by the operators
>
> > that act on the sub-expressions.
>
> >
>
> > Thanks,
>
> >
>
> > Yan
>
> >
>
> >
>
> > -----Original Message-----
>
> > From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > Sent: Monday, July 05, 2010 2:34 AM
>
> > To: pig-dev@hadoop.apache.org
>
> > Cc: Daniel Dai
>
> > Subject: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> > Hi,
>
> >
>
> > I am interested in implementing logical optimization rules and to
> target
>
> > this I have studied currently implemented logical rules and the rule
>
> > framework. In particular, I felt that rules dealing with LOfilter
are
>
> > not
>
> > able to handle complicated boolean expressions. I would like to
share
>
> > suggestions to improve handling of boolean expressions in LOFilter
to
>
> > enable
>
> > better optimization.
>
> >
>
> > 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter
into
>
> > two by
>
> > "AND". However it will not be able to split LOFilter if the top
level
>
> > operator is "OR". For example:
>
> >
>
> > *ex script:*
>
> > A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
>
> > B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
>
> > C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
>
> > J1 = JOIN B by b1, C by c1;
>
> > J2 = JOIN J1 by $0, A by a1;
>
> > D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
>
> > explain D;
>
> >
>
> > In the above example current rule is not able to any filter
condition
>
> > across
>
> > any join as it contains columns from all branches (inputs). But if
we
>
> > convert this expression into "Conjunctive Normal Form" (CNF) then we
>
> > would
>
> > be able to push filter condition c1< 10 and c2 == 5 below both join
>
> > conditions. Here is the CNF expression for highlighted line:
>
> >
>
> > ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
>
> >
>
> > *Suggestions:* It would be a good idea to convert LOFilter's boolean
>
> > expression into CNF, it would then be easy to push parts (conjuncts)
> of
>
> > the
>
> > LOFilter boolean expression selectively.
>
> >
>
> > I have started thinking about the design for implementing this
>
> > conversion
>
> > (arbitrary boolean expression to CNF) and would appreciate any
> feedback
>
> > or
>
> > ideas.
>
> >
>
> > Thanks!
>
> > Swati
>
> >
>
> >
>
> >
>
> >
>
> >
>
>

RE: PIG Logical Optimization: Use CNF in SplitFilter

Posted by Yan Zhou <ya...@yahoo-inc.com>.
Sorry, I was in hurry when wrote last response which is not complete for
demonstration purpose. And the missing part is as follows:

No use of CNF can not generate ((a3+b3 > 10) OR (c2 ==5)) below the
JOIN, at least not as easily as I can see; this is what I characterized
as Scenario 2, i.e., the post-JOIN filtering.
But for this filtering you could have used the original filtering logic,
which in this case is more expensive than evaluating (a3+b3 > 10) OR (c2
==5)). However the cheaper evaluation is not guaranteed for all
filtering logics.

Thanks,

Yan

-----Original Message-----
From: Yan Zhou 
Sent: Monday, July 12, 2010 5:48 PM
To: 'Swati Jain'
Cc: 'pig-dev@hadoop.apache.org'
Subject: RE: PIG Logical Optimization: Use CNF in SplitFilter

In the original expression, let (a3+b3 > 10) to be true, then it
transformed to (c1 < 10) OR (c2 == 5) ) since "TRUE" OR anything is
still "TRUE"; "TRUE" and anything is that "anything". You can write a
visitor to easily do this type of "partial evaluation". (a3+b3>10) is
chosen because it can not be determined from alias 'C'.

Thanks,

Yan

-----Original Message-----
From: Swati Jain [mailto:swati.j@aggiemail.usu.edu] 
Sent: Monday, July 12, 2010 5:40 PM
To: pig-dev@hadoop.apache.org
Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter

Hi Yan

Thanks for your prompt reply. I did not understand your statement "C1
and
C2, or their equivalent, above JOIN can be easily figured out without
resorting to CNF".

Consider a LOFilter above a LOJoin. The predicate of LOFilter: ( (c1 <
10)
AND (a3+b3 > 10) ) OR (c2 == 5)

The schema for LOJoin:

A = (a1:int,a2:int,a3:int);
B = (b1:int,b2:int,b3:int);
C = (c1:int,c2:int,c3:int);

After CNF: ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )

Now we can push ( (c1 < 10) OR (c2 == 5) ) above the JOIN (in the branch
leading up to the source C) while ( (a3+b3 > 10) OR (c2 ==5) ) stays put
below the JOIN.

Please let me know if there is a way of doing the above optimization
without
converting the original expression to CNF.

Thanks,

Swati


On Mon, Jul 12, 2010 at 4:26 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:

> I see. There looks like some disconnect about "Scenario 1". To me, all
> filtering logics that can be pushed above JOIN can be figured out
> without use of CNF, which is scenario 1; while CNF helps to derive the
> filtering logic after (or, in your example, below) JOIN, which is
> Scenario 2.
>
>
>
> In your example, C1 and C2, or their equivalent, above JOIN can be
> easily figured out without resorting to CNF; C3 may have to be figured
> out with CNF, but evaluation cost of the post-Join filtering logic
thus
> generated may not be cheaper than the original one before pushing up.
>
>
>
> In summary, if we want to support scenario 2(and 1), we should use
CNF;
> if we JUST want to support scenario 1, which will push up all possible
> filters closer to source and have all benefits on pruned I/O, we
should
> not use CNF.
>
>
>
> Thanks,
>
>
>
> Yan
>
>
>
> -----Original Message-----
> From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> Sent: Monday, July 12, 2010 4:04 PM
> To: pig-dev@hadoop.apache.org
> Subject: PIG Logical Optimization: Use CNF in SplitFilter
>
>
>
> Yan,
>
>
>
> What I meant in my last email was that scenario 2 optimizations would
> lead
>
> to more opportunities for scenario 1 kind of optimizations.
>
>
>
> Consider the conjunct list [C1;C2;C3] as the source of a JOIN.
>
>
>
> (a)  Suppose none of these are computable on a join input, in this
case
> we
>
> retain the original expression and discard the CNF.
>
>
>
> (b)  Suppose C1 is computable on join input J1 and C2 is computable on
> join
>
> input J2 but C3 requires a combination of both join inputs. In this
> case, we
>
> push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note
> that
>
> C1 and C2 may be further pushed up (with additional iterations of the
>
> optimizer). If they are now the source of single input operators, it
is
>
> similar to scenario 1.
>
>
>
> Thanks,
>
> Swati
>
>
>
>
>
> On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:
>
>
>
> >  Hopefully by this week. I'm still in the debugging phase of the
work.
>
> > While you are welcome to reuse some of my algorithms, I doubt you
can
> reuse
>
> > the code as much as you want. It's basically for my DNF use. You
might
> need
>
> > to factor out some general codes which you can find
>
> >
>
> > reusable.
>
> >
>
> >
>
> >
>
> > I fully understand the I/O benefits as I put in my first message.
And
> it is
>
> > classified as "Scenario 1". There is no doubt that it should be
> considered
>
> > as part of your work. However, for this, CNF is not necessary.
>
> >
>
> >
>
> >
>
> > For scenario 2, the benefits will be from less in-core logical
> expression
>
> > evaluation costs and no I/O benefits as I can see. And use of CNF
may
> or may
>
> > not lead to cheaper evaluations as the example in my first message
> shows. In
>
> > other words, after use of CNF, you should
>
> >
>
> > compare the eval cost with that in the original expression eval
before
>
> > deciding either the CNF or the original form should be evaluated.
>
> >
>
> >
>
> >
>
> > Please let me know if I miss any of your points.
>
> >
>
> >
>
> >
>
> > Thanks,
>
> >
>
> >
>
> >
>
> > Yan
>
> >  ------------------------------
>
> >
>
> > *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > *Sent:* Monday, July 12, 2010 11:52 AM
>
> >
>
> > *To:* Yan Zhou
>
> > *Cc:* pig-dev@hadoop.apache.org
>
> > *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> >
>
> >
>
> > I was wondering if you are not going to check in your patch soon
then
> it
>
> > would be great if you could share it with me. I believe I might be
> able to
>
> > reuse some of your (utility) functionality directly or get some
ideas.
>
> >
>
> > About your cost-benefit question:
>
> > 1) I will control the complexity of CNF conversion by providing a
>
> > configurable threshold value which will limit the OR-nesting.
>
> > 2) One benefit of this conversion is that it will allow pushing
parts
> of a
>
> > filter (conjuncts) across the joins which is not happening in the
> current
>
> > PushUpFilter optimization. Moreover, it may result in a cascading
> effect to
>
> > push the conjuncts below other operators by other rules that may be
> fired as
>
> > a result. The benefit from this is really data dependent, but in
> big-data
>
> > workloads, any kind of predicate pushdown may eventually lead to big
> savings
>
> > in amount of data read or amount of data transfered/shuffled across
> the
>
> > network (I need to understand the LogicalPlan to PhysicalPlan
> conversion
>
> > better to give concrete examples).
>
> >
>
> > Thanks!
>
> > Swati
>
> >
>
> > On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <ya...@yahoo-inc.com>
wrote:
>
> >
>
> > Yes, I already implemented the "NOT push down" upfront, so you do
not
> need
>
> > to do that.
>
> >
>
> >
>
> >
>
> > The support of CNF will probably be the most difficulty part. But as
I
>
> > mentioned last time, you should compare the cost after the trimming
> CNF to
>
> > get the post-split filtering logic. Given the complexity of
> manipulating CNF
>
> > and undetermined benefits, I am not sure it should be in scope at
this
>
> > moment or not.
>
> >
>
> >
>
> >
>
> > To handle CNF, I think it's a good idea to create a new plan and
> connect
>
> > the nodes in the new plan to the base plan as you envisioned. In my
> changes,
>
> > which uses DNF instead of CNF but processing is similar otherwise, I
> use a
>
> > LogicalExpressionProxy, which contains a "source" member that is
just
> the
>
> > node in the original plan, to link the nodes in the new plan and old
> plan.
>
> >  The original LogicalExpression is enhanced with a counter to trace
> the # of
>
> > proxies of the original nodes since normal form creation will
"spread"
> the
>
> > nodes in the original tree across many normalized nodes. The
benefit,
> aside
>
> > from not setting the plan, is that the original expression is
trimmed
>
> > according to the processing results from DNF; while DNF is created
>
> > separately and as a kinda utility so that complex features can be
> used. In
>
> > my changes, I used multiple-child tree in DNF while not changing the
>
> > original binary expression tree structure. Another benefit is that
the
>
> > original tree is kept as much as it is at the start, i.e., I do not
> attempt
>
> > to optimize its overall structure beyond trimming based upon the
>
> > simplification logics. (I also control the size of DNF to 100
nodes.)
> The
>
> > down side of this is added complexity.
>
> >
>
> >
>
> >
>
> > But in your case, for scenario 2 which is the whole point to use
CNF,
> you
>
> > would need to change the original expression tree structurally
beyond
>
> > trimming for post-split filtering logic. The other benefit of using
>
> > multiple-child expression is depending upon if you plan to support
> such
>
> > expression to replace current binary tree
>
> >
>
> > in the final plan. Even though I think it's a good idea to support
> that,
>
> > but it is not in my scope now.
>
> >
>
> >
>
> >
>
> > I'll add my algorithm details soon to my jira. Please take a look
and
>
> > comment as you see appropriate.
>
> >
>
> >
>
> >
>
> > Thanks,
>
> >
>
> >
>
> >
>
> > Yan
>
> >
>
> >
>
> >
>
> >
>
> >  ------------------------------
>
> >
>
> > *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > *Sent:* Friday, July 09, 2010 11:00 PM
>
> > *To:* Yan Zhou
>
> > *Cc:* pig-dev@hadoop.apache.org
>
> > *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> >
>
> >
>
> > Hi Yan,
>
> >
>
> > I agree that the first scenario (filter logic applied to individual
> input
>
> > sources) doesn't need conversion to CNF and that it will be a good
> idea to
>
> > add CNF functionality for the second scenario. I was also planning
to
>
> > provide a configurable threshold value to control the complexity of
> CNF
>
> > conversion.
>
> >
>
> > As part of the above, I wrote a utility to push the "NOT" operator
in
>
> > predicates below the "AND" and "OR" operators (Scenario 2 in
> PIG-1399). I am
>
> > considering making this utility to push NOT a separate rule in
itself.
> Lmk
>
> > if you have already implemented this.
>
> >
>
> > While implementing this utility I am facing some trouble in
> maintaining
>
> > OperatorPlan consistent as I rewrite the expression. This is because
> each
>
> > operator is referencing the main filter logical plan. Here is my
> current
>
> > approach of implementation:
>
> >
>
> > 1. I am creating a new LogicalExpressionPlan for the converted
boolean
>
> > expression.
>
> > 2. I am creating new logical expressions while pushing the NOT
> operation,
>
> > converting AND into OR, OR into AND eliminating NOT NOT pairs.
>
> > 3. However, I am having trouble updating the LogicalExpressionPlan
if
> it
>
> > reaches the base case ( i.e. root operator is not NOT,AND,OR).
>
> >
>
> > D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) )
)
> );
>
> >
>
> > In the above, for example, I am not sure how to integrate base
> expression
>
> > (c2 == 5) into the new LogicalExpressionPlan. There is no routine to
> set the
>
> > plan for a given operator and its children. Also, there is currently
> no way
>
> > to deepCopy an expression into a new OperatorPlan. It would be great
> if you
>
> > could give me some suggestions on what approach to take for this.
>
> >
>
> > One approach I thought of is to visit the base expression and create
> and
>
> > connect the base expression to the LogicalExpressionPlan as I visit
> it.
>
> >
>
> > Thoughts?
>
> > Swati
>
> >
>
> > ps: About your other point regarding binary vs multi way trees, the
> way I
>
> > am creating the normal form is a list of conjuncts, where each
> conjunct is a
>
> > list of disjuncts. This is logically similar to a multi waytree.
> However,
>
> > the current modeling of boolean expressions (modeled as binary
> expressions)
>
> > requires a conversion back to the binary tree model when adding back
> to the
>
> > main plan.
>
> >
>
> > On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <ya...@yahoo-inc.com>
wrote:
>
> >
>
> > Swati,
>
> >
>
> > I happen to be working on the logical expression simplification
effort
>
> > (https://issues.apache.org/jira/browse/PIG-1399), but not on the
> filter
>
> > split front. So I guess our interests will have some overlaps.
>
> >
>
> > I think the filter logic split problem can be divided into 2 parts:
>
> > 1) the filtering logic that can be applied to individual input
> sources;
>
> > and 2) the filtering logic that has to be applied when merged(or
> joined)
>
> > inputs are processed.
>
> >
>
> > The benefits for 1) are any benefits if the underlying storage
> supports
>
> > predicate pushdown; plus the memory/CPU savings by PIG for not
>
> > processing the unqualified rows.
>
> >
>
> > For 2), the purpose is not paying higher evaluation costs than
>
> > necessary.
>
> >
>
> > For 1), no normal form is necessary. The original logical expression
>
> > tree
>
> > can be trimmed off any sub-expressions that are not constants nor
just
>
> > from a particular input source. The complexity is linear with the
tree
>
> > size; while the use of normal form could potentially lead to
> exponential
>
> > complexity. The difficulty with this approach is how to generate the
>
> > filtering logic for 2); while CNF can be used to easily figure out
the
>
> > logic for 2). However, the exact logic in 2) might not be cheaper to
>
> > evaluate than the original logical expression. An example is "Filter
> J2
>
> > by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2)
> the
>
> > filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
>
> > ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The
cost
>
> > will be 5 logical evaluations (3 OR plus 2 AND), which could be
> reduced
>
> > to 4, compared with 3 logical evaluations in the original form.
>
> >
>
> > In summary, if only 1) is desired, the tree trimming is enough. If
2)
> is
>
> > desired too, then CNF could be used but its complexity should be
>
> > controlled and the cost of the filtering logic evaluation in 2)
should
>
> > be computed and compared with the original expression evaluation
cost.
>
> > Further optimization is possible in this direction.
>
> >
>
> > Another potential optimization to consider is to support logical
>
> > expression tree of multiple children vs. the binary tree after
taking
>
> > into consideration of the commutative property of OR and AND
> operations.
>
> > The advantages are less tree traversal costs and easier to change
the
>
> > evaluation ordering within the same sub-tree in order to maximize
the
>
> > possibilities to short-cut the evaluations. Although this is general
> for
>
> > all logical expressions, this tends to be more suitable for normal
> form
>
> > handlings as normal forms group the sub-expressions by the operators
>
> > that act on the sub-expressions.
>
> >
>
> > Thanks,
>
> >
>
> > Yan
>
> >
>
> >
>
> > -----Original Message-----
>
> > From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > Sent: Monday, July 05, 2010 2:34 AM
>
> > To: pig-dev@hadoop.apache.org
>
> > Cc: Daniel Dai
>
> > Subject: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> > Hi,
>
> >
>
> > I am interested in implementing logical optimization rules and to
> target
>
> > this I have studied currently implemented logical rules and the rule
>
> > framework. In particular, I felt that rules dealing with LOfilter
are
>
> > not
>
> > able to handle complicated boolean expressions. I would like to
share
>
> > suggestions to improve handling of boolean expressions in LOFilter
to
>
> > enable
>
> > better optimization.
>
> >
>
> > 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter
into
>
> > two by
>
> > "AND". However it will not be able to split LOFilter if the top
level
>
> > operator is "OR". For example:
>
> >
>
> > *ex script:*
>
> > A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
>
> > B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
>
> > C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
>
> > J1 = JOIN B by b1, C by c1;
>
> > J2 = JOIN J1 by $0, A by a1;
>
> > D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
>
> > explain D;
>
> >
>
> > In the above example current rule is not able to any filter
condition
>
> > across
>
> > any join as it contains columns from all branches (inputs). But if
we
>
> > convert this expression into "Conjunctive Normal Form" (CNF) then we
>
> > would
>
> > be able to push filter condition c1< 10 and c2 == 5 below both join
>
> > conditions. Here is the CNF expression for highlighted line:
>
> >
>
> > ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
>
> >
>
> > *Suggestions:* It would be a good idea to convert LOFilter's boolean
>
> > expression into CNF, it would then be easy to push parts (conjuncts)
> of
>
> > the
>
> > LOFilter boolean expression selectively.
>
> >
>
> > I have started thinking about the design for implementing this
>
> > conversion
>
> > (arbitrary boolean expression to CNF) and would appreciate any
> feedback
>
> > or
>
> > ideas.
>
> >
>
> > Thanks!
>
> > Swati
>
> >
>
> >
>
> >
>
> >
>
> >
>
>

Re: PIG Logical Optimization: Use CNF in SplitFilter

Posted by Swati Jain <sw...@aggiemail.usu.edu>.
Hi Yan

Thanks for your prompt reply. I did not understand your statement “C1 and
C2, or their equivalent, above JOIN can be easily figured out without
resorting to CNF”.

Consider a LOFilter above a LOJoin. The predicate of LOFilter: ( (c1 < 10)
AND (a3+b3 > 10) ) OR (c2 == 5)

The schema for LOJoin:

A = (a1:int,a2:int,a3:int);
B = (b1:int,b2:int,b3:int);
C = (c1:int,c2:int,c3:int);

After CNF: ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )

Now we can push ( (c1 < 10) OR (c2 == 5) ) above the JOIN (in the branch
leading up to the source C) while ( (a3+b3 > 10) OR (c2 ==5) ) stays put
below the JOIN.

Please let me know if there is a way of doing the above optimization without
converting the original expression to CNF.

Thanks,

Swati


On Mon, Jul 12, 2010 at 4:26 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:

> I see. There looks like some disconnect about "Scenario 1". To me, all
> filtering logics that can be pushed above JOIN can be figured out
> without use of CNF, which is scenario 1; while CNF helps to derive the
> filtering logic after (or, in your example, below) JOIN, which is
> Scenario 2.
>
>
>
> In your example, C1 and C2, or their equivalent, above JOIN can be
> easily figured out without resorting to CNF; C3 may have to be figured
> out with CNF, but evaluation cost of the post-Join filtering logic thus
> generated may not be cheaper than the original one before pushing up.
>
>
>
> In summary, if we want to support scenario 2(and 1), we should use CNF;
> if we JUST want to support scenario 1, which will push up all possible
> filters closer to source and have all benefits on pruned I/O, we should
> not use CNF.
>
>
>
> Thanks,
>
>
>
> Yan
>
>
>
> -----Original Message-----
> From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> Sent: Monday, July 12, 2010 4:04 PM
> To: pig-dev@hadoop.apache.org
> Subject: PIG Logical Optimization: Use CNF in SplitFilter
>
>
>
> Yan,
>
>
>
> What I meant in my last email was that scenario 2 optimizations would
> lead
>
> to more opportunities for scenario 1 kind of optimizations.
>
>
>
> Consider the conjunct list [C1;C2;C3] as the source of a JOIN.
>
>
>
> (a)  Suppose none of these are computable on a join input, in this case
> we
>
> retain the original expression and discard the CNF.
>
>
>
> (b)  Suppose C1 is computable on join input J1 and C2 is computable on
> join
>
> input J2 but C3 requires a combination of both join inputs. In this
> case, we
>
> push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note
> that
>
> C1 and C2 may be further pushed up (with additional iterations of the
>
> optimizer). If they are now the source of single input operators, it is
>
> similar to scenario 1.
>
>
>
> Thanks,
>
> Swati
>
>
>
>
>
> On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:
>
>
>
> >  Hopefully by this week. I'm still in the debugging phase of the work.
>
> > While you are welcome to reuse some of my algorithms, I doubt you can
> reuse
>
> > the code as much as you want. It's basically for my DNF use. You might
> need
>
> > to factor out some general codes which you can find
>
> >
>
> > reusable.
>
> >
>
> >
>
> >
>
> > I fully understand the I/O benefits as I put in my first message. And
> it is
>
> > classified as "Scenario 1". There is no doubt that it should be
> considered
>
> > as part of your work. However, for this, CNF is not necessary.
>
> >
>
> >
>
> >
>
> > For scenario 2, the benefits will be from less in-core logical
> expression
>
> > evaluation costs and no I/O benefits as I can see. And use of CNF may
> or may
>
> > not lead to cheaper evaluations as the example in my first message
> shows. In
>
> > other words, after use of CNF, you should
>
> >
>
> > compare the eval cost with that in the original expression eval before
>
> > deciding either the CNF or the original form should be evaluated.
>
> >
>
> >
>
> >
>
> > Please let me know if I miss any of your points.
>
> >
>
> >
>
> >
>
> > Thanks,
>
> >
>
> >
>
> >
>
> > Yan
>
> >  ------------------------------
>
> >
>
> > *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > *Sent:* Monday, July 12, 2010 11:52 AM
>
> >
>
> > *To:* Yan Zhou
>
> > *Cc:* pig-dev@hadoop.apache.org
>
> > *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> >
>
> >
>
> > I was wondering if you are not going to check in your patch soon then
> it
>
> > would be great if you could share it with me. I believe I might be
> able to
>
> > reuse some of your (utility) functionality directly or get some ideas.
>
> >
>
> > About your cost-benefit question:
>
> > 1) I will control the complexity of CNF conversion by providing a
>
> > configurable threshold value which will limit the OR-nesting.
>
> > 2) One benefit of this conversion is that it will allow pushing parts
> of a
>
> > filter (conjuncts) across the joins which is not happening in the
> current
>
> > PushUpFilter optimization. Moreover, it may result in a cascading
> effect to
>
> > push the conjuncts below other operators by other rules that may be
> fired as
>
> > a result. The benefit from this is really data dependent, but in
> big-data
>
> > workloads, any kind of predicate pushdown may eventually lead to big
> savings
>
> > in amount of data read or amount of data transfered/shuffled across
> the
>
> > network (I need to understand the LogicalPlan to PhysicalPlan
> conversion
>
> > better to give concrete examples).
>
> >
>
> > Thanks!
>
> > Swati
>
> >
>
> > On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <ya...@yahoo-inc.com> wrote:
>
> >
>
> > Yes, I already implemented the "NOT push down" upfront, so you do not
> need
>
> > to do that.
>
> >
>
> >
>
> >
>
> > The support of CNF will probably be the most difficulty part. But as I
>
> > mentioned last time, you should compare the cost after the trimming
> CNF to
>
> > get the post-split filtering logic. Given the complexity of
> manipulating CNF
>
> > and undetermined benefits, I am not sure it should be in scope at this
>
> > moment or not.
>
> >
>
> >
>
> >
>
> > To handle CNF, I think it's a good idea to create a new plan and
> connect
>
> > the nodes in the new plan to the base plan as you envisioned. In my
> changes,
>
> > which uses DNF instead of CNF but processing is similar otherwise, I
> use a
>
> > LogicalExpressionProxy, which contains a "source" member that is just
> the
>
> > node in the original plan, to link the nodes in the new plan and old
> plan.
>
> >  The original LogicalExpression is enhanced with a counter to trace
> the # of
>
> > proxies of the original nodes since normal form creation will "spread"
> the
>
> > nodes in the original tree across many normalized nodes. The benefit,
> aside
>
> > from not setting the plan, is that the original expression is trimmed
>
> > according to the processing results from DNF; while DNF is created
>
> > separately and as a kinda utility so that complex features can be
> used. In
>
> > my changes, I used multiple-child tree in DNF while not changing the
>
> > original binary expression tree structure. Another benefit is that the
>
> > original tree is kept as much as it is at the start, i.e., I do not
> attempt
>
> > to optimize its overall structure beyond trimming based upon the
>
> > simplification logics. (I also control the size of DNF to 100 nodes.)
> The
>
> > down side of this is added complexity.
>
> >
>
> >
>
> >
>
> > But in your case, for scenario 2 which is the whole point to use CNF,
> you
>
> > would need to change the original expression tree structurally beyond
>
> > trimming for post-split filtering logic. The other benefit of using
>
> > multiple-child expression is depending upon if you plan to support
> such
>
> > expression to replace current binary tree
>
> >
>
> > in the final plan. Even though I think it's a good idea to support
> that,
>
> > but it is not in my scope now.
>
> >
>
> >
>
> >
>
> > I'll add my algorithm details soon to my jira. Please take a look and
>
> > comment as you see appropriate.
>
> >
>
> >
>
> >
>
> > Thanks,
>
> >
>
> >
>
> >
>
> > Yan
>
> >
>
> >
>
> >
>
> >
>
> >  ------------------------------
>
> >
>
> > *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > *Sent:* Friday, July 09, 2010 11:00 PM
>
> > *To:* Yan Zhou
>
> > *Cc:* pig-dev@hadoop.apache.org
>
> > *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> >
>
> >
>
> > Hi Yan,
>
> >
>
> > I agree that the first scenario (filter logic applied to individual
> input
>
> > sources) doesn't need conversion to CNF and that it will be a good
> idea to
>
> > add CNF functionality for the second scenario. I was also planning to
>
> > provide a configurable threshold value to control the complexity of
> CNF
>
> > conversion.
>
> >
>
> > As part of the above, I wrote a utility to push the "NOT" operator in
>
> > predicates below the "AND" and "OR" operators (Scenario 2 in
> PIG-1399). I am
>
> > considering making this utility to push NOT a separate rule in itself.
> Lmk
>
> > if you have already implemented this.
>
> >
>
> > While implementing this utility I am facing some trouble in
> maintaining
>
> > OperatorPlan consistent as I rewrite the expression. This is because
> each
>
> > operator is referencing the main filter logical plan. Here is my
> current
>
> > approach of implementation:
>
> >
>
> > 1. I am creating a new LogicalExpressionPlan for the converted boolean
>
> > expression.
>
> > 2. I am creating new logical expressions while pushing the NOT
> operation,
>
> > converting AND into OR, OR into AND eliminating NOT NOT pairs.
>
> > 3. However, I am having trouble updating the LogicalExpressionPlan if
> it
>
> > reaches the base case ( i.e. root operator is not NOT,AND,OR).
>
> >
>
> > D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) )
> );
>
> >
>
> > In the above, for example, I am not sure how to integrate base
> expression
>
> > (c2 == 5) into the new LogicalExpressionPlan. There is no routine to
> set the
>
> > plan for a given operator and its children. Also, there is currently
> no way
>
> > to deepCopy an expression into a new OperatorPlan. It would be great
> if you
>
> > could give me some suggestions on what approach to take for this.
>
> >
>
> > One approach I thought of is to visit the base expression and create
> and
>
> > connect the base expression to the LogicalExpressionPlan as I visit
> it.
>
> >
>
> > Thoughts?
>
> > Swati
>
> >
>
> > ps: About your other point regarding binary vs multi way trees, the
> way I
>
> > am creating the normal form is a list of conjuncts, where each
> conjunct is a
>
> > list of disjuncts. This is logically similar to a multi waytree.
> However,
>
> > the current modeling of boolean expressions (modeled as binary
> expressions)
>
> > requires a conversion back to the binary tree model when adding back
> to the
>
> > main plan.
>
> >
>
> > On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:
>
> >
>
> > Swati,
>
> >
>
> > I happen to be working on the logical expression simplification effort
>
> > (https://issues.apache.org/jira/browse/PIG-1399), but not on the
> filter
>
> > split front. So I guess our interests will have some overlaps.
>
> >
>
> > I think the filter logic split problem can be divided into 2 parts:
>
> > 1) the filtering logic that can be applied to individual input
> sources;
>
> > and 2) the filtering logic that has to be applied when merged(or
> joined)
>
> > inputs are processed.
>
> >
>
> > The benefits for 1) are any benefits if the underlying storage
> supports
>
> > predicate pushdown; plus the memory/CPU savings by PIG for not
>
> > processing the unqualified rows.
>
> >
>
> > For 2), the purpose is not paying higher evaluation costs than
>
> > necessary.
>
> >
>
> > For 1), no normal form is necessary. The original logical expression
>
> > tree
>
> > can be trimmed off any sub-expressions that are not constants nor just
>
> > from a particular input source. The complexity is linear with the tree
>
> > size; while the use of normal form could potentially lead to
> exponential
>
> > complexity. The difficulty with this approach is how to generate the
>
> > filtering logic for 2); while CNF can be used to easily figure out the
>
> > logic for 2). However, the exact logic in 2) might not be cheaper to
>
> > evaluate than the original logical expression. An example is "Filter
> J2
>
> > by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2)
> the
>
> > filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
>
> > ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
>
> > will be 5 logical evaluations (3 OR plus 2 AND), which could be
> reduced
>
> > to 4, compared with 3 logical evaluations in the original form.
>
> >
>
> > In summary, if only 1) is desired, the tree trimming is enough. If 2)
> is
>
> > desired too, then CNF could be used but its complexity should be
>
> > controlled and the cost of the filtering logic evaluation in 2) should
>
> > be computed and compared with the original expression evaluation cost.
>
> > Further optimization is possible in this direction.
>
> >
>
> > Another potential optimization to consider is to support logical
>
> > expression tree of multiple children vs. the binary tree after taking
>
> > into consideration of the commutative property of OR and AND
> operations.
>
> > The advantages are less tree traversal costs and easier to change the
>
> > evaluation ordering within the same sub-tree in order to maximize the
>
> > possibilities to short-cut the evaluations. Although this is general
> for
>
> > all logical expressions, this tends to be more suitable for normal
> form
>
> > handlings as normal forms group the sub-expressions by the operators
>
> > that act on the sub-expressions.
>
> >
>
> > Thanks,
>
> >
>
> > Yan
>
> >
>
> >
>
> > -----Original Message-----
>
> > From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
>
> > Sent: Monday, July 05, 2010 2:34 AM
>
> > To: pig-dev@hadoop.apache.org
>
> > Cc: Daniel Dai
>
> > Subject: PIG Logical Optimization: Use CNF in SplitFilter
>
> >
>
> > Hi,
>
> >
>
> > I am interested in implementing logical optimization rules and to
> target
>
> > this I have studied currently implemented logical rules and the rule
>
> > framework. In particular, I felt that rules dealing with LOfilter are
>
> > not
>
> > able to handle complicated boolean expressions. I would like to share
>
> > suggestions to improve handling of boolean expressions in LOFilter to
>
> > enable
>
> > better optimization.
>
> >
>
> > 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into
>
> > two by
>
> > "AND". However it will not be able to split LOFilter if the top level
>
> > operator is "OR". For example:
>
> >
>
> > *ex script:*
>
> > A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
>
> > B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
>
> > C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
>
> > J1 = JOIN B by b1, C by c1;
>
> > J2 = JOIN J1 by $0, A by a1;
>
> > D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
>
> > explain D;
>
> >
>
> > In the above example current rule is not able to any filter condition
>
> > across
>
> > any join as it contains columns from all branches (inputs). But if we
>
> > convert this expression into "Conjunctive Normal Form" (CNF) then we
>
> > would
>
> > be able to push filter condition c1< 10 and c2 == 5 below both join
>
> > conditions. Here is the CNF expression for highlighted line:
>
> >
>
> > ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
>
> >
>
> > *Suggestions:* It would be a good idea to convert LOFilter's boolean
>
> > expression into CNF, it would then be easy to push parts (conjuncts)
> of
>
> > the
>
> > LOFilter boolean expression selectively.
>
> >
>
> > I have started thinking about the design for implementing this
>
> > conversion
>
> > (arbitrary boolean expression to CNF) and would appreciate any
> feedback
>
> > or
>
> > ideas.
>
> >
>
> > Thanks!
>
> > Swati
>
> >
>
> >
>
> >
>
> >
>
> >
>
>

RE: PIG Logical Optimization: Use CNF in SplitFilter

Posted by Yan Zhou <ya...@yahoo-inc.com>.
I see. There looks like some disconnect about "Scenario 1". To me, all
filtering logics that can be pushed above JOIN can be figured out
without use of CNF, which is scenario 1; while CNF helps to derive the
filtering logic after (or, in your example, below) JOIN, which is
Scenario 2.

 

In your example, C1 and C2, or their equivalent, above JOIN can be
easily figured out without resorting to CNF; C3 may have to be figured
out with CNF, but evaluation cost of the post-Join filtering logic thus
generated may not be cheaper than the original one before pushing up.

 

In summary, if we want to support scenario 2(and 1), we should use CNF;
if we JUST want to support scenario 1, which will push up all possible
filters closer to source and have all benefits on pruned I/O, we should
not use CNF.

 

Thanks,

 

Yan

 

-----Original Message-----
From: Swati Jain [mailto:swati.j@aggiemail.usu.edu] 
Sent: Monday, July 12, 2010 4:04 PM
To: pig-dev@hadoop.apache.org
Subject: PIG Logical Optimization: Use CNF in SplitFilter

 

Yan,

 

What I meant in my last email was that scenario 2 optimizations would
lead

to more opportunities for scenario 1 kind of optimizations.

 

Consider the conjunct list [C1;C2;C3] as the source of a JOIN.

 

(a)  Suppose none of these are computable on a join input, in this case
we

retain the original expression and discard the CNF.

 

(b)  Suppose C1 is computable on join input J1 and C2 is computable on
join

input J2 but C3 requires a combination of both join inputs. In this
case, we

push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note
that

C1 and C2 may be further pushed up (with additional iterations of the

optimizer). If they are now the source of single input operators, it is

similar to scenario 1.

 

Thanks,

Swati

 

 

On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:

 

>  Hopefully by this week. I'm still in the debugging phase of the work.

> While you are welcome to reuse some of my algorithms, I doubt you can
reuse

> the code as much as you want. It's basically for my DNF use. You might
need

> to factor out some general codes which you can find

> 

> reusable.

> 

> 

> 

> I fully understand the I/O benefits as I put in my first message. And
it is

> classified as "Scenario 1". There is no doubt that it should be
considered

> as part of your work. However, for this, CNF is not necessary.

> 

> 

> 

> For scenario 2, the benefits will be from less in-core logical
expression

> evaluation costs and no I/O benefits as I can see. And use of CNF may
or may

> not lead to cheaper evaluations as the example in my first message
shows. In

> other words, after use of CNF, you should

> 

> compare the eval cost with that in the original expression eval before

> deciding either the CNF or the original form should be evaluated.

> 

> 

> 

> Please let me know if I miss any of your points.

> 

> 

> 

> Thanks,

> 

> 

> 

> Yan

>  ------------------------------

> 

> *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]

> *Sent:* Monday, July 12, 2010 11:52 AM

> 

> *To:* Yan Zhou

> *Cc:* pig-dev@hadoop.apache.org

> *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter

> 

> 

> 

> I was wondering if you are not going to check in your patch soon then
it

> would be great if you could share it with me. I believe I might be
able to

> reuse some of your (utility) functionality directly or get some ideas.

> 

> About your cost-benefit question:

> 1) I will control the complexity of CNF conversion by providing a

> configurable threshold value which will limit the OR-nesting.

> 2) One benefit of this conversion is that it will allow pushing parts
of a

> filter (conjuncts) across the joins which is not happening in the
current

> PushUpFilter optimization. Moreover, it may result in a cascading
effect to

> push the conjuncts below other operators by other rules that may be
fired as

> a result. The benefit from this is really data dependent, but in
big-data

> workloads, any kind of predicate pushdown may eventually lead to big
savings

> in amount of data read or amount of data transfered/shuffled across
the

> network (I need to understand the LogicalPlan to PhysicalPlan
conversion

> better to give concrete examples).

> 

> Thanks!

> Swati

> 

> On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <ya...@yahoo-inc.com> wrote:

> 

> Yes, I already implemented the "NOT push down" upfront, so you do not
need

> to do that.

> 

> 

> 

> The support of CNF will probably be the most difficulty part. But as I

> mentioned last time, you should compare the cost after the trimming
CNF to

> get the post-split filtering logic. Given the complexity of
manipulating CNF

> and undetermined benefits, I am not sure it should be in scope at this

> moment or not.

> 

> 

> 

> To handle CNF, I think it's a good idea to create a new plan and
connect

> the nodes in the new plan to the base plan as you envisioned. In my
changes,

> which uses DNF instead of CNF but processing is similar otherwise, I
use a

> LogicalExpressionProxy, which contains a "source" member that is just
the

> node in the original plan, to link the nodes in the new plan and old
plan.

>  The original LogicalExpression is enhanced with a counter to trace
the # of

> proxies of the original nodes since normal form creation will "spread"
the

> nodes in the original tree across many normalized nodes. The benefit,
aside

> from not setting the plan, is that the original expression is trimmed

> according to the processing results from DNF; while DNF is created

> separately and as a kinda utility so that complex features can be
used. In

> my changes, I used multiple-child tree in DNF while not changing the

> original binary expression tree structure. Another benefit is that the

> original tree is kept as much as it is at the start, i.e., I do not
attempt

> to optimize its overall structure beyond trimming based upon the

> simplification logics. (I also control the size of DNF to 100 nodes.)
The

> down side of this is added complexity.

> 

> 

> 

> But in your case, for scenario 2 which is the whole point to use CNF,
you

> would need to change the original expression tree structurally beyond

> trimming for post-split filtering logic. The other benefit of using

> multiple-child expression is depending upon if you plan to support
such

> expression to replace current binary tree

> 

> in the final plan. Even though I think it's a good idea to support
that,

> but it is not in my scope now.

> 

> 

> 

> I'll add my algorithm details soon to my jira. Please take a look and

> comment as you see appropriate.

> 

> 

> 

> Thanks,

> 

> 

> 

> Yan

> 

> 

> 

> 

>  ------------------------------

> 

> *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]

> *Sent:* Friday, July 09, 2010 11:00 PM

> *To:* Yan Zhou

> *Cc:* pig-dev@hadoop.apache.org

> *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter

> 

> 

> 

> Hi Yan,

> 

> I agree that the first scenario (filter logic applied to individual
input

> sources) doesn't need conversion to CNF and that it will be a good
idea to

> add CNF functionality for the second scenario. I was also planning to

> provide a configurable threshold value to control the complexity of
CNF

> conversion.

> 

> As part of the above, I wrote a utility to push the "NOT" operator in

> predicates below the "AND" and "OR" operators (Scenario 2 in
PIG-1399). I am

> considering making this utility to push NOT a separate rule in itself.
Lmk

> if you have already implemented this.

> 

> While implementing this utility I am facing some trouble in
maintaining

> OperatorPlan consistent as I rewrite the expression. This is because
each

> operator is referencing the main filter logical plan. Here is my
current

> approach of implementation:

> 

> 1. I am creating a new LogicalExpressionPlan for the converted boolean

> expression.

> 2. I am creating new logical expressions while pushing the NOT
operation,

> converting AND into OR, OR into AND eliminating NOT NOT pairs.

> 3. However, I am having trouble updating the LogicalExpressionPlan if
it

> reaches the base case ( i.e. root operator is not NOT,AND,OR).

> 

> D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) )
);

> 

> In the above, for example, I am not sure how to integrate base
expression

> (c2 == 5) into the new LogicalExpressionPlan. There is no routine to
set the

> plan for a given operator and its children. Also, there is currently
no way

> to deepCopy an expression into a new OperatorPlan. It would be great
if you

> could give me some suggestions on what approach to take for this.

> 

> One approach I thought of is to visit the base expression and create
and

> connect the base expression to the LogicalExpressionPlan as I visit
it.

> 

> Thoughts?

> Swati

> 

> ps: About your other point regarding binary vs multi way trees, the
way I

> am creating the normal form is a list of conjuncts, where each
conjunct is a

> list of disjuncts. This is logically similar to a multi waytree.
However,

> the current modeling of boolean expressions (modeled as binary
expressions)

> requires a conversion back to the binary tree model when adding back
to the

> main plan.

> 

> On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:

> 

> Swati,

> 

> I happen to be working on the logical expression simplification effort

> (https://issues.apache.org/jira/browse/PIG-1399), but not on the
filter

> split front. So I guess our interests will have some overlaps.

> 

> I think the filter logic split problem can be divided into 2 parts:

> 1) the filtering logic that can be applied to individual input
sources;

> and 2) the filtering logic that has to be applied when merged(or
joined)

> inputs are processed.

> 

> The benefits for 1) are any benefits if the underlying storage
supports

> predicate pushdown; plus the memory/CPU savings by PIG for not

> processing the unqualified rows.

> 

> For 2), the purpose is not paying higher evaluation costs than

> necessary.

> 

> For 1), no normal form is necessary. The original logical expression

> tree

> can be trimmed off any sub-expressions that are not constants nor just

> from a particular input source. The complexity is linear with the tree

> size; while the use of normal form could potentially lead to
exponential

> complexity. The difficulty with this approach is how to generate the

> filtering logic for 2); while CNF can be used to easily figure out the

> logic for 2). However, the exact logic in 2) might not be cheaper to

> evaluate than the original logical expression. An example is "Filter
J2

> by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2)
the

> filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND

> ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost

> will be 5 logical evaluations (3 OR plus 2 AND), which could be
reduced

> to 4, compared with 3 logical evaluations in the original form.

> 

> In summary, if only 1) is desired, the tree trimming is enough. If 2)
is

> desired too, then CNF could be used but its complexity should be

> controlled and the cost of the filtering logic evaluation in 2) should

> be computed and compared with the original expression evaluation cost.

> Further optimization is possible in this direction.

> 

> Another potential optimization to consider is to support logical

> expression tree of multiple children vs. the binary tree after taking

> into consideration of the commutative property of OR and AND
operations.

> The advantages are less tree traversal costs and easier to change the

> evaluation ordering within the same sub-tree in order to maximize the

> possibilities to short-cut the evaluations. Although this is general
for

> all logical expressions, this tends to be more suitable for normal
form

> handlings as normal forms group the sub-expressions by the operators

> that act on the sub-expressions.

> 

> Thanks,

> 

> Yan

> 

> 

> -----Original Message-----

> From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]

> Sent: Monday, July 05, 2010 2:34 AM

> To: pig-dev@hadoop.apache.org

> Cc: Daniel Dai

> Subject: PIG Logical Optimization: Use CNF in SplitFilter

> 

> Hi,

> 

> I am interested in implementing logical optimization rules and to
target

> this I have studied currently implemented logical rules and the rule

> framework. In particular, I felt that rules dealing with LOfilter are

> not

> able to handle complicated boolean expressions. I would like to share

> suggestions to improve handling of boolean expressions in LOFilter to

> enable

> better optimization.

> 

> 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into

> two by

> "AND". However it will not be able to split LOFilter if the top level

> operator is "OR". For example:

> 

> *ex script:*

> A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);

> B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);

> C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);

> J1 = JOIN B by b1, C by c1;

> J2 = JOIN J1 by $0, A by a1;

> D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*

> explain D;

> 

> In the above example current rule is not able to any filter condition

> across

> any join as it contains columns from all branches (inputs). But if we

> convert this expression into "Conjunctive Normal Form" (CNF) then we

> would

> be able to push filter condition c1< 10 and c2 == 5 below both join

> conditions. Here is the CNF expression for highlighted line:

> 

> ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )

> 

> *Suggestions:* It would be a good idea to convert LOFilter's boolean

> expression into CNF, it would then be easy to push parts (conjuncts)
of

> the

> LOFilter boolean expression selectively.

> 

> I have started thinking about the design for implementing this

> conversion

> (arbitrary boolean expression to CNF) and would appreciate any
feedback

> or

> ideas.

> 

> Thanks!

> Swati

> 

> 

> 

> 

> 


PIG Logical Optimization: Use CNF in SplitFilter

Posted by Swati Jain <sw...@aggiemail.usu.edu>.
Yan,

What I meant in my last email was that scenario 2 optimizations would lead
to more opportunities for scenario 1 kind of optimizations.

Consider the conjunct list [C1;C2;C3] as the source of a JOIN.

(a)  Suppose none of these are computable on a join input, in this case we
retain the original expression and discard the CNF.

(b)  Suppose C1 is computable on join input J1 and C2 is computable on join
input J2 but C3 requires a combination of both join inputs. In this case, we
push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note that
C1 and C2 may be further pushed up (with additional iterations of the
optimizer). If they are now the source of single input operators, it is
similar to scenario 1.

Thanks,
Swati


On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:

>  Hopefully by this week. I’m still in the debugging phase of the work.
> While you are welcome to reuse some of my algorithms, I doubt you can reuse
> the code as much as you want. It’s basically for my DNF use. You might need
> to factor out some general codes which you can find
>
> reusable.
>
>
>
> I fully understand the I/O benefits as I put in my first message. And it is
> classified as “Scenario 1”. There is no doubt that it should be considered
> as part of your work. However, for this, CNF is not necessary.
>
>
>
> For scenario 2, the benefits will be from less in-core logical expression
> evaluation costs and no I/O benefits as I can see. And use of CNF may or may
> not lead to cheaper evaluations as the example in my first message shows. In
> other words, after use of CNF, you should
>
> compare the eval cost with that in the original expression eval before
> deciding either the CNF or the original form should be evaluated.
>
>
>
> Please let me know if I miss any of your points.
>
>
>
> Thanks,
>
>
>
> Yan
>  ------------------------------
>
> *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> *Sent:* Monday, July 12, 2010 11:52 AM
>
> *To:* Yan Zhou
> *Cc:* pig-dev@hadoop.apache.org
> *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
>
>
>
> I was wondering if you are not going to check in your patch soon then it
> would be great if you could share it with me. I believe I might be able to
> reuse some of your (utility) functionality directly or get some ideas.
>
> About your cost-benefit question:
> 1) I will control the complexity of CNF conversion by providing a
> configurable threshold value which will limit the OR-nesting.
> 2) One benefit of this conversion is that it will allow pushing parts of a
> filter (conjuncts) across the joins which is not happening in the current
> PushUpFilter optimization. Moreover, it may result in a cascading effect to
> push the conjuncts below other operators by other rules that may be fired as
> a result. The benefit from this is really data dependent, but in big-data
> workloads, any kind of predicate pushdown may eventually lead to big savings
> in amount of data read or amount of data transfered/shuffled across the
> network (I need to understand the LogicalPlan to PhysicalPlan conversion
> better to give concrete examples).
>
> Thanks!
> Swati
>
> On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <ya...@yahoo-inc.com> wrote:
>
> Yes, I already implemented the “NOT push down” upfront, so you do not need
> to do that.
>
>
>
> The support of CNF will probably be the most difficulty part. But as I
> mentioned last time, you should compare the cost after the trimming CNF to
> get the post-split filtering logic. Given the complexity of manipulating CNF
> and undetermined benefits, I am not sure it should be in scope at this
> moment or not.
>
>
>
> To handle CNF, I think it’s a good idea to create a new plan and connect
> the nodes in the new plan to the base plan as you envisioned. In my changes,
> which uses DNF instead of CNF but processing is similar otherwise, I use a
> LogicalExpressionProxy, which contains a “source” member that is just the
> node in the original plan, to link the nodes in the new plan and old plan.
>  The original LogicalExpression is enhanced with a counter to trace the # of
> proxies of the original nodes since normal form creation will “spread” the
> nodes in the original tree across many normalized nodes. The benefit, aside
> from not setting the plan, is that the original expression is trimmed
> according to the processing results from DNF; while DNF is created
> separately and as a kinda utility so that complex features can be used. In
> my changes, I used multiple-child tree in DNF while not changing the
> original binary expression tree structure. Another benefit is that the
> original tree is kept as much as it is at the start, i.e., I do not attempt
> to optimize its overall structure beyond trimming based upon the
> simplification logics. (I also control the size of DNF to 100 nodes.) The
> down side of this is added complexity.
>
>
>
> But in your case, for scenario 2 which is the whole point to use CNF, you
> would need to change the original expression tree structurally beyond
> trimming for post-split filtering logic. The other benefit of using
> multiple-child expression is depending upon if you plan to support such
> expression to replace current binary tree
>
> in the final plan. Even though I think it’s a good idea to support that,
> but it is not in my scope now.
>
>
>
> I’ll add my algorithm details soon to my jira. Please take a look and
> comment as you see appropriate.
>
>
>
> Thanks,
>
>
>
> Yan
>
>
>
>
>  ------------------------------
>
> *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> *Sent:* Friday, July 09, 2010 11:00 PM
> *To:* Yan Zhou
> *Cc:* pig-dev@hadoop.apache.org
> *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
>
>
>
> Hi Yan,
>
> I agree that the first scenario (filter logic applied to individual input
> sources) doesn't need conversion to CNF and that it will be a good idea to
> add CNF functionality for the second scenario. I was also planning to
> provide a configurable threshold value to control the complexity of CNF
> conversion.
>
> As part of the above, I wrote a utility to push the "NOT" operator in
> predicates below the "AND" and "OR" operators (Scenario 2 in PIG-1399). I am
> considering making this utility to push NOT a separate rule in itself. Lmk
> if you have already implemented this.
>
> While implementing this utility I am facing some trouble in maintaining
> OperatorPlan consistent as I rewrite the expression. This is because each
> operator is referencing the main filter logical plan. Here is my current
> approach of implementation:
>
> 1. I am creating a new LogicalExpressionPlan for the converted boolean
> expression.
> 2. I am creating new logical expressions while pushing the NOT operation,
> converting AND into OR, OR into AND eliminating NOT NOT pairs.
> 3. However, I am having trouble updating the LogicalExpressionPlan if it
> reaches the base case ( i.e. root operator is not NOT,AND,OR).
>
> D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) ) );
>
> In the above, for example, I am not sure how to integrate base expression
> (c2 == 5) into the new LogicalExpressionPlan. There is no routine to set the
> plan for a given operator and its children. Also, there is currently no way
> to deepCopy an expression into a new OperatorPlan. It would be great if you
> could give me some suggestions on what approach to take for this.
>
> One approach I thought of is to visit the base expression and create and
> connect the base expression to the LogicalExpressionPlan as I visit it.
>
> Thoughts?
> Swati
>
> ps: About your other point regarding binary vs multi way trees, the way I
> am creating the normal form is a list of conjuncts, where each conjunct is a
> list of disjuncts. This is logically similar to a multi waytree. However,
> the current modeling of boolean expressions (modeled as binary expressions)
> requires a conversion back to the binary tree model when adding back to the
> main plan.
>
> On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:
>
> Swati,
>
> I happen to be working on the logical expression simplification effort
> (https://issues.apache.org/jira/browse/PIG-1399), but not on the filter
> split front. So I guess our interests will have some overlaps.
>
> I think the filter logic split problem can be divided into 2 parts:
> 1) the filtering logic that can be applied to individual input sources;
> and 2) the filtering logic that has to be applied when merged(or joined)
> inputs are processed.
>
> The benefits for 1) are any benefits if the underlying storage supports
> predicate pushdown; plus the memory/CPU savings by PIG for not
> processing the unqualified rows.
>
> For 2), the purpose is not paying higher evaluation costs than
> necessary.
>
> For 1), no normal form is necessary. The original logical expression
> tree
> can be trimmed off any sub-expressions that are not constants nor just
> from a particular input source. The complexity is linear with the tree
> size; while the use of normal form could potentially lead to exponential
> complexity. The difficulty with this approach is how to generate the
> filtering logic for 2); while CNF can be used to easily figure out the
> logic for 2). However, the exact logic in 2) might not be cheaper to
> evaluate than the original logical expression. An example is "Filter J2
> by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the
> filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
> ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
> will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced
> to 4, compared with 3 logical evaluations in the original form.
>
> In summary, if only 1) is desired, the tree trimming is enough. If 2) is
> desired too, then CNF could be used but its complexity should be
> controlled and the cost of the filtering logic evaluation in 2) should
> be computed and compared with the original expression evaluation cost.
> Further optimization is possible in this direction.
>
> Another potential optimization to consider is to support logical
> expression tree of multiple children vs. the binary tree after taking
> into consideration of the commutative property of OR and AND operations.
> The advantages are less tree traversal costs and easier to change the
> evaluation ordering within the same sub-tree in order to maximize the
> possibilities to short-cut the evaluations. Although this is general for
> all logical expressions, this tends to be more suitable for normal form
> handlings as normal forms group the sub-expressions by the operators
> that act on the sub-expressions.
>
> Thanks,
>
> Yan
>
>
> -----Original Message-----
> From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> Sent: Monday, July 05, 2010 2:34 AM
> To: pig-dev@hadoop.apache.org
> Cc: Daniel Dai
> Subject: PIG Logical Optimization: Use CNF in SplitFilter
>
> Hi,
>
> I am interested in implementing logical optimization rules and to target
> this I have studied currently implemented logical rules and the rule
> framework. In particular, I felt that rules dealing with LOfilter are
> not
> able to handle complicated boolean expressions. I would like to share
> suggestions to improve handling of boolean expressions in LOFilter to
> enable
> better optimization.
>
> 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into
> two by
> "AND". However it will not be able to split LOFilter if the top level
> operator is "OR". For example:
>
> *ex script:*
> A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
> B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
> C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
> J1 = JOIN B by b1, C by c1;
> J2 = JOIN J1 by $0, A by a1;
> D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
> explain D;
>
> In the above example current rule is not able to any filter condition
> across
> any join as it contains columns from all branches (inputs). But if we
> convert this expression into "Conjunctive Normal Form" (CNF) then we
> would
> be able to push filter condition c1< 10 and c2 == 5 below both join
> conditions. Here is the CNF expression for highlighted line:
>
> ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
>
> *Suggestions:* It would be a good idea to convert LOFilter's boolean
> expression into CNF, it would then be easy to push parts (conjuncts) of
> the
> LOFilter boolean expression selectively.
>
> I have started thinking about the design for implementing this
> conversion
> (arbitrary boolean expression to CNF) and would appreciate any feedback
> or
> ideas.
>
> Thanks!
> Swati
>
>
>
>
>

RE: PIG Logical Optimization: Use CNF in SplitFilter

Posted by Yan Zhou <ya...@yahoo-inc.com>.
Hopefully by this week. I'm still in the debugging phase of the work.
While you are welcome to reuse some of my algorithms, I doubt you can
reuse the code as much as you want. It's basically for my DNF use. You
might need to factor out some general codes which you can find

reusable.

 

I fully understand the I/O benefits as I put in my first message. And it
is classified as "Scenario 1". There is no doubt that it should be
considered as part of your work. However, for this, CNF is not
necessary.

 

For scenario 2, the benefits will be from less in-core logical
expression evaluation costs and no I/O benefits as I can see. And use of
CNF may or may not lead to cheaper evaluations as the example in my
first message shows. In other words, after use of CNF, you should

compare the eval cost with that in the original expression eval before
deciding either the CNF or the original form should be evaluated.

 

Please let me know if I miss any of your points.

 

Thanks,

 

Yan

________________________________

From: Swati Jain [mailto:swati.j@aggiemail.usu.edu] 
Sent: Monday, July 12, 2010 11:52 AM
To: Yan Zhou
Cc: pig-dev@hadoop.apache.org
Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter

 

I was wondering if you are not going to check in your patch soon then it
would be great if you could share it with me. I believe I might be able
to reuse some of your (utility) functionality directly or get some
ideas. 

About your cost-benefit question:
1) I will control the complexity of CNF conversion by providing a
configurable threshold value which will limit the OR-nesting.
2) One benefit of this conversion is that it will allow pushing parts of
a filter (conjuncts) across the joins which is not happening in the
current PushUpFilter optimization. Moreover, it may result in a
cascading effect to push the conjuncts below other operators by other
rules that may be fired as a result. The benefit from this is really
data dependent, but in big-data workloads, any kind of predicate
pushdown may eventually lead to big savings in amount of data read or
amount of data transfered/shuffled across the network (I need to
understand the LogicalPlan to PhysicalPlan conversion better to give
concrete examples).

Thanks!
Swati

On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <ya...@yahoo-inc.com> wrote:

Yes, I already implemented the "NOT push down" upfront, so you do not
need to do that.

 

The support of CNF will probably be the most difficulty part. But as I
mentioned last time, you should compare the cost after the trimming CNF
to get the post-split filtering logic. Given the complexity of
manipulating CNF and undetermined benefits, I am not sure it should be
in scope at this moment or not.

 

To handle CNF, I think it's a good idea to create a new plan and connect
the nodes in the new plan to the base plan as you envisioned. In my
changes, which uses DNF instead of CNF but processing is similar
otherwise, I use a LogicalExpressionProxy, which contains a "source"
member that is just the node in the original plan, to link the nodes in
the new plan and old plan.  The original LogicalExpression is enhanced
with a counter to trace the # of proxies of the original nodes since
normal form creation will "spread" the nodes in the original tree across
many normalized nodes. The benefit, aside from not setting the plan, is
that the original expression is trimmed according to the processing
results from DNF; while DNF is created separately and as a kinda utility
so that complex features can be used. In my changes, I used
multiple-child tree in DNF while not changing the original binary
expression tree structure. Another benefit is that the original tree is
kept as much as it is at the start, i.e., I do not attempt to optimize
its overall structure beyond trimming based upon the simplification
logics. (I also control the size of DNF to 100 nodes.) The down side of
this is added complexity.

 

But in your case, for scenario 2 which is the whole point to use CNF,
you would need to change the original expression tree structurally
beyond trimming for post-split filtering logic. The other benefit of
using multiple-child expression is depending upon if you plan to support
such expression to replace current binary tree

in the final plan. Even though I think it's a good idea to support that,
but it is not in my scope now.

 

I'll add my algorithm details soon to my jira. Please take a look and
comment as you see appropriate.

 

Thanks,

 

Yan

 

 

________________________________

From: Swati Jain [mailto:swati.j@aggiemail.usu.edu] 
Sent: Friday, July 09, 2010 11:00 PM
To: Yan Zhou
Cc: pig-dev@hadoop.apache.org
Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter

 

Hi Yan,

I agree that the first scenario (filter logic applied to individual
input sources) doesn't need conversion to CNF and that it will be a good
idea to add CNF functionality for the second scenario. I was also
planning to provide a configurable threshold value to control the
complexity of CNF conversion.

As part of the above, I wrote a utility to push the "NOT" operator in
predicates below the "AND" and "OR" operators (Scenario 2 in PIG-1399).
I am considering making this utility to push NOT a separate rule in
itself. Lmk if you have already implemented this.

While implementing this utility I am facing some trouble in maintaining
OperatorPlan consistent as I rewrite the expression. This is because
each operator is referencing the main filter logical plan. Here is my
current approach of implementation:

1. I am creating a new LogicalExpressionPlan for the converted boolean
expression.
2. I am creating new logical expressions while pushing the NOT
operation, converting AND into OR, OR into AND eliminating NOT NOT
pairs.
3. However, I am having trouble updating the LogicalExpressionPlan if it
reaches the base case ( i.e. root operator is not NOT,AND,OR).

D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) )
);

In the above, for example, I am not sure how to integrate base
expression (c2 == 5) into the new LogicalExpressionPlan. There is no
routine to set the plan for a given operator and its children. Also,
there is currently no way to deepCopy an expression into a new
OperatorPlan. It would be great if you could give me some suggestions on
what approach to take for this.

One approach I thought of is to visit the base expression and create and
connect the base expression to the LogicalExpressionPlan as I visit it.

Thoughts?
Swati

ps: About your other point regarding binary vs multi way trees, the way
I am creating the normal form is a list of conjuncts, where each
conjunct is a list of disjuncts. This is logically similar to a multi
waytree. However, the current modeling of boolean expressions (modeled
as binary expressions) requires a conversion back to the binary tree
model when adding back to the main plan.

On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:

Swati,

I happen to be working on the logical expression simplification effort
(https://issues.apache.org/jira/browse/PIG-1399), but not on the filter
split front. So I guess our interests will have some overlaps.

I think the filter logic split problem can be divided into 2 parts:
1) the filtering logic that can be applied to individual input sources;
and 2) the filtering logic that has to be applied when merged(or joined)
inputs are processed.

The benefits for 1) are any benefits if the underlying storage supports
predicate pushdown; plus the memory/CPU savings by PIG for not
processing the unqualified rows.

For 2), the purpose is not paying higher evaluation costs than
necessary.

For 1), no normal form is necessary. The original logical expression
tree
can be trimmed off any sub-expressions that are not constants nor just
from a particular input source. The complexity is linear with the tree
size; while the use of normal form could potentially lead to exponential
complexity. The difficulty with this approach is how to generate the
filtering logic for 2); while CNF can be used to easily figure out the
logic for 2). However, the exact logic in 2) might not be cheaper to
evaluate than the original logical expression. An example is "Filter J2
by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the
filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced
to 4, compared with 3 logical evaluations in the original form.

In summary, if only 1) is desired, the tree trimming is enough. If 2) is
desired too, then CNF could be used but its complexity should be
controlled and the cost of the filtering logic evaluation in 2) should
be computed and compared with the original expression evaluation cost.
Further optimization is possible in this direction.

Another potential optimization to consider is to support logical
expression tree of multiple children vs. the binary tree after taking
into consideration of the commutative property of OR and AND operations.
The advantages are less tree traversal costs and easier to change the
evaluation ordering within the same sub-tree in order to maximize the
possibilities to short-cut the evaluations. Although this is general for
all logical expressions, this tends to be more suitable for normal form
handlings as normal forms group the sub-expressions by the operators
that act on the sub-expressions.

Thanks,

Yan


-----Original Message-----
From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
Sent: Monday, July 05, 2010 2:34 AM
To: pig-dev@hadoop.apache.org
Cc: Daniel Dai
Subject: PIG Logical Optimization: Use CNF in SplitFilter

Hi,

I am interested in implementing logical optimization rules and to target
this I have studied currently implemented logical rules and the rule
framework. In particular, I felt that rules dealing with LOfilter are
not
able to handle complicated boolean expressions. I would like to share
suggestions to improve handling of boolean expressions in LOFilter to
enable
better optimization.

1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into
two by
"AND". However it will not be able to split LOFilter if the top level
operator is "OR". For example:

*ex script:*
A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
J1 = JOIN B by b1, C by c1;
J2 = JOIN J1 by $0, A by a1;
D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
explain D;

In the above example current rule is not able to any filter condition
across
any join as it contains columns from all branches (inputs). But if we
convert this expression into "Conjunctive Normal Form" (CNF) then we
would
be able to push filter condition c1< 10 and c2 == 5 below both join
conditions. Here is the CNF expression for highlighted line:

( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )

*Suggestions:* It would be a good idea to convert LOFilter's boolean
expression into CNF, it would then be easy to push parts (conjuncts) of
the
LOFilter boolean expression selectively.

I have started thinking about the design for implementing this
conversion
(arbitrary boolean expression to CNF) and would appreciate any feedback
or
ideas.

Thanks!
Swati

 

 


Re: PIG Logical Optimization: Use CNF in SplitFilter

Posted by Swati Jain <sw...@aggiemail.usu.edu>.
I was wondering if you are not going to check in your patch soon then it
would be great if you could share it with me. I believe I might be able to
reuse some of your (utility) functionality directly or get some ideas.

About your cost-benefit question:
1) I will control the complexity of CNF conversion by providing a
configurable threshold value which will limit the OR-nesting.
2) One benefit of this conversion is that it will allow pushing parts of a
filter (conjuncts) across the joins which is not happening in the current
PushUpFilter optimization. Moreover, it may result in a cascading effect to
push the conjuncts below other operators by other rules that may be fired as
a result. The benefit from this is really data dependent, but in big-data
workloads, any kind of predicate pushdown may eventually lead to big savings
in amount of data read or amount of data transfered/shuffled across the
network (I need to understand the LogicalPlan to PhysicalPlan conversion
better to give concrete examples).

Thanks!
Swati

On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <ya...@yahoo-inc.com> wrote:

>  Yes, I already implemented the “NOT push down” upfront, so you do not
> need to do that.
>
>
>
> The support of CNF will probably be the most difficulty part. But as I
> mentioned last time, you should compare the cost after the trimming CNF to
> get the post-split filtering logic. Given the complexity of manipulating CNF
> and undetermined benefits, I am not sure it should be in scope at this
> moment or not.
>
>
>
> To handle CNF, I think it’s a good idea to create a new plan and connect
> the nodes in the new plan to the base plan as you envisioned. In my changes,
> which uses DNF instead of CNF but processing is similar otherwise, I use a
> LogicalExpressionProxy, which contains a “source” member that is just the
> node in the original plan, to link the nodes in the new plan and old plan.
>  The original LogicalExpression is enhanced with a counter to trace the # of
> proxies of the original nodes since normal form creation will “spread” the
> nodes in the original tree across many normalized nodes. The benefit, aside
> from not setting the plan, is that the original expression is trimmed
> according to the processing results from DNF; while DNF is created
> separately and as a kinda utility so that complex features can be used. In
> my changes, I used multiple-child tree in DNF while not changing the
> original binary expression tree structure. Another benefit is that the
> original tree is kept as much as it is at the start, i.e., I do not attempt
> to optimize its overall structure beyond trimming based upon the
> simplification logics. (I also control the size of DNF to 100 nodes.) The
> down side of this is added complexity.
>
>
>
> But in your case, for scenario 2 which is the whole point to use CNF, you
> would need to change the original expression tree structurally beyond
> trimming for post-split filtering logic. The other benefit of using
> multiple-child expression is depending upon if you plan to support such
> expression to replace current binary tree
>
> in the final plan. Even though I think it’s a good idea to support that,
> but it is not in my scope now.
>
>
>
> I’ll add my algorithm details soon to my jira. Please take a look and
> comment as you see appropriate.
>
>
>
> Thanks,
>
>
>
> Yan
>
>
>
>
>  ------------------------------
>
> *From:* Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> *Sent:* Friday, July 09, 2010 11:00 PM
> *To:* Yan Zhou
> *Cc:* pig-dev@hadoop.apache.org
> *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter
>
>
>
> Hi Yan,
>
> I agree that the first scenario (filter logic applied to individual input
> sources) doesn't need conversion to CNF and that it will be a good idea to
> add CNF functionality for the second scenario. I was also planning to
> provide a configurable threshold value to control the complexity of CNF
> conversion.
>
> As part of the above, I wrote a utility to push the "NOT" operator in
> predicates below the "AND" and "OR" operators (Scenario 2 in PIG-1399). I am
> considering making this utility to push NOT a separate rule in itself. Lmk
> if you have already implemented this.
>
> While implementing this utility I am facing some trouble in maintaining
> OperatorPlan consistent as I rewrite the expression. This is because each
> operator is referencing the main filter logical plan. Here is my current
> approach of implementation:
>
> 1. I am creating a new LogicalExpressionPlan for the converted boolean
> expression.
> 2. I am creating new logical expressions while pushing the NOT operation,
> converting AND into OR, OR into AND eliminating NOT NOT pairs.
> 3. However, I am having trouble updating the LogicalExpressionPlan if it
> reaches the base case ( i.e. root operator is not NOT,AND,OR).
>
> D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) ) );
>
> In the above, for example, I am not sure how to integrate base expression
> (c2 == 5) into the new LogicalExpressionPlan. There is no routine to set the
> plan for a given operator and its children. Also, there is currently no way
> to deepCopy an expression into a new OperatorPlan. It would be great if you
> could give me some suggestions on what approach to take for this.
>
> One approach I thought of is to visit the base expression and create and
> connect the base expression to the LogicalExpressionPlan as I visit it.
>
> Thoughts?
> Swati
>
> ps: About your other point regarding binary vs multi way trees, the way I
> am creating the normal form is a list of conjuncts, where each conjunct is a
> list of disjuncts. This is logically similar to a multi waytree. However,
> the current modeling of boolean expressions (modeled as binary expressions)
> requires a conversion back to the binary tree model when adding back to the
> main plan.
>
> On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:
>
> Swati,
>
> I happen to be working on the logical expression simplification effort
> (https://issues.apache.org/jira/browse/PIG-1399), but not on the filter
> split front. So I guess our interests will have some overlaps.
>
> I think the filter logic split problem can be divided into 2 parts:
> 1) the filtering logic that can be applied to individual input sources;
> and 2) the filtering logic that has to be applied when merged(or joined)
> inputs are processed.
>
> The benefits for 1) are any benefits if the underlying storage supports
> predicate pushdown; plus the memory/CPU savings by PIG for not
> processing the unqualified rows.
>
> For 2), the purpose is not paying higher evaluation costs than
> necessary.
>
> For 1), no normal form is necessary. The original logical expression
> tree
> can be trimmed off any sub-expressions that are not constants nor just
> from a particular input source. The complexity is linear with the tree
> size; while the use of normal form could potentially lead to exponential
> complexity. The difficulty with this approach is how to generate the
> filtering logic for 2); while CNF can be used to easily figure out the
> logic for 2). However, the exact logic in 2) might not be cheaper to
> evaluate than the original logical expression. An example is "Filter J2
> by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the
> filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
> ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
> will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced
> to 4, compared with 3 logical evaluations in the original form.
>
> In summary, if only 1) is desired, the tree trimming is enough. If 2) is
> desired too, then CNF could be used but its complexity should be
> controlled and the cost of the filtering logic evaluation in 2) should
> be computed and compared with the original expression evaluation cost.
> Further optimization is possible in this direction.
>
> Another potential optimization to consider is to support logical
> expression tree of multiple children vs. the binary tree after taking
> into consideration of the commutative property of OR and AND operations.
> The advantages are less tree traversal costs and easier to change the
> evaluation ordering within the same sub-tree in order to maximize the
> possibilities to short-cut the evaluations. Although this is general for
> all logical expressions, this tends to be more suitable for normal form
> handlings as normal forms group the sub-expressions by the operators
> that act on the sub-expressions.
>
> Thanks,
>
> Yan
>
>
> -----Original Message-----
> From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> Sent: Monday, July 05, 2010 2:34 AM
> To: pig-dev@hadoop.apache.org
> Cc: Daniel Dai
> Subject: PIG Logical Optimization: Use CNF in SplitFilter
>
> Hi,
>
> I am interested in implementing logical optimization rules and to target
> this I have studied currently implemented logical rules and the rule
> framework. In particular, I felt that rules dealing with LOfilter are
> not
> able to handle complicated boolean expressions. I would like to share
> suggestions to improve handling of boolean expressions in LOFilter to
> enable
> better optimization.
>
> 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into
> two by
> "AND". However it will not be able to split LOFilter if the top level
> operator is "OR". For example:
>
> *ex script:*
> A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
> B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
> C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
> J1 = JOIN B by b1, C by c1;
> J2 = JOIN J1 by $0, A by a1;
> D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
> explain D;
>
> In the above example current rule is not able to any filter condition
> across
> any join as it contains columns from all branches (inputs). But if we
> convert this expression into "Conjunctive Normal Form" (CNF) then we
> would
> be able to push filter condition c1< 10 and c2 == 5 below both join
> conditions. Here is the CNF expression for highlighted line:
>
> ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
>
> *Suggestions:* It would be a good idea to convert LOFilter's boolean
> expression into CNF, it would then be easy to push parts (conjuncts) of
> the
> LOFilter boolean expression selectively.
>
> I have started thinking about the design for implementing this
> conversion
> (arbitrary boolean expression to CNF) and would appreciate any feedback
> or
> ideas.
>
> Thanks!
> Swati
>
>
>

RE: PIG Logical Optimization: Use CNF in SplitFilter

Posted by Yan Zhou <ya...@yahoo-inc.com>.
Yes, I already implemented the "NOT push down" upfront, so you do not
need to do that.

 

The support of CNF will probably be the most difficulty part. But as I
mentioned last time, you should compare the cost after the trimming CNF
to get the post-split filtering logic. Given the complexity of
manipulating CNF and undetermined benefits, I am not sure it should be
in scope at this moment or not.

 

To handle CNF, I think it's a good idea to create a new plan and connect
the nodes in the new plan to the base plan as you envisioned. In my
changes, which uses DNF instead of CNF but processing is similar
otherwise, I use a LogicalExpressionProxy, which contains a "source"
member that is just the node in the original plan, to link the nodes in
the new plan and old plan.  The original LogicalExpression is enhanced
with a counter to trace the # of proxies of the original nodes since
normal form creation will "spread" the nodes in the original tree across
many normalized nodes. The benefit, aside from not setting the plan, is
that the original expression is trimmed according to the processing
results from DNF; while DNF is created separately and as a kinda utility
so that complex features can be used. In my changes, I used
multiple-child tree in DNF while not changing the original binary
expression tree structure. Another benefit is that the original tree is
kept as much as it is at the start, i.e., I do not attempt to optimize
its overall structure beyond trimming based upon the simplification
logics. (I also control the size of DNF to 100 nodes.) The down side of
this is added complexity.

 

But in your case, for scenario 2 which is the whole point to use CNF,
you would need to change the original expression tree structurally
beyond trimming for post-split filtering logic. The other benefit of
using multiple-child expression is depending upon if you plan to support
such expression to replace current binary tree

in the final plan. Even though I think it's a good idea to support that,
but it is not in my scope now.

 

I'll add my algorithm details soon to my jira. Please take a look and
comment as you see appropriate.

 

Thanks,

 

Yan

 

 

________________________________

From: Swati Jain [mailto:swati.j@aggiemail.usu.edu] 
Sent: Friday, July 09, 2010 11:00 PM
To: Yan Zhou
Cc: pig-dev@hadoop.apache.org
Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter

 

Hi Yan,

I agree that the first scenario (filter logic applied to individual
input sources) doesn't need conversion to CNF and that it will be a good
idea to add CNF functionality for the second scenario. I was also
planning to provide a configurable threshold value to control the
complexity of CNF conversion.

As part of the above, I wrote a utility to push the "NOT" operator in
predicates below the "AND" and "OR" operators (Scenario 2 in PIG-1399).
I am considering making this utility to push NOT a separate rule in
itself. Lmk if you have already implemented this.

While implementing this utility I am facing some trouble in maintaining
OperatorPlan consistent as I rewrite the expression. This is because
each operator is referencing the main filter logical plan. Here is my
current approach of implementation:

1. I am creating a new LogicalExpressionPlan for the converted boolean
expression.
2. I am creating new logical expressions while pushing the NOT
operation, converting AND into OR, OR into AND eliminating NOT NOT
pairs.
3. However, I am having trouble updating the LogicalExpressionPlan if it
reaches the base case ( i.e. root operator is not NOT,AND,OR).

D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) )
);

In the above, for example, I am not sure how to integrate base
expression (c2 == 5) into the new LogicalExpressionPlan. There is no
routine to set the plan for a given operator and its children. Also,
there is currently no way to deepCopy an expression into a new
OperatorPlan. It would be great if you could give me some suggestions on
what approach to take for this.

One approach I thought of is to visit the base expression and create and
connect the base expression to the LogicalExpressionPlan as I visit it.

Thoughts?
Swati

ps: About your other point regarding binary vs multi way trees, the way
I am creating the normal form is a list of conjuncts, where each
conjunct is a list of disjuncts. This is logically similar to a multi
waytree. However, the current modeling of boolean expressions (modeled
as binary expressions) requires a conversion back to the binary tree
model when adding back to the main plan.

On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:

Swati,

I happen to be working on the logical expression simplification effort
(https://issues.apache.org/jira/browse/PIG-1399), but not on the filter
split front. So I guess our interests will have some overlaps.

I think the filter logic split problem can be divided into 2 parts:
1) the filtering logic that can be applied to individual input sources;
and 2) the filtering logic that has to be applied when merged(or joined)
inputs are processed.

The benefits for 1) are any benefits if the underlying storage supports
predicate pushdown; plus the memory/CPU savings by PIG for not
processing the unqualified rows.

For 2), the purpose is not paying higher evaluation costs than
necessary.

For 1), no normal form is necessary. The original logical expression
tree
can be trimmed off any sub-expressions that are not constants nor just
from a particular input source. The complexity is linear with the tree
size; while the use of normal form could potentially lead to exponential
complexity. The difficulty with this approach is how to generate the
filtering logic for 2); while CNF can be used to easily figure out the
logic for 2). However, the exact logic in 2) might not be cheaper to
evaluate than the original logical expression. An example is "Filter J2
by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the
filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced
to 4, compared with 3 logical evaluations in the original form.

In summary, if only 1) is desired, the tree trimming is enough. If 2) is
desired too, then CNF could be used but its complexity should be
controlled and the cost of the filtering logic evaluation in 2) should
be computed and compared with the original expression evaluation cost.
Further optimization is possible in this direction.

Another potential optimization to consider is to support logical
expression tree of multiple children vs. the binary tree after taking
into consideration of the commutative property of OR and AND operations.
The advantages are less tree traversal costs and easier to change the
evaluation ordering within the same sub-tree in order to maximize the
possibilities to short-cut the evaluations. Although this is general for
all logical expressions, this tends to be more suitable for normal form
handlings as normal forms group the sub-expressions by the operators
that act on the sub-expressions.

Thanks,

Yan


-----Original Message-----
From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
Sent: Monday, July 05, 2010 2:34 AM
To: pig-dev@hadoop.apache.org
Cc: Daniel Dai
Subject: PIG Logical Optimization: Use CNF in SplitFilter

Hi,

I am interested in implementing logical optimization rules and to target
this I have studied currently implemented logical rules and the rule
framework. In particular, I felt that rules dealing with LOfilter are
not
able to handle complicated boolean expressions. I would like to share
suggestions to improve handling of boolean expressions in LOFilter to
enable
better optimization.

1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into
two by
"AND". However it will not be able to split LOFilter if the top level
operator is "OR". For example:

*ex script:*
A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
J1 = JOIN B by b1, C by c1;
J2 = JOIN J1 by $0, A by a1;
D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
explain D;

In the above example current rule is not able to any filter condition
across
any join as it contains columns from all branches (inputs). But if we
convert this expression into "Conjunctive Normal Form" (CNF) then we
would
be able to push filter condition c1< 10 and c2 == 5 below both join
conditions. Here is the CNF expression for highlighted line:

( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )

*Suggestions:* It would be a good idea to convert LOFilter's boolean
expression into CNF, it would then be easy to push parts (conjuncts) of
the
LOFilter boolean expression selectively.

I have started thinking about the design for implementing this
conversion
(arbitrary boolean expression to CNF) and would appreciate any feedback
or
ideas.

Thanks!
Swati

 


Re: PIG Logical Optimization: Use CNF in SplitFilter

Posted by Swati Jain <sw...@aggiemail.usu.edu>.
Hi Yan,

I agree that the first scenario (filter logic applied to individual input
sources) doesn't need conversion to CNF and that it will be a good idea to
add CNF functionality for the second scenario. I was also planning to
provide a configurable threshold value to control the complexity of CNF
conversion.

As part of the above, I wrote a utility to push the "NOT" operator in
predicates below the "AND" and "OR" operators (Scenario 2 in PIG-1399). I am
considering making this utility to push NOT a separate rule in itself. Lmk
if you have already implemented this.

While implementing this utility I am facing some trouble in maintaining
OperatorPlan consistent as I rewrite the expression. This is because each
operator is referencing the main filter logical plan. Here is my current
approach of implementation:

1. I am creating a new LogicalExpressionPlan for the converted boolean
expression.
2. I am creating new logical expressions while pushing the NOT operation,
converting AND into OR, OR into AND eliminating NOT NOT pairs.
3. However, I am having trouble updating the LogicalExpressionPlan if it
reaches the base case ( i.e. root operator is not NOT,AND,OR).

D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) ) );

In the above, for example, I am not sure how to integrate base expression
(c2 == 5) into the new LogicalExpressionPlan. There is no routine to set the
plan for a given operator and its children. Also, there is currently no way
to deepCopy an expression into a new OperatorPlan. It would be great if you
could give me some suggestions on what approach to take for this.

One approach I thought of is to visit the base expression and create and
connect the base expression to the LogicalExpressionPlan as I visit it.

Thoughts?
Swati

ps: About your other point regarding binary vs multi way trees, the way I am
creating the normal form is a list of conjuncts, where each conjunct is a
list of disjuncts. This is logically similar to a multi waytree. However,
the current modeling of boolean expressions (modeled as binary expressions)
requires a conversion back to the binary tree model when adding back to the
main plan.

On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <ya...@yahoo-inc.com> wrote:

> Swati,
>
> I happen to be working on the logical expression simplification effort
> (https://issues.apache.org/jira/browse/PIG-1399), but not on the filter
> split front. So I guess our interests will have some overlaps.
>
> I think the filter logic split problem can be divided into 2 parts:
> 1) the filtering logic that can be applied to individual input sources;
> and 2) the filtering logic that has to be applied when merged(or joined)
> inputs are processed.
>
> The benefits for 1) are any benefits if the underlying storage supports
> predicate pushdown; plus the memory/CPU savings by PIG for not
> processing the unqualified rows.
>
> For 2), the purpose is not paying higher evaluation costs than
> necessary.
>
> For 1), no normal form is necessary. The original logical expression
> tree
> can be trimmed off any sub-expressions that are not constants nor just
> from a particular input source. The complexity is linear with the tree
> size; while the use of normal form could potentially lead to exponential
> complexity. The difficulty with this approach is how to generate the
> filtering logic for 2); while CNF can be used to easily figure out the
> logic for 2). However, the exact logic in 2) might not be cheaper to
> evaluate than the original logical expression. An example is "Filter J2
> by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the
> filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
> ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
> will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced
> to 4, compared with 3 logical evaluations in the original form.
>
> In summary, if only 1) is desired, the tree trimming is enough. If 2) is
> desired too, then CNF could be used but its complexity should be
> controlled and the cost of the filtering logic evaluation in 2) should
> be computed and compared with the original expression evaluation cost.
> Further optimization is possible in this direction.
>
> Another potential optimization to consider is to support logical
> expression tree of multiple children vs. the binary tree after taking
> into consideration of the commutative property of OR and AND operations.
> The advantages are less tree traversal costs and easier to change the
> evaluation ordering within the same sub-tree in order to maximize the
> possibilities to short-cut the evaluations. Although this is general for
> all logical expressions, this tends to be more suitable for normal form
> handlings as normal forms group the sub-expressions by the operators
> that act on the sub-expressions.
>
> Thanks,
>
> Yan
>
> -----Original Message-----
> From: Swati Jain [mailto:swati.j@aggiemail.usu.edu]
> Sent: Monday, July 05, 2010 2:34 AM
> To: pig-dev@hadoop.apache.org
> Cc: Daniel Dai
> Subject: PIG Logical Optimization: Use CNF in SplitFilter
>
> Hi,
>
> I am interested in implementing logical optimization rules and to target
> this I have studied currently implemented logical rules and the rule
> framework. In particular, I felt that rules dealing with LOfilter are
> not
> able to handle complicated boolean expressions. I would like to share
> suggestions to improve handling of boolean expressions in LOFilter to
> enable
> better optimization.
>
> 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into
> two by
> "AND". However it will not be able to split LOFilter if the top level
> operator is "OR". For example:
>
> *ex script:*
> A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
> B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
> C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
> J1 = JOIN B by b1, C by c1;
> J2 = JOIN J1 by $0, A by a1;
> D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
> explain D;
>
> In the above example current rule is not able to any filter condition
> across
> any join as it contains columns from all branches (inputs). But if we
> convert this expression into "Conjunctive Normal Form" (CNF) then we
> would
> be able to push filter condition c1< 10 and c2 == 5 below both join
> conditions. Here is the CNF expression for highlighted line:
>
> ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
>
> *Suggestions:* It would be a good idea to convert LOFilter's boolean
> expression into CNF, it would then be easy to push parts (conjuncts) of
> the
> LOFilter boolean expression selectively.
>
> I have started thinking about the design for implementing this
> conversion
> (arbitrary boolean expression to CNF) and would appreciate any feedback
> or
> ideas.
>
> Thanks!
> Swati
>

RE: PIG Logical Optimization: Use CNF in SplitFilter

Posted by Yan Zhou <ya...@yahoo-inc.com>.
Swati,

I happen to be working on the logical expression simplification effort
(https://issues.apache.org/jira/browse/PIG-1399), but not on the filter
split front. So I guess our interests will have some overlaps.

I think the filter logic split problem can be divided into 2 parts:
1) the filtering logic that can be applied to individual input sources;
and 2) the filtering logic that has to be applied when merged(or joined)
inputs are processed.

The benefits for 1) are any benefits if the underlying storage supports
predicate pushdown; plus the memory/CPU savings by PIG for not
processing the unqualified rows.

For 2), the purpose is not paying higher evaluation costs than
necessary.

For 1), no normal form is necessary. The original logical expression
tree
can be trimmed off any sub-expressions that are not constants nor just
from a particular input source. The complexity is linear with the tree
size; while the use of normal form could potentially lead to exponential
complexity. The difficulty with this approach is how to generate the
filtering logic for 2); while CNF can be used to easily figure out the
logic for 2). However, the exact logic in 2) might not be cheaper to
evaluate than the original logical expression. An example is "Filter J2
by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the
filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced
to 4, compared with 3 logical evaluations in the original form.

In summary, if only 1) is desired, the tree trimming is enough. If 2) is
desired too, then CNF could be used but its complexity should be
controlled and the cost of the filtering logic evaluation in 2) should
be computed and compared with the original expression evaluation cost.
Further optimization is possible in this direction.

Another potential optimization to consider is to support logical
expression tree of multiple children vs. the binary tree after taking
into consideration of the commutative property of OR and AND operations.
The advantages are less tree traversal costs and easier to change the
evaluation ordering within the same sub-tree in order to maximize the
possibilities to short-cut the evaluations. Although this is general for
all logical expressions, this tends to be more suitable for normal form
handlings as normal forms group the sub-expressions by the operators
that act on the sub-expressions.

Thanks,

Yan

-----Original Message-----
From: Swati Jain [mailto:swati.j@aggiemail.usu.edu] 
Sent: Monday, July 05, 2010 2:34 AM
To: pig-dev@hadoop.apache.org
Cc: Daniel Dai
Subject: PIG Logical Optimization: Use CNF in SplitFilter

Hi,

I am interested in implementing logical optimization rules and to target
this I have studied currently implemented logical rules and the rule
framework. In particular, I felt that rules dealing with LOfilter are
not
able to handle complicated boolean expressions. I would like to share
suggestions to improve handling of boolean expressions in LOFilter to
enable
better optimization.

1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into
two by
"AND". However it will not be able to split LOFilter if the top level
operator is "OR". For example:

*ex script:*
A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
J1 = JOIN B by b1, C by c1;
J2 = JOIN J1 by $0, A by a1;
D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
explain D;

In the above example current rule is not able to any filter condition
across
any join as it contains columns from all branches (inputs). But if we
convert this expression into "Conjunctive Normal Form" (CNF) then we
would
be able to push filter condition c1< 10 and c2 == 5 below both join
conditions. Here is the CNF expression for highlighted line:

( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )

*Suggestions:* It would be a good idea to convert LOFilter's boolean
expression into CNF, it would then be easy to push parts (conjuncts) of
the
LOFilter boolean expression selectively.

I have started thinking about the design for implementing this
conversion
(arbitrary boolean expression to CNF) and would appreciate any feedback
or
ideas.

Thanks!
Swati