You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Gianmarco De Francisci Morales <gd...@apache.org> on 2012/08/02 15:52:50 UTC

illustrate

Hi,

The GSoC project that Allan is working on and that I am mentoring is at the
point where we are starting to refine things.
One question Allan asked was about the illustrate command:
Are we still supporting it?
If so, is there anybody with experience that could give some suggestions on
how to do it?

Thanks,
--
Gianmarco

Re: illustrate

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.

Hi Dmitriy,
I think there are at least a couple things that would be more difficult to
do with a UDF implementation, namely:
1) AFAIK, you don't have access to the MR task id within the UDF.
2) Using the counters between the two steps of the operations in order to
communicate the cumulative sums.


Cheers,
--
Gianmarco



On Fri, Aug 10, 2012 at 9:13 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I must be missing some tricky detail... Which of these operations could
> not be done by clever udfs?
>
> On Aug 9, 2012, at 9:01 AM, Gianmarco De Francisci Morales <
> gdfm@apache.org> wrote:
>
> > Hi Allan,
> > I think I found an answer to your problem:
> >
> > 1) Modify PhysicalPlanResetter by adding:
> >
> >    @Override
> >
> >    public void visitCounter(POCounter counter) throws VisitorException {
> >
> >        counter.reset();
> >
> >    }
> >
> >
> > 2) Modify POCounter by adding
> >
> >    @Override
> >
> >    public void reset() {
> >
> >        localCount = 0L;
> >
> >        taskID = "-1";
> >
> >        incrementer = 1;
> >
> >    }
> >
> >
> >
> > I get this result on this file + script:
> >
> > ------------------------------------------
> >
> > | a     | id:int    | value:chararray    |
> >
> > ------------------------------------------
> >
> > |       | 6         | g                  |
> >
> > ------------------------------------------
> >
> > -----------------------------------------------------------
> >
> > | b     | rank_a:long    | id:int    | value:chararray    |
> >
> > -----------------------------------------------------------
> >
> > |       | 1              | 6         | g                  |
> >
> > -----------------------------------------------------------
> >
> >
> >
> > grunt> cat file.txt
> >
> > 1 a
> >
> > 2 b
> >
> > 3 c
> >
> > 3 d
> >
> > 4 e
> >
> > 6 f
> >
> > 6 g
> >
> > 8 h
> >
> >
> >
> >
> > grunt> a = load 'file.txt' as (id:int, value:chararray);
> >
> > grunt> b = rank a;
> >
> > grunt> illustrate b
> >
> >
> >
> >
> > Hope it helps.
> >
> >
> > Cheers,
> >
> > --
> > Gianmarco
> >
> >
> >
> > On Tue, Aug 7, 2012 at 4:34 PM, Allan <aa...@gmail.com> wrote:
> >
> >> Hi to everybody!
> >>
> >> I'm working on the implementation of rank operator, which successfully
> >> passed all the e2e tests on a cluster.
> >> Rank operator is composed by two physical operators: POCounter and
> PORank,
> >> and it provides two functionalities:
> >>
> >> 1) First functionality is similar to ROW NUMBER like on SQL, which
> >> provides a sequential number to each tuple.
> >> This is implemented by two map-only works (one for each physical
> >> operator).
> >>
> >> - POCounter adds to each tuple the task identifier (which is processing
> >> it) and a local counter.  Furthermore, POCounter register the total
> number
> >> of processed tuples by each task, through the used of global counters.
> >> After finished the POCounter, it is calculated the cumulative sum, which
> >> is the summation of the total tuples processed by previous tasks, i.e.
> for
> >> task0 cumulative sum is 0 (there is not tuples before), task1 cumulative
> >> sum is the number of tuples processed by task0 (the only task before it
> is
> >> task0), and so on.
> >>
> >> - Finally, PORank reads the corresponding cumulative according to the
> task
> >> id of each tuple and sums the local counter at the tuple.
> >>
> >> An input example for the POCount could be:
> >>
> >> (1,n,5)
> >> (8,a,0)
> >> (0,b,9)
> >>
> >> result of POCounter, and input to the PORank:
> >>
> >> (0,1,1,n,5)
> >> (0,2,8,a,0)
> >> (0,3,0,b,9)
> >>
> >> and result after PORank processing:
> >>
> >> (1,1,n,5)
> >> (2,8,a,0)
> >> (3,0,b,9)
> >>
> >>
> >> 2) Second functionality is RANK BY, which is based on set of ordered
> >> columns.
> >> And it requires another methodology:
> >> First, the dataset is group by the desired columns. Then, this result is
> >> sorted by the columns specified. And, at the end this result is
> processed
> >> by POCounter and PORank.
> >> As in the previous case, POCounter adds to each tuple the task
> identifier
> >> and the local counter. But here, local counter is not sequentially
> >> incremented. Instead, it is added the number of tuples in the bag
> (produced
> >> within the previous "group by").
> >> Another particular change is the fact of the global counter is also
> >> incremented by the size of bags on each tuple.
> >>
> >> Finally, PORank does the same as the previous implementation without
> >> change. After that, the rank column is spread to each component on the
> bag
> >> within a for each operator.
> >>
> >> An input example for the POCounter (after sorting and grouping):
> >> On this case, I would like to rank by the first column.
> >>
> >> (0,{(0,b,9)})
> >> (1,{(1,n,5)})
> >> (8,{(8,a,0)})
> >>
> >> And after being processed by POCounter, and an input example for the
> >> PORank:
> >>
> >> (0,1,0,{(0,b,9)})
> >> (0,2,1,{(1,n,5)})
> >> (0,3,8,{(8,a,0)})
> >>
> >> Then, the resulting after PORank:
> >>
> >> (1,0,{(0,b,9)})
> >> (2,1,{(1,n,5)})
> >> (3,8,{(8,a,0)})
> >>
> >> Finally, the rank value is spread to each element at the bag through a
> for
> >> each operator, resulting:
> >>
> >> (1,0,b,9)
> >> (2,1,n,5)
> >> (3,8,a,0)
> >>
> >> After testing some options, I got a way to illustrate the rank operator,
> >> but I have some problems:
> >>
> >> 1.- I guess that due to the illustrator algorithm, resulting tuples
> after
> >> POCounter produces numbers high counters values two or three times than
> >> expected, for example:
> >> (0,38,1,n,5)
> >> (0,39,8,a,0)
> >> (0,40,0,b,9)
> >>
> >> 2.- Until now, I get 1 tuple example after illustrate. How could I get
> at
> >> least three or four tuples as result?
> >>
> >> Thanks in advance for your replies,
> >>
> >> --
> >>
> >> Allan Avendaño S.
> >> Computer Engineer
> >> Ex-SWY22 Participant
> >> Rome - Italy
> >> Gmail: aavendan@gmail.com
> >> --
> >>
> >>
>

Re: illustrate

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I must be missing some tricky detail... Which of these operations could not be done by clever udfs?

On Aug 9, 2012, at 9:01 AM, Gianmarco De Francisci Morales <gd...@apache.org> wrote:

> Hi Allan,
> I think I found an answer to your problem:
> 
> 1) Modify PhysicalPlanResetter by adding:
> 
>    @Override
> 
>    public void visitCounter(POCounter counter) throws VisitorException {
> 
>        counter.reset();
> 
>    }
> 
> 
> 2) Modify POCounter by adding
> 
>    @Override
> 
>    public void reset() {
> 
>        localCount = 0L;
> 
>        taskID = "-1";
> 
>        incrementer = 1;
> 
>    }
> 
> 
> 
> I get this result on this file + script:
> 
> ------------------------------------------
> 
> | a     | id:int    | value:chararray    |
> 
> ------------------------------------------
> 
> |       | 6         | g                  |
> 
> ------------------------------------------
> 
> -----------------------------------------------------------
> 
> | b     | rank_a:long    | id:int    | value:chararray    |
> 
> -----------------------------------------------------------
> 
> |       | 1              | 6         | g                  |
> 
> -----------------------------------------------------------
> 
> 
> 
> grunt> cat file.txt
> 
> 1 a
> 
> 2 b
> 
> 3 c
> 
> 3 d
> 
> 4 e
> 
> 6 f
> 
> 6 g
> 
> 8 h
> 
> 
> 
> 
> grunt> a = load 'file.txt' as (id:int, value:chararray);
> 
> grunt> b = rank a;
> 
> grunt> illustrate b
> 
> 
> 
> 
> Hope it helps.
> 
> 
> Cheers,
> 
> --
> Gianmarco
> 
> 
> 
> On Tue, Aug 7, 2012 at 4:34 PM, Allan <aa...@gmail.com> wrote:
> 
>> Hi to everybody!
>> 
>> I'm working on the implementation of rank operator, which successfully
>> passed all the e2e tests on a cluster.
>> Rank operator is composed by two physical operators: POCounter and PORank,
>> and it provides two functionalities:
>> 
>> 1) First functionality is similar to ROW NUMBER like on SQL, which
>> provides a sequential number to each tuple.
>> This is implemented by two map-only works (one for each physical
>> operator).
>> 
>> - POCounter adds to each tuple the task identifier (which is processing
>> it) and a local counter.  Furthermore, POCounter register the total number
>> of processed tuples by each task, through the used of global counters.
>> After finished the POCounter, it is calculated the cumulative sum, which
>> is the summation of the total tuples processed by previous tasks, i.e. for
>> task0 cumulative sum is 0 (there is not tuples before), task1 cumulative
>> sum is the number of tuples processed by task0 (the only task before it is
>> task0), and so on.
>> 
>> - Finally, PORank reads the corresponding cumulative according to the task
>> id of each tuple and sums the local counter at the tuple.
>> 
>> An input example for the POCount could be:
>> 
>> (1,n,5)
>> (8,a,0)
>> (0,b,9)
>> 
>> result of POCounter, and input to the PORank:
>> 
>> (0,1,1,n,5)
>> (0,2,8,a,0)
>> (0,3,0,b,9)
>> 
>> and result after PORank processing:
>> 
>> (1,1,n,5)
>> (2,8,a,0)
>> (3,0,b,9)
>> 
>> 
>> 2) Second functionality is RANK BY, which is based on set of ordered
>> columns.
>> And it requires another methodology:
>> First, the dataset is group by the desired columns. Then, this result is
>> sorted by the columns specified. And, at the end this result is processed
>> by POCounter and PORank.
>> As in the previous case, POCounter adds to each tuple the task identifier
>> and the local counter. But here, local counter is not sequentially
>> incremented. Instead, it is added the number of tuples in the bag (produced
>> within the previous "group by").
>> Another particular change is the fact of the global counter is also
>> incremented by the size of bags on each tuple.
>> 
>> Finally, PORank does the same as the previous implementation without
>> change. After that, the rank column is spread to each component on the bag
>> within a for each operator.
>> 
>> An input example for the POCounter (after sorting and grouping):
>> On this case, I would like to rank by the first column.
>> 
>> (0,{(0,b,9)})
>> (1,{(1,n,5)})
>> (8,{(8,a,0)})
>> 
>> And after being processed by POCounter, and an input example for the
>> PORank:
>> 
>> (0,1,0,{(0,b,9)})
>> (0,2,1,{(1,n,5)})
>> (0,3,8,{(8,a,0)})
>> 
>> Then, the resulting after PORank:
>> 
>> (1,0,{(0,b,9)})
>> (2,1,{(1,n,5)})
>> (3,8,{(8,a,0)})
>> 
>> Finally, the rank value is spread to each element at the bag through a for
>> each operator, resulting:
>> 
>> (1,0,b,9)
>> (2,1,n,5)
>> (3,8,a,0)
>> 
>> After testing some options, I got a way to illustrate the rank operator,
>> but I have some problems:
>> 
>> 1.- I guess that due to the illustrator algorithm, resulting tuples after
>> POCounter produces numbers high counters values two or three times than
>> expected, for example:
>> (0,38,1,n,5)
>> (0,39,8,a,0)
>> (0,40,0,b,9)
>> 
>> 2.- Until now, I get 1 tuple example after illustrate. How could I get at
>> least three or four tuples as result?
>> 
>> Thanks in advance for your replies,
>> 
>> --
>> 
>> Allan Avendaño S.
>> Computer Engineer
>> Ex-SWY22 Participant
>> Rome - Italy
>> Gmail: aavendan@gmail.com
>> --
>> 
>>

Re: illustrate

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.

Hi Allan,
I think I found an answer to your problem:

1) Modify PhysicalPlanResetter by adding:

    @Override

    public void visitCounter(POCounter counter) throws VisitorException {

        counter.reset();

    }


2) Modify POCounter by adding

    @Override

    public void reset() {

        localCount = 0L;

        taskID = "-1";

        incrementer = 1;

    }



I get this result on this file + script:

------------------------------------------

| a     | id:int    | value:chararray    |

------------------------------------------

|       | 6         | g                  |

------------------------------------------

-----------------------------------------------------------

| b     | rank_a:long    | id:int    | value:chararray    |

-----------------------------------------------------------

|       | 1              | 6         | g                  |

-----------------------------------------------------------



grunt> cat file.txt

1 a

2 b

3 c

3 d

4 e

6 f

6 g

8 h




grunt> a = load 'file.txt' as (id:int, value:chararray);

grunt> b = rank a;

grunt> illustrate b




Hope it helps.


Cheers,

--
Gianmarco



On Tue, Aug 7, 2012 at 4:34 PM, Allan <aa...@gmail.com> wrote:

> Hi to everybody!
>
> I'm working on the implementation of rank operator, which successfully
> passed all the e2e tests on a cluster.
> Rank operator is composed by two physical operators: POCounter and PORank,
> and it provides two functionalities:
>
> 1) First functionality is similar to ROW NUMBER like on SQL, which
> provides a sequential number to each tuple.
> This is implemented by two map-only works (one for each physical
> operator).
>
> - POCounter adds to each tuple the task identifier (which is processing
> it) and a local counter.  Furthermore, POCounter register the total number
> of processed tuples by each task, through the used of global counters.
> After finished the POCounter, it is calculated the cumulative sum, which
> is the summation of the total tuples processed by previous tasks, i.e. for
> task0 cumulative sum is 0 (there is not tuples before), task1 cumulative
> sum is the number of tuples processed by task0 (the only task before it is
> task0), and so on.
>
> - Finally, PORank reads the corresponding cumulative according to the task
> id of each tuple and sums the local counter at the tuple.
>
> An input example for the POCount could be:
>
> (1,n,5)
> (8,a,0)
> (0,b,9)
>
> result of POCounter, and input to the PORank:
>
> (0,1,1,n,5)
> (0,2,8,a,0)
> (0,3,0,b,9)
>
> and result after PORank processing:
>
> (1,1,n,5)
> (2,8,a,0)
> (3,0,b,9)
>
>
> 2) Second functionality is RANK BY, which is based on set of ordered
> columns.
> And it requires another methodology:
> First, the dataset is group by the desired columns. Then, this result is
> sorted by the columns specified. And, at the end this result is processed
> by POCounter and PORank.
> As in the previous case, POCounter adds to each tuple the task identifier
> and the local counter. But here, local counter is not sequentially
> incremented. Instead, it is added the number of tuples in the bag (produced
> within the previous "group by").
> Another particular change is the fact of the global counter is also
> incremented by the size of bags on each tuple.
>
> Finally, PORank does the same as the previous implementation without
> change. After that, the rank column is spread to each component on the bag
> within a for each operator.
>
> An input example for the POCounter (after sorting and grouping):
> On this case, I would like to rank by the first column.
>
> (0,{(0,b,9)})
> (1,{(1,n,5)})
>  (8,{(8,a,0)})
>
> And after being processed by POCounter, and an input example for the
> PORank:
>
> (0,1,0,{(0,b,9)})
> (0,2,1,{(1,n,5)})
> (0,3,8,{(8,a,0)})
>
> Then, the resulting after PORank:
>
> (1,0,{(0,b,9)})
> (2,1,{(1,n,5)})
> (3,8,{(8,a,0)})
>
> Finally, the rank value is spread to each element at the bag through a for
> each operator, resulting:
>
> (1,0,b,9)
> (2,1,n,5)
> (3,8,a,0)
>
> After testing some options, I got a way to illustrate the rank operator,
> but I have some problems:
>
> 1.- I guess that due to the illustrator algorithm, resulting tuples after
> POCounter produces numbers high counters values two or three times than
> expected, for example:
> (0,38,1,n,5)
> (0,39,8,a,0)
> (0,40,0,b,9)
>
> 2.- Until now, I get 1 tuple example after illustrate. How could I get at
> least three or four tuples as result?
>
> Thanks in advance for your replies,
>
> --
>
> Allan Avendaño S.
> Computer Engineer
> Ex-SWY22 Participant
> Rome - Italy
> Gmail: aavendan@gmail.com
> --
>
>

Re: illustrate

Posted by Allan <aa...@gmail.com>.

Hi to everybody!

I'm working on the implementation of rank operator, which successfully
passed all the e2e tests on a cluster.
Rank operator is composed by two physical operators: POCounter and PORank,
and it provides two functionalities:

1) First functionality is similar to ROW NUMBER like on SQL, which provides
a sequential number to each tuple.
This is implemented by two map-only works (one for each physical operator).

- POCounter adds to each tuple the task identifier (which is processing it)
and a local counter.  Furthermore, POCounter register the total number of
processed tuples by each task, through the used of global counters.
After finished the POCounter, it is calculated the cumulative sum, which is
the summation of the total tuples processed by previous tasks, i.e. for
task0 cumulative sum is 0 (there is not tuples before), task1 cumulative
sum is the number of tuples processed by task0 (the only task before it is
task0), and so on.

- Finally, PORank reads the corresponding cumulative according to the task
id of each tuple and sums the local counter at the tuple.

An input example for the POCount could be:

(1,n,5)
(8,a,0)
(0,b,9)

result of POCounter, and input to the PORank:

(0,1,1,n,5)
(0,2,8,a,0)
(0,3,0,b,9)

and result after PORank processing:

(1,1,n,5)
(2,8,a,0)
(3,0,b,9)


2) Second functionality is RANK BY, which is based on set of ordered
columns.
And it requires another methodology:
First, the dataset is group by the desired columns. Then, this result is
sorted by the columns specified. And, at the end this result is processed
by POCounter and PORank.
As in the previous case, POCounter adds to each tuple the task identifier
and the local counter. But here, local counter is not sequentially
incremented. Instead, it is added the number of tuples in the bag (produced
within the previous "group by").
Another particular change is the fact of the global counter is also
incremented by the size of bags on each tuple.

Finally, PORank does the same as the previous implementation without
change. After that, the rank column is spread to each component on the bag
within a for each operator.

An input example for the POCounter (after sorting and grouping):
On this case, I would like to rank by the first column.

(0,{(0,b,9)})
(1,{(1,n,5)})
(8,{(8,a,0)})

And after being processed by POCounter, and an input example for the PORank:

(0,1,0,{(0,b,9)})
(0,2,1,{(1,n,5)})
(0,3,8,{(8,a,0)})

Then, the resulting after PORank:

(1,0,{(0,b,9)})
(2,1,{(1,n,5)})
(3,8,{(8,a,0)})

Finally, the rank value is spread to each element at the bag through a for
each operator, resulting:

(1,0,b,9)
(2,1,n,5)
(3,8,a,0)

After testing some options, I got a way to illustrate the rank operator,
but I have some problems:

1.- I guess that due to the illustrator algorithm, resulting tuples after
POCounter produces numbers high counters values two or three times than
expected, for example:
(0,38,1,n,5)
(0,39,8,a,0)
(0,40,0,b,9)

2.- Until now, I get 1 tuple example after illustrate. How could I get at
least three or four tuples as result?

Thanks in advance for your replies,

-- 

Allan Avendaño S.
Computer Engineer
Ex-SWY22 Participant
Rome - Italy
Gmail: aavendan@gmail.com
--

Re: illustrate

Posted by Allan <aa...@gmail.com>.

Hi to everybody!

Thanks Thejas for your reply.

Current situation is the following:

Rank command is composed by two physical operators connected: POCounter and
PORank.

This operator has two implementations:
* The first one is composed by two map-only jobs (one per each physical
operator of rank command). It was quite easy to incorporate the illustrator
command on it.

* The second implementation is a chain of physical operators, and at the
end are the POCounter, PORank and a POForEach. And I have some problems on
illustrate it.
This is what I got, while trying to illustrate here:

java.lang.NullPointerException

at org.apache.pig.pen.util.LineageTracer.link(LineageTracer.java:70)

at org.apache.pig.pen.util.LineageTracer.union(LineageTracer.java:56)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.illustratorMarkup(
POForEach.java:743)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.createTuple(
POForEach.java:488)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(
POForEach.java:436)

at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(
POForEach.java:294)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(
PigGenericMapBase.java:275)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(
PigGenericMapBase.java:270)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(
PigGenericMapBase.java:1)


While I was looking for a solution, I think one possible reason is the
LineageTracer used at POForEach, which should be implemented at PORank too.
(But I don't know how to used it)

Thanks in advance for your reply,

-- 

Allan Avendaño S.
Computer Engineer
Ex-SWY22 Participant
Rome - Italy
Gmail: aavendan@gmail.com
--

Re: illustrate

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.

Hi,
Thanks for your answer Thejas.
Here is the question:

"I got some advances today within the illustrator command. The ROW NUMBER
shows one tuple from the load and the same tuple after rank. Do you think
it is enough?

But, on the case of RANK BY, I think I've a problem within the
LineageTracer. While I was debugging it, I saw that on some point of the
latest foreach (after PORank), it threw an exception related with
boundaries of an array. Do you know which is exactly the use of it?
I was thinking on produce a Databag on the PORank, maybe it works."

I am not familiar with the illustrator code, so I am unable to answer his
questions right now.
Is there any up-to-date documentation on it? Or anybody already familiar
with it willing to give a hand?

Cheers,
--
Gianmarco

On Fri, Aug 3, 2012 at 11:30 PM, Thejas Nair <th...@hortonworks.com> wrote:

> Yes, illustrate is still supported. We made lot of improvements in 0.9 in
> getting it working under more conditions.
>
> Can you forward Allan's question to the list ?
>
> Thanks,
> Thejas
>
>
>
>
> On 8/2/12 6:52 AM, Gianmarco De Francisci Morales wrote:
>
>> Hi,
>>
>> The GSoC project that Allan is working on and that I am mentoring is at
>> the
>> point where we are starting to refine things.
>> One question Allan asked was about the illustrate command:
>> Are we still supporting it?
>> If so, is there anybody with experience that could give some suggestions
>> on
>> how to do it?
>>
>> Thanks,
>> --
>> Gianmarco
>>
>>
>

Re: illustrate

Posted by Thejas Nair <th...@hortonworks.com>.

Yes, illustrate is still supported. We made lot of improvements in 0.9 
in getting it working under more conditions.

Can you forward Allan's question to the list ?

Thanks,
Thejas



On 8/2/12 6:52 AM, Gianmarco De Francisci Morales wrote:
> Hi,
>
> The GSoC project that Allan is working on and that I am mentoring is at the
> point where we are starting to refine things.
> One question Allan asked was about the illustrate command:
> Are we still supporting it?
> If so, is there anybody with experience that could give some suggestions on
> how to do it?
>
> Thanks,
> --
> Gianmarco
>