You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Rodrigo Ferreira <we...@gmail.com> on 2014/07/15 19:42:30 UTC

Thoughts

Hi everyone,

I'm developing a quite complex system using Pig and I'd like to confirm
some ideas if possible. They are not really questions. They are more like
thoughts.

1) I'm creating my "input" data using Pig itself. It means that the actual
input is a small file with a few rows (few means not Big Data). And for
each of these rows I create lots of data (my real input).

Well, in order to do that, considering that the creation of the real input
is CPU bounded, I decided to create a separated file for each row and LOAD
them separately, this way allowing Pig to fire a different Map process for
each of them and hopefully obtaining some parallelization. Is it OK?

2) I have a UDF that I call in a projection relation. This UDF communicates
with my S3 bucket and the relation that is produced in this projection is
never used. Well, it seems that Pig optimizer simply discards this UDF.
What I did was to make this UDF return a boolean value and I store it on S3
(a lightweight file). This way it gets executed. Any thoughts on this?

Thank you. I'll come back later on with other ideas. I hope this reasoning
may help someone :)

Rodrigo Ferreira

Re: Thoughts

Posted by David McNelis <dm...@gmail.com>.

Regarding your UDF, you're creating a lot of overhead for storing something
outside of the Hadoop ecosystem, imho.

Why not create a dump of your booleans and then have a separate script push
them all to S3 at one time after you're Pig script is complete?  That way
you wouldn't be waiting on puts to S3 to complete your script.




On Wed, Jul 16, 2014 at 5:20 AM, Rodrigo Ferreira <we...@gmail.com> wrote:

> Hey Jacob, thanks for your reply.
>
> Well, congratulations on your blog. It seems very interesting. I'll take a
> look at your post later.
>
> Regarding the second question. I have something like this:
>
> B = FOREACH A GENERATE MyUDF(args);
>
> This UDF stores something in S3 and the return value is not important (in
> fact it's just a boolean value). So I don't use relation B after this
> point.
>
> Well, it seems that because of that, Pig's optimizer skips/discards this
> statement. So, in order to make it work I had to insert another statement
> just doing something silly like:
>
> STORE B INTO 's3n://mybucket/dump';
>
> And now it works. I think it's reasonable because everything is in AWS
> servers and the file is really small (just a boolean value). Is there
> another way to do that?
>
> Rodrigo.
>
>
> 2014-07-15 19:49 GMT+02:00 Jacob Perkins <ja...@gmail.com>:
>
> >
> > On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <we...@gmail.com> wrote:
> >
> > > 1) I'm creating my "input" data using Pig itself. It means that the
> > actual
> > > input is a small file with a few rows (few means not Big Data). And for
> > > each of these rows I create lots of data (my real input).
> > >
> > > Well, in order to do that, considering that the creation of the real
> > input
> > > is CPU bounded, I decided to create a separated file for each row and
> > LOAD
> > > them separately, this way allowing Pig to fire a different Map process
> > for
> > > each of them and hopefully obtaining some parallelization. Is it OK?
> > Seems totally reasonable to me, albeit laborious. Be sure to set
> > pig.splitCombination to false. Alternatively, you could try the approach
> > here and write your own simple inputFormat:
> >
> http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html
> > Similar ideas in that the "input" is actually just a very small file and
> > numerous simulations are run in parallel using pig.
> >
> > >
> > > 2) I have a UDF that I call in a projection relation. This UDF
> > communicates
> > > with my S3 bucket and the relation that is produced in this projection
> is
> > > never used. Well, it seems that Pig optimizer simply discards this UDF.
> > > What I did was to make this UDF return a boolean value and I store it
> on
> > S3
> > > (a lightweight file). This way it gets executed. Any thoughts on this?
> > Can you explain further? It's not clear what you're trying to do/what
> > isn't working.
> >
> > --jacob
> > @thedatachef
> >
> >
>

Re: Thoughts

Posted by Paul Houle <on...@gmail.com>.

This is just the way Pig is.

Pig takes a network of relations you define and it only computes what
it needs to compute to make the observable results in order to produce
the outputs that you want to generate.

Pig is all about creating the desired outputs so it reserves the right
to create a query plan which is entirely different from the network of
relations you defined.  If you create a job with multiple output it is
smart enough,  however,  to share intermediate steps between the
outputs.

For instance,  if you never use relation B,  it won't compute relation
B.  If relation Z depends on B,  it will compute B on demand (or do
something equivalent) in the process of computing Z.

You certainly can materialize a relation,  store it in HDFS or S3,
then load it later.  This isn't hard to do,  but then you have to
write different code in the case that you compute the relation in some
cases and in other cases LOAD it.  It's one of the many "missing
features" in Pig that would make it easier to maintain bigger Pig
systems.
ᐧ

On Wed, Jul 16, 2014 at 10:23 AM, Jacob Perkins
<ja...@gmail.com> wrote:
> Rodrigo,
>
> Write your own StoreFunc and use it instead of the udf. PigStorage can already write to s3 so it should be straightforward to simply subclass that.
>
> --jacob
> @thedatachef
>
> On Jul 16, 2014, at 2:20 AM, Rodrigo Ferreira <we...@gmail.com> wrote:
>
>> Hey Jacob, thanks for your reply.
>>
>> Well, congratulations on your blog. It seems very interesting. I'll take a
>> look at your post later.
>>
>> Regarding the second question. I have something like this:
>>
>> B = FOREACH A GENERATE MyUDF(args);
>>
>> This UDF stores something in S3 and the return value is not important (in
>> fact it's just a boolean value). So I don't use relation B after this point.
>>
>> Well, it seems that because of that, Pig's optimizer skips/discards this
>> statement. So, in order to make it work I had to insert another statement
>> just doing something silly like:
>>
>> STORE B INTO 's3n://mybucket/dump';
>>
>> And now it works. I think it's reasonable because everything is in AWS
>> servers and the file is really small (just a boolean value). Is there
>> another way to do that?
>>
>> Rodrigo.
>>
>>
>> 2014-07-15 19:49 GMT+02:00 Jacob Perkins <ja...@gmail.com>:
>>
>>>
>>> On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <we...@gmail.com> wrote:
>>>
>>>> 1) I'm creating my "input" data using Pig itself. It means that the
>>> actual
>>>> input is a small file with a few rows (few means not Big Data). And for
>>>> each of these rows I create lots of data (my real input).
>>>>
>>>> Well, in order to do that, considering that the creation of the real
>>> input
>>>> is CPU bounded, I decided to create a separated file for each row and
>>> LOAD
>>>> them separately, this way allowing Pig to fire a different Map process
>>> for
>>>> each of them and hopefully obtaining some parallelization. Is it OK?
>>> Seems totally reasonable to me, albeit laborious. Be sure to set
>>> pig.splitCombination to false. Alternatively, you could try the approach
>>> here and write your own simple inputFormat:
>>> http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html
>>> Similar ideas in that the "input" is actually just a very small file and
>>> numerous simulations are run in parallel using pig.
>>>
>>>>
>>>> 2) I have a UDF that I call in a projection relation. This UDF
>>> communicates
>>>> with my S3 bucket and the relation that is produced in this projection is
>>>> never used. Well, it seems that Pig optimizer simply discards this UDF.
>>>> What I did was to make this UDF return a boolean value and I store it on
>>> S3
>>>> (a lightweight file). This way it gets executed. Any thoughts on this?
>>> Can you explain further? It's not clear what you're trying to do/what
>>> isn't working.
>>>
>>> --jacob
>>> @thedatachef
>>>
>>>
>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontology2@gmail.com

Re: Thoughts

Posted by Jacob Perkins <ja...@gmail.com>.

Rodrigo,

Write your own StoreFunc and use it instead of the udf. PigStorage can already write to s3 so it should be straightforward to simply subclass that.

--jacob
@thedatachef

On Jul 16, 2014, at 2:20 AM, Rodrigo Ferreira <we...@gmail.com> wrote:

> Hey Jacob, thanks for your reply.
> 
> Well, congratulations on your blog. It seems very interesting. I'll take a
> look at your post later.
> 
> Regarding the second question. I have something like this:
> 
> B = FOREACH A GENERATE MyUDF(args);
> 
> This UDF stores something in S3 and the return value is not important (in
> fact it's just a boolean value). So I don't use relation B after this point.
> 
> Well, it seems that because of that, Pig's optimizer skips/discards this
> statement. So, in order to make it work I had to insert another statement
> just doing something silly like:
> 
> STORE B INTO 's3n://mybucket/dump';
> 
> And now it works. I think it's reasonable because everything is in AWS
> servers and the file is really small (just a boolean value). Is there
> another way to do that?
> 
> Rodrigo.
> 
> 
> 2014-07-15 19:49 GMT+02:00 Jacob Perkins <ja...@gmail.com>:
> 
>> 
>> On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <we...@gmail.com> wrote:
>> 
>>> 1) I'm creating my "input" data using Pig itself. It means that the
>> actual
>>> input is a small file with a few rows (few means not Big Data). And for
>>> each of these rows I create lots of data (my real input).
>>> 
>>> Well, in order to do that, considering that the creation of the real
>> input
>>> is CPU bounded, I decided to create a separated file for each row and
>> LOAD
>>> them separately, this way allowing Pig to fire a different Map process
>> for
>>> each of them and hopefully obtaining some parallelization. Is it OK?
>> Seems totally reasonable to me, albeit laborious. Be sure to set
>> pig.splitCombination to false. Alternatively, you could try the approach
>> here and write your own simple inputFormat:
>> http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html
>> Similar ideas in that the "input" is actually just a very small file and
>> numerous simulations are run in parallel using pig.
>> 
>>> 
>>> 2) I have a UDF that I call in a projection relation. This UDF
>> communicates
>>> with my S3 bucket and the relation that is produced in this projection is
>>> never used. Well, it seems that Pig optimizer simply discards this UDF.
>>> What I did was to make this UDF return a boolean value and I store it on
>> S3
>>> (a lightweight file). This way it gets executed. Any thoughts on this?
>> Can you explain further? It's not clear what you're trying to do/what
>> isn't working.
>> 
>> --jacob
>> @thedatachef
>> 
>>

Re: Thoughts

Posted by Rodrigo Ferreira <we...@gmail.com>.

Hey Jacob, thanks for your reply.

Well, congratulations on your blog. It seems very interesting. I'll take a
look at your post later.

Regarding the second question. I have something like this:

B = FOREACH A GENERATE MyUDF(args);

This UDF stores something in S3 and the return value is not important (in
fact it's just a boolean value). So I don't use relation B after this point.

Well, it seems that because of that, Pig's optimizer skips/discards this
statement. So, in order to make it work I had to insert another statement
just doing something silly like:

STORE B INTO 's3n://mybucket/dump';

And now it works. I think it's reasonable because everything is in AWS
servers and the file is really small (just a boolean value). Is there
another way to do that?

Rodrigo.


2014-07-15 19:49 GMT+02:00 Jacob Perkins <ja...@gmail.com>:

>
> On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <we...@gmail.com> wrote:
>
> > 1) I'm creating my "input" data using Pig itself. It means that the
> actual
> > input is a small file with a few rows (few means not Big Data). And for
> > each of these rows I create lots of data (my real input).
> >
> > Well, in order to do that, considering that the creation of the real
> input
> > is CPU bounded, I decided to create a separated file for each row and
> LOAD
> > them separately, this way allowing Pig to fire a different Map process
> for
> > each of them and hopefully obtaining some parallelization. Is it OK?
> Seems totally reasonable to me, albeit laborious. Be sure to set
> pig.splitCombination to false. Alternatively, you could try the approach
> here and write your own simple inputFormat:
> http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html
> Similar ideas in that the "input" is actually just a very small file and
> numerous simulations are run in parallel using pig.
>
> >
> > 2) I have a UDF that I call in a projection relation. This UDF
> communicates
> > with my S3 bucket and the relation that is produced in this projection is
> > never used. Well, it seems that Pig optimizer simply discards this UDF.
> > What I did was to make this UDF return a boolean value and I store it on
> S3
> > (a lightweight file). This way it gets executed. Any thoughts on this?
> Can you explain further? It's not clear what you're trying to do/what
> isn't working.
>
> --jacob
> @thedatachef
>
>

Re: Thoughts

Posted by Jacob Perkins <ja...@gmail.com>.

On Jul 15, 2014, at 10:42 AM, Rodrigo Ferreira <we...@gmail.com> wrote:

> 1) I'm creating my "input" data using Pig itself. It means that the actual
> input is a small file with a few rows (few means not Big Data). And for
> each of these rows I create lots of data (my real input).
> 
> Well, in order to do that, considering that the creation of the real input
> is CPU bounded, I decided to create a separated file for each row and LOAD
> them separately, this way allowing Pig to fire a different Map process for
> each of them and hopefully obtaining some parallelization. Is it OK?
Seems totally reasonable to me, albeit laborious. Be sure to set pig.splitCombination to false. Alternatively, you could try the approach here and write your own simple inputFormat: http://thedatachef.blogspot.com/2013/08/using-hadoop-to-explore-chaos.html Similar ideas in that the "input" is actually just a very small file and numerous simulations are run in parallel using pig.

> 
> 2) I have a UDF that I call in a projection relation. This UDF communicates
> with my S3 bucket and the relation that is produced in this projection is
> never used. Well, it seems that Pig optimizer simply discards this UDF.
> What I did was to make this UDF return a boolean value and I store it on S3
> (a lightweight file). This way it gets executed. Any thoughts on this?
Can you explain further? It's not clear what you're trying to do/what isn't working.

--jacob
@thedatachef