You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Lai Will <la...@student.ethz.ch> on 2011/03/02 14:29:46 UTC

Shared resources

Hello,

I wrote a EvalFunc implementation that


1)      Parses a SQL Query

2)      Scans a folder for resource files and creates an index on these files

3)      According to certain properties of the SQL Query accesses the corresponding file and creates a Java objects holding relevant the information of the file (for reuse).

4)      Does some computation with the SQL Query and the information found in the file

5)      Outputs a transformed SQL Query

Currently I'm doing local tests without Hadoop and the code works fine.

The problem I see, is that right now I initialize my parser in the EvalFunc, so that every time It gets instantiated a new instance of the parser is generated. Ideally only on instance per machine would be created.
Even worse right now I create the index and parse the corresponding resource file once per call exec in EvalFunc  and therefore do a lot of redundant computation.

Just because I don't know where and how to put this shared computation.
Does anybody have a solution on that?

Best,
Will

Re: Shared resources

Posted by Alan Gates <ga...@yahoo-inc.com>.

Within a given task, a UDF is only instantiated once.  For maps and  
reduces this should mean one per map or reduce.  Since the combiner  
can be run multiple times there can be multiple instantiations per  
combine.  But the warning on number of instantiations is about how  
many times the UDF is constructed on the front end (by which I mean  
compile time).  Pig use to construct the UDF multiple times in the  
front end.  Now we have it down to one construction on the front end  
and one per task in the backend.

Alan.

On Mar 2, 2011, at 9:11 AM, Lai Will wrote:

> I understand that the is not inter-task communication at all.
> However my question arises within one task. The documentation says  
> that we should not make any assumptions on how may EvalFunc  
> instances (of the same class) are instantiated.
> Therefore I assume that within the same task, there might be several  
> instances of my EvalFunc and if every one of them is doing the  
> parsing of resource files into data structures a lot of memory and  
> computing power would be wasted.. so it's not about inter-task  
> communication but about inter-instance communication.
>
> Thank you for your help.
>
> Best,
> Will
>
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com]
> Sent: Wednesday, March 02, 2011 5:17 PM
> To: user@pig.apache.org
> Subject: Re: Shared resources
>
> There is no shared inter-task processing in Hadoop.  Each task runs  
> in a separate JVM and is locked off from all other tasks.  This is  
> partly because you do not know a priori which tasks will run  
> together in which nodes, and partly for security.  Data can be  
> shared by all tasks on a node via the distributed cache.  If all  
> your work could be done once on the front end and then serialized to  
> be later read by all tasks you could use this mechanism to share  
> it.  With the code in trunk UDFs can store data to the distributed  
> cache, though this feature is not in a release yet.
>
> Alan.
>
> On Mar 2, 2011, at 7:54 AM, Lai Will wrote:
>
>> So I still get the redundant work whenever the same clusternode/vm
>> creates multiple instances of my EvalFunc?
>> And is it usual to have several instance of the EvalFunc on the same
>> clusternode/vm?
>>
>> Will
>>
>> -----Original Message-----
>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>> Sent: Wednesday, March 02, 2011 4:49 PM
>> To: user@pig.apache.org
>> Subject: Re: Shared resources
>>
>> There is no method in the eval func that gets called on the backend
>> before any exec calls.  You can keep a flag that tracks whether you
>> have done the initialization so that you only do it the first time.
>>
>> Alan.
>>
>> On Mar 2, 2011, at 5:29 AM, Lai Will wrote:
>>
>>> Hello,
>>>
>>> I wrote a EvalFunc implementation that
>>>
>>>
>>> 1)      Parses a SQL Query
>>>
>>> 2)      Scans a folder for resource files and creates an index on
>>> these files
>>>
>>> 3)      According to certain properties of the SQL Query accesses
>>> the corresponding file and creates a Java objects holding relevant
>>> the information of the file (for reuse).
>>>
>>> 4)      Does some computation with the SQL Query and the information
>>> found in the file
>>>
>>> 5)      Outputs a transformed SQL Query
>>>
>>> Currently I'm doing local tests without Hadoop and the code works
>>> fine.
>>>
>>> The problem I see, is that right now I initialize my parser in the
>>> EvalFunc, so that every time It gets instantiated a new instance of
>>> the parser is generated. Ideally only on instance per machine would
>>> be created.
>>> Even worse right now I create the index and parse the corresponding
>>> resource file once per call exec in EvalFunc  and therefore do a lot
>>> of redundant computation.
>>>
>>> Just because I don't know where and how to put this shared
>>> computation.
>>> Does anybody have a solution on that?
>>>
>>> Best,
>>> Will
>>
>

Re: Shared resources

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

In practice, I have never seen an EvalFunc get instantiated twice on the
same task, except when you are actually using it in two different places
(foreach x generate Foo(field1), Foo(field2)), or when things like combiners
are involved. In any case, number of instantiations is >> number of
invocations, to the degree that it might as well be one even when it's not
actually 1.

Put whatever initialization logic you need into the constructor.  You can
pass (string) arguments to UDF constructors in Pig using the DEFINE keyword:

define myFoo com.mycompany.Foo('some', 'arguments');

foreach x generate myFoo(field1);

D

On Wed, Mar 2, 2011 at 9:11 AM, Lai Will <la...@student.ethz.ch> wrote:

> I understand that the is not inter-task communication at all.
> However my question arises within one task. The documentation says that we
> should not make any assumptions on how may EvalFunc instances (of the same
> class) are instantiated.
> Therefore I assume that within the same task, there might be several
> instances of my EvalFunc and if every one of them is doing the parsing of
> resource files into data structures a lot of memory and computing power
> would be wasted.. so it's not about inter-task communication but about
> inter-instance communication.
>
> Thank you for your help.
>
> Best,
> Will
>
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com]
> Sent: Wednesday, March 02, 2011 5:17 PM
> To: user@pig.apache.org
> Subject: Re: Shared resources
>
> There is no shared inter-task processing in Hadoop.  Each task runs in a
> separate JVM and is locked off from all other tasks.  This is partly because
> you do not know a priori which tasks will run together in which nodes, and
> partly for security.  Data can be shared by all tasks on a node via the
> distributed cache.  If all your work could be done once on the front end and
> then serialized to be later read by all tasks you could use this mechanism
> to share it.  With the code in trunk UDFs can store data to the distributed
> cache, though this feature is not in a release yet.
>
> Alan.
>
> On Mar 2, 2011, at 7:54 AM, Lai Will wrote:
>
> > So I still get the redundant work whenever the same clusternode/vm
> > creates multiple instances of my EvalFunc?
> > And is it usual to have several instance of the EvalFunc on the same
> > clusternode/vm?
> >
> > Will
> >
> > -----Original Message-----
> > From: Alan Gates [mailto:gates@yahoo-inc.com]
> > Sent: Wednesday, March 02, 2011 4:49 PM
> > To: user@pig.apache.org
> > Subject: Re: Shared resources
> >
> > There is no method in the eval func that gets called on the backend
> > before any exec calls.  You can keep a flag that tracks whether you
> > have done the initialization so that you only do it the first time.
> >
> > Alan.
> >
> > On Mar 2, 2011, at 5:29 AM, Lai Will wrote:
> >
> >> Hello,
> >>
> >> I wrote a EvalFunc implementation that
> >>
> >>
> >> 1)      Parses a SQL Query
> >>
> >> 2)      Scans a folder for resource files and creates an index on
> >> these files
> >>
> >> 3)      According to certain properties of the SQL Query accesses
> >> the corresponding file and creates a Java objects holding relevant
> >> the information of the file (for reuse).
> >>
> >> 4)      Does some computation with the SQL Query and the information
> >> found in the file
> >>
> >> 5)      Outputs a transformed SQL Query
> >>
> >> Currently I'm doing local tests without Hadoop and the code works
> >> fine.
> >>
> >> The problem I see, is that right now I initialize my parser in the
> >> EvalFunc, so that every time It gets instantiated a new instance of
> >> the parser is generated. Ideally only on instance per machine would
> >> be created.
> >> Even worse right now I create the index and parse the corresponding
> >> resource file once per call exec in EvalFunc  and therefore do a lot
> >> of redundant computation.
> >>
> >> Just because I don't know where and how to put this shared
> >> computation.
> >> Does anybody have a solution on that?
> >>
> >> Best,
> >> Will
> >
>
>

RE: Shared resources

Posted by Lai Will <la...@student.ethz.ch>.

I understand that the is not inter-task communication at all.
However my question arises within one task. The documentation says that we should not make any assumptions on how may EvalFunc instances (of the same class) are instantiated.
Therefore I assume that within the same task, there might be several instances of my EvalFunc and if every one of them is doing the parsing of resource files into data structures a lot of memory and computing power would be wasted.. so it's not about inter-task communication but about inter-instance communication.

Thank you for your help.

Best,
Will

-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com] 
Sent: Wednesday, March 02, 2011 5:17 PM
To: user@pig.apache.org
Subject: Re: Shared resources

There is no shared inter-task processing in Hadoop.  Each task runs in a separate JVM and is locked off from all other tasks.  This is partly because you do not know a priori which tasks will run together in which nodes, and partly for security.  Data can be shared by all tasks on a node via the distributed cache.  If all your work could be done once on the front end and then serialized to be later read by all tasks you could use this mechanism to share it.  With the code in trunk UDFs can store data to the distributed cache, though this feature is not in a release yet.

Alan.

On Mar 2, 2011, at 7:54 AM, Lai Will wrote:

> So I still get the redundant work whenever the same clusternode/vm 
> creates multiple instances of my EvalFunc?
> And is it usual to have several instance of the EvalFunc on the same 
> clusternode/vm?
>
> Will
>
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com]
> Sent: Wednesday, March 02, 2011 4:49 PM
> To: user@pig.apache.org
> Subject: Re: Shared resources
>
> There is no method in the eval func that gets called on the backend 
> before any exec calls.  You can keep a flag that tracks whether you 
> have done the initialization so that you only do it the first time.
>
> Alan.
>
> On Mar 2, 2011, at 5:29 AM, Lai Will wrote:
>
>> Hello,
>>
>> I wrote a EvalFunc implementation that
>>
>>
>> 1)      Parses a SQL Query
>>
>> 2)      Scans a folder for resource files and creates an index on
>> these files
>>
>> 3)      According to certain properties of the SQL Query accesses
>> the corresponding file and creates a Java objects holding relevant 
>> the information of the file (for reuse).
>>
>> 4)      Does some computation with the SQL Query and the information
>> found in the file
>>
>> 5)      Outputs a transformed SQL Query
>>
>> Currently I'm doing local tests without Hadoop and the code works 
>> fine.
>>
>> The problem I see, is that right now I initialize my parser in the 
>> EvalFunc, so that every time It gets instantiated a new instance of 
>> the parser is generated. Ideally only on instance per machine would 
>> be created.
>> Even worse right now I create the index and parse the corresponding 
>> resource file once per call exec in EvalFunc  and therefore do a lot 
>> of redundant computation.
>>
>> Just because I don't know where and how to put this shared 
>> computation.
>> Does anybody have a solution on that?
>>
>> Best,
>> Will
>

Re: Shared resources

Posted by Alan Gates <ga...@yahoo-inc.com>.

There is no shared inter-task processing in Hadoop.  Each task runs in  
a separate JVM and is locked off from all other tasks.  This is partly  
because you do not know a priori which tasks will run together in  
which nodes, and partly for security.  Data can be shared by all tasks  
on a node via the distributed cache.  If all your work could be done  
once on the front end and then serialized to be later read by all  
tasks you could use this mechanism to share it.  With the code in  
trunk UDFs can store data to the distributed cache, though this  
feature is not in a release yet.

Alan.

On Mar 2, 2011, at 7:54 AM, Lai Will wrote:

> So I still get the redundant work whenever the same clusternode/vm  
> creates multiple instances of my EvalFunc?
> And is it usual to have several instance of the EvalFunc on the same  
> clusternode/vm?
>
> Will
>
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com]
> Sent: Wednesday, March 02, 2011 4:49 PM
> To: user@pig.apache.org
> Subject: Re: Shared resources
>
> There is no method in the eval func that gets called on the backend  
> before any exec calls.  You can keep a flag that tracks whether you  
> have done the initialization so that you only do it the first time.
>
> Alan.
>
> On Mar 2, 2011, at 5:29 AM, Lai Will wrote:
>
>> Hello,
>>
>> I wrote a EvalFunc implementation that
>>
>>
>> 1)      Parses a SQL Query
>>
>> 2)      Scans a folder for resource files and creates an index on
>> these files
>>
>> 3)      According to certain properties of the SQL Query accesses
>> the corresponding file and creates a Java objects holding relevant  
>> the
>> information of the file (for reuse).
>>
>> 4)      Does some computation with the SQL Query and the information
>> found in the file
>>
>> 5)      Outputs a transformed SQL Query
>>
>> Currently I'm doing local tests without Hadoop and the code works
>> fine.
>>
>> The problem I see, is that right now I initialize my parser in the
>> EvalFunc, so that every time It gets instantiated a new instance of
>> the parser is generated. Ideally only on instance per machine would  
>> be
>> created.
>> Even worse right now I create the index and parse the corresponding
>> resource file once per call exec in EvalFunc  and therefore do a lot
>> of redundant computation.
>>
>> Just because I don't know where and how to put this shared
>> computation.
>> Does anybody have a solution on that?
>>
>> Best,
>> Will
>

RE: Shared resources

Posted by Lai Will <la...@student.ethz.ch>.

So I still get the redundant work whenever the same clusternode/vm creates multiple instances of my EvalFunc?
And is it usual to have several instance of the EvalFunc on the same clusternode/vm?

Will

-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com] 
Sent: Wednesday, March 02, 2011 4:49 PM
To: user@pig.apache.org
Subject: Re: Shared resources

There is no method in the eval func that gets called on the backend before any exec calls.  You can keep a flag that tracks whether you have done the initialization so that you only do it the first time.

Alan.

On Mar 2, 2011, at 5:29 AM, Lai Will wrote:

> Hello,
>
> I wrote a EvalFunc implementation that
>
>
> 1)      Parses a SQL Query
>
> 2)      Scans a folder for resource files and creates an index on  
> these files
>
> 3)      According to certain properties of the SQL Query accesses  
> the corresponding file and creates a Java objects holding relevant the 
> information of the file (for reuse).
>
> 4)      Does some computation with the SQL Query and the information  
> found in the file
>
> 5)      Outputs a transformed SQL Query
>
> Currently I'm doing local tests without Hadoop and the code works 
> fine.
>
> The problem I see, is that right now I initialize my parser in the 
> EvalFunc, so that every time It gets instantiated a new instance of 
> the parser is generated. Ideally only on instance per machine would be 
> created.
> Even worse right now I create the index and parse the corresponding 
> resource file once per call exec in EvalFunc  and therefore do a lot 
> of redundant computation.
>
> Just because I don't know where and how to put this shared 
> computation.
> Does anybody have a solution on that?
>
> Best,
> Will

Re: Shared resources

Posted by Alan Gates <ga...@yahoo-inc.com>.

There is no method in the eval func that gets called on the backend  
before any exec calls.  You can keep a flag that tracks whether you  
have done the initialization so that you only do it the first time.

Alan.

On Mar 2, 2011, at 5:29 AM, Lai Will wrote:

> Hello,
>
> I wrote a EvalFunc implementation that
>
>
> 1)      Parses a SQL Query
>
> 2)      Scans a folder for resource files and creates an index on  
> these files
>
> 3)      According to certain properties of the SQL Query accesses  
> the corresponding file and creates a Java objects holding relevant  
> the information of the file (for reuse).
>
> 4)      Does some computation with the SQL Query and the information  
> found in the file
>
> 5)      Outputs a transformed SQL Query
>
> Currently I'm doing local tests without Hadoop and the code works  
> fine.
>
> The problem I see, is that right now I initialize my parser in the  
> EvalFunc, so that every time It gets instantiated a new instance of  
> the parser is generated. Ideally only on instance per machine would  
> be created.
> Even worse right now I create the index and parse the corresponding  
> resource file once per call exec in EvalFunc  and therefore do a lot  
> of redundant computation.
>
> Just because I don't know where and how to put this shared  
> computation.
> Does anybody have a solution on that?
>
> Best,
> Will