You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by jagaran das <ja...@yahoo.co.in> on 2011/06/10 20:57:37 UTC

PIG Explain Plan

Hi Guys,  Can anyone please tell me how to read Explain plan in pig? When I 
do explain plan for any of my pig query it gives me really good flow 
diagram, but it uses some Pig functions so didn' t really understand what going 
on and what
it mean.  Please let me if there is any documentation for this.

- Jagaran 

Re: Running multiple Pig jobs simultaneously on same data

Posted by jagaran das <ja...@yahoo.co.in>.
Hi,

Can't we append in hadoop-0.20.203.0?

Regards,
Jagaran       



________________________________
From: Dmitriy Ryaboy <dv...@gmail.com>
To: user@pig.apache.org
Sent: Wed, 15 June, 2011 9:57:20 AM
Subject: Re: Running multiple Pig jobs simultaneously on same data

Yong,

You can't. Hence, immutable. It's not a database. It's a write-once file system.

Approaches to solve updates include:
1) rewrite everything
2) write a separate set of "deltas" into other files and join them in
at read time
3) do 2, and occasionally run a "compaction" which does a complete
rewrite based on existing deltas
4) write to something like HBase that handles all of this under the covers

D

2011/6/15 勇胡 <yo...@gmail.com>:
> Jon,
>
> If I want to modify data(insert or delete) in the HDFS, how can I do it?
> From the description, I can not directly modify the data itself(update the
> data), I can not append the new data to the file! How the HDFS implement the
> data modification? I just feel a little bit confusion.
>
> Yong
> 在 2011年6月15日 下午3:36,Jonathan Coveney <jc...@gmail.com>写道:
>
>> Yong,
>>
>> Currently, HDFS does not support appending to a file. So once a file is
>> created, it literally cannot be changed (although it can be deleted, I
>> suppose). this lets you avoid issues where I do a SELECT * on the entire
>> database, and the dba can't update a row, or other things like that. There
>> are some append patches in the works but I am not sure how they handle the
>> concurrency implications.
>>
>> Make sense?
>> Jon
>>
>> 2011/6/15 勇胡 <yo...@gmail.com>
>>
>> > I read the link, and I just felt that the HDFS is designed for the
>> > read-frequently operation, not for the write-frequently( A file
>> > once created, written, and closed need not be changed.) .
>> >
>> > For your description (Immutable means that after creation it cannot be
>> > modified.), if I understand correct, you mean that the HDFS can not
>> > implement "update" semantics as same as in the database area? The write
>> > operation can not directly apply to the specific tuple or record? The
>> > result
>> > of write operation just appends at the end of the file.
>> >
>> > Regards
>> >
>> > Yong
>> >
>> > 2011/6/15 Nathan Bijnens <na...@nathan.gs>
>> >
>> > > Immutable means that after creation it cannot be modified.
>> > >
>> > > HDFS applications need a write-once-read-many access model for files. A
>> > > file
>> > > once created, written, and closed need not be changed. This assumption
>> > > simplifies data coherency issues and enables high throughput data
>> access.
>> > A
>> > > MapReduce application or a web crawler application fits perfectly with
>> > this
>> > > model. There is a plan to support appending-writes to files in the
>> > future.
>> > >
>> > >
>> >
>>http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model
>>l
>> > >
>> > > Best regards,
>> > >  Nathan
>> > > ---
>> > > nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs
>> > >
>> > >
>> > > On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 <yo...@gmail.com> wrote:
>> > >
>> > > > How can I understand immutable? I mean whether the HDFS implements
>> lock
>> > > > mechanism to obtain immutable data access when the concurrent tasks
>> > > process
>> > > > the same set of data or uses other strategy to implement immutable?
>> > > >
>> > > > Thanks
>> > > >
>> > > > Yong
>> > > >
>> > > > 2011/6/14 Bill Graham <bi...@gmail.com>
>> > > >
>> > > > > Yes, this is possible. Data in HDFS is immutable and MR tasks are
>> > > spawned
>> > > > > in
>> > > > > their own VM so multiple concurrent jobs acting on the same input
>> > data
>> > > > are
>> > > > > fine.
>> > > > >
>> > > > > On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
>> > > > > pradipta.dutta@me.com> wrote:
>> > > > >
>> > > > > > Hi All,
>> > > > > >
>> > > > > > We have a requirement where we have to process same set of data
>> (in
>> > > > > Hadoop
>> > > > > > cluster) by running multiple Pig jobs simultaneously.
>> > > > > >
>> > > > > > Any idea whether this is possible in Pig?
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Pradipta
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Running multiple Pig jobs simultaneously on same data

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Yong,

You can't. Hence, immutable. It's not a database. It's a write-once file system.

Approaches to solve updates include:
1) rewrite everything
2) write a separate set of "deltas" into other files and join them in
at read time
3) do 2, and occasionally run a "compaction" which does a complete
rewrite based on existing deltas
4) write to something like HBase that handles all of this under the covers

D

2011/6/15 勇胡 <yo...@gmail.com>:
> Jon,
>
> If I want to modify data(insert or delete) in the HDFS, how can I do it?
> From the description, I can not directly modify the data itself(update the
> data), I can not append the new data to the file! How the HDFS implement the
> data modification? I just feel a little bit confusion.
>
> Yong
> 在 2011年6月15日 下午3:36,Jonathan Coveney <jc...@gmail.com>写道:
>
>> Yong,
>>
>> Currently, HDFS does not support appending to a file. So once a file is
>> created, it literally cannot be changed (although it can be deleted, I
>> suppose). this lets you avoid issues where I do a SELECT * on the entire
>> database, and the dba can't update a row, or other things like that. There
>> are some append patches in the works but I am not sure how they handle the
>> concurrency implications.
>>
>> Make sense?
>> Jon
>>
>> 2011/6/15 勇胡 <yo...@gmail.com>
>>
>> > I read the link, and I just felt that the HDFS is designed for the
>> > read-frequently operation, not for the write-frequently( A file
>> > once created, written, and closed need not be changed.) .
>> >
>> > For your description (Immutable means that after creation it cannot be
>> > modified.), if I understand correct, you mean that the HDFS can not
>> > implement "update" semantics as same as in the database area? The write
>> > operation can not directly apply to the specific tuple or record? The
>> > result
>> > of write operation just appends at the end of the file.
>> >
>> > Regards
>> >
>> > Yong
>> >
>> > 2011/6/15 Nathan Bijnens <na...@nathan.gs>
>> >
>> > > Immutable means that after creation it cannot be modified.
>> > >
>> > > HDFS applications need a write-once-read-many access model for files. A
>> > > file
>> > > once created, written, and closed need not be changed. This assumption
>> > > simplifies data coherency issues and enables high throughput data
>> access.
>> > A
>> > > MapReduce application or a web crawler application fits perfectly with
>> > this
>> > > model. There is a plan to support appending-writes to files in the
>> > future.
>> > >
>> > >
>> >
>> http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model
>> > >
>> > > Best regards,
>> > >  Nathan
>> > > ---
>> > > nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs
>> > >
>> > >
>> > > On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 <yo...@gmail.com> wrote:
>> > >
>> > > > How can I understand immutable? I mean whether the HDFS implements
>> lock
>> > > > mechanism to obtain immutable data access when the concurrent tasks
>> > > process
>> > > > the same set of data or uses other strategy to implement immutable?
>> > > >
>> > > > Thanks
>> > > >
>> > > > Yong
>> > > >
>> > > > 2011/6/14 Bill Graham <bi...@gmail.com>
>> > > >
>> > > > > Yes, this is possible. Data in HDFS is immutable and MR tasks are
>> > > spawned
>> > > > > in
>> > > > > their own VM so multiple concurrent jobs acting on the same input
>> > data
>> > > > are
>> > > > > fine.
>> > > > >
>> > > > > On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
>> > > > > pradipta.dutta@me.com> wrote:
>> > > > >
>> > > > > > Hi All,
>> > > > > >
>> > > > > > We have a requirement where we have to process same set of data
>> (in
>> > > > > Hadoop
>> > > > > > cluster) by running multiple Pig jobs simultaneously.
>> > > > > >
>> > > > > > Any idea whether this is possible in Pig?
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Pradipta
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Running multiple Pig jobs simultaneously on same data

Posted by 勇胡 <yo...@gmail.com>.
Jon,

If I want to modify data(insert or delete) in the HDFS, how can I do it?
>From the description, I can not directly modify the data itself(update the
data), I can not append the new data to the file! How the HDFS implement the
data modification? I just feel a little bit confusion.

Yong
在 2011年6月15日 下午3:36,Jonathan Coveney <jc...@gmail.com>写道:

> Yong,
>
> Currently, HDFS does not support appending to a file. So once a file is
> created, it literally cannot be changed (although it can be deleted, I
> suppose). this lets you avoid issues where I do a SELECT * on the entire
> database, and the dba can't update a row, or other things like that. There
> are some append patches in the works but I am not sure how they handle the
> concurrency implications.
>
> Make sense?
> Jon
>
> 2011/6/15 勇胡 <yo...@gmail.com>
>
> > I read the link, and I just felt that the HDFS is designed for the
> > read-frequently operation, not for the write-frequently( A file
> > once created, written, and closed need not be changed.) .
> >
> > For your description (Immutable means that after creation it cannot be
> > modified.), if I understand correct, you mean that the HDFS can not
> > implement "update" semantics as same as in the database area? The write
> > operation can not directly apply to the specific tuple or record? The
> > result
> > of write operation just appends at the end of the file.
> >
> > Regards
> >
> > Yong
> >
> > 2011/6/15 Nathan Bijnens <na...@nathan.gs>
> >
> > > Immutable means that after creation it cannot be modified.
> > >
> > > HDFS applications need a write-once-read-many access model for files. A
> > > file
> > > once created, written, and closed need not be changed. This assumption
> > > simplifies data coherency issues and enables high throughput data
> access.
> > A
> > > MapReduce application or a web crawler application fits perfectly with
> > this
> > > model. There is a plan to support appending-writes to files in the
> > future.
> > >
> > >
> >
> http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model
> > >
> > > Best regards,
> > >  Nathan
> > > ---
> > > nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs
> > >
> > >
> > > On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 <yo...@gmail.com> wrote:
> > >
> > > > How can I understand immutable? I mean whether the HDFS implements
> lock
> > > > mechanism to obtain immutable data access when the concurrent tasks
> > > process
> > > > the same set of data or uses other strategy to implement immutable?
> > > >
> > > > Thanks
> > > >
> > > > Yong
> > > >
> > > > 2011/6/14 Bill Graham <bi...@gmail.com>
> > > >
> > > > > Yes, this is possible. Data in HDFS is immutable and MR tasks are
> > > spawned
> > > > > in
> > > > > their own VM so multiple concurrent jobs acting on the same input
> > data
> > > > are
> > > > > fine.
> > > > >
> > > > > On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
> > > > > pradipta.dutta@me.com> wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > We have a requirement where we have to process same set of data
> (in
> > > > > Hadoop
> > > > > > cluster) by running multiple Pig jobs simultaneously.
> > > > > >
> > > > > > Any idea whether this is possible in Pig?
> > > > > >
> > > > > > Thanks,
> > > > > > Pradipta
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Running multiple Pig jobs simultaneously on same data

Posted by Jonathan Coveney <jc...@gmail.com>.
Yong,

Currently, HDFS does not support appending to a file. So once a file is
created, it literally cannot be changed (although it can be deleted, I
suppose). this lets you avoid issues where I do a SELECT * on the entire
database, and the dba can't update a row, or other things like that. There
are some append patches in the works but I am not sure how they handle the
concurrency implications.

Make sense?
Jon

2011/6/15 勇胡 <yo...@gmail.com>

> I read the link, and I just felt that the HDFS is designed for the
> read-frequently operation, not for the write-frequently( A file
> once created, written, and closed need not be changed.) .
>
> For your description (Immutable means that after creation it cannot be
> modified.), if I understand correct, you mean that the HDFS can not
> implement "update" semantics as same as in the database area? The write
> operation can not directly apply to the specific tuple or record? The
> result
> of write operation just appends at the end of the file.
>
> Regards
>
> Yong
>
> 2011/6/15 Nathan Bijnens <na...@nathan.gs>
>
> > Immutable means that after creation it cannot be modified.
> >
> > HDFS applications need a write-once-read-many access model for files. A
> > file
> > once created, written, and closed need not be changed. This assumption
> > simplifies data coherency issues and enables high throughput data access.
> A
> > MapReduce application or a web crawler application fits perfectly with
> this
> > model. There is a plan to support appending-writes to files in the
> future.
> >
> >
> http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model
> >
> > Best regards,
> >  Nathan
> > ---
> > nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs
> >
> >
> > On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 <yo...@gmail.com> wrote:
> >
> > > How can I understand immutable? I mean whether the HDFS implements lock
> > > mechanism to obtain immutable data access when the concurrent tasks
> > process
> > > the same set of data or uses other strategy to implement immutable?
> > >
> > > Thanks
> > >
> > > Yong
> > >
> > > 2011/6/14 Bill Graham <bi...@gmail.com>
> > >
> > > > Yes, this is possible. Data in HDFS is immutable and MR tasks are
> > spawned
> > > > in
> > > > their own VM so multiple concurrent jobs acting on the same input
> data
> > > are
> > > > fine.
> > > >
> > > > On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
> > > > pradipta.dutta@me.com> wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > We have a requirement where we have to process same set of data (in
> > > > Hadoop
> > > > > cluster) by running multiple Pig jobs simultaneously.
> > > > >
> > > > > Any idea whether this is possible in Pig?
> > > > >
> > > > > Thanks,
> > > > > Pradipta
> > > > >
> > > >
> > >
> >
>

Re: Running multiple Pig jobs simultaneously on same data

Posted by 勇胡 <yo...@gmail.com>.
I read the link, and I just felt that the HDFS is designed for the
read-frequently operation, not for the write-frequently( A file
once created, written, and closed need not be changed.) .

For your description (Immutable means that after creation it cannot be
modified.), if I understand correct, you mean that the HDFS can not
implement "update" semantics as same as in the database area? The write
operation can not directly apply to the specific tuple or record? The result
of write operation just appends at the end of the file.

Regards

Yong

2011/6/15 Nathan Bijnens <na...@nathan.gs>

> Immutable means that after creation it cannot be modified.
>
> HDFS applications need a write-once-read-many access model for files. A
> file
> once created, written, and closed need not be changed. This assumption
> simplifies data coherency issues and enables high throughput data access. A
> MapReduce application or a web crawler application fits perfectly with this
> model. There is a plan to support appending-writes to files in the future.
>
> http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model
>
> Best regards,
>  Nathan
> ---
> nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs
>
>
> On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 <yo...@gmail.com> wrote:
>
> > How can I understand immutable? I mean whether the HDFS implements lock
> > mechanism to obtain immutable data access when the concurrent tasks
> process
> > the same set of data or uses other strategy to implement immutable?
> >
> > Thanks
> >
> > Yong
> >
> > 2011/6/14 Bill Graham <bi...@gmail.com>
> >
> > > Yes, this is possible. Data in HDFS is immutable and MR tasks are
> spawned
> > > in
> > > their own VM so multiple concurrent jobs acting on the same input data
> > are
> > > fine.
> > >
> > > On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
> > > pradipta.dutta@me.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > We have a requirement where we have to process same set of data (in
> > > Hadoop
> > > > cluster) by running multiple Pig jobs simultaneously.
> > > >
> > > > Any idea whether this is possible in Pig?
> > > >
> > > > Thanks,
> > > > Pradipta
> > > >
> > >
> >
>

Re: Running multiple Pig jobs simultaneously on same data

Posted by Nathan Bijnens <na...@nathan.gs>.
Immutable means that after creation it cannot be modified.

HDFS applications need a write-once-read-many access model for files. A file
once created, written, and closed need not be changed. This assumption
simplifies data coherency issues and enables high throughput data access. A
MapReduce application or a web crawler application fits perfectly with this
model. There is a plan to support appending-writes to files in the future.
http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model

Best regards,
  Nathan
---
nathan@nathan.gs : http://nathan.gs : http://twitter.com/nathan_gs


On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 <yo...@gmail.com> wrote:

> How can I understand immutable? I mean whether the HDFS implements lock
> mechanism to obtain immutable data access when the concurrent tasks process
> the same set of data or uses other strategy to implement immutable?
>
> Thanks
>
> Yong
>
> 2011/6/14 Bill Graham <bi...@gmail.com>
>
> > Yes, this is possible. Data in HDFS is immutable and MR tasks are spawned
> > in
> > their own VM so multiple concurrent jobs acting on the same input data
> are
> > fine.
> >
> > On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
> > pradipta.dutta@me.com> wrote:
> >
> > > Hi All,
> > >
> > > We have a requirement where we have to process same set of data (in
> > Hadoop
> > > cluster) by running multiple Pig jobs simultaneously.
> > >
> > > Any idea whether this is possible in Pig?
> > >
> > > Thanks,
> > > Pradipta
> > >
> >
>

Re: Running multiple Pig jobs simultaneously on same data

Posted by 勇胡 <yo...@gmail.com>.
How can I understand immutable? I mean whether the HDFS implements lock
mechanism to obtain immutable data access when the concurrent tasks process
the same set of data or uses other strategy to implement immutable?

Thanks

Yong

2011/6/14 Bill Graham <bi...@gmail.com>

> Yes, this is possible. Data in HDFS is immutable and MR tasks are spawned
> in
> their own VM so multiple concurrent jobs acting on the same input data are
> fine.
>
> On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
> pradipta.dutta@me.com> wrote:
>
> > Hi All,
> >
> > We have a requirement where we have to process same set of data (in
> Hadoop
> > cluster) by running multiple Pig jobs simultaneously.
> >
> > Any idea whether this is possible in Pig?
> >
> > Thanks,
> > Pradipta
> >
>

Re: Running multiple Pig jobs simultaneously on same data

Posted by Bill Graham <bi...@gmail.com>.
Yes, this is possible. Data in HDFS is immutable and MR tasks are spawned in
their own VM so multiple concurrent jobs acting on the same input data are
fine.

On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
pradipta.dutta@me.com> wrote:

> Hi All,
>
> We have a requirement where we have to process same set of data (in Hadoop
> cluster) by running multiple Pig jobs simultaneously.
>
> Any idea whether this is possible in Pig?
>
> Thanks,
> Pradipta
>

Running multiple Pig jobs simultaneously on same data

Posted by Pradipta Kumar Dutta <pr...@me.com>.
Hi All,

We have a requirement where we have to process same set of data (in Hadoop cluster) by running multiple Pig jobs simultaneously.

Any idea whether this is possible in Pig?

Thanks,
Pradipta

Re: PIG Explain Plan

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Alan has a section on the explain plan in his upcoming book:

http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html

On Fri, Jun 10, 2011 at 11:57 AM, jagaran das <ja...@yahoo.co.in> wrote:
> Hi Guys,  Can anyone please tell me how to read Explain plan in pig? When I
> do explain plan for any of my pig query it gives me really good flow
> diagram, but it uses some Pig functions so didn' t really understand what going
> on and what
> it mean.  Please let me if there is any documentation for this.
>
> - Jagaran