You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jae Lee <Ja...@forward.co.uk> on 2010/12/07 15:40:41 UTC

Is there anything in pig that supports external client to stream out a content of alias? a bit like Hive Thrift server...

Hi, 

In our application Hive is used as a database. i.e. a result set from a select query is consumed outside of hadoop cluster. 

The consumption process is not Hadoop friendly as in it is network bound not cpu/disk bound.

I'm in a process of converting hive query into pig query to see if it reads better.

What I'm stuck at is finding the content of a specific alias dump, from all the other stuff being logged, to be able to trigger further process.

STREAM <alias> THROUGH <cmd> seems to be one way to trigger a process, it's just that it seems not suitable for the kind of process we are looking at, because the <cmd> gets run in hadoop cluster.

any thought?

J

Is there anything in pig that supports external client to stream out a content of alias? a bit like Hive Thrift server...

Posted by Jae Lee <Ja...@forward.co.uk>.
> Hi, 
> 
> In our application Hive is used as a database. i.e. a result set from a select query is consumed outside of hadoop cluster. 
> 
> The consumption process is not Hadoop friendly as in it is network bound not cpu/disk bound.
> 
> I'm in a process of converting hive query into pig query to see if it reads better.
> 
> What I'm stuck at is finding the content of a specific alias dump, from all the other stuff being logged, to be able to trigger further process.
> 
> STREAM <alias> THROUGH <cmd> seems to be one way to trigger a process, it's just that it seems not suitable for the kind of process we are looking at, because the <cmd> gets run in hadoop cluster.
> 
> any thought?
> 
> J


Re: Is there anything in pig that supports external client to stream out a content of alias? a bit like Hive Thrift server...

Posted by Jae Lee <Ja...@forward.co.uk>.
Hi Jeff,

It's the process that we do with the result data from Hive (or equally from Pig) is network bound. Pig at the moment only allow "DUMP" and "STORE" to get that result data which makes it a bit in-convinient.

J

On 8 Dec 2010, at 01:19, Jeff Zhang wrote:

> Hi Jay,
> 
> I believe even you use pig ,the performance of fetching from HDFS
> won't be better than Hive, because pig and hive both store result data
> in hdfs and fetch data from client. And in most of cases, the result
> data won't be very large. So the performance wont' be a problem. But I
> guess your result data is very large because you mention that it  is
> network bound, then I suggest run another pig script or native
> mapreduce jobs on your result data.
> 
> 
> On Wed, Dec 8, 2010 at 2:26 AM, Jae Lee <Ja...@forward.co.uk> wrote:
>> yeah I came across the openIterator(alias) on PigServer.
>> 
>> basically that's what I like to get (dump of the alias and nothing else) when I execute pig script.
>> 
>> I'm currently writing a ruby wrapper that will use STORE the alias into temporary location in hdfs then do Hadoop file fetch
>> any better idea?
>> 
>> J
>> On 7 Dec 2010, at 18:16, Ashutosh Chauhan wrote:
>> 
>>> I am not sure if I understood your requirements clearly, but if you
>>> are not looking for a pure PigLatin solution and can work through
>>> Pig's java api, then you may want to look at PigServer.
>>> http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/PigServer.html
>>> Something along the following lines:
>>> 
>>> PigServer pig = new PigServer(pc, true);
>>> pig.registerQuery("A = load 'mydata'; ");
>>> pig.registerQuery("B = filter A by $0 > 10;");
>>> Iterator<Tuple> itr = pig.operIterator("B");
>>> while(itr.hasNext()){
>>>  if ( itr.next().get(0) == 25 ) {
>>>    // trigger further processing.
>>>  }
>>> }
>>> 
>>> Its obviously not directly useful, but conveys the general idea. Hope it helps.
>>> 
>>> Ashutosh
>>> On Tue, Dec 7, 2010 at 06:40, Jae Lee <Ja...@forward.co.uk> wrote:
>>>> Hi,
>>>> 
>>>> In our application Hive is used as a database. i.e. a result set from a select query is consumed outside of hadoop cluster.
>>>> 
>>>> The consumption process is not Hadoop friendly as in it is network bound not cpu/disk bound.
>>>> 
>>>> I'm in a process of converting hive query into pig query to see if it reads better.
>>>> 
>>>> What I'm stuck at is finding the content of a specific alias dump, from all the other stuff being logged, to be able to trigger further process.
>>>> 
>>>> STREAM <alias> THROUGH <cmd> seems to be one way to trigger a process, it's just that it seems not suitable for the kind of process we are looking at, because the <cmd> gets run in hadoop cluster.
>>>> 
>>>> any thought?
>>>> 
>>>> J
>>> 
>> 
>> 
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang
> 


Re: Is there anything in pig that supports external client to stream out a content of alias? a bit like Hive Thrift server...

Posted by Jeff Zhang <zj...@gmail.com>.
Hi Jay,

I believe even you use pig ,the performance of fetching from HDFS
won't be better than Hive, because pig and hive both store result data
in hdfs and fetch data from client. And in most of cases, the result
data won't be very large. So the performance wont' be a problem. But I
guess your result data is very large because you mention that it  is
network bound, then I suggest run another pig script or native
mapreduce jobs on your result data.


On Wed, Dec 8, 2010 at 2:26 AM, Jae Lee <Ja...@forward.co.uk> wrote:
> yeah I came across the openIterator(alias) on PigServer.
>
> basically that's what I like to get (dump of the alias and nothing else) when I execute pig script.
>
> I'm currently writing a ruby wrapper that will use STORE the alias into temporary location in hdfs then do Hadoop file fetch
> any better idea?
>
> J
> On 7 Dec 2010, at 18:16, Ashutosh Chauhan wrote:
>
>> I am not sure if I understood your requirements clearly, but if you
>> are not looking for a pure PigLatin solution and can work through
>> Pig's java api, then you may want to look at PigServer.
>> http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/PigServer.html
>> Something along the following lines:
>>
>> PigServer pig = new PigServer(pc, true);
>> pig.registerQuery("A = load 'mydata'; ");
>> pig.registerQuery("B = filter A by $0 > 10;");
>> Iterator<Tuple> itr = pig.operIterator("B");
>> while(itr.hasNext()){
>>  if ( itr.next().get(0) == 25 ) {
>>    // trigger further processing.
>>  }
>> }
>>
>> Its obviously not directly useful, but conveys the general idea. Hope it helps.
>>
>> Ashutosh
>> On Tue, Dec 7, 2010 at 06:40, Jae Lee <Ja...@forward.co.uk> wrote:
>>> Hi,
>>>
>>> In our application Hive is used as a database. i.e. a result set from a select query is consumed outside of hadoop cluster.
>>>
>>> The consumption process is not Hadoop friendly as in it is network bound not cpu/disk bound.
>>>
>>> I'm in a process of converting hive query into pig query to see if it reads better.
>>>
>>> What I'm stuck at is finding the content of a specific alias dump, from all the other stuff being logged, to be able to trigger further process.
>>>
>>> STREAM <alias> THROUGH <cmd> seems to be one way to trigger a process, it's just that it seems not suitable for the kind of process we are looking at, because the <cmd> gets run in hadoop cluster.
>>>
>>> any thought?
>>>
>>> J
>>
>
>



-- 
Best Regards

Jeff Zhang

Re: Is there anything in pig that supports external client to stream out a content of alias? a bit like Hive Thrift server...

Posted by Jae Lee <Ja...@forward.co.uk>.
oh yes it will definitely work... it's just that I don't want to write java wrapper around PigServer. I would rather want a solution that works with plain vanilla pig installation...

J
On 8 Dec 2010, at 16:41, Ashutosh Chauhan wrote:

> You didn't mention why PigServer.openIterator() won't work for you.
> One of its usecase is what you are describing. It will avoid the need
> of writing ruby wrapper.
> 
> Ashutosh
> On Tue, Dec 7, 2010 at 10:26, Jae Lee <Ja...@forward.co.uk> wrote:
>> yeah I came across the openIterator(alias) on PigServer.
>> 
>> basically that's what I like to get (dump of the alias and nothing else) when I execute pig script.
>> 
>> I'm currently writing a ruby wrapper that will use STORE the alias into temporary location in hdfs then do Hadoop file fetch
>> any better idea?
>> 
>> J
>> On 7 Dec 2010, at 18:16, Ashutosh Chauhan wrote:
>> 
>>> I am not sure if I understood your requirements clearly, but if you
>>> are not looking for a pure PigLatin solution and can work through
>>> Pig's java api, then you may want to look at PigServer.
>>> http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/PigServer.html
>>> Something along the following lines:
>>> 
>>> PigServer pig = new PigServer(pc, true);
>>> pig.registerQuery("A = load 'mydata'; ");
>>> pig.registerQuery("B = filter A by $0 > 10;");
>>> Iterator<Tuple> itr = pig.operIterator("B");
>>> while(itr.hasNext()){
>>>  if ( itr.next().get(0) == 25 ) {
>>>    // trigger further processing.
>>>  }
>>> }
>>> 
>>> Its obviously not directly useful, but conveys the general idea. Hope it helps.
>>> 
>>> Ashutosh
>>> On Tue, Dec 7, 2010 at 06:40, Jae Lee <Ja...@forward.co.uk> wrote:
>>>> Hi,
>>>> 
>>>> In our application Hive is used as a database. i.e. a result set from a select query is consumed outside of hadoop cluster.
>>>> 
>>>> The consumption process is not Hadoop friendly as in it is network bound not cpu/disk bound.
>>>> 
>>>> I'm in a process of converting hive query into pig query to see if it reads better.
>>>> 
>>>> What I'm stuck at is finding the content of a specific alias dump, from all the other stuff being logged, to be able to trigger further process.
>>>> 
>>>> STREAM <alias> THROUGH <cmd> seems to be one way to trigger a process, it's just that it seems not suitable for the kind of process we are looking at, because the <cmd> gets run in hadoop cluster.
>>>> 
>>>> any thought?
>>>> 
>>>> J
>>> 
>> 
>> 
> 


Re: Is there anything in pig that supports external client to stream out a content of alias? a bit like Hive Thrift server...

Posted by Ashutosh Chauhan <ha...@apache.org>.
You didn't mention why PigServer.openIterator() won't work for you.
One of its usecase is what you are describing. It will avoid the need
of writing ruby wrapper.

Ashutosh
On Tue, Dec 7, 2010 at 10:26, Jae Lee <Ja...@forward.co.uk> wrote:
> yeah I came across the openIterator(alias) on PigServer.
>
> basically that's what I like to get (dump of the alias and nothing else) when I execute pig script.
>
> I'm currently writing a ruby wrapper that will use STORE the alias into temporary location in hdfs then do Hadoop file fetch
> any better idea?
>
> J
> On 7 Dec 2010, at 18:16, Ashutosh Chauhan wrote:
>
>> I am not sure if I understood your requirements clearly, but if you
>> are not looking for a pure PigLatin solution and can work through
>> Pig's java api, then you may want to look at PigServer.
>> http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/PigServer.html
>> Something along the following lines:
>>
>> PigServer pig = new PigServer(pc, true);
>> pig.registerQuery("A = load 'mydata'; ");
>> pig.registerQuery("B = filter A by $0 > 10;");
>> Iterator<Tuple> itr = pig.operIterator("B");
>> while(itr.hasNext()){
>>  if ( itr.next().get(0) == 25 ) {
>>    // trigger further processing.
>>  }
>> }
>>
>> Its obviously not directly useful, but conveys the general idea. Hope it helps.
>>
>> Ashutosh
>> On Tue, Dec 7, 2010 at 06:40, Jae Lee <Ja...@forward.co.uk> wrote:
>>> Hi,
>>>
>>> In our application Hive is used as a database. i.e. a result set from a select query is consumed outside of hadoop cluster.
>>>
>>> The consumption process is not Hadoop friendly as in it is network bound not cpu/disk bound.
>>>
>>> I'm in a process of converting hive query into pig query to see if it reads better.
>>>
>>> What I'm stuck at is finding the content of a specific alias dump, from all the other stuff being logged, to be able to trigger further process.
>>>
>>> STREAM <alias> THROUGH <cmd> seems to be one way to trigger a process, it's just that it seems not suitable for the kind of process we are looking at, because the <cmd> gets run in hadoop cluster.
>>>
>>> any thought?
>>>
>>> J
>>
>
>

Re: Is there anything in pig that supports external client to stream out a content of alias? a bit like Hive Thrift server...

Posted by Jae Lee <Ja...@forward.co.uk>.
yeah I came across the openIterator(alias) on PigServer.

basically that's what I like to get (dump of the alias and nothing else) when I execute pig script.

I'm currently writing a ruby wrapper that will use STORE the alias into temporary location in hdfs then do Hadoop file fetch
any better idea?

J
On 7 Dec 2010, at 18:16, Ashutosh Chauhan wrote:

> I am not sure if I understood your requirements clearly, but if you
> are not looking for a pure PigLatin solution and can work through
> Pig's java api, then you may want to look at PigServer.
> http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/PigServer.html
> Something along the following lines:
> 
> PigServer pig = new PigServer(pc, true);
> pig.registerQuery("A = load 'mydata'; ");
> pig.registerQuery("B = filter A by $0 > 10;");
> Iterator<Tuple> itr = pig.operIterator("B");
> while(itr.hasNext()){
>  if ( itr.next().get(0) == 25 ) {
>    // trigger further processing.
>  }
> }
> 
> Its obviously not directly useful, but conveys the general idea. Hope it helps.
> 
> Ashutosh
> On Tue, Dec 7, 2010 at 06:40, Jae Lee <Ja...@forward.co.uk> wrote:
>> Hi,
>> 
>> In our application Hive is used as a database. i.e. a result set from a select query is consumed outside of hadoop cluster.
>> 
>> The consumption process is not Hadoop friendly as in it is network bound not cpu/disk bound.
>> 
>> I'm in a process of converting hive query into pig query to see if it reads better.
>> 
>> What I'm stuck at is finding the content of a specific alias dump, from all the other stuff being logged, to be able to trigger further process.
>> 
>> STREAM <alias> THROUGH <cmd> seems to be one way to trigger a process, it's just that it seems not suitable for the kind of process we are looking at, because the <cmd> gets run in hadoop cluster.
>> 
>> any thought?
>> 
>> J
> 


Re: Is there anything in pig that supports external client to stream out a content of alias? a bit like Hive Thrift server...

Posted by Ashutosh Chauhan <ha...@apache.org>.
I am not sure if I understood your requirements clearly, but if you
are not looking for a pure PigLatin solution and can work through
Pig's java api, then you may want to look at PigServer.
http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/PigServer.html
Something along the following lines:

PigServer pig = new PigServer(pc, true);
pig.registerQuery("A = load 'mydata'; ");
pig.registerQuery("B = filter A by $0 > 10;");
Iterator<Tuple> itr = pig.operIterator("B");
while(itr.hasNext()){
  if ( itr.next().get(0) == 25 ) {
    // trigger further processing.
  }
}

Its obviously not directly useful, but conveys the general idea. Hope it helps.

Ashutosh
On Tue, Dec 7, 2010 at 06:40, Jae Lee <Ja...@forward.co.uk> wrote:
> Hi,
>
> In our application Hive is used as a database. i.e. a result set from a select query is consumed outside of hadoop cluster.
>
> The consumption process is not Hadoop friendly as in it is network bound not cpu/disk bound.
>
> I'm in a process of converting hive query into pig query to see if it reads better.
>
> What I'm stuck at is finding the content of a specific alias dump, from all the other stuff being logged, to be able to trigger further process.
>
> STREAM <alias> THROUGH <cmd> seems to be one way to trigger a process, it's just that it seems not suitable for the kind of process we are looking at, because the <cmd> gets run in hadoop cluster.
>
> any thought?
>
> J