You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Pankil Doshi <fo...@gmail.com> on 2009/08/26 19:21:53 UTC

Question Regarding Multiple Loads

Hello Everyone,

I am trying to write Pig scripts for my project. Problem I ma facing is I
want to load different files to same variable .Can it be possible to do
without modifying the Loader. I read about Hadoop globbing .  Does anyone
have solution to these.

I know I can load all files of a given directory to single variable.
But is it possible to load specific files from that directory? Or specific
files from different directories to same load variable?

I also know about UNION strategy but that increase one map-reduce job and I
want to avoid that.

Any kind of suggestions are welcomed.

Pankil

RE: Question Regarding Multiple Loads

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Pankil,

You have a couple of options:

(1) If you disable the multiquery support, you can take advantage of the
full Hadoop globing capabilities which is likely to be sufficient.
(2) If you need to use multiquery, only single-pattern globs are
supported so you would not be able to specify multiple unrelated
directories. If that is not sufficient, you will need to use union but
it might not significantly impact your performance. I would try that
first before trying a custom solution.

Olga

-----Original Message-----
From: Pankil Doshi [mailto:forpankil@gmail.com] 
Sent: Wednesday, August 26, 2009 10:22 AM
To: pig-user@hadoop.apache.org
Subject: Question Regarding Multiple Loads

Hello Everyone,

I am trying to write Pig scripts for my project. Problem I ma facing is
I
want to load different files to same variable .Can it be possible to do
without modifying the Loader. I read about Hadoop globbing .  Does
anyone
have solution to these.

I know I can load all files of a given directory to single variable.
But is it possible to load specific files from that directory? Or
specific
files from different directories to same load variable?

I also know about UNION strategy but that increase one map-reduce job
and I
want to avoid that.

Any kind of suggestions are welcomed.

Pankil

Re: Question Regarding Multiple Loads

Posted by Daniel Dai <da...@gmail.com>.
Pig will pass filename directly to hadoop. So the support of globbing is 
provided by the underlying hadoop. Hadoop 18 only support single-pattern 
globs. Hadoop 19/20 support globbing for multiple unrelated directories. 
Lastest Pig release (0.3) bundles hadoop 18, so you can only use 
single-pattern globbing with that release.


----- Original Message ----- 
From: "Mridul Muralidharan" <mr...@yahoo-inc.com>
To: <pi...@hadoop.apache.org>
Sent: Sunday, August 30, 2009 5:08 PM
Subject: Re: Question Regarding Multiple Loads


> Pankil Doshi wrote:
>> Which version of hadoop support hadoop globbing? or Do i have to apply 
>> patch
>> for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested 
>> it?
>
>
> Someone from pig team can give details of actual versions.
> But I have been using globbing for quite a while now, and I think all 
> versions of pig which you can get your hands on should be able to support 
> it !
>
> Regards,
> Mridul
>
> PS: iirc there are difference between hadoop globbing and bash globbing, 
> so you might want to look at the javadoc.
>
>>
>> Pankil
>>
>> On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
>> <mr...@yahoo-inc.com>wrote:
>>
>>> Hi Pankil,
>>>
>>>  As thejas pointed out in the other thread, you can use globbing that
>>> hadoop supports :
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
>>> stem.html#globStatus(org.apache.hadoop.fs.Path)
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> Pankil Doshi wrote:
>>>
>>>> Hello Everyone,
>>>>
>>>> I am trying to write Pig scripts for my project. Problem I ma facing is 
>>>> I
>>>> want to load different files to same variable .Can it be possible to do
>>>> without modifying the Loader. I read about Hadoop globbing .  Does 
>>>> anyone
>>>> have solution to these.
>>>>
>>>> I know I can load all files of a given directory to single variable.
>>>> But is it possible to load specific files from that directory? Or 
>>>> specific
>>>> files from different directories to same load variable?
>>>>
>>>> I also know about UNION strategy but that increase one map-reduce job 
>>>> and
>>>> I
>>>> want to avoid that.
>>>>
>>>> Any kind of suggestions are welcomed.
>>>>
>>>> Pankil
>>>>
>>>>
>>
> 


Re: Question Regarding Multiple Loads

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Pankil Doshi wrote:
> Which version of hadoop support hadoop globbing? or Do i have to apply patch
> for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested it?


Someone from pig team can give details of actual versions.
But I have been using globbing for quite a while now, and I think all 
versions of pig which you can get your hands on should be able to 
support it !

Regards,
Mridul

PS: iirc there are difference between hadoop globbing and bash globbing, 
so you might want to look at the javadoc.

> 
> Pankil
> 
> On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
> <mr...@yahoo-inc.com>wrote:
> 
>> Hi Pankil,
>>
>>  As thejas pointed out in the other thread, you can use globbing that
>> hadoop supports :
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
>> stem.html#globStatus(org.apache.hadoop.fs.Path)
>>
>>
>> Regards,
>> Mridul
>>
>>
>> Pankil Doshi wrote:
>>
>>> Hello Everyone,
>>>
>>> I am trying to write Pig scripts for my project. Problem I ma facing is I
>>> want to load different files to same variable .Can it be possible to do
>>> without modifying the Loader. I read about Hadoop globbing .  Does anyone
>>> have solution to these.
>>>
>>> I know I can load all files of a given directory to single variable.
>>> But is it possible to load specific files from that directory? Or specific
>>> files from different directories to same load variable?
>>>
>>> I also know about UNION strategy but that increase one map-reduce job and
>>> I
>>> want to avoid that.
>>>
>>> Any kind of suggestions are welcomed.
>>>
>>> Pankil
>>>
>>>
> 


RE: Question Regarding Multiple Loads

Posted by zjffdu <zj...@gmail.com>.
The currently version 0.183 that Pig use will be OK for you.

e.g.   raw = LOAD '/data/*.log' USING PigStorage();

This statement will load all the files with extension log.


-----Original Message-----
From: Pankil Doshi [mailto:forpankil@gmail.com] 
Sent: 2009年8月26日 17:17
To: pig-user@hadoop.apache.org
Subject: Re: Question Regarding Multiple Loads

Which version of hadoop support hadoop globbing? or Do i have to apply patch
for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested it?

Pankil

On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

>
> Hi Pankil,
>
>  As thejas pointed out in the other thread, you can use globbing that
> hadoop supports :
>
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
> stem.html#globStatus(org.apache.hadoop.fs.Path)
>
>
> Regards,
> Mridul
>
>
> Pankil Doshi wrote:
>
>> Hello Everyone,
>>
>> I am trying to write Pig scripts for my project. Problem I ma facing is I
>> want to load different files to same variable .Can it be possible to do
>> without modifying the Loader. I read about Hadoop globbing .  Does anyone
>> have solution to these.
>>
>> I know I can load all files of a given directory to single variable.
>> But is it possible to load specific files from that directory? Or
specific
>> files from different directories to same load variable?
>>
>> I also know about UNION strategy but that increase one map-reduce job and
>> I
>> want to avoid that.
>>
>> Any kind of suggestions are welcomed.
>>
>> Pankil
>>
>>
>


Re: Question Regarding Multiple Loads

Posted by Pankil Doshi <fo...@gmail.com>.
Which version of hadoop support hadoop globbing? or Do i have to apply patch
for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested it?

Pankil

On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

>
> Hi Pankil,
>
>  As thejas pointed out in the other thread, you can use globbing that
> hadoop supports :
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
> stem.html#globStatus(org.apache.hadoop.fs.Path)
>
>
> Regards,
> Mridul
>
>
> Pankil Doshi wrote:
>
>> Hello Everyone,
>>
>> I am trying to write Pig scripts for my project. Problem I ma facing is I
>> want to load different files to same variable .Can it be possible to do
>> without modifying the Loader. I read about Hadoop globbing .  Does anyone
>> have solution to these.
>>
>> I know I can load all files of a given directory to single variable.
>> But is it possible to load specific files from that directory? Or specific
>> files from different directories to same load variable?
>>
>> I also know about UNION strategy but that increase one map-reduce job and
>> I
>> want to avoid that.
>>
>> Any kind of suggestions are welcomed.
>>
>> Pankil
>>
>>
>

Re: Question Regarding Multiple Loads

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Hi Pankil,

   As thejas pointed out in the other thread, you can use globbing that 
hadoop supports : 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
stem.html#globStatus(org.apache.hadoop.fs.Path)


Regards,
Mridul

Pankil Doshi wrote:
> Hello Everyone,
> 
> I am trying to write Pig scripts for my project. Problem I ma facing is I
> want to load different files to same variable .Can it be possible to do
> without modifying the Loader. I read about Hadoop globbing .  Does anyone
> have solution to these.
> 
> I know I can load all files of a given directory to single variable.
> But is it possible to load specific files from that directory? Or specific
> files from different directories to same load variable?
> 
> I also know about UNION strategy but that increase one map-reduce job and I
> want to avoid that.
> 
> Any kind of suggestions are welcomed.
> 
> Pankil
>