You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Pankil Doshi <fo...@gmail.com> on 2009/08/26 19:21:53 UTC
Question Regarding Multiple Loads
Hello Everyone,
I am trying to write Pig scripts for my project. Problem I ma facing is I
want to load different files to same variable .Can it be possible to do
without modifying the Loader. I read about Hadoop globbing . Does anyone
have solution to these.
I know I can load all files of a given directory to single variable.
But is it possible to load specific files from that directory? Or specific
files from different directories to same load variable?
I also know about UNION strategy but that increase one map-reduce job and I
want to avoid that.
Any kind of suggestions are welcomed.
Pankil
RE: Question Regarding Multiple Loads
Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Pankil,
You have a couple of options:
(1) If you disable the multiquery support, you can take advantage of the
full Hadoop globing capabilities which is likely to be sufficient.
(2) If you need to use multiquery, only single-pattern globs are
supported so you would not be able to specify multiple unrelated
directories. If that is not sufficient, you will need to use union but
it might not significantly impact your performance. I would try that
first before trying a custom solution.
Olga
-----Original Message-----
From: Pankil Doshi [mailto:forpankil@gmail.com]
Sent: Wednesday, August 26, 2009 10:22 AM
To: pig-user@hadoop.apache.org
Subject: Question Regarding Multiple Loads
Hello Everyone,
I am trying to write Pig scripts for my project. Problem I ma facing is
I
want to load different files to same variable .Can it be possible to do
without modifying the Loader. I read about Hadoop globbing . Does
anyone
have solution to these.
I know I can load all files of a given directory to single variable.
But is it possible to load specific files from that directory? Or
specific
files from different directories to same load variable?
I also know about UNION strategy but that increase one map-reduce job
and I
want to avoid that.
Any kind of suggestions are welcomed.
Pankil
Re: Question Regarding Multiple Loads
Posted by Daniel Dai <da...@gmail.com>.
Pig will pass filename directly to hadoop. So the support of globbing is
provided by the underlying hadoop. Hadoop 18 only support single-pattern
globs. Hadoop 19/20 support globbing for multiple unrelated directories.
Lastest Pig release (0.3) bundles hadoop 18, so you can only use
single-pattern globbing with that release.
----- Original Message -----
From: "Mridul Muralidharan" <mr...@yahoo-inc.com>
To: <pi...@hadoop.apache.org>
Sent: Sunday, August 30, 2009 5:08 PM
Subject: Re: Question Regarding Multiple Loads
> Pankil Doshi wrote:
>> Which version of hadoop support hadoop globbing? or Do i have to apply
>> patch
>> for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested
>> it?
>
>
> Someone from pig team can give details of actual versions.
> But I have been using globbing for quite a while now, and I think all
> versions of pig which you can get your hands on should be able to support
> it !
>
> Regards,
> Mridul
>
> PS: iirc there are difference between hadoop globbing and bash globbing,
> so you might want to look at the javadoc.
>
>>
>> Pankil
>>
>> On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
>> <mr...@yahoo-inc.com>wrote:
>>
>>> Hi Pankil,
>>>
>>> As thejas pointed out in the other thread, you can use globbing that
>>> hadoop supports :
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
>>> stem.html#globStatus(org.apache.hadoop.fs.Path)
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> Pankil Doshi wrote:
>>>
>>>> Hello Everyone,
>>>>
>>>> I am trying to write Pig scripts for my project. Problem I ma facing is
>>>> I
>>>> want to load different files to same variable .Can it be possible to do
>>>> without modifying the Loader. I read about Hadoop globbing . Does
>>>> anyone
>>>> have solution to these.
>>>>
>>>> I know I can load all files of a given directory to single variable.
>>>> But is it possible to load specific files from that directory? Or
>>>> specific
>>>> files from different directories to same load variable?
>>>>
>>>> I also know about UNION strategy but that increase one map-reduce job
>>>> and
>>>> I
>>>> want to avoid that.
>>>>
>>>> Any kind of suggestions are welcomed.
>>>>
>>>> Pankil
>>>>
>>>>
>>
>
Re: Question Regarding Multiple Loads
Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Pankil Doshi wrote:
> Which version of hadoop support hadoop globbing? or Do i have to apply patch
> for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested it?
Someone from pig team can give details of actual versions.
But I have been using globbing for quite a while now, and I think all
versions of pig which you can get your hands on should be able to
support it !
Regards,
Mridul
PS: iirc there are difference between hadoop globbing and bash globbing,
so you might want to look at the javadoc.
>
> Pankil
>
> On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
> <mr...@yahoo-inc.com>wrote:
>
>> Hi Pankil,
>>
>> As thejas pointed out in the other thread, you can use globbing that
>> hadoop supports :
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
>> stem.html#globStatus(org.apache.hadoop.fs.Path)
>>
>>
>> Regards,
>> Mridul
>>
>>
>> Pankil Doshi wrote:
>>
>>> Hello Everyone,
>>>
>>> I am trying to write Pig scripts for my project. Problem I ma facing is I
>>> want to load different files to same variable .Can it be possible to do
>>> without modifying the Loader. I read about Hadoop globbing . Does anyone
>>> have solution to these.
>>>
>>> I know I can load all files of a given directory to single variable.
>>> But is it possible to load specific files from that directory? Or specific
>>> files from different directories to same load variable?
>>>
>>> I also know about UNION strategy but that increase one map-reduce job and
>>> I
>>> want to avoid that.
>>>
>>> Any kind of suggestions are welcomed.
>>>
>>> Pankil
>>>
>>>
>
RE: Question Regarding Multiple Loads
Posted by zjffdu <zj...@gmail.com>.
The currently version 0.183 that Pig use will be OK for you.
e.g. raw = LOAD '/data/*.log' USING PigStorage();
This statement will load all the files with extension log.
-----Original Message-----
From: Pankil Doshi [mailto:forpankil@gmail.com]
Sent: 2009年8月26日 17:17
To: pig-user@hadoop.apache.org
Subject: Re: Question Regarding Multiple Loads
Which version of hadoop support hadoop globbing? or Do i have to apply patch
for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested it?
Pankil
On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:
>
> Hi Pankil,
>
> As thejas pointed out in the other thread, you can use globbing that
> hadoop supports :
>
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
> stem.html#globStatus(org.apache.hadoop.fs.Path)
>
>
> Regards,
> Mridul
>
>
> Pankil Doshi wrote:
>
>> Hello Everyone,
>>
>> I am trying to write Pig scripts for my project. Problem I ma facing is I
>> want to load different files to same variable .Can it be possible to do
>> without modifying the Loader. I read about Hadoop globbing . Does anyone
>> have solution to these.
>>
>> I know I can load all files of a given directory to single variable.
>> But is it possible to load specific files from that directory? Or
specific
>> files from different directories to same load variable?
>>
>> I also know about UNION strategy but that increase one map-reduce job and
>> I
>> want to avoid that.
>>
>> Any kind of suggestions are welcomed.
>>
>> Pankil
>>
>>
>
Re: Question Regarding Multiple Loads
Posted by Pankil Doshi <fo...@gmail.com>.
Which version of hadoop support hadoop globbing? or Do i have to apply patch
for it? and Ya will it be compatible with Pig 0.3.0? has anyone tested it?
Pankil
On Wed, Aug 26, 2009 at 3:08 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:
>
> Hi Pankil,
>
> As thejas pointed out in the other thread, you can use globbing that
> hadoop supports :
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
> stem.html#globStatus(org.apache.hadoop.fs.Path)
>
>
> Regards,
> Mridul
>
>
> Pankil Doshi wrote:
>
>> Hello Everyone,
>>
>> I am trying to write Pig scripts for my project. Problem I ma facing is I
>> want to load different files to same variable .Can it be possible to do
>> without modifying the Loader. I read about Hadoop globbing . Does anyone
>> have solution to these.
>>
>> I know I can load all files of a given directory to single variable.
>> But is it possible to load specific files from that directory? Or specific
>> files from different directories to same load variable?
>>
>> I also know about UNION strategy but that increase one map-reduce job and
>> I
>> want to avoid that.
>>
>> Any kind of suggestions are welcomed.
>>
>> Pankil
>>
>>
>
Re: Question Regarding Multiple Loads
Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Hi Pankil,
As thejas pointed out in the other thread, you can use globbing that
hadoop supports :
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
stem.html#globStatus(org.apache.hadoop.fs.Path)
Regards,
Mridul
Pankil Doshi wrote:
> Hello Everyone,
>
> I am trying to write Pig scripts for my project. Problem I ma facing is I
> want to load different files to same variable .Can it be possible to do
> without modifying the Loader. I read about Hadoop globbing . Does anyone
> have solution to these.
>
> I know I can load all files of a given directory to single variable.
> But is it possible to load specific files from that directory? Or specific
> files from different directories to same load variable?
>
> I also know about UNION strategy but that increase one map-reduce job and I
> want to avoid that.
>
> Any kind of suggestions are welcomed.
>
> Pankil
>