You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Bill Graham <bi...@gmail.com> on 2010/01/21 20:02:55 UTC

LOAD from multiple directories

Hi,

I have summary data created in directories every 10 minutes and I have a job
that needs to LOAD from all directories in a one hour period. I was hoping
to use Hadoop file path globing, but it doesn't seem to allow the glob
patterns with slashes '/' in them. If my directory structure looks like what
I show below, does anyone have any suggestions for how I could write a LOAD
command that would load all directories from 10:30-11:20, for example?


/20100121/10/00
/20100121/10/10
/20100121/10/20
/20100121/10/30 <--
/20100121/10/40 <--
/20100121/10/50 <--
/20100121/11/00 <--
/20100121/11/10 <--
/20100121/11/20 <--
/20100121/11/30
/20100121/11/40


thanks,
Bill

Re: LOAD from multiple directories

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

right.. I was assuming 20. Sorry :(
I used union when i was on 18.

On Thu, Jan 21, 2010 at 12:24 PM, Bill Graham <bi...@gmail.com> wrote:
> Note to those that are interested. As of 0.19.0, globs with slashes do work:
>
> http://issues.apache.org/jira/browse/HADOOP-3498
>
> Of course we're on 0.18.3. Sigh...
>
>
> On Thu, Jan 21, 2010 at 12:09 PM, Bill Graham <bi...@gmail.com> wrote:
>
>> Thanks for the union suggestion, Thejas.
>>
>> Dmitry, how were you envisioning that globs can be used for this use case?
>> Globs with slashes like this don't work:
>>
>> {10/30,10/40,10/50,11/00,11/10,11/20}
>>
>>
>>
>> On Thu, Jan 21, 2010 at 11:57 AM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>>
>>> you should be able to use globs:
>>>
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29
>>>
>>> {ab,c{de,fh}}
>>>    Matches a string from the string set {ab, cde, cfh}
>>>
>>> -D
>>>
>>> On Thu, Jan 21, 2010 at 11:29 AM, Thejas Nair <te...@yahoo-inc.com>
>>> wrote:
>>> > I was going to suggest -
>>> > "/20100121/{10,11}/{30,40,50,00,10,20}" but that would not work because
>>> it
>>> > will also match - "/20100121/10/00" . I don't think hadoop file path
>>> globing
>>> > can be used for this use case.
>>> >
>>> > You can use multiple loads followed by a union .
>>> >
>>> > -Thejas
>>> >
>>> >
>>> >
>>> > On 1/21/10 11:02 AM, "Bill Graham" <bi...@gmail.com> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> I have summary data created in directories every 10 minutes and I have
>>> a job
>>> >> that needs to LOAD from all directories in a one hour period. I was
>>> hoping
>>> >> to use Hadoop file path globing, but it doesn't seem to allow the glob
>>> >> patterns with slashes '/' in them. If my directory structure looks like
>>> what
>>> >> I show below, does anyone have any suggestions for how I could write a
>>> LOAD
>>> >> command that would load all directories from 10:30-11:20, for example?
>>> >>
>>> >>
>>> >> /20100121/10/00
>>> >> /20100121/10/10
>>> >> /20100121/10/20
>>> >> /20100121/10/30 <--
>>> >> /20100121/10/40 <--
>>> >> /20100121/10/50 <--
>>> >> /20100121/11/00 <--
>>> >> /20100121/11/10 <--
>>> >> /20100121/11/20 <--
>>> >> /20100121/11/30
>>> >> /20100121/11/40
>>> >>
>>> >>
>>> >> thanks,
>>> >> Bill
>>> >
>>> >
>>>
>>
>>
>

Re: LOAD from multiple directories

Posted by Bill Graham <bi...@gmail.com>.

Note to those that are interested. As of 0.19.0, globs with slashes do work:

http://issues.apache.org/jira/browse/HADOOP-3498

Of course we're on 0.18.3. Sigh...


On Thu, Jan 21, 2010 at 12:09 PM, Bill Graham <bi...@gmail.com> wrote:

> Thanks for the union suggestion, Thejas.
>
> Dmitry, how were you envisioning that globs can be used for this use case?
> Globs with slashes like this don't work:
>
> {10/30,10/40,10/50,11/00,11/10,11/20}
>
>
>
> On Thu, Jan 21, 2010 at 11:57 AM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> you should be able to use globs:
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29
>>
>> {ab,c{de,fh}}
>>    Matches a string from the string set {ab, cde, cfh}
>>
>> -D
>>
>> On Thu, Jan 21, 2010 at 11:29 AM, Thejas Nair <te...@yahoo-inc.com>
>> wrote:
>> > I was going to suggest -
>> > "/20100121/{10,11}/{30,40,50,00,10,20}" but that would not work because
>> it
>> > will also match - "/20100121/10/00" . I don't think hadoop file path
>> globing
>> > can be used for this use case.
>> >
>> > You can use multiple loads followed by a union .
>> >
>> > -Thejas
>> >
>> >
>> >
>> > On 1/21/10 11:02 AM, "Bill Graham" <bi...@gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> I have summary data created in directories every 10 minutes and I have
>> a job
>> >> that needs to LOAD from all directories in a one hour period. I was
>> hoping
>> >> to use Hadoop file path globing, but it doesn't seem to allow the glob
>> >> patterns with slashes '/' in them. If my directory structure looks like
>> what
>> >> I show below, does anyone have any suggestions for how I could write a
>> LOAD
>> >> command that would load all directories from 10:30-11:20, for example?
>> >>
>> >>
>> >> /20100121/10/00
>> >> /20100121/10/10
>> >> /20100121/10/20
>> >> /20100121/10/30 <--
>> >> /20100121/10/40 <--
>> >> /20100121/10/50 <--
>> >> /20100121/11/00 <--
>> >> /20100121/11/10 <--
>> >> /20100121/11/20 <--
>> >> /20100121/11/30
>> >> /20100121/11/40
>> >>
>> >>
>> >> thanks,
>> >> Bill
>> >
>> >
>>
>
>

Re: LOAD from multiple directories

Posted by Bill Graham <bi...@gmail.com>.

Thanks for the union suggestion, Thejas.

Dmitry, how were you envisioning that globs can be used for this use case?
Globs with slashes like this don't work:

{10/30,10/40,10/50,11/00,11/10,11/20}


On Thu, Jan 21, 2010 at 11:57 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> you should be able to use globs:
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29
>
> {ab,c{de,fh}}
>    Matches a string from the string set {ab, cde, cfh}
>
> -D
>
> On Thu, Jan 21, 2010 at 11:29 AM, Thejas Nair <te...@yahoo-inc.com> wrote:
> > I was going to suggest -
> > "/20100121/{10,11}/{30,40,50,00,10,20}" but that would not work because
> it
> > will also match - "/20100121/10/00" . I don't think hadoop file path
> globing
> > can be used for this use case.
> >
> > You can use multiple loads followed by a union .
> >
> > -Thejas
> >
> >
> >
> > On 1/21/10 11:02 AM, "Bill Graham" <bi...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I have summary data created in directories every 10 minutes and I have a
> job
> >> that needs to LOAD from all directories in a one hour period. I was
> hoping
> >> to use Hadoop file path globing, but it doesn't seem to allow the glob
> >> patterns with slashes '/' in them. If my directory structure looks like
> what
> >> I show below, does anyone have any suggestions for how I could write a
> LOAD
> >> command that would load all directories from 10:30-11:20, for example?
> >>
> >>
> >> /20100121/10/00
> >> /20100121/10/10
> >> /20100121/10/20
> >> /20100121/10/30 <--
> >> /20100121/10/40 <--
> >> /20100121/10/50 <--
> >> /20100121/11/00 <--
> >> /20100121/11/10 <--
> >> /20100121/11/20 <--
> >> /20100121/11/30
> >> /20100121/11/40
> >>
> >>
> >> thanks,
> >> Bill
> >
> >
>

Re: LOAD from multiple directories

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

you should be able to use globs:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

{ab,c{de,fh}}
    Matches a string from the string set {ab, cde, cfh}

-D

On Thu, Jan 21, 2010 at 11:29 AM, Thejas Nair <te...@yahoo-inc.com> wrote:
> I was going to suggest -
> "/20100121/{10,11}/{30,40,50,00,10,20}" but that would not work because it
> will also match - "/20100121/10/00" . I don't think hadoop file path globing
> can be used for this use case.
>
> You can use multiple loads followed by a union .
>
> -Thejas
>
>
>
> On 1/21/10 11:02 AM, "Bill Graham" <bi...@gmail.com> wrote:
>
>> Hi,
>>
>> I have summary data created in directories every 10 minutes and I have a job
>> that needs to LOAD from all directories in a one hour period. I was hoping
>> to use Hadoop file path globing, but it doesn't seem to allow the glob
>> patterns with slashes '/' in them. If my directory structure looks like what
>> I show below, does anyone have any suggestions for how I could write a LOAD
>> command that would load all directories from 10:30-11:20, for example?
>>
>>
>> /20100121/10/00
>> /20100121/10/10
>> /20100121/10/20
>> /20100121/10/30 <--
>> /20100121/10/40 <--
>> /20100121/10/50 <--
>> /20100121/11/00 <--
>> /20100121/11/10 <--
>> /20100121/11/20 <--
>> /20100121/11/30
>> /20100121/11/40
>>
>>
>> thanks,
>> Bill
>
>

Re: LOAD from multiple directories

Posted by Thejas Nair <te...@yahoo-inc.com>.

I was going to suggest -
"/20100121/{10,11}/{30,40,50,00,10,20}" but that would not work because it
will also match - "/20100121/10/00" . I don't think hadoop file path globing
can be used for this use case.

You can use multiple loads followed by a union .

-Thejas



On 1/21/10 11:02 AM, "Bill Graham" <bi...@gmail.com> wrote:

> Hi,
> 
> I have summary data created in directories every 10 minutes and I have a job
> that needs to LOAD from all directories in a one hour period. I was hoping
> to use Hadoop file path globing, but it doesn't seem to allow the glob
> patterns with slashes '/' in them. If my directory structure looks like what
> I show below, does anyone have any suggestions for how I could write a LOAD
> command that would load all directories from 10:30-11:20, for example?
> 
> 
> /20100121/10/00
> /20100121/10/10
> /20100121/10/20
> /20100121/10/30 <--
> /20100121/10/40 <--
> /20100121/10/50 <--
> /20100121/11/00 <--
> /20100121/11/10 <--
> /20100121/11/20 <--
> /20100121/11/30
> /20100121/11/40
> 
> 
> thanks,
> Bill