You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Meyer, Dennis" <de...@adtech.com> on 2012/01/02 15:27:15 UTC

Aggregating multiple files by pattern (regex possible?)

Hi,

We have a use-case where it would be beneficial to "select" multiple files to process by a regex pattern (or a loop-like functionality to dynamically adjust which files to pick). We have files of different types and inside one type they have versions where we add new data to the records, but we do not remove info. As the files of the same type would be very similar, this would be a UNION. The files are stored in a directory and look like:

type-A-v1—1.avro
type-A-v1—2.avro
type-A-v1—3.avro
type-A-v1—4.avro
type-A-v2—1.avro
type-A-v2—2.avro
type-A-v2—3.avro
type-A-v2—4.avro
type-A-v2—5.avro
type-B-v1—1.avro
type-B-v1—2.avro
type-B-v1—3.avro
….
Same with C etc…

As you can guess the v1 stands for version #1, so higher version will have new fields in it. Different types contain different data.

It would be great if there is a possibility to address only certain files (aggregate all files type "A" for "v1" and "v2"). What would be the technique of choice here?
The aim is to increment the version (adding fields to the records dynamically) without changing the aggregation itself. Of course the new fields will just be ignored.

Thanks,
Dennis

Re: Aggregating multiple files by pattern (regex possible?)

Posted by "Meyer, Dennis" <de...@adtech.com>.
That looks like what I've been searching for. I'll give it a try.

Thanks!
Dennis


PS: Thanks Daniel as well. The loader might also be very useful.

Am 02.01.12 20:16 schrieb "Dmitriy Ryaboy" unter <dv...@gmail.com>:

>Dennis,
>Hadoop and Pig support globs, which may be sufficient for what you want.
>The glob matching rules are described here:
>http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/File
>System.html#globStatus(org.apache.hadoop.fs.Path)
>
>If those aren't sufficient, it's possible to write a custom loader to do
>more advanced regex expression handling in the input format, or  you could
>alter your file naming conventions / directory structure so that globs do
>become sufficient.
>
>Hope this helps.
>-Dmitriy
>
>On Mon, Jan 2, 2012 at 6:27 AM, Meyer, Dennis
><de...@adtech.com>wrote:
>
>> Hi,
>>
>> We have a use-case where it would be beneficial to "select" multiple
>>files
>> to process by a regex pattern (or a loop-like functionality to
>>dynamically
>> adjust which files to pick). We have files of different types and inside
>> one type they have versions where we add new data to the records, but
>>we do
>> not remove info. As the files of the same type would be very similar,
>>this
>> would be a UNION. The files are stored in a directory and look like:
>>
>> type-A-v1‹1.avro
>> type-A-v1‹2.avro
>> type-A-v1‹3.avro
>> type-A-v1‹4.avro
>> type-A-v2‹1.avro
>> type-A-v2‹2.avro
>> type-A-v2‹3.avro
>> type-A-v2‹4.avro
>> type-A-v2‹5.avro
>> type-B-v1‹1.avro
>> type-B-v1‹2.avro
>> type-B-v1‹3.avro
>> Š.
>> Same with C etcŠ
>>
>> As you can guess the v1 stands for version #1, so higher version will
>>have
>> new fields in it. Different types contain different data.
>>
>> It would be great if there is a possibility to address only certain
>>files
>> (aggregate all files type "A" for "v1" and "v2"). What would be the
>> technique of choice here?
>> The aim is to increment the version (adding fields to the records
>> dynamically) without changing the aggregation itself. Of course the new
>> fields will just be ignored.
>>
>> Thanks,
>> Dennis
>>


Re: Aggregating multiple files by pattern (regex possible?)

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Dennis,
Hadoop and Pig support globs, which may be sufficient for what you want.
The glob matching rules are described here:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)

If those aren't sufficient, it's possible to write a custom loader to do
more advanced regex expression handling in the input format, or  you could
alter your file naming conventions / directory structure so that globs do
become sufficient.

Hope this helps.
-Dmitriy

On Mon, Jan 2, 2012 at 6:27 AM, Meyer, Dennis <de...@adtech.com>wrote:

> Hi,
>
> We have a use-case where it would be beneficial to "select" multiple files
> to process by a regex pattern (or a loop-like functionality to dynamically
> adjust which files to pick). We have files of different types and inside
> one type they have versions where we add new data to the records, but we do
> not remove info. As the files of the same type would be very similar, this
> would be a UNION. The files are stored in a directory and look like:
>
> type-A-v1—1.avro
> type-A-v1—2.avro
> type-A-v1—3.avro
> type-A-v1—4.avro
> type-A-v2—1.avro
> type-A-v2—2.avro
> type-A-v2—3.avro
> type-A-v2—4.avro
> type-A-v2—5.avro
> type-B-v1—1.avro
> type-B-v1—2.avro
> type-B-v1—3.avro
> ….
> Same with C etc…
>
> As you can guess the v1 stands for version #1, so higher version will have
> new fields in it. Different types contain different data.
>
> It would be great if there is a possibility to address only certain files
> (aggregate all files type "A" for "v1" and "v2"). What would be the
> technique of choice here?
> The aim is to increment the version (adding fields to the records
> dynamically) without changing the aggregation itself. Of course the new
> fields will just be ignored.
>
> Thanks,
> Dennis
>

Re: Aggregating multiple files by pattern (regex possible?)

Posted by Daniel Dai <da...@hortonworks.com>.
You can append filename to the input record. See
https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3F

Daniel

On Mon, Jan 2, 2012 at 6:27 AM, Meyer, Dennis <de...@adtech.com>wrote:

> Hi,
>
> We have a use-case where it would be beneficial to "select" multiple files
> to process by a regex pattern (or a loop-like functionality to dynamically
> adjust which files to pick). We have files of different types and inside
> one type they have versions where we add new data to the records, but we do
> not remove info. As the files of the same type would be very similar, this
> would be a UNION. The files are stored in a directory and look like:
>
> type-A-v1—1.avro
> type-A-v1—2.avro
> type-A-v1—3.avro
> type-A-v1—4.avro
> type-A-v2—1.avro
> type-A-v2—2.avro
> type-A-v2—3.avro
> type-A-v2—4.avro
> type-A-v2—5.avro
> type-B-v1—1.avro
> type-B-v1—2.avro
> type-B-v1—3.avro
> ….
> Same with C etc…
>
> As you can guess the v1 stands for version #1, so higher version will have
> new fields in it. Different types contain different data.
>
> It would be great if there is a possibility to address only certain files
> (aggregate all files type "A" for "v1" and "v2"). What would be the
> technique of choice here?
> The aim is to increment the version (adding fields to the records
> dynamically) without changing the aggregation itself. Of course the new
> fields will just be ignored.
>
> Thanks,
> Dennis
>