You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Kris Coward <kr...@melon.org> on 2011/01/28 01:40:47 UTC

load union store

Hi all,

I'm writing a bit of code to grab some logfiles, parse them, and run some
sanity checks on them (before subjecting them to further analysis).
Naturally, logfiles being logfiles, they accumulate, and I was wondering
how efficiently pig would handle a request to add recently accumulated
log data to a bit of logfile that's already been started.

In particular, two approaches that I'm contemplating are

raw = LOAD 'logfile' ...
-- snipped parsing/cleaning steps producing a relation with alias "cleanfile"
oldclean = LOAD 'existing_log';
newclean = UNION oldclean, cleanfile;
STORE newclean INTO 'tmp_log';
rm existing_log;
mv tmp_log existing_log;

...ALTERNATELY...

raw = LOAD 'logfile' ...
-- snipped parsing/cleaning steps producing a relation with alias "cleanfile"
STORE cleanfile INTO 'tmp_log';

followed by renumbering all the part files in tmp_log and copying them
to existing_log.

Is pig clever enough to handle the first set of instructions reasonably
efficiently (and if not, are there any gotchas I'd have to watch out for
with the second approach, e.g. a catalogue file that'd have to be edited
when the new parts are added).

Thanks,
Kris

-- 
Kris Coward					http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: load union store

Posted by Ashutosh Chauhan <ha...@apache.org>.

Howl is coming. This is one of the basic use cases for Howl.

1) You create a partitioned table in it. (In this case partitioned by date)
2) You add some partitions to your table. (In this case for each date,
new log files)
3) You query your table in Howl using Pig to do analysis.
4) Next day, you add new partition to Howl table.
5) Run your pig script either on newly created partition (by using
filters) or on full dataset.
6) Repeat.

Note that all the path munging, schema management, load-union-store is
gone. You are relieved of all that. Howl presents you with a table
like abstraction for your Hadoop datasets.

More info: http://wiki.apache.org/pig/Howl

Ashutosh

On Fri, Jan 28, 2011 at 09:11, Kris Coward <kr...@melon.org> wrote:
>
> I want to flatten things at least a little, since I'm looking for
> year-long trends in logfiles that are rotated hourly (and loading the
> data back out of 8760 distinct directories isn't my idea of a good
> time).
>
> Any reason that moving/renaming the part-nnnn files wouldn't work?
>
> Thanks,
> Kris
>
> On Thu, Jan 27, 2011 at 05:57:32PM -0800, Dmitriy Ryaboy wrote:
>> Kris,
>> As logs accumulate over time the union will get slow since you have to read
>> all the data off disk and write it back to disk.
>>
>> Why not just have a hierarchy in your cleaned log directory? You can do
>> something like
>> define newdir `date +%s`
>>
>> store newclean into 'cleaned_files/$newdir/'
>>
>>
>> then to load all logs you can just load 'cleaned_files'
>>
>> you can also format the date output differently and wind up with your
>> cleaned files nicely organized by year/month/day/hour/ ...
>>
>> D
>>
>> On Thu, Jan 27, 2011 at 4:40 PM, Kris Coward <kr...@melon.org> wrote:
>>
>> > Hi all,
>> >
>> > I'm writing a bit of code to grab some logfiles, parse them, and run some
>> > sanity checks on them (before subjecting them to further analysis).
>> > Naturally, logfiles being logfiles, they accumulate, and I was wondering
>> > how efficiently pig would handle a request to add recently accumulated
>> > log data to a bit of logfile that's already been started.
>> >
>> > In particular, two approaches that I'm contemplating are
>> >
>> > raw = LOAD 'logfile' ...
>> > -- snipped parsing/cleaning steps producing a relation with alias
>> > "cleanfile"
>> > oldclean = LOAD 'existing_log';
>> > newclean = UNION oldclean, cleanfile;
>> > STORE newclean INTO 'tmp_log';
>> > rm existing_log;
>> > mv tmp_log existing_log;
>> >
>> > ...ALTERNATELY...
>> >
>> > raw = LOAD 'logfile' ...
>> > -- snipped parsing/cleaning steps producing a relation with alias
>> > "cleanfile"
>> > STORE cleanfile INTO 'tmp_log';
>> >
>> > followed by renumbering all the part files in tmp_log and copying them
>> > to existing_log.
>> >
>> > Is pig clever enough to handle the first set of instructions reasonably
>> > efficiently (and if not, are there any gotchas I'd have to watch out for
>> > with the second approach, e.g. a catalogue file that'd have to be edited
>> > when the new parts are added).
>> >
>> > Thanks,
>> > Kris
>> >
>> > --
>> > Kris Coward                                     http://unripe.melon.org/
>> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
>

Re: load union store

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Extra cool thing about globbing is that this kind of thing works:

mystuff = load '/logs/stuff/2011/01/{01,02,03}'

:-)

D

On Fri, Jan 28, 2011 at 1:53 PM, Kris Coward <kr...@melon.org> wrote:

>
> I missed the globbing on my previous passes over the documentation for
> LOAD. Having missed that, my objection would have been that with all the
> files in a single directory, I can get them with a single LOAD command.
> That said, a wildcard would also solve that. Thanks for pushing back hard
> enough to make me re-read that.
>
> Cheers,
> Kris
>
> On Fri, Jan 28, 2011 at 01:27:54PM -0800, Dmitriy Ryaboy wrote:
> > It's a pain to rename everything, especially since the number of renames
> > grows every day. You'll stress out the namenode at some point.
> >
> > I am not sure why loading data back out of 8760 distinct directories is
> > worse than 8760 distinct files. There is no real difference.
> >
> > That's what we do at Twitter, fwiw, and that's also what the standard
> setup
> > for Hive logs is.. Can you explain in greater detail what your objection
> is
> > if this doesn't work for you?
> >
> > D
> >
> >
> > On Fri, Jan 28, 2011 at 9:11 AM, Kris Coward <kr...@melon.org> wrote:
> >
> > >
> > > I want to flatten things at least a little, since I'm looking for
> > > year-long trends in logfiles that are rotated hourly (and loading the
> > > data back out of 8760 distinct directories isn't my idea of a good
> > > time).
> > >
> > > Any reason that moving/renaming the part-nnnn files wouldn't work?
> > >
> > > Thanks,
> > > Kris
> > >
> > > On Thu, Jan 27, 2011 at 05:57:32PM -0800, Dmitriy Ryaboy wrote:
> > > > Kris,
> > > > As logs accumulate over time the union will get slow since you have
> to
> > > read
> > > > all the data off disk and write it back to disk.
> > > >
> > > > Why not just have a hierarchy in your cleaned log directory? You can
> do
> > > > something like
> > > > define newdir `date +%s`
> > > >
> > > > store newclean into 'cleaned_files/$newdir/'
> > > >
> > > >
> > > > then to load all logs you can just load 'cleaned_files'
> > > >
> > > > you can also format the date output differently and wind up with your
> > > > cleaned files nicely organized by year/month/day/hour/ ...
> > > >
> > > > D
> > > >
> > > > On Thu, Jan 27, 2011 at 4:40 PM, Kris Coward <kr...@melon.org> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'm writing a bit of code to grab some logfiles, parse them, and
> run
> > > some
> > > > > sanity checks on them (before subjecting them to further analysis).
> > > > > Naturally, logfiles being logfiles, they accumulate, and I was
> > > wondering
> > > > > how efficiently pig would handle a request to add recently
> accumulated
> > > > > log data to a bit of logfile that's already been started.
> > > > >
> > > > > In particular, two approaches that I'm contemplating are
> > > > >
> > > > > raw = LOAD 'logfile' ...
> > > > > -- snipped parsing/cleaning steps producing a relation with alias
> > > > > "cleanfile"
> > > > > oldclean = LOAD 'existing_log';
> > > > > newclean = UNION oldclean, cleanfile;
> > > > > STORE newclean INTO 'tmp_log';
> > > > > rm existing_log;
> > > > > mv tmp_log existing_log;
> > > > >
> > > > > ...ALTERNATELY...
> > > > >
> > > > > raw = LOAD 'logfile' ...
> > > > > -- snipped parsing/cleaning steps producing a relation with alias
> > > > > "cleanfile"
> > > > > STORE cleanfile INTO 'tmp_log';
> > > > >
> > > > > followed by renumbering all the part files in tmp_log and copying
> them
> > > > > to existing_log.
> > > > >
> > > > > Is pig clever enough to handle the first set of instructions
> reasonably
> > > > > efficiently (and if not, are there any gotchas I'd have to watch
> out
> > > for
> > > > > with the second approach, e.g. a catalogue file that'd have to be
> > > edited
> > > > > when the new parts are added).
> > > > >
> > > > > Thanks,
> > > > > Kris
> > > > >
> > > > > --
> > > > > Kris Coward
> > > http://unripe.melon.org/
> > > > > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
>

Re: load union store

Posted by Kris Coward <kr...@melon.org>.

I missed the globbing on my previous passes over the documentation for
LOAD. Having missed that, my objection would have been that with all the
files in a single directory, I can get them with a single LOAD command.
That said, a wildcard would also solve that. Thanks for pushing back hard
enough to make me re-read that.

Cheers,
Kris

On Fri, Jan 28, 2011 at 01:27:54PM -0800, Dmitriy Ryaboy wrote:
> It's a pain to rename everything, especially since the number of renames
> grows every day. You'll stress out the namenode at some point.
> 
> I am not sure why loading data back out of 8760 distinct directories is
> worse than 8760 distinct files. There is no real difference.
> 
> That's what we do at Twitter, fwiw, and that's also what the standard setup
> for Hive logs is.. Can you explain in greater detail what your objection is
> if this doesn't work for you?
> 
> D
> 
> 
> On Fri, Jan 28, 2011 at 9:11 AM, Kris Coward <kr...@melon.org> wrote:
> 
> >
> > I want to flatten things at least a little, since I'm looking for
> > year-long trends in logfiles that are rotated hourly (and loading the
> > data back out of 8760 distinct directories isn't my idea of a good
> > time).
> >
> > Any reason that moving/renaming the part-nnnn files wouldn't work?
> >
> > Thanks,
> > Kris
> >
> > On Thu, Jan 27, 2011 at 05:57:32PM -0800, Dmitriy Ryaboy wrote:
> > > Kris,
> > > As logs accumulate over time the union will get slow since you have to
> > read
> > > all the data off disk and write it back to disk.
> > >
> > > Why not just have a hierarchy in your cleaned log directory? You can do
> > > something like
> > > define newdir `date +%s`
> > >
> > > store newclean into 'cleaned_files/$newdir/'
> > >
> > >
> > > then to load all logs you can just load 'cleaned_files'
> > >
> > > you can also format the date output differently and wind up with your
> > > cleaned files nicely organized by year/month/day/hour/ ...
> > >
> > > D
> > >
> > > On Thu, Jan 27, 2011 at 4:40 PM, Kris Coward <kr...@melon.org> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'm writing a bit of code to grab some logfiles, parse them, and run
> > some
> > > > sanity checks on them (before subjecting them to further analysis).
> > > > Naturally, logfiles being logfiles, they accumulate, and I was
> > wondering
> > > > how efficiently pig would handle a request to add recently accumulated
> > > > log data to a bit of logfile that's already been started.
> > > >
> > > > In particular, two approaches that I'm contemplating are
> > > >
> > > > raw = LOAD 'logfile' ...
> > > > -- snipped parsing/cleaning steps producing a relation with alias
> > > > "cleanfile"
> > > > oldclean = LOAD 'existing_log';
> > > > newclean = UNION oldclean, cleanfile;
> > > > STORE newclean INTO 'tmp_log';
> > > > rm existing_log;
> > > > mv tmp_log existing_log;
> > > >
> > > > ...ALTERNATELY...
> > > >
> > > > raw = LOAD 'logfile' ...
> > > > -- snipped parsing/cleaning steps producing a relation with alias
> > > > "cleanfile"
> > > > STORE cleanfile INTO 'tmp_log';
> > > >
> > > > followed by renumbering all the part files in tmp_log and copying them
> > > > to existing_log.
> > > >
> > > > Is pig clever enough to handle the first set of instructions reasonably
> > > > efficiently (and if not, are there any gotchas I'd have to watch out
> > for
> > > > with the second approach, e.g. a catalogue file that'd have to be
> > edited
> > > > when the new parts are added).
> > > >
> > > > Thanks,
> > > > Kris
> > > >
> > > > --
> > > > Kris Coward
> > http://unripe.melon.org/
> > > > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: load union store

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

It's a pain to rename everything, especially since the number of renames
grows every day. You'll stress out the namenode at some point.

I am not sure why loading data back out of 8760 distinct directories is
worse than 8760 distinct files. There is no real difference.

That's what we do at Twitter, fwiw, and that's also what the standard setup
for Hive logs is.. Can you explain in greater detail what your objection is
if this doesn't work for you?

D


On Fri, Jan 28, 2011 at 9:11 AM, Kris Coward <kr...@melon.org> wrote:

>
> I want to flatten things at least a little, since I'm looking for
> year-long trends in logfiles that are rotated hourly (and loading the
> data back out of 8760 distinct directories isn't my idea of a good
> time).
>
> Any reason that moving/renaming the part-nnnn files wouldn't work?
>
> Thanks,
> Kris
>
> On Thu, Jan 27, 2011 at 05:57:32PM -0800, Dmitriy Ryaboy wrote:
> > Kris,
> > As logs accumulate over time the union will get slow since you have to
> read
> > all the data off disk and write it back to disk.
> >
> > Why not just have a hierarchy in your cleaned log directory? You can do
> > something like
> > define newdir `date +%s`
> >
> > store newclean into 'cleaned_files/$newdir/'
> >
> >
> > then to load all logs you can just load 'cleaned_files'
> >
> > you can also format the date output differently and wind up with your
> > cleaned files nicely organized by year/month/day/hour/ ...
> >
> > D
> >
> > On Thu, Jan 27, 2011 at 4:40 PM, Kris Coward <kr...@melon.org> wrote:
> >
> > > Hi all,
> > >
> > > I'm writing a bit of code to grab some logfiles, parse them, and run
> some
> > > sanity checks on them (before subjecting them to further analysis).
> > > Naturally, logfiles being logfiles, they accumulate, and I was
> wondering
> > > how efficiently pig would handle a request to add recently accumulated
> > > log data to a bit of logfile that's already been started.
> > >
> > > In particular, two approaches that I'm contemplating are
> > >
> > > raw = LOAD 'logfile' ...
> > > -- snipped parsing/cleaning steps producing a relation with alias
> > > "cleanfile"
> > > oldclean = LOAD 'existing_log';
> > > newclean = UNION oldclean, cleanfile;
> > > STORE newclean INTO 'tmp_log';
> > > rm existing_log;
> > > mv tmp_log existing_log;
> > >
> > > ...ALTERNATELY...
> > >
> > > raw = LOAD 'logfile' ...
> > > -- snipped parsing/cleaning steps producing a relation with alias
> > > "cleanfile"
> > > STORE cleanfile INTO 'tmp_log';
> > >
> > > followed by renumbering all the part files in tmp_log and copying them
> > > to existing_log.
> > >
> > > Is pig clever enough to handle the first set of instructions reasonably
> > > efficiently (and if not, are there any gotchas I'd have to watch out
> for
> > > with the second approach, e.g. a catalogue file that'd have to be
> edited
> > > when the new parts are added).
> > >
> > > Thanks,
> > > Kris
> > >
> > > --
> > > Kris Coward
> http://unripe.melon.org/
> > > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
>

Re: load union store

Posted by Kris Coward <kr...@melon.org>.

I want to flatten things at least a little, since I'm looking for
year-long trends in logfiles that are rotated hourly (and loading the
data back out of 8760 distinct directories isn't my idea of a good
time).

Any reason that moving/renaming the part-nnnn files wouldn't work?

Thanks,
Kris

On Thu, Jan 27, 2011 at 05:57:32PM -0800, Dmitriy Ryaboy wrote:
> Kris,
> As logs accumulate over time the union will get slow since you have to read
> all the data off disk and write it back to disk.
> 
> Why not just have a hierarchy in your cleaned log directory? You can do
> something like
> define newdir `date +%s`
> 
> store newclean into 'cleaned_files/$newdir/'
> 
> 
> then to load all logs you can just load 'cleaned_files'
> 
> you can also format the date output differently and wind up with your
> cleaned files nicely organized by year/month/day/hour/ ...
> 
> D
> 
> On Thu, Jan 27, 2011 at 4:40 PM, Kris Coward <kr...@melon.org> wrote:
> 
> > Hi all,
> >
> > I'm writing a bit of code to grab some logfiles, parse them, and run some
> > sanity checks on them (before subjecting them to further analysis).
> > Naturally, logfiles being logfiles, they accumulate, and I was wondering
> > how efficiently pig would handle a request to add recently accumulated
> > log data to a bit of logfile that's already been started.
> >
> > In particular, two approaches that I'm contemplating are
> >
> > raw = LOAD 'logfile' ...
> > -- snipped parsing/cleaning steps producing a relation with alias
> > "cleanfile"
> > oldclean = LOAD 'existing_log';
> > newclean = UNION oldclean, cleanfile;
> > STORE newclean INTO 'tmp_log';
> > rm existing_log;
> > mv tmp_log existing_log;
> >
> > ...ALTERNATELY...
> >
> > raw = LOAD 'logfile' ...
> > -- snipped parsing/cleaning steps producing a relation with alias
> > "cleanfile"
> > STORE cleanfile INTO 'tmp_log';
> >
> > followed by renumbering all the part files in tmp_log and copying them
> > to existing_log.
> >
> > Is pig clever enough to handle the first set of instructions reasonably
> > efficiently (and if not, are there any gotchas I'd have to watch out for
> > with the second approach, e.g. a catalogue file that'd have to be edited
> > when the new parts are added).
> >
> > Thanks,
> > Kris
> >
> > --
> > Kris Coward                                     http://unripe.melon.org/
> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: load union store

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Kris,
As logs accumulate over time the union will get slow since you have to read
all the data off disk and write it back to disk.

Why not just have a hierarchy in your cleaned log directory? You can do
something like
define newdir `date +%s`

store newclean into 'cleaned_files/$newdir/'


then to load all logs you can just load 'cleaned_files'

you can also format the date output differently and wind up with your
cleaned files nicely organized by year/month/day/hour/ ...

D

On Thu, Jan 27, 2011 at 4:40 PM, Kris Coward <kr...@melon.org> wrote:

> Hi all,
>
> I'm writing a bit of code to grab some logfiles, parse them, and run some
> sanity checks on them (before subjecting them to further analysis).
> Naturally, logfiles being logfiles, they accumulate, and I was wondering
> how efficiently pig would handle a request to add recently accumulated
> log data to a bit of logfile that's already been started.
>
> In particular, two approaches that I'm contemplating are
>
> raw = LOAD 'logfile' ...
> -- snipped parsing/cleaning steps producing a relation with alias
> "cleanfile"
> oldclean = LOAD 'existing_log';
> newclean = UNION oldclean, cleanfile;
> STORE newclean INTO 'tmp_log';
> rm existing_log;
> mv tmp_log existing_log;
>
> ...ALTERNATELY...
>
> raw = LOAD 'logfile' ...
> -- snipped parsing/cleaning steps producing a relation with alias
> "cleanfile"
> STORE cleanfile INTO 'tmp_log';
>
> followed by renumbering all the part files in tmp_log and copying them
> to existing_log.
>
> Is pig clever enough to handle the first set of instructions reasonably
> efficiently (and if not, are there any gotchas I'd have to watch out for
> with the second approach, e.g. a catalogue file that'd have to be edited
> when the new parts are added).
>
> Thanks,
> Kris
>
> --
> Kris Coward                                     http://unripe.melon.org/
> GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
>