You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Earl Cahill <ca...@yahoo.com> on 2008/10/18 07:17:21 UTC
jdbc driven store function(++)
My latest stuff looks at apache logs, aggregates to txt files, then I have a simple perl script that +='s into mysql tables. A few thoughts
* Would sure be nice if I could just STORE my aggregations into any jdbc-friendly database, like mysql, instead of text files. Anyone work on such a thing? I could do the simple case(s), but would need some help with more complicated ones.
* How about a MOVE function? Would be nice to move files once done processing them.
* I have yet to get into hadoop, but it would be nice to have an incoming directory, then a processed directory. Really, I would like to have a daemon that watches a directory that churns through logs exactly once. That's kind of how hadoop works, right?
* How about a LOAD function that can read from S3, or maybe the MOVE could move from S3 to local storage, or vice versa?Thoughts?
Thanks,
Earl
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: jdbc driven store function(++)
Posted by Ian Holsman <li...@holsman.net>.
Earl Cahill wrote:
> My latest stuff looks at apache logs, aggregates to txt files, then I have a simple perl script that +='s into mysql tables. A few thoughts
>
> * Would sure be nice if I could just STORE my aggregations into any jdbc-friendly database, like mysql, instead of text files. Anyone work on such a thing? I could do the simple case(s), but would need some help with more complicated ones.
>
I think a generic 'MySQLStore()' might be a bit troublesome, especially
if you need to deal with Bags & Maps. but you could encode this kind of
thing into a properties file that could be passed as an argument to it,
similar to a delimiter is passed into PigStorage() that maps up the
fields in the Schema with columns/tables.
with that you would just create a connection in 'putnext' and push it
through via INSERT ... on DUPLICATE UPDATE ..
(http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html ).
mysql connections are established quite fast, so I'd argue you wouldn't
even need pooling.
saying that, it might be faster to create a load file, and use LOAD DATA
(http://dev.mysql.com/doc/refman/5.0/en/load-data.html ).
regards
Ian
(and no I haven't written one of these yet, but I think I will need one
shortly).
> * How about a MOVE function? Would be nice to move files once done processing them.
>
> * I have yet to get into hadoop, but it would be nice to have an incoming directory, then a processed directory. Really, I would like to have a daemon that watches a directory that churns through logs exactly once. That's kind of how hadoop works, right?
> * How about a LOAD function that can read from S3, or maybe the MOVE could move from S3 to local storage, or vice versa?Thoughts?
>
> Thanks,
> Earl
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>