You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Earl Cahill <ca...@yahoo.com> on 2008/10/18 07:17:21 UTC

jdbc driven store function(++)

My latest stuff looks at apache logs, aggregates to txt files, then I have a simple perl script that +='s into mysql tables.  A few thoughts

	* Would sure be nice if I could just STORE my aggregations into any jdbc-friendly database, like mysql, instead of text files.  Anyone work on such a thing?  I could do the simple case(s), but would need some help with more complicated ones.

	* How about a MOVE function?  Would be nice to move files once done processing them.

	* I have yet to get into hadoop, but it would be nice to have an incoming directory, then a processed directory.  Really, I would like to have a daemon that watches a directory that churns through logs exactly once.  That's kind of how hadoop works, right?
	* How about a LOAD function that can read from S3, or maybe the MOVE could move from S3 to local storage, or vice versa?Thoughts?

Thanks,
Earl


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: jdbc driven store function(++)

Posted by Ian Holsman <li...@holsman.net>.

Earl Cahill wrote:
> My latest stuff looks at apache logs, aggregates to txt files, then I have a simple perl script that +='s into mysql tables.  A few thoughts
>
> 	* Would sure be nice if I could just STORE my aggregations into any jdbc-friendly database, like mysql, instead of text files.  Anyone work on such a thing?  I could do the simple case(s), but would need some help with more complicated ones.
>   

I think a generic 'MySQLStore()' might be a bit troublesome, especially 
if you need to deal with Bags & Maps. but you could encode this kind of 
thing into a properties file that could be passed as an argument to it, 
similar to a delimiter is passed into PigStorage() that maps up the 
fields in the Schema with columns/tables.

with that you would just create a connection in 'putnext' and push it 
through via INSERT ... on DUPLICATE UPDATE .. 
(http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html ).

mysql connections are established quite fast, so I'd argue you wouldn't 
even need pooling.

saying that, it might be faster to create a load file, and use LOAD DATA 
(http://dev.mysql.com/doc/refman/5.0/en/load-data.html ).

regards
Ian

(and no I haven't written one of these yet, but I think I will need one 
shortly).

> 	* How about a MOVE function?  Would be nice to move files once done processing them.
>
> 	* I have yet to get into hadoop, but it would be nice to have an incoming directory, then a processed directory.  Really, I would like to have a daemon that watches a directory that churns through logs exactly once.  That's kind of how hadoop works, right?
> 	* How about a LOAD function that can read from S3, or maybe the MOVE could move from S3 to local storage, or vice versa?Thoughts?
>
> Thanks,
> Earl
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
>