You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mark <st...@gmail.com> on 2013/04/11 17:56:47 UTC

Help me improve this InputFormat/Loader

We have logs stored in HDFS in the following format /YEAR/MONTH/DAY. It's not guaranteed though that we will have every single day thought so there will be gaps. Now we have some jobs that require us to retrieve the last X amount of days of data for only days that actually contain data/exist. 

We have something like the following: https://gist.github.com/anonymous/5364554 (The naming is a little off since its technically not an InputFormat. .any ideas on a proper name?) Basically it uses retrieves all directory for a given path and sorts them in descending order, limiting to the last X. It then delegates the setInputPaths to FileInputFormat. Just in case if you are wondering how we are using it here is an example of a custom PigStorage class we use here: https://gist.github.com/anonymous/5364601

Although this works, I am thinking there may be a better/easier way to accomplish the same thing. Any ideas?

Thanks

- M