You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Dominik Hübner <co...@dhuebner.com> on 2015/07/08 10:23:42 UTC

Flume timestamp partitioning overlaps

I am using Cloudera’s example source to collect a sample of Twitter’s stream partitioned by year -> month -> day -> hour. 
https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java <https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java>

timestamp of an event is set by 
headers.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));

My agent config:
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://kronos.feeb.co:8020/user/flume/tweets/%Y/%m/%d/%H/ <hdfs://kronos.feeb.co:8020/user/flume/tweets/%25Y/%25m/%25d/%25H/>

However, I see that in almost all hours there is at least one (more often multiple records) from the last second of the previous hour. 

Is there any way to prevent having those overlaps in data? 
Hourly aggregation without dropping data becomes unnecessarily messy due to this.