You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Guy Doulberg <gu...@conduit.com> on 2011/10/16 17:36:12 UTC

Loggin Large Events to S3

Hi fellow flummers,

I am struggling with flume for a couple of weeks, I am trying to log 
events to Amazon S3 so later I could Use Amazon EMR to analyze the events.
The architecture I am trying to build is:

The client posts data bziped -> a end point decompresses the data and 
attach extra data (like http headers)-> writes the data to a local file 
system file -> flume agent tails that file -> send the events to a flume 
collector -> the flume collector send the file to S3 bzipped

After some effort I made this architecture working for small events, the 
problem is the events I should store are large (72kb expanded) and I 
have no control over the client (the client writes large zipped XML 
files and I cann't change this behavior), so this architecture should be 
able to deal with this kind of events.

So I was thinking of two approaches, and I wanted share them with you, 
and to hear what you can  say

1. Flume supports 32kb event size, but can support larger events by 
changing the "flume.event.max.size.bytes" property, I tried to do that, but:
     a. I am afraid of the performance issue
     b. It didn't work well, it seems like the events, it  writes are 
trimmed, and also it writes them infinitely.

2. Fluming the event bziped (not decompressing it on the endpoint)  to 
S3, and decompressing it with the EMR later. In that case:
    a. What is the format I should store the events?
    b. How would I enrich the data with the request headers?


Thanks for time.



Guy Doulberg