You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2012/10/15 02:49:52 UTC

Flume for multi KB or MB docs?

Hi,

We're considering using Flume for transport of potentially large "documents" (think documents that can be as small as tweets or as large as PDF files).

I'm wondering if Flume is suitable for transporting potentially large documents (in the most reliable mode, too) or if there is something inherent in Flume that makes it a poor choice for this use case?

Thanks,
Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 

Re: Flume for multi KB or MB docs?

Posted by Mike Percy <mp...@apache.org>.
Otis,
Yes those are my concerns, but 10MB might be ok. You will have to tune your
batch sizes to a lower range and watch out for GC but if you give the
process enough RAM it should work.

If you go that route, please let us know how it goes!

Regards
Mike

On Mon, Oct 15, 2012 at 8:14 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hi Mike,
>
> Thanks for the info!  Our docs, however, are not quite 100MB - more like
> 5MB max and most of the time under 10KB.  Would you still say Flume is not
> the right tool for the job?  If so, what is the main concern?  Is it about
> the number of documents Flume will keep in memory at any one time and thus
> require a potentially large heap and still risk OOMing?  Or is the main
> concern that writing such "large" documents to disk will be slow?
>
> My documents need to end up in Solr or ElasticSearch and maybe also in
> HDFS, so I was hoping I could get ES and HDFS sinks from Flume for free.
>
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase -
> http://sematext.com/spm
>
>   ------------------------------
> *From:* Mike Percy <mp...@apache.org>
> *To:* user@flume.apache.org; Otis Gospodnetic <ot...@yahoo.com>
>
> *Sent:* Monday, October 15, 2012 6:15 PM
> *Subject:* Re: Flume for multi KB or MB docs?
>
> Hi Otis,
> Flume was designed as a streaming event transport system, not as a general
> purpose file transfer system. The two have quite different characteristics,
> so while binary files could be transported by Flume, if you tried to
> transport a 100MB PDF as a single event you may have issues around memory
> allocation, GC, transfer speed, etc., since we hold at least one event at a
> time in memory. However if you want to transfer a large log file and each
> line is an event then it's a perfect use case because you care about the
> individual events more than the file itself.
>
> For transferring very large binary files that are not events or records,
> you may want to look for something that it good at being a single-hop
> system with resume capability, like rsync, to transfer the files. Then I
> suppose you could use the hadoop fs shell and a small script to store the
> data onto HDFS. You probably wouldn't need all the fancy tagging, routing,
> and serialization features that Flume has.
>
> Hope this helps.
>
> Regards
> Mike
>
> On Sun, Oct 14, 2012 at 5:49 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
>
> Hi,
>
> We're considering using Flume for transport of potentially large
> "documents" (think documents that can be as small as tweets or as large as
> PDF files).
>
> I'm wondering if Flume is suitable for transporting potentially large
> documents (in the most reliable mode, too) or if there is something
> inherent in Flume that makes it a poor choice for this use case?
>
> Thanks,
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase -
> http://sematext.com/spm
>
>
>
>
>

Re: Flume for multi KB or MB docs?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Mike,

Thanks for the info!  Our docs, however, are not quite 100MB - more like 5MB max and most of the time under 10KB.  Would you still say Flume is not the right tool for the job?  If so, what is the main concern?  Is it about the number of documents Flume will keep in memory at any one time and thus require a potentially large heap and still risk OOMing?  Or is the main concern that writing such "large" documents to disk will be slow?

My documents need to end up in Solr or ElasticSearch and maybe also in HDFS, so I was hoping I could get ES and HDFS sinks from Flume for free.

Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 



>________________________________
> From: Mike Percy <mp...@apache.org>
>To: user@flume.apache.org; Otis Gospodnetic <ot...@yahoo.com> 
>Sent: Monday, October 15, 2012 6:15 PM
>Subject: Re: Flume for multi KB or MB docs?
> 
>
>Hi Otis,
>Flume was designed as a streaming event transport system, not as a general purpose file transfer system. The two have quite different characteristics, so while binary files could be transported by Flume, if you tried to transport a 100MB PDF as a single event you may have issues around memory allocation, GC, transfer speed, etc., since we hold at least one event at a time in memory. However if you want to transfer a large log file and each line is an event then it's a perfect use case because you care about the individual events more than the file itself.
>
>
>For transferring very large binary files that are not events or records, you may want to look for something that it good at being a single-hop system with resume capability, like rsync, to transfer the files. Then I suppose you could use the hadoop fs shell and a small script to store the data onto HDFS. You probably wouldn't need all the fancy tagging, routing, and serialization features that Flume has.
>
>
>Hope this helps.
>
>
>Regards
>Mike
>
>
>On Sun, Oct 14, 2012 at 5:49 PM, Otis Gospodnetic <ot...@yahoo.com> wrote:
>
>Hi,
>>
>>
>>We're considering using Flume for transport of potentially large "documents" (think documents that can be as small as tweets or as large as PDF files).
>>
>>
>>I'm wondering if Flume is suitable for transporting potentially large documents (in the most reliable mode, too) or if there is something inherent in Flume that makes it a poor choice for this use case?
>>
>>
>>Thanks,
>>Otis 
>>----
>>Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 
>>
>
>
>

Re: Flume for multi KB or MB docs?

Posted by Mike Percy <mp...@apache.org>.
Hi Otis,
Flume was designed as a streaming event transport system, not as a general
purpose file transfer system. The two have quite different characteristics,
so while binary files could be transported by Flume, if you tried to
transport a 100MB PDF as a single event you may have issues around memory
allocation, GC, transfer speed, etc., since we hold at least one event at a
time in memory. However if you want to transfer a large log file and each
line is an event then it's a perfect use case because you care about the
individual events more than the file itself.

For transferring very large binary files that are not events or records,
you may want to look for something that it good at being a single-hop
system with resume capability, like rsync, to transfer the files. Then I
suppose you could use the hadoop fs shell and a small script to store the
data onto HDFS. You probably wouldn't need all the fancy tagging, routing,
and serialization features that Flume has.

Hope this helps.

Regards
Mike

On Sun, Oct 14, 2012 at 5:49 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hi,
>
> We're considering using Flume for transport of potentially large
> "documents" (think documents that can be as small as tweets or as large as
> PDF files).
>
> I'm wondering if Flume is suitable for transporting potentially large
> documents (in the most reliable mode, too) or if there is something
> inherent in Flume that makes it a poor choice for this use case?
>
> Thanks,
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase -
> http://sematext.com/spm
>