You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Riccardo Carè <ri...@gmail.com> on 2015/01/15 11:40:00 UTC

Moving binary files from spooldir source to HDFS sink

Hello,

I am new to Flume and I am trying to experiment it by moving binary files
over two agents.

 - The first agent runs on machine A and uses a spooldir source and a
thrift sink.
 - The second agent runs on machine B, which is part of a Hadoop cluster.
It has a thrift source and an HDFS sink.

I have two questions for this configuration:
 - I know I have to use the BlobDeserializer$Builder for the source on A,
but which is the correct size for the maxBlobLength parameter? Should it be
less or greater than the expected size of the binary file?
 - I did some tests and I found that the transmitted file was corrupted on
HDFS. I think this was caused by the HDFS sink which uses TEXT as default
serializer (I assume it is writing \n characters between one event and the
other). How could I fix this?

Thank you very much in advance.

Best regards,
Riccardo

Re: Moving binary files from spooldir source to HDFS sink

Posted by Joey Echeverria <jo...@cloudera.com>.
This is something of an anti-pattern to do with Flume, though it is possible.

You need to set the maxBlobLength to something larger than your largest file.

You need a custom serializer
(org.apache.flume.serialization.EventSerializer$Builder) to keep the
files as binary.

An easier solution would be to use Apache NiFi (incubating) which is
designed for file-based data flow and has support for writing binary
files to HDFS.

-Joey

On Thu, Jan 15, 2015 at 2:40 AM, Riccardo Carè <ri...@gmail.com> wrote:
> Hello,
>
> I am new to Flume and I am trying to experiment it by moving binary files
> over two agents.
>
>  - The first agent runs on machine A and uses a spooldir source and a thrift
> sink.
>  - The second agent runs on machine B, which is part of a Hadoop cluster. It
> has a thrift source and an HDFS sink.
>
> I have two questions for this configuration:
>  - I know I have to use the BlobDeserializer$Builder for the source on A,
> but which is the correct size for the maxBlobLength parameter? Should it be
> less or greater than the expected size of the binary file?
>  - I did some tests and I found that the transmitted file was corrupted on
> HDFS. I think this was caused by the HDFS sink which uses TEXT as default
> serializer (I assume it is writing \n characters between one event and the
> other). How could I fix this?
>
> Thank you very much in advance.
>
> Best regards,
> Riccardo
>



-- 
Joey Echeverria