You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Connor Woodson <cw...@gmail.com> on 2013/05/01 03:48:47 UTC

Re: Programmatically write files into HDFS with Flume

If you just want to write data to HDFS then Flume might not be the best
thing to use; however, there is a Flume Embedded
Agent<https://github.com/apache/flume/blob/trunk/flume-ng-doc/sphinx/FlumeDeveloperGuide.rst#embedded-agent>that
will embed Flume into your application. I don't believe it works yet
with the HDFS sink, but some tinkering can likely make it work.

- Connor


On Tue, Apr 30, 2013 at 11:00 AM, Chen Song <ch...@gmail.com> wrote:

> I am looking at options in Java programs that can write files into HDFS
> with the following requirements.
>
> 1) Transaction Support: Each file, when being written, either fully
> written successfully or failed totally without any partial file blocks
> written.
>
> 2) Compression Support/File Formats: Can specify compression type or file
> format when writing contents.
>
> I know how to write data into a file on HDFS by opening a
> FSDataOutputStream shown here<http://stackoverflow.com/questions/13457934/writing-to-a-file-in-hdfs-in-hadoop>.
> Just wondering if there is some libraries of out of the box solutions that
> provides the support I mentioned above.
>
> I stumbled upon Flume, which provides HDFS sink that can support
> transaction, compression, file rotation, etc. But it doesn't seem to
> provide an API to be used as a library. The features Flume provides are
> highly coupled with the Flume architectural components, like source,
> channel, and sinks and doesn't seem to be usable independently. All I need
> is merely on the HDFS loading part.
>
> Does anyone have some good suggestions?
>
> --
> Chen Song
>
>

Re: Programmatically write files into HDFS with Flume

Posted by Roshan Naik <ro...@hortonworks.com>.
at the moment it appears that will be the case with FLUME-1734 when it
materializes.  with hdfs sink, try setting its  roll count to the same size
as its batchSize.
you will need these counts to be large enough to avoid creating too many
small files on hdfs...  you could consider post processing to concatenate
the files into larger files.
when your batch size increases  flume will need to hold on to all the
events in the transaction until its committed. you may want to consider
using file channel (instead of mem channel) to alleviate capacity issues.


On Wed, May 1, 2013 at 9:07 PM, Chen Song <ch...@gmail.com> wrote:

> Thanks for your feedbacks.
>
> We have components to do data ingestion, effectively what flume sources
> do. That component collects data and dump it into files on boxes. The data
> files are then loaded into HDFS via 'hadoop fs -put' by a simple script. We
> want to build a resilient long-lived service in Java to load files in HDFS.
> That is how I come to know Flume.
>
> I understand that Flume is managing its transactions as events, not
> physical files. Is it possible to map files to logical events, thus achieve
> atomic writes?
>
>
>
>
> On Tue, Apr 30, 2013 at 10:25 PM, Roshan Naik <ro...@hortonworks.com>wrote:
>
>> Are you sure you want to directly write to hdfs from the app that is
>> generating data ? often in production, apps like web servers etc do not
>> have direct access to HDFS.  i am not sure that HDFS sink guarantees  'either
>> fully written successfully or failed totally without any partial file
>> blocks written' since  each transaction does not translate into a separate
>> file. so i think there could be some partially written transactions in case
>> of transaction abort.
>>
>> This level of support for all-or-none at the file level is planned for
>> what is currently referred to as the HCatalog sink
>> https://issues.apache.org/jira/browse/FLUME-1734
>>
>> -roshan
>>
>>
>> On Tue, Apr 30, 2013 at 6:48 PM, Connor Woodson <cw...@gmail.com>wrote:
>>
>>> If you just want to write data to HDFS then Flume might not be the best
>>> thing to use; however, there is a Flume Embedded Agent<https://github.com/apache/flume/blob/trunk/flume-ng-doc/sphinx/FlumeDeveloperGuide.rst#embedded-agent>that will embed Flume into your application. I don't believe it works yet
>>> with the HDFS sink, but some tinkering can likely make it work.
>>>
>>> - Connor
>>>
>>>
>>> On Tue, Apr 30, 2013 at 11:00 AM, Chen Song <ch...@gmail.com>wrote:
>>>
>>>> I am looking at options in Java programs that can write files into HDFS
>>>> with the following requirements.
>>>>
>>>> 1) Transaction Support: Each file, when being written, either fully
>>>> written successfully or failed totally without any partial file blocks
>>>> written.
>>>>
>>>> 2) Compression Support/File Formats: Can specify compression type or
>>>> file format when writing contents.
>>>>
>>>> I know how to write data into a file on HDFS by opening a
>>>> FSDataOutputStream shown here<http://stackoverflow.com/questions/13457934/writing-to-a-file-in-hdfs-in-hadoop>.
>>>> Just wondering if there is some libraries of out of the box solutions that
>>>> provides the support I mentioned above.
>>>>
>>>> I stumbled upon Flume, which provides HDFS sink that can support
>>>> transaction, compression, file rotation, etc. But it doesn't seem to
>>>> provide an API to be used as a library. The features Flume provides are
>>>> highly coupled with the Flume architectural components, like source,
>>>> channel, and sinks and doesn't seem to be usable independently. All I need
>>>> is merely on the HDFS loading part.
>>>>
>>>> Does anyone have some good suggestions?
>>>>
>>>> --
>>>> Chen Song
>>>>
>>>>
>>>
>>
>
>
> --
> Chen Song
>
>

Re: Programmatically write files into HDFS with Flume

Posted by Chen Song <ch...@gmail.com>.
Thanks for your feedbacks.

We have components to do data ingestion, effectively what flume sources do.
That component collects data and dump it into files on boxes. The data
files are then loaded into HDFS via 'hadoop fs -put' by a simple script. We
want to build a resilient long-lived service in Java to load files in HDFS.
That is how I come to know Flume.

I understand that Flume is managing its transactions as events, not
physical files. Is it possible to map files to logical events, thus achieve
atomic writes?




On Tue, Apr 30, 2013 at 10:25 PM, Roshan Naik <ro...@hortonworks.com>wrote:

> Are you sure you want to directly write to hdfs from the app that is
> generating data ? often in production, apps like web servers etc do not
> have direct access to HDFS.  i am not sure that HDFS sink guarantees  'either
> fully written successfully or failed totally without any partial file
> blocks written' since  each transaction does not translate into a separate
> file. so i think there could be some partially written transactions in case
> of transaction abort.
>
> This level of support for all-or-none at the file level is planned for
> what is currently referred to as the HCatalog sink
> https://issues.apache.org/jira/browse/FLUME-1734
>
> -roshan
>
>
> On Tue, Apr 30, 2013 at 6:48 PM, Connor Woodson <cw...@gmail.com>wrote:
>
>> If you just want to write data to HDFS then Flume might not be the best
>> thing to use; however, there is a Flume Embedded Agent<https://github.com/apache/flume/blob/trunk/flume-ng-doc/sphinx/FlumeDeveloperGuide.rst#embedded-agent>that will embed Flume into your application. I don't believe it works yet
>> with the HDFS sink, but some tinkering can likely make it work.
>>
>> - Connor
>>
>>
>> On Tue, Apr 30, 2013 at 11:00 AM, Chen Song <ch...@gmail.com>wrote:
>>
>>> I am looking at options in Java programs that can write files into HDFS
>>> with the following requirements.
>>>
>>> 1) Transaction Support: Each file, when being written, either fully
>>> written successfully or failed totally without any partial file blocks
>>> written.
>>>
>>> 2) Compression Support/File Formats: Can specify compression type or
>>> file format when writing contents.
>>>
>>> I know how to write data into a file on HDFS by opening a
>>> FSDataOutputStream shown here<http://stackoverflow.com/questions/13457934/writing-to-a-file-in-hdfs-in-hadoop>.
>>> Just wondering if there is some libraries of out of the box solutions that
>>> provides the support I mentioned above.
>>>
>>> I stumbled upon Flume, which provides HDFS sink that can support
>>> transaction, compression, file rotation, etc. But it doesn't seem to
>>> provide an API to be used as a library. The features Flume provides are
>>> highly coupled with the Flume architectural components, like source,
>>> channel, and sinks and doesn't seem to be usable independently. All I need
>>> is merely on the HDFS loading part.
>>>
>>> Does anyone have some good suggestions?
>>>
>>> --
>>> Chen Song
>>>
>>>
>>
>


-- 
Chen Song

Re: Programmatically write files into HDFS with Flume

Posted by Roshan Naik <ro...@hortonworks.com>.
Are you sure you want to directly write to hdfs from the app that is
generating data ? often in production, apps like web servers etc do not
have direct access to HDFS.  i am not sure that HDFS sink guarantees  'either
fully written successfully or failed totally without any partial file
blocks written' since  each transaction does not translate into a separate
file. so i think there could be some partially written transactions in case
of transaction abort.

This level of support for all-or-none at the file level is planned for what
is currently referred to as the HCatalog sink
https://issues.apache.org/jira/browse/FLUME-1734

-roshan


On Tue, Apr 30, 2013 at 6:48 PM, Connor Woodson <cw...@gmail.com>wrote:

> If you just want to write data to HDFS then Flume might not be the best
> thing to use; however, there is a Flume Embedded Agent<https://github.com/apache/flume/blob/trunk/flume-ng-doc/sphinx/FlumeDeveloperGuide.rst#embedded-agent>that will embed Flume into your application. I don't believe it works yet
> with the HDFS sink, but some tinkering can likely make it work.
>
> - Connor
>
>
> On Tue, Apr 30, 2013 at 11:00 AM, Chen Song <ch...@gmail.com>wrote:
>
>> I am looking at options in Java programs that can write files into HDFS
>> with the following requirements.
>>
>> 1) Transaction Support: Each file, when being written, either fully
>> written successfully or failed totally without any partial file blocks
>> written.
>>
>> 2) Compression Support/File Formats: Can specify compression type or file
>> format when writing contents.
>>
>> I know how to write data into a file on HDFS by opening a
>> FSDataOutputStream shown here<http://stackoverflow.com/questions/13457934/writing-to-a-file-in-hdfs-in-hadoop>.
>> Just wondering if there is some libraries of out of the box solutions that
>> provides the support I mentioned above.
>>
>> I stumbled upon Flume, which provides HDFS sink that can support
>> transaction, compression, file rotation, etc. But it doesn't seem to
>> provide an API to be used as a library. The features Flume provides are
>> highly coupled with the Flume architectural components, like source,
>> channel, and sinks and doesn't seem to be usable independently. All I need
>> is merely on the HDFS loading part.
>>
>> Does anyone have some good suggestions?
>>
>> --
>> Chen Song
>>
>>
>