You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Jeff Field (JIRA)" <ji...@apache.org> on 2016/07/07 23:21:11 UTC
[jira] [Commented] (FLUME-2458) Separate hdfs tmp directory for flume hdfs sink

    [ https://issues.apache.org/jira/browse/FLUME-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366958#comment-15366958 ] 

Jeff Field commented on FLUME-2458:
-----------------------------------

We're currently running into a problem with snapshot-diff based distcp where it won't correctly copy the tmp files from flume even after they've been finalized (it renames them but they only have the blocks they had at the moment of the copy), and this could be a solution to that (and a number of other problems we've run into related to Flume fouling its own nest). We can't use the workaround because our paths are more complex and based off the headers we get from RabbitMQ:

We currently do the following to name our files (and flume happily created directories in the path that don't exist):
{code}
tier1.sinks.bi_prod_sink.hdfs.path = /DW/App/%{routing_key}/%{x-proto-message-type}%{fileName}%{proto}%{message_id}/%Y%m%d
tier1.sinks.bi_prod_sink.hdfs.filePrefix = $H-
tier1.sinks.bi_prod_sink.hdfs.fileSuffix = .txt
{code}

We'd need to do something like this, which I think won't work for the reason Harsh mentioned:
{code}
tier1.sinks.bi_prod_sink.hdfs.path = /DW/
tier1.sinks.bi_prod_sink.hdfs.filePrefix = App/%{routing_key}/%{x-proto-message-type}%{fileName}%{proto}%{message_id}/%Y%m%d/$H-
tier1.sinks.bi_prod_sink.hdfs.fileSuffix = .txt
tier1.sinks.bi_prod_sink.hdfs.inUsePrefix = tmp/%{routing_key}/%{x-proto-message-type}%{fileName}%{proto}%{message_id}/%Y%m%d/$H-
{code}

Which is why I would prefer a temp path. If someone sees a better way to use the existing parameters to accomplish this without having to change how I land the final file, that'd work too.

> Separate hdfs tmp directory for flume hdfs sink
> -----------------------------------------------
>
>                 Key: FLUME-2458
>                 URL: https://issues.apache.org/jira/browse/FLUME-2458
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>    Affects Versions: v1.5.0.1
>            Reporter: Sverre Bakke
>            Assignee: Neerja Khattar
>            Priority: Minor
>         Attachments: FLUME-2458.patch, patch-2458.txt
>
>
> The current HDFS sink will write temporary files to the same directory as the final file will be stored. This is a problem for several reasons:
> 1) File moving
> When mapreduce fetches a list of files to be processed and then processes files that are then gone (i.e. are moved from .tmp to  whatever final name it is suppose to have), then the mapreduce job will crash.
> 2) File type
> When mapreduce decides how to process files, then it looks at files extension. If using compressed files, then it will decompress it for you. If the file has a .tmp file extension (in the same folder) then it will treat a compressed file as an uncompressed files, thus breaking the results of the mapreduce job.
> I propose that the sink gets an optional tmp path for storing these files to avoid these issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)