You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Ashish Paliwal (JIRA)" <ji...@apache.org> on 2014/04/25 15:32:15 UTC

[jira] [Resolved] (FLUME-2364) netcat source and HDFS sink. Performance problem

     [ https://issues.apache.org/jira/browse/FLUME-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Paliwal resolved FLUME-2364.
-----------------------------------

       Resolution: Invalid
    Fix Version/s: v1.5.0

User ML question

> netcat source and HDFS sink. Performance problem
> ------------------------------------------------
>
>                 Key: FLUME-2364
>                 URL: https://issues.apache.org/jira/browse/FLUME-2364
>             Project: Flume
>          Issue Type: Test
>          Components: Configuration
>            Reporter: Praveen
>             Fix For: v1.5.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> 1. We have a csv file, size ~ 1GB
> 2. We tried to store it to HDFS using hadoop fs -put. It took ~10 seconds.
> 3. We try to use Flume 1.2 with netcat source and HFDS sink and we get serious perfomance problem. It takes ~ 20 minutes to store file. Also HDFS sink doesn't store it to single files. It create a lot of files, size of each is ~2 MB.
> Our goal is: 
> 1. send csv files to HDFS. We send file a1.csv to flume and get a1.csv in HDFS.
> 2. We do send these files one by one.
> 3. We want HDFS sink to close file after it was been received. 
> Here is our configuration:
> httpptpt.sources = httpptpt_src
> httpptpt.channels = httpptpt_channel
> httpptpt.sinks = httpptpt_sink
> # источники
> httpptpt.sources.httpptpt_src.type = netcat
> httpptpt.sources.httpptpt_src.bind = 10.66.48.23
> httpptpt.sources.httpptpt_src.port = 6969
> httpptpt.sources.httpptpt_src.ack-every-event = false
> #default size is 512B
> #httpptpt.sources.httpptpt_src.max-line-length = 4096 
> httpptpt.sources.httpptpt_src.channels = httpptpt_channel
> # channel
> httpptpt.channels.httpptpt_channel.type = memory
> #Seems like we don't understand how it works :( With default values it doesn't work (capacity=100, transaction capacity= 100). Memory channel has no room for storing incomming lines
> #httpptpt.channels.httpptpt_channel.capacity = 100000
> #httpptpt.channels.httpptpt_channel.transactionCapacity = 1000
> #Defaul is 3 sec
> #httpptpt.channels.httpptpt_channel.keep-alive = 1 
> # sink
> httpptpt.sinks.httpptpt_sink.channel = httpptpt_channel
> httpptpt.sinks.httpptpt_sink.type = hdfs
> httpptpt.sinks.httpptpt_sink.hdfs.path = hdfs://10.66.48.23/user/httpptpt/
> httpptpt.sinks.httpptpt_sink.hdfs.fileType = DataStream
> httpptpt.sinks.httpptpt_sink.hdfs.writeFormat = Writable
> httpptpt.sinks.httpptpt_sink.hdfs.filePrefix = httpptpt
> httpptpt.sinks.httpptpt_sink.hdfs.threadsPoolSize = 10
> #We want HDFS sink roll temp file after source stops to emit lines
> #httpptpt.sinks.httpptpt_sink.hdfs.rollSize = 10485760000 
> httpptpt.sinks.httpptpt_sink.hdfs.rollSize = 0
> #httpptpt.sinks.httpptpt_sink.hdfs.rollCount = 6000000
> httpptpt.sinks.httpptpt_sink.hdfs.rollCount = 0
> httpptpt.sinks.httpptpt_sink.hdfs.rollInterval = 0
> #??? Source doesn't emit messages for 10 seconds, then rool the file
> httpptpt.sinks.httpptpt_sink.hdfs.idleTimeout = 10
> What do we do wrong?



--
This message was sent by Atlassian JIRA
(v6.2#6252)