You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by Venkateswarlu Danda <Ve...@lntinfotech.com> on 2013/04/23 07:17:00 UTC

spool directorty configiration problem

Hello

I am generating files continuously in local folder of my base machine. How I can now use the flume to stream the generated files from local folder to HDFS.

I have written some configuration but its giving some issues ,please give me some sample code configuration

This is my configratin  file

agents.sources=spooldir-source
agents.sinks=hdfs-sink
agents.channels=ch1

agents.sources.spooldir-source.type=spooldir
agents.sources.spooldir-source.spoolDir=/apache-tomcat-7.0.39/logs/MultiThreadLogs
agents.sources.spooldir-source.fileSuffix=.SPOOL
agents.sources.spooldir-source.fileHeader=true
agents.sources.spooldir-source.bufferMaxLineLength=50000

agents.sinks.hdfs-sink.type=hdfs
agents.sinks.hdfs-sink.hdfs.path=hdfs://cloudx-740-677:54300/multipleFiles/
agents.sinks.hdfs-sink.hdfs.rollSize=12553700
agents.sinks.hdfs-sink.hdfs.rollCount=12553665
agents.sinks.hdfs-sink.hdfs.rollInterval=3000
agents.sinks.hdfs-sink.hdfs.fileType=DataStream
agents.sinks.hdfs-sink.hdfs.writeFormat=Text

agents.channels.ch1.type=file

agents.sources.spooldir-source.channels=ch1
agents.sinks.hdfs-sink.channel=ch1



If I adding a large files (10Mb) , getting error


13/04/18 16:11:21 ERROR source.SpoolDirectorySource: Uncaught exception in Runnable
java.lang.IllegalStateException: File has been modified since being read: /apache-tomcat-7.0.39/logs/MultiThreadLogs/log_0.txt
        at org.apache.flume.client.avro.SpoolingFileLineReader.retireCurrentFile(SpoolingFileLineReader.java:237)
        at org.apache.flume.client.avro.SpoolingFileLineReader.readLines(SpoolingFileLineReader.java:185)
        at org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:135)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
13/04/18 16:11:21 ERROR source.SpoolDirectorySource: Uncaught exception in Runnable
java.io.IOException: Stream closed
        at java.io.BufferedReader.ensureOpen(BufferedReader.java:115)
        at java.io.BufferedReader.readLine(BufferedReader.java:310)
        at java.io.BufferedReader.readLine(BufferedReader.java:382)
        at org.apache.flume.client.avro.SpoolingFileLineReader.readLines(SpoolingFileLineReader.java:180)
        at org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:135)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)


if increase the "bufferMaxLineLength"

java.lang.OutOfMemoryError: Java heap space
        at java.io.BufferedReader.<init>(BufferedReader.java:98)
        at org.apache.flume.client.avro.SpoolingFileLineReader.getNextFile(SpoolingFileLineReader.java:322)
        at org.apache.flume.client.avro.SpoolingFileLineReader.readLines(SpoolingFileLineReader.java:172)
        at org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:135)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)


Thanks&Regards
Venkat.D


-----Original Message-----
From: Venkatesh Sivasubramanian (JIRA) [mailto:jira@apache.org]
Sent: Tuesday, April 23, 2013 8:57 AM
To: dev@flume.apache.org
Subject: [jira] [Comment Edited] (FLUME-1819) ExecSource don't flush the cache if there is no input entries


    [ https://issues.apache.org/jira/browse/FLUME-1819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638744#comment-13638744 ]

Venkatesh Sivasubramanian edited comment on FLUME-1819 at 4/23/13 3:27 AM:
---------------------------------------------------------------------------

Yes Hari, let me take a stab. Will keep you posted. thanks!


      was (Author: venkyz):
    Yes Hari, let me take a stab. Will keep you posted.


> ExecSource don't flush the cache if there is no input entries
> -------------------------------------------------------------
>
>                 Key: FLUME-1819
>                 URL: https://issues.apache.org/jira/browse/FLUME-1819
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.3.0
>            Reporter: Fengdong Yu
>            Assignee: Venkatesh Sivasubramanian
>             Fix For: v1.4.0
>
>         Attachments: FLUME-1819.patch, FLUME-1819.patch.1
>
>
> ExecSource has a default batchSize: 20, exec source read data from the source, then put it into the cache, after the cache is full, push it to the channel.
> but if exec source's cache is not full, and there isn't any input for a long time, then these entries always kept in the cache, there is no chance to the channel until the source's cache is full.
> so, the patch added a new config line: batchTimeout for ExecSource, and default is 3 seconds, if batchTimeout exceeded, push all cached data to the channel even the cache is not full.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira

The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and  using or disseminating the information,  and must notify the sender and delete it from their system. L&T Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail"

Re: spool directorty configiration problem

Posted by Israel Ekpo <is...@aicer.org>.
Hello Venkat,

Your question is more appropriate for the users mailing list so I have
changed the list in this reply.

Going forward, you can use the following as a guide when sending emails to
the lists:

For questions about how to use or configure Apache FLUME or if you are
experiencing issues using Apache Flume, such queries should be sent to the
user mailing list (user@flume.apache.org)

For questions about API internals, patches and code reviews, such queries
are more appropriate for the developer mailing list (dev@flume.apache.org)

Coming back to the issue you reported, I have had this problem before in my
early days with Flume.

The root cause of your problem is in the log files you included in your
message.

You cannot set the spooling directory to one where the files and constantly
being updated.

If the files have been updated after they are picked up by Flume in the
spooling directory you are going to encounter an Exception.

You can check out the user gude for more info.

http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

>From the user guide, the SpoolingDirectorySource expects that only
immutable, uniquely named files are dropped in the spooling directory. If
duplicate names are used, or files are modified while being read, the
source will fail with an error message. For some use cases this may require
adding unique identifiers (such as a timestamp) to log file names when they
are copied into the spooling directory.



On 23 April 2013 01:17, Venkateswarlu Danda <
Venkateswarlu.Danda@lntinfotech.com> wrote:

> Hello
>
> I am generating files continuously in local folder of my base machine. How
> I can now use the flume to stream the generated files from local folder to
> HDFS.
>
> I have written some configuration but its giving some issues ,please give
> me some sample code configuration
>
> This is my configratin  file
>
> agents.sources=spooldir-source
> agents.sinks=hdfs-sink
> agents.channels=ch1
>
> agents.sources.spooldir-source.type=spooldir
>
> agents.sources.spooldir-source.spoolDir=/apache-tomcat-7.0.39/logs/MultiThreadLogs
> agents.sources.spooldir-source.fileSuffix=.SPOOL
> agents.sources.spooldir-source.fileHeader=true
> agents.sources.spooldir-source.bufferMaxLineLength=50000
>
> agents.sinks.hdfs-sink.type=hdfs
> agents.sinks.hdfs-sink.hdfs.path=hdfs://cloudx-740-677:54300/multipleFiles/
> agents.sinks.hdfs-sink.hdfs.rollSize=12553700
> agents.sinks.hdfs-sink.hdfs.rollCount=12553665
> agents.sinks.hdfs-sink.hdfs.rollInterval=3000
> agents.sinks.hdfs-sink.hdfs.fileType=DataStream
> agents.sinks.hdfs-sink.hdfs.writeFormat=Text
>
> agents.channels.ch1.type=file
>
> agents.sources.spooldir-source.channels=ch1
> agents.sinks.hdfs-sink.channel=ch1
>
>
>
> If I adding a large files (10Mb) , getting error
>
>
> 13/04/18 16:11:21 ERROR source.SpoolDirectorySource: Uncaught exception in
> Runnable
> java.lang.IllegalStateException: File has been modified since being read:
> /apache-tomcat-7.0.39/logs/MultiThreadLogs/log_0.txt
>         at
> org.apache.flume.client.avro.SpoolingFileLineReader.retireCurrentFile(SpoolingFileLineReader.java:237)
>         at
> org.apache.flume.client.avro.SpoolingFileLineReader.readLines(SpoolingFileLineReader.java:185)
>         at
> org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:135)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:722)
> 13/04/18 16:11:21 ERROR source.SpoolDirectorySource: Uncaught exception in
> Runnable
> java.io.IOException: Stream closed
>         at java.io.BufferedReader.ensureOpen(BufferedReader.java:115)
>         at java.io.BufferedReader.readLine(BufferedReader.java:310)
>         at java.io.BufferedReader.readLine(BufferedReader.java:382)
>         at
> org.apache.flume.client.avro.SpoolingFileLineReader.readLines(SpoolingFileLineReader.java:180)
>         at
> org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:135)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:722)
>
>
> if increase the "bufferMaxLineLength"
>
> java.lang.OutOfMemoryError: Java heap space
>         at java.io.BufferedReader.<init>(BufferedReader.java:98)
>         at
> org.apache.flume.client.avro.SpoolingFileLineReader.getNextFile(SpoolingFileLineReader.java:322)
>         at
> org.apache.flume.client.avro.SpoolingFileLineReader.readLines(SpoolingFileLineReader.java:172)
>         at
> org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:135)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>
>
> Thanks&Regards
> Venkat.D
>
>
> -----Original Message-----
> From: Venkatesh Sivasubramanian (JIRA) [mailto:jira@apache.org]
> Sent: Tuesday, April 23, 2013 8:57 AM
> To: dev@flume.apache.org
> Subject: [jira] [Comment Edited] (FLUME-1819) ExecSource don't flush the
> cache if there is no input entries
>
>
>     [
> https://issues.apache.org/jira/browse/FLUME-1819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638744#comment-13638744]
>
> Venkatesh Sivasubramanian edited comment on FLUME-1819 at 4/23/13 3:27 AM:
> ---------------------------------------------------------------------------
>
> Yes Hari, let me take a stab. Will keep you posted. thanks!
>
>
>       was (Author: venkyz):
>     Yes Hari, let me take a stab. Will keep you posted.
>
>
> > ExecSource don't flush the cache if there is no input entries
> > -------------------------------------------------------------
> >
> >                 Key: FLUME-1819
> >                 URL: https://issues.apache.org/jira/browse/FLUME-1819
> >             Project: Flume
> >          Issue Type: Bug
> >          Components: Sinks+Sources
> >    Affects Versions: v1.3.0
> >            Reporter: Fengdong Yu
> >            Assignee: Venkatesh Sivasubramanian
> >             Fix For: v1.4.0
> >
> >         Attachments: FLUME-1819.patch, FLUME-1819.patch.1
> >
> >
> > ExecSource has a default batchSize: 20, exec source read data from the
> source, then put it into the cache, after the cache is full, push it to the
> channel.
> > but if exec source's cache is not full, and there isn't any input for a
> long time, then these entries always kept in the cache, there is no chance
> to the channel until the source's cache is full.
> > so, the patch added a new config line: batchTimeout for ExecSource, and
> default is 3 seconds, if batchTimeout exceeded, push all cached data to the
> channel even the cache is not full.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>
> The contents of this e-mail and any attachment(s) may contain confidential
> or privileged information for the intended recipient(s). Unintended
> recipients are prohibited from taking action on the basis of information in
> this e-mail and  using or disseminating the information,  and must notify
> the sender and delete it from their system. L&T Infotech will not accept
> responsibility or liability for the accuracy or completeness of, or the
> presence of any virus or disabling code in this e-mail"
>