You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Gaurav Khanna <kh...@yahoo.com> on 2012/01/19 22:08:31 UTC

Collector Sink writing to HDFS

Hi, 

A newbie question - perhaps it has already been answered and so apologize in advance in that case.

collectorSink("/tmp/bb/%H00/", "%{host}-")

Used the above collector sink to read one file (around 78 M) and it wrote 4 files into the /tmp/bb folder. I was wondering why that was done by the collector sink and what is the rationale behind that?

Thanks
Gaurav 


Gaurav Khanna

Re: Collector Sink writing to HDFS

Posted by Gaurav Khanna <kh...@yahoo.com>.
Thaks Ziiad. Understand that. Thanks.



________________________________
 From: Zijad Purkovic <zi...@gmail.com>
To: flume-user@incubator.apache.org; Gaurav Khanna <kh...@yahoo.com> 
Sent: Thursday, January 19, 2012 1:54 PM
Subject: Re: Collector Sink writing to HDFS
 
If youre using default flume-site.xml, it will open a new file every
30 seconds for writing to HDFS. So if your file takes longer than that
to read, send to collector, acknowledge and write to HDFS youre gonna
end up with more that one file on HDFS.

On Thu, Jan 19, 2012 at 10:08 PM, Gaurav Khanna <kh...@yahoo.com> wrote:
> Hi,
> A newbie question - perhaps it has already been answered and so apologize in
> advance in that case.
>
> collectorSink("/tmp/bb/%H00/", "%{host}-")
>
> Used the above collector sink to read one file (around 78 M) and it wrote 4
> files into the /tmp/bb folder. I was wondering why that was done by the
> collector sink and what is the rationale behind that?
>
> Thanks
> Gaurav
>
> Gaurav Khanna



-- 
Zijad Purković

Re: Collector Sink writing to HDFS

Posted by Gaurav Khanna <kh...@yahoo.com>.
So the file name in hdfs is (for the collector sink example given earlier):
POC_Hadoop_Client_1-20120119-160450707-0600.604742529713813.00000021

The following makes sense: 

POC_Hadoop_Client_1-20120119

But what comes after it (-160450707-0600.604742529713813.00000021)does not make sense? Also if one file is divided into multiple files in hdfs, how do I know the chronology (ie what comes first - what is the header, what is the end of that file?)

Thanks
Gaurav






________________________________
 From: Zijad Purkovic <zi...@gmail.com>
To: flume-user@incubator.apache.org; Gaurav Khanna <kh...@yahoo.com> 
Sent: Thursday, January 19, 2012 1:54 PM
Subject: Re: Collector Sink writing to HDFS
 
If youre using default flume-site.xml, it will open a new file every
30 seconds for writing to HDFS. So if your file takes longer than that
to read, send to collector, acknowledge and write to HDFS youre gonna
end up with more that one file on HDFS.

On Thu, Jan 19, 2012 at 10:08 PM, Gaurav Khanna <kh...@yahoo.com> wrote:
> Hi,
> A newbie question - perhaps it has already been answered and so apologize in
> advance in that case.
>
> collectorSink("/tmp/bb/%H00/", "%{host}-")
>
> Used the above collector sink to read one file (around 78 M) and it wrote 4
> files into the /tmp/bb folder. I was wondering why that was done by the
> collector sink and what is the rationale behind that?
>
> Thanks
> Gaurav
>
> Gaurav Khanna



-- 
Zijad Purković

Re: Collector Sink writing to HDFS

Posted by Zijad Purkovic <zi...@gmail.com>.
If youre using default flume-site.xml, it will open a new file every
30 seconds for writing to HDFS. So if your file takes longer than that
to read, send to collector, acknowledge and write to HDFS youre gonna
end up with more that one file on HDFS.

On Thu, Jan 19, 2012 at 10:08 PM, Gaurav Khanna <kh...@yahoo.com> wrote:
> Hi,
> A newbie question - perhaps it has already been answered and so apologize in
> advance in that case.
>
> collectorSink("/tmp/bb/%H00/", "%{host}-")
>
> Used the above collector sink to read one file (around 78 M) and it wrote 4
> files into the /tmp/bb folder. I was wondering why that was done by the
> collector sink and what is the rationale behind that?
>
> Thanks
> Gaurav
>
> Gaurav Khanna



-- 
Zijad Purković