You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Kaalu Singh <ka...@gmail.com> on 2014/01/22 23:52:01 UTC

Question about Flume

Hi,

I have the following use case:

I have data files getting generated frequently on a certain machine, X. The
only way I can bring them into my Hadoop cluster  is by SFTPing at certain
intervals of time and getting them and landing them in HDFS.

I am new to Hadoop and to Flume. I read up about Flume and it seems like
this framework is appropriate for something like this although I did not
see an available 'source' that can do exactly what I am looking for.
Unavailability of a 'source' plugin is not a deal breaker for me as I can
write one but first I want to make sure this is the right way to go. So, my
questions are:

1. What are the pros/cons of using Flume for this use case?
2. Does anybody know of a source plugin that does what I am looking for?
3. Does anybody think I should not use Flume and instead write my own
application to achieve this use case?

Thanks
KS

Re: Question about Flume

Posted by Olivier Renault <or...@hortonworks.com>.

You could also consider using WebHDFS instead of sftp / flume. WebHDFS is a
REST API which will allow you to copy data directly into HDFS.

Regards,
Olivier


On 23 January 2014 05:25, sudhakara st <su...@gmail.com> wrote:

> Hello Kaalu Singh,
>
> Flume is best mach for your requirement. First define the storage
> structure of data in HDFS and how you are going process the stored data in
> HDFS. Data is very large size flume supports multiple-hop flow, filtering
> and aggregation. I think no source plugin required, a command or a script
> or a program which converts you data in stream bytes work with flume.
>
>
> On Thu, Jan 23, 2014 at 6:31 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:
>
>> Fair enough. I just wanted to point out that doing it via a script is
>> going to be a million times faster to implement compared to something like
>> Flume (and arguably more reliable too with no maintenance overhead). Don't
>> get me wrong, we use Flume for our data collection as well but our use case
>> is real time/online data collection and Flume does the job well. So nothing
>> against Flume per se. I was just thinking - if a script becomes a pain down
>> the road how much throw away effort are we talking about here, a few
>> minutes to a few hours at max vs what happens if Flume becomes a pain, a
>> few days to a few weeks of throw away work.
>>
>> Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>
>>
>>  ------------------------------
>> * From: * Kaalu Singh <ka...@gmail.com>;
>> * To: * <us...@hadoop.apache.org>; Dhaval Shah <
>> prince_mithibai@yahoo.co.in>;
>> * Subject: * Re: Question about Flume
>> * Sent: * Wed, Jan 22, 2014 11:20:52 PM
>>
>>   The closest built-in functionality to the use case I have is the
>> "Spooling Directory Source" and I like the idea of using/building software
>> with higher level languages like Java for reasons of extensibility etc (and
>> don't like the idea of scripts).
>>
>> However, I am soliciting opinions and can be swayed to change my mind.
>>
>> Thanks for your response Dhaval - appreciate it.
>>
>> Regards
>> KS
>>
>>
>> On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <prince_mithibai@yahoo.co.in
>> > wrote:
>>
>>> Flume is useful for online log aggregation in a streaming format. Your
>>> use case seems more like a batch format where you just need to grab the
>>> file and put it in HDFS at regular intervals which can be much more easily
>>>  achieved by a bash script running on a cron'd basis.
>>>
>>> Regards,
>>>
>>> Dhaval
>>>
>>>
>>> ________________________________
>>> From: Kaalu Singh <ka...@gmail.com>
>>> To: user@hadoop.apache.org
>>> Sent: Wednesday, 22 January 2014 5:52 PM
>>> Subject: Question about Flume
>>>
>>>
>>>
>>> Hi,
>>>
>>> I have the following use case:
>>>
>>> I have data files getting generated frequently on a certain machine, X.
>>> The only way I can bring them into my Hadoop cluster  is by SFTPing at
>>> certain intervals of time and getting them and landing them in HDFS.
>>>
>>> I am new to Hadoop and to Flume. I read up about Flume and it seems like
>>> this framework is appropriate for something like this although I did not
>>> see an available 'source' that can do exactly what I am looking for.
>>> Unavailability of a 'source' plugin is not a deal breaker for me as I can
>>> write one but first I want to make sure this is the right way to go. So, my
>>> questions are:
>>>
>>> 1. What are the pros/cons of using Flume for this use case?
>>> 2. Does anybody know of a source plugin that does what I am looking for?
>>> 3. Does anybody think I should not use Flume and instead write my own
>>> application to achieve this use case?
>>>
>>> Thanks
>>> KS
>>>
>>
>>
>
>
> --
>
> Regards,
> ...Sudhakara.st
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Question about Flume

Posted by Olivier Renault <or...@hortonworks.com>.

You could also consider using WebHDFS instead of sftp / flume. WebHDFS is a
REST API which will allow you to copy data directly into HDFS.

Regards,
Olivier


On 23 January 2014 05:25, sudhakara st <su...@gmail.com> wrote:

> Hello Kaalu Singh,
>
> Flume is best mach for your requirement. First define the storage
> structure of data in HDFS and how you are going process the stored data in
> HDFS. Data is very large size flume supports multiple-hop flow, filtering
> and aggregation. I think no source plugin required, a command or a script
> or a program which converts you data in stream bytes work with flume.
>
>
> On Thu, Jan 23, 2014 at 6:31 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:
>
>> Fair enough. I just wanted to point out that doing it via a script is
>> going to be a million times faster to implement compared to something like
>> Flume (and arguably more reliable too with no maintenance overhead). Don't
>> get me wrong, we use Flume for our data collection as well but our use case
>> is real time/online data collection and Flume does the job well. So nothing
>> against Flume per se. I was just thinking - if a script becomes a pain down
>> the road how much throw away effort are we talking about here, a few
>> minutes to a few hours at max vs what happens if Flume becomes a pain, a
>> few days to a few weeks of throw away work.
>>
>> Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>
>>
>>  ------------------------------
>> * From: * Kaalu Singh <ka...@gmail.com>;
>> * To: * <us...@hadoop.apache.org>; Dhaval Shah <
>> prince_mithibai@yahoo.co.in>;
>> * Subject: * Re: Question about Flume
>> * Sent: * Wed, Jan 22, 2014 11:20:52 PM
>>
>>   The closest built-in functionality to the use case I have is the
>> "Spooling Directory Source" and I like the idea of using/building software
>> with higher level languages like Java for reasons of extensibility etc (and
>> don't like the idea of scripts).
>>
>> However, I am soliciting opinions and can be swayed to change my mind.
>>
>> Thanks for your response Dhaval - appreciate it.
>>
>> Regards
>> KS
>>
>>
>> On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <prince_mithibai@yahoo.co.in
>> > wrote:
>>
>>> Flume is useful for online log aggregation in a streaming format. Your
>>> use case seems more like a batch format where you just need to grab the
>>> file and put it in HDFS at regular intervals which can be much more easily
>>>  achieved by a bash script running on a cron'd basis.
>>>
>>> Regards,
>>>
>>> Dhaval
>>>
>>>
>>> ________________________________
>>> From: Kaalu Singh <ka...@gmail.com>
>>> To: user@hadoop.apache.org
>>> Sent: Wednesday, 22 January 2014 5:52 PM
>>> Subject: Question about Flume
>>>
>>>
>>>
>>> Hi,
>>>
>>> I have the following use case:
>>>
>>> I have data files getting generated frequently on a certain machine, X.
>>> The only way I can bring them into my Hadoop cluster  is by SFTPing at
>>> certain intervals of time and getting them and landing them in HDFS.
>>>
>>> I am new to Hadoop and to Flume. I read up about Flume and it seems like
>>> this framework is appropriate for something like this although I did not
>>> see an available 'source' that can do exactly what I am looking for.
>>> Unavailability of a 'source' plugin is not a deal breaker for me as I can
>>> write one but first I want to make sure this is the right way to go. So, my
>>> questions are:
>>>
>>> 1. What are the pros/cons of using Flume for this use case?
>>> 2. Does anybody know of a source plugin that does what I am looking for?
>>> 3. Does anybody think I should not use Flume and instead write my own
>>> application to achieve this use case?
>>>
>>> Thanks
>>> KS
>>>
>>
>>
>
>
> --
>
> Regards,
> ...Sudhakara.st
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Question about Flume

Posted by Olivier Renault <or...@hortonworks.com>.

You could also consider using WebHDFS instead of sftp / flume. WebHDFS is a
REST API which will allow you to copy data directly into HDFS.

Regards,
Olivier


On 23 January 2014 05:25, sudhakara st <su...@gmail.com> wrote:

> Hello Kaalu Singh,
>
> Flume is best mach for your requirement. First define the storage
> structure of data in HDFS and how you are going process the stored data in
> HDFS. Data is very large size flume supports multiple-hop flow, filtering
> and aggregation. I think no source plugin required, a command or a script
> or a program which converts you data in stream bytes work with flume.
>
>
> On Thu, Jan 23, 2014 at 6:31 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:
>
>> Fair enough. I just wanted to point out that doing it via a script is
>> going to be a million times faster to implement compared to something like
>> Flume (and arguably more reliable too with no maintenance overhead). Don't
>> get me wrong, we use Flume for our data collection as well but our use case
>> is real time/online data collection and Flume does the job well. So nothing
>> against Flume per se. I was just thinking - if a script becomes a pain down
>> the road how much throw away effort are we talking about here, a few
>> minutes to a few hours at max vs what happens if Flume becomes a pain, a
>> few days to a few weeks of throw away work.
>>
>> Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>
>>
>>  ------------------------------
>> * From: * Kaalu Singh <ka...@gmail.com>;
>> * To: * <us...@hadoop.apache.org>; Dhaval Shah <
>> prince_mithibai@yahoo.co.in>;
>> * Subject: * Re: Question about Flume
>> * Sent: * Wed, Jan 22, 2014 11:20:52 PM
>>
>>   The closest built-in functionality to the use case I have is the
>> "Spooling Directory Source" and I like the idea of using/building software
>> with higher level languages like Java for reasons of extensibility etc (and
>> don't like the idea of scripts).
>>
>> However, I am soliciting opinions and can be swayed to change my mind.
>>
>> Thanks for your response Dhaval - appreciate it.
>>
>> Regards
>> KS
>>
>>
>> On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <prince_mithibai@yahoo.co.in
>> > wrote:
>>
>>> Flume is useful for online log aggregation in a streaming format. Your
>>> use case seems more like a batch format where you just need to grab the
>>> file and put it in HDFS at regular intervals which can be much more easily
>>>  achieved by a bash script running on a cron'd basis.
>>>
>>> Regards,
>>>
>>> Dhaval
>>>
>>>
>>> ________________________________
>>> From: Kaalu Singh <ka...@gmail.com>
>>> To: user@hadoop.apache.org
>>> Sent: Wednesday, 22 January 2014 5:52 PM
>>> Subject: Question about Flume
>>>
>>>
>>>
>>> Hi,
>>>
>>> I have the following use case:
>>>
>>> I have data files getting generated frequently on a certain machine, X.
>>> The only way I can bring them into my Hadoop cluster  is by SFTPing at
>>> certain intervals of time and getting them and landing them in HDFS.
>>>
>>> I am new to Hadoop and to Flume. I read up about Flume and it seems like
>>> this framework is appropriate for something like this although I did not
>>> see an available 'source' that can do exactly what I am looking for.
>>> Unavailability of a 'source' plugin is not a deal breaker for me as I can
>>> write one but first I want to make sure this is the right way to go. So, my
>>> questions are:
>>>
>>> 1. What are the pros/cons of using Flume for this use case?
>>> 2. Does anybody know of a source plugin that does what I am looking for?
>>> 3. Does anybody think I should not use Flume and instead write my own
>>> application to achieve this use case?
>>>
>>> Thanks
>>> KS
>>>
>>
>>
>
>
> --
>
> Regards,
> ...Sudhakara.st
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Question about Flume

Posted by Olivier Renault <or...@hortonworks.com>.

You could also consider using WebHDFS instead of sftp / flume. WebHDFS is a
REST API which will allow you to copy data directly into HDFS.

Regards,
Olivier


On 23 January 2014 05:25, sudhakara st <su...@gmail.com> wrote:

> Hello Kaalu Singh,
>
> Flume is best mach for your requirement. First define the storage
> structure of data in HDFS and how you are going process the stored data in
> HDFS. Data is very large size flume supports multiple-hop flow, filtering
> and aggregation. I think no source plugin required, a command or a script
> or a program which converts you data in stream bytes work with flume.
>
>
> On Thu, Jan 23, 2014 at 6:31 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:
>
>> Fair enough. I just wanted to point out that doing it via a script is
>> going to be a million times faster to implement compared to something like
>> Flume (and arguably more reliable too with no maintenance overhead). Don't
>> get me wrong, we use Flume for our data collection as well but our use case
>> is real time/online data collection and Flume does the job well. So nothing
>> against Flume per se. I was just thinking - if a script becomes a pain down
>> the road how much throw away effort are we talking about here, a few
>> minutes to a few hours at max vs what happens if Flume becomes a pain, a
>> few days to a few weeks of throw away work.
>>
>> Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>
>>
>>  ------------------------------
>> * From: * Kaalu Singh <ka...@gmail.com>;
>> * To: * <us...@hadoop.apache.org>; Dhaval Shah <
>> prince_mithibai@yahoo.co.in>;
>> * Subject: * Re: Question about Flume
>> * Sent: * Wed, Jan 22, 2014 11:20:52 PM
>>
>>   The closest built-in functionality to the use case I have is the
>> "Spooling Directory Source" and I like the idea of using/building software
>> with higher level languages like Java for reasons of extensibility etc (and
>> don't like the idea of scripts).
>>
>> However, I am soliciting opinions and can be swayed to change my mind.
>>
>> Thanks for your response Dhaval - appreciate it.
>>
>> Regards
>> KS
>>
>>
>> On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <prince_mithibai@yahoo.co.in
>> > wrote:
>>
>>> Flume is useful for online log aggregation in a streaming format. Your
>>> use case seems more like a batch format where you just need to grab the
>>> file and put it in HDFS at regular intervals which can be much more easily
>>>  achieved by a bash script running on a cron'd basis.
>>>
>>> Regards,
>>>
>>> Dhaval
>>>
>>>
>>> ________________________________
>>> From: Kaalu Singh <ka...@gmail.com>
>>> To: user@hadoop.apache.org
>>> Sent: Wednesday, 22 January 2014 5:52 PM
>>> Subject: Question about Flume
>>>
>>>
>>>
>>> Hi,
>>>
>>> I have the following use case:
>>>
>>> I have data files getting generated frequently on a certain machine, X.
>>> The only way I can bring them into my Hadoop cluster  is by SFTPing at
>>> certain intervals of time and getting them and landing them in HDFS.
>>>
>>> I am new to Hadoop and to Flume. I read up about Flume and it seems like
>>> this framework is appropriate for something like this although I did not
>>> see an available 'source' that can do exactly what I am looking for.
>>> Unavailability of a 'source' plugin is not a deal breaker for me as I can
>>> write one but first I want to make sure this is the right way to go. So, my
>>> questions are:
>>>
>>> 1. What are the pros/cons of using Flume for this use case?
>>> 2. Does anybody know of a source plugin that does what I am looking for?
>>> 3. Does anybody think I should not use Flume and instead write my own
>>> application to achieve this use case?
>>>
>>> Thanks
>>> KS
>>>
>>
>>
>
>
> --
>
> Regards,
> ...Sudhakara.st
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Question about Flume

Posted by sudhakara st <su...@gmail.com>.

Hello Kaalu Singh,

Flume is best mach for your requirement. First define the storage structure
of data in HDFS and how you are going process the stored data in HDFS. Data
is very large size flume supports multiple-hop flow, filtering and
aggregation. I think no source plugin required, a command or a script or a
program which converts you data in stream bytes work with flume.


On Thu, Jan 23, 2014 at 6:31 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Fair enough. I just wanted to point out that doing it via a script is
> going to be a million times faster to implement compared to something like
> Flume (and arguably more reliable too with no maintenance overhead). Don't
> get me wrong, we use Flume for our data collection as well but our use case
> is real time/online data collection and Flume does the job well. So nothing
> against Flume per se. I was just thinking - if a script becomes a pain down
> the road how much throw away effort are we talking about here, a few
> minutes to a few hours at max vs what happens if Flume becomes a pain, a
> few days to a few weeks of throw away work.
>
> Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>
>
>  ------------------------------
> * From: * Kaalu Singh <ka...@gmail.com>;
> * To: * <us...@hadoop.apache.org>; Dhaval Shah <pr...@yahoo.co.in>;
>
> * Subject: * Re: Question about Flume
> * Sent: * Wed, Jan 22, 2014 11:20:52 PM
>
>   The closest built-in functionality to the use case I have is the
> "Spooling Directory Source" and I like the idea of using/building software
> with higher level languages like Java for reasons of extensibility etc (and
> don't like the idea of scripts).
>
> However, I am soliciting opinions and can be swayed to change my mind.
>
> Thanks for your response Dhaval - appreciate it.
>
> Regards
> KS
>
>
> On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:
>
>> Flume is useful for online log aggregation in a streaming format. Your
>> use case seems more like a batch format where you just need to grab the
>> file and put it in HDFS at regular intervals which can be much more easily
>>  achieved by a bash script running on a cron'd basis.
>>
>> Regards,
>>
>> Dhaval
>>
>>
>> ________________________________
>> From: Kaalu Singh <ka...@gmail.com>
>> To: user@hadoop.apache.org
>> Sent: Wednesday, 22 January 2014 5:52 PM
>> Subject: Question about Flume
>>
>>
>>
>> Hi,
>>
>> I have the following use case:
>>
>> I have data files getting generated frequently on a certain machine, X.
>> The only way I can bring them into my Hadoop cluster  is by SFTPing at
>> certain intervals of time and getting them and landing them in HDFS.
>>
>> I am new to Hadoop and to Flume. I read up about Flume and it seems like
>> this framework is appropriate for something like this although I did not
>> see an available 'source' that can do exactly what I am looking for.
>> Unavailability of a 'source' plugin is not a deal breaker for me as I can
>> write one but first I want to make sure this is the right way to go. So, my
>> questions are:
>>
>> 1. What are the pros/cons of using Flume for this use case?
>> 2. Does anybody know of a source plugin that does what I am looking for?
>> 3. Does anybody think I should not use Flume and instead write my own
>> application to achieve this use case?
>>
>> Thanks
>> KS
>>
>
>


-- 

Regards,
...Sudhakara.st

Re: Question about Flume

Posted by sudhakara st <su...@gmail.com>.

Hello Kaalu Singh,

Flume is best mach for your requirement. First define the storage structure
of data in HDFS and how you are going process the stored data in HDFS. Data
is very large size flume supports multiple-hop flow, filtering and
aggregation. I think no source plugin required, a command or a script or a
program which converts you data in stream bytes work with flume.


On Thu, Jan 23, 2014 at 6:31 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Fair enough. I just wanted to point out that doing it via a script is
> going to be a million times faster to implement compared to something like
> Flume (and arguably more reliable too with no maintenance overhead). Don't
> get me wrong, we use Flume for our data collection as well but our use case
> is real time/online data collection and Flume does the job well. So nothing
> against Flume per se. I was just thinking - if a script becomes a pain down
> the road how much throw away effort are we talking about here, a few
> minutes to a few hours at max vs what happens if Flume becomes a pain, a
> few days to a few weeks of throw away work.
>
> Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>
>
>  ------------------------------
> * From: * Kaalu Singh <ka...@gmail.com>;
> * To: * <us...@hadoop.apache.org>; Dhaval Shah <pr...@yahoo.co.in>;
>
> * Subject: * Re: Question about Flume
> * Sent: * Wed, Jan 22, 2014 11:20:52 PM
>
>   The closest built-in functionality to the use case I have is the
> "Spooling Directory Source" and I like the idea of using/building software
> with higher level languages like Java for reasons of extensibility etc (and
> don't like the idea of scripts).
>
> However, I am soliciting opinions and can be swayed to change my mind.
>
> Thanks for your response Dhaval - appreciate it.
>
> Regards
> KS
>
>
> On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:
>
>> Flume is useful for online log aggregation in a streaming format. Your
>> use case seems more like a batch format where you just need to grab the
>> file and put it in HDFS at regular intervals which can be much more easily
>>  achieved by a bash script running on a cron'd basis.
>>
>> Regards,
>>
>> Dhaval
>>
>>
>> ________________________________
>> From: Kaalu Singh <ka...@gmail.com>
>> To: user@hadoop.apache.org
>> Sent: Wednesday, 22 January 2014 5:52 PM
>> Subject: Question about Flume
>>
>>
>>
>> Hi,
>>
>> I have the following use case:
>>
>> I have data files getting generated frequently on a certain machine, X.
>> The only way I can bring them into my Hadoop cluster  is by SFTPing at
>> certain intervals of time and getting them and landing them in HDFS.
>>
>> I am new to Hadoop and to Flume. I read up about Flume and it seems like
>> this framework is appropriate for something like this although I did not
>> see an available 'source' that can do exactly what I am looking for.
>> Unavailability of a 'source' plugin is not a deal breaker for me as I can
>> write one but first I want to make sure this is the right way to go. So, my
>> questions are:
>>
>> 1. What are the pros/cons of using Flume for this use case?
>> 2. Does anybody know of a source plugin that does what I am looking for?
>> 3. Does anybody think I should not use Flume and instead write my own
>> application to achieve this use case?
>>
>> Thanks
>> KS
>>
>
>


-- 

Regards,
...Sudhakara.st

Re: Question about Flume

Posted by sudhakara st <su...@gmail.com>.

Hello Kaalu Singh,

Flume is best mach for your requirement. First define the storage structure
of data in HDFS and how you are going process the stored data in HDFS. Data
is very large size flume supports multiple-hop flow, filtering and
aggregation. I think no source plugin required, a command or a script or a
program which converts you data in stream bytes work with flume.


On Thu, Jan 23, 2014 at 6:31 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Fair enough. I just wanted to point out that doing it via a script is
> going to be a million times faster to implement compared to something like
> Flume (and arguably more reliable too with no maintenance overhead). Don't
> get me wrong, we use Flume for our data collection as well but our use case
> is real time/online data collection and Flume does the job well. So nothing
> against Flume per se. I was just thinking - if a script becomes a pain down
> the road how much throw away effort are we talking about here, a few
> minutes to a few hours at max vs what happens if Flume becomes a pain, a
> few days to a few weeks of throw away work.
>
> Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>
>
>  ------------------------------
> * From: * Kaalu Singh <ka...@gmail.com>;
> * To: * <us...@hadoop.apache.org>; Dhaval Shah <pr...@yahoo.co.in>;
>
> * Subject: * Re: Question about Flume
> * Sent: * Wed, Jan 22, 2014 11:20:52 PM
>
>   The closest built-in functionality to the use case I have is the
> "Spooling Directory Source" and I like the idea of using/building software
> with higher level languages like Java for reasons of extensibility etc (and
> don't like the idea of scripts).
>
> However, I am soliciting opinions and can be swayed to change my mind.
>
> Thanks for your response Dhaval - appreciate it.
>
> Regards
> KS
>
>
> On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:
>
>> Flume is useful for online log aggregation in a streaming format. Your
>> use case seems more like a batch format where you just need to grab the
>> file and put it in HDFS at regular intervals which can be much more easily
>>  achieved by a bash script running on a cron'd basis.
>>
>> Regards,
>>
>> Dhaval
>>
>>
>> ________________________________
>> From: Kaalu Singh <ka...@gmail.com>
>> To: user@hadoop.apache.org
>> Sent: Wednesday, 22 January 2014 5:52 PM
>> Subject: Question about Flume
>>
>>
>>
>> Hi,
>>
>> I have the following use case:
>>
>> I have data files getting generated frequently on a certain machine, X.
>> The only way I can bring them into my Hadoop cluster  is by SFTPing at
>> certain intervals of time and getting them and landing them in HDFS.
>>
>> I am new to Hadoop and to Flume. I read up about Flume and it seems like
>> this framework is appropriate for something like this although I did not
>> see an available 'source' that can do exactly what I am looking for.
>> Unavailability of a 'source' plugin is not a deal breaker for me as I can
>> write one but first I want to make sure this is the right way to go. So, my
>> questions are:
>>
>> 1. What are the pros/cons of using Flume for this use case?
>> 2. Does anybody know of a source plugin that does what I am looking for?
>> 3. Does anybody think I should not use Flume and instead write my own
>> application to achieve this use case?
>>
>> Thanks
>> KS
>>
>
>


-- 

Regards,
...Sudhakara.st

Re: Question about Flume

Posted by sudhakara st <su...@gmail.com>.

Hello Kaalu Singh,

Flume is best mach for your requirement. First define the storage structure
of data in HDFS and how you are going process the stored data in HDFS. Data
is very large size flume supports multiple-hop flow, filtering and
aggregation. I think no source plugin required, a command or a script or a
program which converts you data in stream bytes work with flume.


On Thu, Jan 23, 2014 at 6:31 AM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Fair enough. I just wanted to point out that doing it via a script is
> going to be a million times faster to implement compared to something like
> Flume (and arguably more reliable too with no maintenance overhead). Don't
> get me wrong, we use Flume for our data collection as well but our use case
> is real time/online data collection and Flume does the job well. So nothing
> against Flume per se. I was just thinking - if a script becomes a pain down
> the road how much throw away effort are we talking about here, a few
> minutes to a few hours at max vs what happens if Flume becomes a pain, a
> few days to a few weeks of throw away work.
>
> Sent from Yahoo Mail on Android<https://overview.mail.yahoo.com/mobile/?.src=Android>
>
>  ------------------------------
> * From: * Kaalu Singh <ka...@gmail.com>;
> * To: * <us...@hadoop.apache.org>; Dhaval Shah <pr...@yahoo.co.in>;
>
> * Subject: * Re: Question about Flume
> * Sent: * Wed, Jan 22, 2014 11:20:52 PM
>
>   The closest built-in functionality to the use case I have is the
> "Spooling Directory Source" and I like the idea of using/building software
> with higher level languages like Java for reasons of extensibility etc (and
> don't like the idea of scripts).
>
> However, I am soliciting opinions and can be swayed to change my mind.
>
> Thanks for your response Dhaval - appreciate it.
>
> Regards
> KS
>
>
> On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:
>
>> Flume is useful for online log aggregation in a streaming format. Your
>> use case seems more like a batch format where you just need to grab the
>> file and put it in HDFS at regular intervals which can be much more easily
>>  achieved by a bash script running on a cron'd basis.
>>
>> Regards,
>>
>> Dhaval
>>
>>
>> ________________________________
>> From: Kaalu Singh <ka...@gmail.com>
>> To: user@hadoop.apache.org
>> Sent: Wednesday, 22 January 2014 5:52 PM
>> Subject: Question about Flume
>>
>>
>>
>> Hi,
>>
>> I have the following use case:
>>
>> I have data files getting generated frequently on a certain machine, X.
>> The only way I can bring them into my Hadoop cluster  is by SFTPing at
>> certain intervals of time and getting them and landing them in HDFS.
>>
>> I am new to Hadoop and to Flume. I read up about Flume and it seems like
>> this framework is appropriate for something like this although I did not
>> see an available 'source' that can do exactly what I am looking for.
>> Unavailability of a 'source' plugin is not a deal breaker for me as I can
>> write one but first I want to make sure this is the right way to go. So, my
>> questions are:
>>
>> 1. What are the pros/cons of using Flume for this use case?
>> 2. Does anybody know of a source plugin that does what I am looking for?
>> 3. Does anybody think I should not use Flume and instead write my own
>> application to achieve this use case?
>>
>> Thanks
>> KS
>>
>
>


-- 

Regards,
...Sudhakara.st

Re: Question about Flume

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Fair enough. I just wanted to point out that doing it via a script is going to be a million times faster to implement compared to something like Flume (and arguably more reliable too with no maintenance overhead). Don't get me wrong, we use Flume for our data collection as well but our use case is real time/online data collection and Flume does the job well. So nothing against Flume per se. I was just thinking - if a script becomes a pain down the road how much throw away effort are we talking about here, a few minutes to a few hours at max vs what happens if Flume becomes a pain, a few days to a few weeks of throw away work.

Sent from Yahoo Mail on Android

Re: Question about Flume

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Fair enough. I just wanted to point out that doing it via a script is going to be a million times faster to implement compared to something like Flume (and arguably more reliable too with no maintenance overhead). Don't get me wrong, we use Flume for our data collection as well but our use case is real time/online data collection and Flume does the job well. So nothing against Flume per se. I was just thinking - if a script becomes a pain down the road how much throw away effort are we talking about here, a few minutes to a few hours at max vs what happens if Flume becomes a pain, a few days to a few weeks of throw away work.

Sent from Yahoo Mail on Android

Re: Question about Flume

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Fair enough. I just wanted to point out that doing it via a script is going to be a million times faster to implement compared to something like Flume (and arguably more reliable too with no maintenance overhead). Don't get me wrong, we use Flume for our data collection as well but our use case is real time/online data collection and Flume does the job well. So nothing against Flume per se. I was just thinking - if a script becomes a pain down the road how much throw away effort are we talking about here, a few minutes to a few hours at max vs what happens if Flume becomes a pain, a few days to a few weeks of throw away work.

Sent from Yahoo Mail on Android

Re: Question about Flume

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Fair enough. I just wanted to point out that doing it via a script is going to be a million times faster to implement compared to something like Flume (and arguably more reliable too with no maintenance overhead). Don't get me wrong, we use Flume for our data collection as well but our use case is real time/online data collection and Flume does the job well. So nothing against Flume per se. I was just thinking - if a script becomes a pain down the road how much throw away effort are we talking about here, a few minutes to a few hours at max vs what happens if Flume becomes a pain, a few days to a few weeks of throw away work.

Sent from Yahoo Mail on Android

Re: Question about Flume

Posted by Kaalu Singh <ka...@gmail.com>.

The closest built-in functionality to the use case I have is the "Spooling
Directory Source" and I like the idea of using/building software with
higher level languages like Java for reasons of extensibility etc (and
don't like the idea of scripts).

However, I am soliciting opinions and can be swayed to change my mind.

Thanks for your response Dhaval - appreciate it.

Regards
KS


On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Flume is useful for online log aggregation in a streaming format. Your use
> case seems more like a batch format where you just need to grab the file
> and put it in HDFS at regular intervals which can be much more easily
>  achieved by a bash script running on a cron'd basis.
>
> Regards,
>
> Dhaval
>
>
> ________________________________
> From: Kaalu Singh <ka...@gmail.com>
> To: user@hadoop.apache.org
> Sent: Wednesday, 22 January 2014 5:52 PM
> Subject: Question about Flume
>
>
>
> Hi,
>
> I have the following use case:
>
> I have data files getting generated frequently on a certain machine, X.
> The only way I can bring them into my Hadoop cluster  is by SFTPing at
> certain intervals of time and getting them and landing them in HDFS.
>
> I am new to Hadoop and to Flume. I read up about Flume and it seems like
> this framework is appropriate for something like this although I did not
> see an available 'source' that can do exactly what I am looking for.
> Unavailability of a 'source' plugin is not a deal breaker for me as I can
> write one but first I want to make sure this is the right way to go. So, my
> questions are:
>
> 1. What are the pros/cons of using Flume for this use case?
> 2. Does anybody know of a source plugin that does what I am looking for?
> 3. Does anybody think I should not use Flume and instead write my own
> application to achieve this use case?
>
> Thanks
> KS
>

Re: Question about Flume

Posted by Kaalu Singh <ka...@gmail.com>.

The closest built-in functionality to the use case I have is the "Spooling
Directory Source" and I like the idea of using/building software with
higher level languages like Java for reasons of extensibility etc (and
don't like the idea of scripts).

However, I am soliciting opinions and can be swayed to change my mind.

Thanks for your response Dhaval - appreciate it.

Regards
KS


On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Flume is useful for online log aggregation in a streaming format. Your use
> case seems more like a batch format where you just need to grab the file
> and put it in HDFS at regular intervals which can be much more easily
>  achieved by a bash script running on a cron'd basis.
>
> Regards,
>
> Dhaval
>
>
> ________________________________
> From: Kaalu Singh <ka...@gmail.com>
> To: user@hadoop.apache.org
> Sent: Wednesday, 22 January 2014 5:52 PM
> Subject: Question about Flume
>
>
>
> Hi,
>
> I have the following use case:
>
> I have data files getting generated frequently on a certain machine, X.
> The only way I can bring them into my Hadoop cluster  is by SFTPing at
> certain intervals of time and getting them and landing them in HDFS.
>
> I am new to Hadoop and to Flume. I read up about Flume and it seems like
> this framework is appropriate for something like this although I did not
> see an available 'source' that can do exactly what I am looking for.
> Unavailability of a 'source' plugin is not a deal breaker for me as I can
> write one but first I want to make sure this is the right way to go. So, my
> questions are:
>
> 1. What are the pros/cons of using Flume for this use case?
> 2. Does anybody know of a source plugin that does what I am looking for?
> 3. Does anybody think I should not use Flume and instead write my own
> application to achieve this use case?
>
> Thanks
> KS
>

Re: Question about Flume

Posted by Kaalu Singh <ka...@gmail.com>.

The closest built-in functionality to the use case I have is the "Spooling
Directory Source" and I like the idea of using/building software with
higher level languages like Java for reasons of extensibility etc (and
don't like the idea of scripts).

However, I am soliciting opinions and can be swayed to change my mind.

Thanks for your response Dhaval - appreciate it.

Regards
KS


On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Flume is useful for online log aggregation in a streaming format. Your use
> case seems more like a batch format where you just need to grab the file
> and put it in HDFS at regular intervals which can be much more easily
>  achieved by a bash script running on a cron'd basis.
>
> Regards,
>
> Dhaval
>
>
> ________________________________
> From: Kaalu Singh <ka...@gmail.com>
> To: user@hadoop.apache.org
> Sent: Wednesday, 22 January 2014 5:52 PM
> Subject: Question about Flume
>
>
>
> Hi,
>
> I have the following use case:
>
> I have data files getting generated frequently on a certain machine, X.
> The only way I can bring them into my Hadoop cluster  is by SFTPing at
> certain intervals of time and getting them and landing them in HDFS.
>
> I am new to Hadoop and to Flume. I read up about Flume and it seems like
> this framework is appropriate for something like this although I did not
> see an available 'source' that can do exactly what I am looking for.
> Unavailability of a 'source' plugin is not a deal breaker for me as I can
> write one but first I want to make sure this is the right way to go. So, my
> questions are:
>
> 1. What are the pros/cons of using Flume for this use case?
> 2. Does anybody know of a source plugin that does what I am looking for?
> 3. Does anybody think I should not use Flume and instead write my own
> application to achieve this use case?
>
> Thanks
> KS
>

Re: Question about Flume

Posted by Kaalu Singh <ka...@gmail.com>.

The closest built-in functionality to the use case I have is the "Spooling
Directory Source" and I like the idea of using/building software with
higher level languages like Java for reasons of extensibility etc (and
don't like the idea of scripts).

However, I am soliciting opinions and can be swayed to change my mind.

Thanks for your response Dhaval - appreciate it.

Regards
KS


On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Flume is useful for online log aggregation in a streaming format. Your use
> case seems more like a batch format where you just need to grab the file
> and put it in HDFS at regular intervals which can be much more easily
>  achieved by a bash script running on a cron'd basis.
>
> Regards,
>
> Dhaval
>
>
> ________________________________
> From: Kaalu Singh <ka...@gmail.com>
> To: user@hadoop.apache.org
> Sent: Wednesday, 22 January 2014 5:52 PM
> Subject: Question about Flume
>
>
>
> Hi,
>
> I have the following use case:
>
> I have data files getting generated frequently on a certain machine, X.
> The only way I can bring them into my Hadoop cluster  is by SFTPing at
> certain intervals of time and getting them and landing them in HDFS.
>
> I am new to Hadoop and to Flume. I read up about Flume and it seems like
> this framework is appropriate for something like this although I did not
> see an available 'source' that can do exactly what I am looking for.
> Unavailability of a 'source' plugin is not a deal breaker for me as I can
> write one but first I want to make sure this is the right way to go. So, my
> questions are:
>
> 1. What are the pros/cons of using Flume for this use case?
> 2. Does anybody know of a source plugin that does what I am looking for?
> 3. Does anybody think I should not use Flume and instead write my own
> application to achieve this use case?
>
> Thanks
> KS
>

Re: Question about Flume

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Flume is useful for online log aggregation in a streaming format. Your use case seems more like a batch format where you just need to grab the file and put it in HDFS at regular intervals which can be much more easily  achieved by a bash script running on a cron'd basis. 

Regards,

Dhaval

________________________________
From: Kaalu Singh <ka...@gmail.com>
To: user@hadoop.apache.org 
Sent: Wednesday, 22 January 2014 5:52 PM
Subject: Question about Flume

Hi,

I have the following use case:

I have data files getting generated frequently on a certain machine, X. The only way I can bring them into my Hadoop cluster  is by SFTPing at certain intervals of time and getting them and landing them in HDFS.  

I am new to Hadoop and to Flume. I read up about Flume and it seems like this framework is appropriate for something like this although I did not see an available 'source' that can do exactly what I am looking for. Unavailability of a 'source' plugin is not a deal breaker for me as I can write one but first I want to make sure this is the right way to go. So, my questions are:

1. What are the pros/cons of using Flume for this use case? 
2. Does anybody know of a source plugin that does what I am looking for? 
3. Does anybody think I should not use Flume and instead write my own application to achieve this use case?

Thanks
KS

Re: Question about Flume

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Flume is useful for online log aggregation in a streaming format. Your use case seems more like a batch format where you just need to grab the file and put it in HDFS at regular intervals which can be much more easily  achieved by a bash script running on a cron'd basis. 

Regards,

Dhaval

________________________________
From: Kaalu Singh <ka...@gmail.com>
To: user@hadoop.apache.org 
Sent: Wednesday, 22 January 2014 5:52 PM
Subject: Question about Flume

Hi,

I have the following use case:

I have data files getting generated frequently on a certain machine, X. The only way I can bring them into my Hadoop cluster  is by SFTPing at certain intervals of time and getting them and landing them in HDFS.  

I am new to Hadoop and to Flume. I read up about Flume and it seems like this framework is appropriate for something like this although I did not see an available 'source' that can do exactly what I am looking for. Unavailability of a 'source' plugin is not a deal breaker for me as I can write one but first I want to make sure this is the right way to go. So, my questions are:

1. What are the pros/cons of using Flume for this use case? 
2. Does anybody know of a source plugin that does what I am looking for? 
3. Does anybody think I should not use Flume and instead write my own application to achieve this use case?

Thanks
KS

Re: Question about Flume

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Flume is useful for online log aggregation in a streaming format. Your use case seems more like a batch format where you just need to grab the file and put it in HDFS at regular intervals which can be much more easily  achieved by a bash script running on a cron'd basis. 

Regards,

Dhaval

________________________________
From: Kaalu Singh <ka...@gmail.com>
To: user@hadoop.apache.org 
Sent: Wednesday, 22 January 2014 5:52 PM
Subject: Question about Flume

Hi,

I have the following use case:

I have data files getting generated frequently on a certain machine, X. The only way I can bring them into my Hadoop cluster  is by SFTPing at certain intervals of time and getting them and landing them in HDFS.  

I am new to Hadoop and to Flume. I read up about Flume and it seems like this framework is appropriate for something like this although I did not see an available 'source' that can do exactly what I am looking for. Unavailability of a 'source' plugin is not a deal breaker for me as I can write one but first I want to make sure this is the right way to go. So, my questions are:

1. What are the pros/cons of using Flume for this use case? 
2. Does anybody know of a source plugin that does what I am looking for? 
3. Does anybody think I should not use Flume and instead write my own application to achieve this use case?

Thanks
KS

Re: Question about Flume

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Flume is useful for online log aggregation in a streaming format. Your use case seems more like a batch format where you just need to grab the file and put it in HDFS at regular intervals which can be much more easily  achieved by a bash script running on a cron'd basis. 

Regards,

Dhaval

________________________________
From: Kaalu Singh <ka...@gmail.com>
To: user@hadoop.apache.org 
Sent: Wednesday, 22 January 2014 5:52 PM
Subject: Question about Flume

Hi,

I have the following use case:

I have data files getting generated frequently on a certain machine, X. The only way I can bring them into my Hadoop cluster  is by SFTPing at certain intervals of time and getting them and landing them in HDFS.  

I am new to Hadoop and to Flume. I read up about Flume and it seems like this framework is appropriate for something like this although I did not see an available 'source' that can do exactly what I am looking for. Unavailability of a 'source' plugin is not a deal breaker for me as I can write one but first I want to make sure this is the right way to go. So, my questions are:

1. What are the pros/cons of using Flume for this use case? 
2. Does anybody know of a source plugin that does what I am looking for? 
3. Does anybody think I should not use Flume and instead write my own application to achieve this use case?

Thanks
KS