You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Panshul Whisper <ou...@gmail.com> on 2013/01/11 04:12:17 UTC

queues in haddop

Hello,

I have a hadoop cluster setup of 10 nodes and I an in need of implementing
queues in the cluster for receiving high volumes of data.
Please suggest what will be more efficient to use in the case of receiving
24 Million Json files.. approx 5 KB each in every 24 hours :
1. Using Capacity Scheduler
2. Implementing RabbitMQ and receive data from them using Spring
Integration Data pipe lines.

I cannot afford to loose any of the JSON files received.

Thanking You,

-- 
Regards,
Ouch Whisper
010101010101

Re: queues in haddop

Posted by Michael Segel <mi...@hotmail.com>.

He's got two different queues. 

1) queue in capacity scheduler so he can have a set or M/R tasks running in the background to pull data off of...

2) a durable queue that receives the inbound json files to be processed. 

You can have a customer written listener that pulls data from the queue and puts them either in HDFS or HBase, depending on the access patterns and the content of the files. 
Then you would write a M/R job that actually processes the data to be used by ancillary processes not mentioned in the OP's question. 

This is why he asked about RabbitMQ which is one option, there are others like ActiveMQ or something else....

On Jan 11, 2013, at 12:04 AM, Harsh J <ha...@cloudera.com> wrote:

> Your question in unclear: HDFS has no queues for ingesting data (it is
> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
> components have queues for processing data purposes.
> 
> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com> wrote:
>> Hello,
>> 
>> I have a hadoop cluster setup of 10 nodes and I an in need of implementing
>> queues in the cluster for receiving high volumes of data.
>> Please suggest what will be more efficient to use in the case of receiving
>> 24 Million Json files.. approx 5 KB each in every 24 hours :
>> 1. Using Capacity Scheduler
>> 2. Implementing RabbitMQ and receive data from them using Spring Integration
>> Data pipe lines.
>> 
>> I cannot afford to loose any of the JSON files received.
>> 
>> Thanking You,
>> 
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
> 
> 
> 
> -- 
> Harsh J
>

Re: queues in haddop

Posted by Michael Segel <mi...@hotmail.com>.

He's got two different queues. 

1) queue in capacity scheduler so he can have a set or M/R tasks running in the background to pull data off of...

2) a durable queue that receives the inbound json files to be processed. 

You can have a customer written listener that pulls data from the queue and puts them either in HDFS or HBase, depending on the access patterns and the content of the files. 
Then you would write a M/R job that actually processes the data to be used by ancillary processes not mentioned in the OP's question. 

This is why he asked about RabbitMQ which is one option, there are others like ActiveMQ or something else....

On Jan 11, 2013, at 12:04 AM, Harsh J <ha...@cloudera.com> wrote:

> Your question in unclear: HDFS has no queues for ingesting data (it is
> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
> components have queues for processing data purposes.
> 
> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com> wrote:
>> Hello,
>> 
>> I have a hadoop cluster setup of 10 nodes and I an in need of implementing
>> queues in the cluster for receiving high volumes of data.
>> Please suggest what will be more efficient to use in the case of receiving
>> 24 Million Json files.. approx 5 KB each in every 24 hours :
>> 1. Using Capacity Scheduler
>> 2. Implementing RabbitMQ and receive data from them using Spring Integration
>> Data pipe lines.
>> 
>> I cannot afford to loose any of the JSON files received.
>> 
>> Thanking You,
>> 
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
> 
> 
> 
> -- 
> Harsh J
>

Re: queues in haddop

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.

You can also use fluentd. http://fluentd.org/
"Fluentd receives logs as JSON streams, buffers them, and sends them
to other systems like Amazon S3, MongoDB, Hadoop, or other Fluentds."
It has a plugin for pushing into HDFS through fluent-plugin-webhdfs.
https://github.com/fluent/fluent-plugin-webhdfs
It can also handle JSON directly, so it fits in your case.

Thanks,
Tsuyoshi

On Fri, Jan 11, 2013 at 10:03 PM, Bertrand Dechoux <de...@gmail.com> wrote:
> There is also kafka. http://kafka.apache.org
> "A high-throughput, distributed, publish-subscribe messaging system."
>
> But it does not push into HDFS, you need to launch a job to pull data in.
>
> Regards
>
> Bertrand
>
>
> On Fri, Jan 11, 2013 at 1:52 PM, Mirko Kämpf <mi...@gmail.com> wrote:
>>
>> I would suggest to work with Flume, in order to clollect a certain number
>> of files and store it to HDFS in larger chunk or write it directly to HBase,
>> this allows random access later on (if need) otherwise HBase could be an
>> overkill. You can collect data in an MySQL DB and than import regularly via
>> Sqoop.
>>
>> Best
>> Mirko
>>
>>
>> "Every dat flow goes to Hadoop"
>> citation from an unkown source
>>
>> 2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>
>>>
>>> Queues in the capacity scheduler are logical data structures into which
>>> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
>>> framework, according to some capacity constraints that can be defined for a
>>> queue.
>>>
>>> So, given your use case, I don't think Capacity Scheduler is going to
>>> directly help you (since you only spoke about data-in, and not processing)
>>>
>>> So, yes something like Flume or Scribe
>>>
>>> Thanks
>>> Hemanth
>>>
>>> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>
>>>> Your question in unclear: HDFS has no queues for ingesting data (it is
>>>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>>>> components have queues for processing data purposes.
>>>>
>>>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>>>> wrote:
>>>> > Hello,
>>>> >
>>>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>>>> > implementing
>>>> > queues in the cluster for receiving high volumes of data.
>>>> > Please suggest what will be more efficient to use in the case of
>>>> > receiving
>>>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>>>> > 1. Using Capacity Scheduler
>>>> > 2. Implementing RabbitMQ and receive data from them using Spring
>>>> > Integration
>>>> > Data pipe lines.
>>>> >
>>>> > I cannot afford to loose any of the JSON files received.
>>>> >
>>>> > Thanking You,
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Ouch Whisper
>>>> > 010101010101
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>>>
>>
>
>
>
> --
> Bertrand Dechoux



-- 
OZAWA Tsuyoshi

Re: queues in haddop

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.

You can also use fluentd. http://fluentd.org/
"Fluentd receives logs as JSON streams, buffers them, and sends them
to other systems like Amazon S3, MongoDB, Hadoop, or other Fluentds."
It has a plugin for pushing into HDFS through fluent-plugin-webhdfs.
https://github.com/fluent/fluent-plugin-webhdfs
It can also handle JSON directly, so it fits in your case.

Thanks,
Tsuyoshi

On Fri, Jan 11, 2013 at 10:03 PM, Bertrand Dechoux <de...@gmail.com> wrote:
> There is also kafka. http://kafka.apache.org
> "A high-throughput, distributed, publish-subscribe messaging system."
>
> But it does not push into HDFS, you need to launch a job to pull data in.
>
> Regards
>
> Bertrand
>
>
> On Fri, Jan 11, 2013 at 1:52 PM, Mirko Kämpf <mi...@gmail.com> wrote:
>>
>> I would suggest to work with Flume, in order to clollect a certain number
>> of files and store it to HDFS in larger chunk or write it directly to HBase,
>> this allows random access later on (if need) otherwise HBase could be an
>> overkill. You can collect data in an MySQL DB and than import regularly via
>> Sqoop.
>>
>> Best
>> Mirko
>>
>>
>> "Every dat flow goes to Hadoop"
>> citation from an unkown source
>>
>> 2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>
>>>
>>> Queues in the capacity scheduler are logical data structures into which
>>> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
>>> framework, according to some capacity constraints that can be defined for a
>>> queue.
>>>
>>> So, given your use case, I don't think Capacity Scheduler is going to
>>> directly help you (since you only spoke about data-in, and not processing)
>>>
>>> So, yes something like Flume or Scribe
>>>
>>> Thanks
>>> Hemanth
>>>
>>> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>
>>>> Your question in unclear: HDFS has no queues for ingesting data (it is
>>>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>>>> components have queues for processing data purposes.
>>>>
>>>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>>>> wrote:
>>>> > Hello,
>>>> >
>>>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>>>> > implementing
>>>> > queues in the cluster for receiving high volumes of data.
>>>> > Please suggest what will be more efficient to use in the case of
>>>> > receiving
>>>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>>>> > 1. Using Capacity Scheduler
>>>> > 2. Implementing RabbitMQ and receive data from them using Spring
>>>> > Integration
>>>> > Data pipe lines.
>>>> >
>>>> > I cannot afford to loose any of the JSON files received.
>>>> >
>>>> > Thanking You,
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Ouch Whisper
>>>> > 010101010101
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>>>
>>
>
>
>
> --
> Bertrand Dechoux



-- 
OZAWA Tsuyoshi

Re: queues in haddop

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.

You can also use fluentd. http://fluentd.org/
"Fluentd receives logs as JSON streams, buffers them, and sends them
to other systems like Amazon S3, MongoDB, Hadoop, or other Fluentds."
It has a plugin for pushing into HDFS through fluent-plugin-webhdfs.
https://github.com/fluent/fluent-plugin-webhdfs
It can also handle JSON directly, so it fits in your case.

Thanks,
Tsuyoshi

On Fri, Jan 11, 2013 at 10:03 PM, Bertrand Dechoux <de...@gmail.com> wrote:
> There is also kafka. http://kafka.apache.org
> "A high-throughput, distributed, publish-subscribe messaging system."
>
> But it does not push into HDFS, you need to launch a job to pull data in.
>
> Regards
>
> Bertrand
>
>
> On Fri, Jan 11, 2013 at 1:52 PM, Mirko Kämpf <mi...@gmail.com> wrote:
>>
>> I would suggest to work with Flume, in order to clollect a certain number
>> of files and store it to HDFS in larger chunk or write it directly to HBase,
>> this allows random access later on (if need) otherwise HBase could be an
>> overkill. You can collect data in an MySQL DB and than import regularly via
>> Sqoop.
>>
>> Best
>> Mirko
>>
>>
>> "Every dat flow goes to Hadoop"
>> citation from an unkown source
>>
>> 2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>
>>>
>>> Queues in the capacity scheduler are logical data structures into which
>>> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
>>> framework, according to some capacity constraints that can be defined for a
>>> queue.
>>>
>>> So, given your use case, I don't think Capacity Scheduler is going to
>>> directly help you (since you only spoke about data-in, and not processing)
>>>
>>> So, yes something like Flume or Scribe
>>>
>>> Thanks
>>> Hemanth
>>>
>>> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>
>>>> Your question in unclear: HDFS has no queues for ingesting data (it is
>>>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>>>> components have queues for processing data purposes.
>>>>
>>>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>>>> wrote:
>>>> > Hello,
>>>> >
>>>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>>>> > implementing
>>>> > queues in the cluster for receiving high volumes of data.
>>>> > Please suggest what will be more efficient to use in the case of
>>>> > receiving
>>>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>>>> > 1. Using Capacity Scheduler
>>>> > 2. Implementing RabbitMQ and receive data from them using Spring
>>>> > Integration
>>>> > Data pipe lines.
>>>> >
>>>> > I cannot afford to loose any of the JSON files received.
>>>> >
>>>> > Thanking You,
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Ouch Whisper
>>>> > 010101010101
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>>>
>>
>
>
>
> --
> Bertrand Dechoux



-- 
OZAWA Tsuyoshi

Re: queues in haddop

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.

You can also use fluentd. http://fluentd.org/
"Fluentd receives logs as JSON streams, buffers them, and sends them
to other systems like Amazon S3, MongoDB, Hadoop, or other Fluentds."
It has a plugin for pushing into HDFS through fluent-plugin-webhdfs.
https://github.com/fluent/fluent-plugin-webhdfs
It can also handle JSON directly, so it fits in your case.

Thanks,
Tsuyoshi

On Fri, Jan 11, 2013 at 10:03 PM, Bertrand Dechoux <de...@gmail.com> wrote:
> There is also kafka. http://kafka.apache.org
> "A high-throughput, distributed, publish-subscribe messaging system."
>
> But it does not push into HDFS, you need to launch a job to pull data in.
>
> Regards
>
> Bertrand
>
>
> On Fri, Jan 11, 2013 at 1:52 PM, Mirko Kämpf <mi...@gmail.com> wrote:
>>
>> I would suggest to work with Flume, in order to clollect a certain number
>> of files and store it to HDFS in larger chunk or write it directly to HBase,
>> this allows random access later on (if need) otherwise HBase could be an
>> overkill. You can collect data in an MySQL DB and than import regularly via
>> Sqoop.
>>
>> Best
>> Mirko
>>
>>
>> "Every dat flow goes to Hadoop"
>> citation from an unkown source
>>
>> 2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>
>>>
>>> Queues in the capacity scheduler are logical data structures into which
>>> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
>>> framework, according to some capacity constraints that can be defined for a
>>> queue.
>>>
>>> So, given your use case, I don't think Capacity Scheduler is going to
>>> directly help you (since you only spoke about data-in, and not processing)
>>>
>>> So, yes something like Flume or Scribe
>>>
>>> Thanks
>>> Hemanth
>>>
>>> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>
>>>> Your question in unclear: HDFS has no queues for ingesting data (it is
>>>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>>>> components have queues for processing data purposes.
>>>>
>>>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>>>> wrote:
>>>> > Hello,
>>>> >
>>>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>>>> > implementing
>>>> > queues in the cluster for receiving high volumes of data.
>>>> > Please suggest what will be more efficient to use in the case of
>>>> > receiving
>>>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>>>> > 1. Using Capacity Scheduler
>>>> > 2. Implementing RabbitMQ and receive data from them using Spring
>>>> > Integration
>>>> > Data pipe lines.
>>>> >
>>>> > I cannot afford to loose any of the JSON files received.
>>>> >
>>>> > Thanking You,
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Ouch Whisper
>>>> > 010101010101
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>>>
>>
>
>
>
> --
> Bertrand Dechoux



-- 
OZAWA Tsuyoshi

Re: queues in haddop

Posted by Bertrand Dechoux <de...@gmail.com>.

There is also kafka. http://kafka.apache.org
"A high-throughput, distributed, publish-subscribe messaging system."

But it does not push into HDFS, you need to launch a job to pull data in.

Regards

Bertrand

On Fri, Jan 11, 2013 at 1:52 PM, Mirko Kämpf <mi...@gmail.com> wrote:

> I would suggest to work with Flume, in order to clollect a certain number
> of files and store it to HDFS in larger chunk or write it directly to
> HBase, this allows random access later on (if need) otherwise HBase could
> be an overkill. You can collect data in an MySQL DB and than import
> regularly via Sqoop.
>
> Best
> Mirko
>
>
> "Every dat flow goes to Hadoop"
> citation from an unkown source
>
> 2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>
>
>> Queues in the capacity scheduler are logical data structures into which
>> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
>> framework, according to some capacity constraints that can be defined for a
>> queue.
>>
>> So, given your use case, I don't think Capacity Scheduler is going to
>> directly help you (since you only spoke about data-in, and not processing)
>>
>> So, yes something like Flume or Scribe
>>
>> Thanks
>> Hemanth
>>
>> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Your question in unclear: HDFS has no queues for ingesting data (it is
>>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>>> components have queues for processing data purposes.
>>>
>>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>>> wrote:
>>> > Hello,
>>> >
>>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>>> implementing
>>> > queues in the cluster for receiving high volumes of data.
>>> > Please suggest what will be more efficient to use in the case of
>>> receiving
>>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>>> > 1. Using Capacity Scheduler
>>> > 2. Implementing RabbitMQ and receive data from them using Spring
>>> Integration
>>> > Data pipe lines.
>>> >
>>> > I cannot afford to loose any of the JSON files received.
>>> >
>>> > Thanking You,
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>


-- 
Bertrand Dechoux

Re: queues in haddop

Posted by Bertrand Dechoux <de...@gmail.com>.

There is also kafka. http://kafka.apache.org
"A high-throughput, distributed, publish-subscribe messaging system."

But it does not push into HDFS, you need to launch a job to pull data in.

Regards

Bertrand

On Fri, Jan 11, 2013 at 1:52 PM, Mirko Kämpf <mi...@gmail.com> wrote:

> I would suggest to work with Flume, in order to clollect a certain number
> of files and store it to HDFS in larger chunk or write it directly to
> HBase, this allows random access later on (if need) otherwise HBase could
> be an overkill. You can collect data in an MySQL DB and than import
> regularly via Sqoop.
>
> Best
> Mirko
>
>
> "Every dat flow goes to Hadoop"
> citation from an unkown source
>
> 2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>
>
>> Queues in the capacity scheduler are logical data structures into which
>> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
>> framework, according to some capacity constraints that can be defined for a
>> queue.
>>
>> So, given your use case, I don't think Capacity Scheduler is going to
>> directly help you (since you only spoke about data-in, and not processing)
>>
>> So, yes something like Flume or Scribe
>>
>> Thanks
>> Hemanth
>>
>> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Your question in unclear: HDFS has no queues for ingesting data (it is
>>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>>> components have queues for processing data purposes.
>>>
>>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>>> wrote:
>>> > Hello,
>>> >
>>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>>> implementing
>>> > queues in the cluster for receiving high volumes of data.
>>> > Please suggest what will be more efficient to use in the case of
>>> receiving
>>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>>> > 1. Using Capacity Scheduler
>>> > 2. Implementing RabbitMQ and receive data from them using Spring
>>> Integration
>>> > Data pipe lines.
>>> >
>>> > I cannot afford to loose any of the JSON files received.
>>> >
>>> > Thanking You,
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>


-- 
Bertrand Dechoux

Re: queues in haddop

Posted by Bertrand Dechoux <de...@gmail.com>.

There is also kafka. http://kafka.apache.org
"A high-throughput, distributed, publish-subscribe messaging system."

But it does not push into HDFS, you need to launch a job to pull data in.

Regards

Bertrand

On Fri, Jan 11, 2013 at 1:52 PM, Mirko Kämpf <mi...@gmail.com> wrote:

> I would suggest to work with Flume, in order to clollect a certain number
> of files and store it to HDFS in larger chunk or write it directly to
> HBase, this allows random access later on (if need) otherwise HBase could
> be an overkill. You can collect data in an MySQL DB and than import
> regularly via Sqoop.
>
> Best
> Mirko
>
>
> "Every dat flow goes to Hadoop"
> citation from an unkown source
>
> 2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>
>
>> Queues in the capacity scheduler are logical data structures into which
>> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
>> framework, according to some capacity constraints that can be defined for a
>> queue.
>>
>> So, given your use case, I don't think Capacity Scheduler is going to
>> directly help you (since you only spoke about data-in, and not processing)
>>
>> So, yes something like Flume or Scribe
>>
>> Thanks
>> Hemanth
>>
>> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Your question in unclear: HDFS has no queues for ingesting data (it is
>>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>>> components have queues for processing data purposes.
>>>
>>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>>> wrote:
>>> > Hello,
>>> >
>>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>>> implementing
>>> > queues in the cluster for receiving high volumes of data.
>>> > Please suggest what will be more efficient to use in the case of
>>> receiving
>>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>>> > 1. Using Capacity Scheduler
>>> > 2. Implementing RabbitMQ and receive data from them using Spring
>>> Integration
>>> > Data pipe lines.
>>> >
>>> > I cannot afford to loose any of the JSON files received.
>>> >
>>> > Thanking You,
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>


-- 
Bertrand Dechoux

Re: queues in haddop

Posted by Bertrand Dechoux <de...@gmail.com>.

There is also kafka. http://kafka.apache.org
"A high-throughput, distributed, publish-subscribe messaging system."

But it does not push into HDFS, you need to launch a job to pull data in.

Regards

Bertrand

On Fri, Jan 11, 2013 at 1:52 PM, Mirko Kämpf <mi...@gmail.com> wrote:

> I would suggest to work with Flume, in order to clollect a certain number
> of files and store it to HDFS in larger chunk or write it directly to
> HBase, this allows random access later on (if need) otherwise HBase could
> be an overkill. You can collect data in an MySQL DB and than import
> regularly via Sqoop.
>
> Best
> Mirko
>
>
> "Every dat flow goes to Hadoop"
> citation from an unkown source
>
> 2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>
>
>> Queues in the capacity scheduler are logical data structures into which
>> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
>> framework, according to some capacity constraints that can be defined for a
>> queue.
>>
>> So, given your use case, I don't think Capacity Scheduler is going to
>> directly help you (since you only spoke about data-in, and not processing)
>>
>> So, yes something like Flume or Scribe
>>
>> Thanks
>> Hemanth
>>
>> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Your question in unclear: HDFS has no queues for ingesting data (it is
>>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>>> components have queues for processing data purposes.
>>>
>>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>>> wrote:
>>> > Hello,
>>> >
>>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>>> implementing
>>> > queues in the cluster for receiving high volumes of data.
>>> > Please suggest what will be more efficient to use in the case of
>>> receiving
>>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>>> > 1. Using Capacity Scheduler
>>> > 2. Implementing RabbitMQ and receive data from them using Spring
>>> Integration
>>> > Data pipe lines.
>>> >
>>> > I cannot afford to loose any of the JSON files received.
>>> >
>>> > Thanking You,
>>> >
>>> > --
>>> > Regards,
>>> > Ouch Whisper
>>> > 010101010101
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>


-- 
Bertrand Dechoux

Re: queues in haddop

Posted by Mirko Kämpf <mi...@gmail.com>.

I would suggest to work with Flume, in order to clollect a certain number
of files and store it to HDFS in larger chunk or write it directly to
HBase, this allows random access later on (if need) otherwise HBase could
be an overkill. You can collect data in an MySQL DB and than import
regularly via Sqoop.

Best
Mirko


"Every dat flow goes to Hadoop"
citation from an unkown source

2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>

> Queues in the capacity scheduler are logical data structures into which
> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
> framework, according to some capacity constraints that can be defined for a
> queue.
>
> So, given your use case, I don't think Capacity Scheduler is going to
> directly help you (since you only spoke about data-in, and not processing)
>
> So, yes something like Flume or Scribe
>
> Thanks
> Hemanth
>
> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Your question in unclear: HDFS has no queues for ingesting data (it is
>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>> components have queues for processing data purposes.
>>
>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>> > Hello,
>> >
>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>> implementing
>> > queues in the cluster for receiving high volumes of data.
>> > Please suggest what will be more efficient to use in the case of
>> receiving
>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>> > 1. Using Capacity Scheduler
>> > 2. Implementing RabbitMQ and receive data from them using Spring
>> Integration
>> > Data pipe lines.
>> >
>> > I cannot afford to loose any of the JSON files received.
>> >
>> > Thanking You,
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: queues in haddop

Posted by Mirko Kämpf <mi...@gmail.com>.

I would suggest to work with Flume, in order to clollect a certain number
of files and store it to HDFS in larger chunk or write it directly to
HBase, this allows random access later on (if need) otherwise HBase could
be an overkill. You can collect data in an MySQL DB and than import
regularly via Sqoop.

Best
Mirko


"Every dat flow goes to Hadoop"
citation from an unkown source

2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>

> Queues in the capacity scheduler are logical data structures into which
> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
> framework, according to some capacity constraints that can be defined for a
> queue.
>
> So, given your use case, I don't think Capacity Scheduler is going to
> directly help you (since you only spoke about data-in, and not processing)
>
> So, yes something like Flume or Scribe
>
> Thanks
> Hemanth
>
> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Your question in unclear: HDFS has no queues for ingesting data (it is
>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>> components have queues for processing data purposes.
>>
>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>> > Hello,
>> >
>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>> implementing
>> > queues in the cluster for receiving high volumes of data.
>> > Please suggest what will be more efficient to use in the case of
>> receiving
>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>> > 1. Using Capacity Scheduler
>> > 2. Implementing RabbitMQ and receive data from them using Spring
>> Integration
>> > Data pipe lines.
>> >
>> > I cannot afford to loose any of the JSON files received.
>> >
>> > Thanking You,
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: queues in haddop

Posted by Mirko Kämpf <mi...@gmail.com>.

I would suggest to work with Flume, in order to clollect a certain number
of files and store it to HDFS in larger chunk or write it directly to
HBase, this allows random access later on (if need) otherwise HBase could
be an overkill. You can collect data in an MySQL DB and than import
regularly via Sqoop.

Best
Mirko


"Every dat flow goes to Hadoop"
citation from an unkown source

2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>

> Queues in the capacity scheduler are logical data structures into which
> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
> framework, according to some capacity constraints that can be defined for a
> queue.
>
> So, given your use case, I don't think Capacity Scheduler is going to
> directly help you (since you only spoke about data-in, and not processing)
>
> So, yes something like Flume or Scribe
>
> Thanks
> Hemanth
>
> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Your question in unclear: HDFS has no queues for ingesting data (it is
>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>> components have queues for processing data purposes.
>>
>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>> > Hello,
>> >
>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>> implementing
>> > queues in the cluster for receiving high volumes of data.
>> > Please suggest what will be more efficient to use in the case of
>> receiving
>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>> > 1. Using Capacity Scheduler
>> > 2. Implementing RabbitMQ and receive data from them using Spring
>> Integration
>> > Data pipe lines.
>> >
>> > I cannot afford to loose any of the JSON files received.
>> >
>> > Thanking You,
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: queues in haddop

Posted by Mirko Kämpf <mi...@gmail.com>.

I would suggest to work with Flume, in order to clollect a certain number
of files and store it to HDFS in larger chunk or write it directly to
HBase, this allows random access later on (if need) otherwise HBase could
be an overkill. You can collect data in an MySQL DB and than import
regularly via Sqoop.

Best
Mirko


"Every dat flow goes to Hadoop"
citation from an unkown source

2013/1/11 Hemanth Yamijala <yh...@thoughtworks.com>

> Queues in the capacity scheduler are logical data structures into which
> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
> framework, according to some capacity constraints that can be defined for a
> queue.
>
> So, given your use case, I don't think Capacity Scheduler is going to
> directly help you (since you only spoke about data-in, and not processing)
>
> So, yes something like Flume or Scribe
>
> Thanks
> Hemanth
>
> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Your question in unclear: HDFS has no queues for ingesting data (it is
>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>> components have queues for processing data purposes.
>>
>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
>> wrote:
>> > Hello,
>> >
>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>> implementing
>> > queues in the cluster for receiving high volumes of data.
>> > Please suggest what will be more efficient to use in the case of
>> receiving
>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>> > 1. Using Capacity Scheduler
>> > 2. Implementing RabbitMQ and receive data from them using Spring
>> Integration
>> > Data pipe lines.
>> >
>> > I cannot afford to loose any of the JSON files received.
>> >
>> > Thanking You,
>> >
>> > --
>> > Regards,
>> > Ouch Whisper
>> > 010101010101
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: queues in haddop

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Queues in the capacity scheduler are logical data structures into which
MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
framework, according to some capacity constraints that can be defined for a
queue.

So, given your use case, I don't think Capacity Scheduler is going to
directly help you (since you only spoke about data-in, and not processing)

So, yes something like Flume or Scribe

Thanks
Hemanth

On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:

> Your question in unclear: HDFS has no queues for ingesting data (it is
> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
> components have queues for processing data purposes.
>
> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
> wrote:
> > Hello,
> >
> > I have a hadoop cluster setup of 10 nodes and I an in need of
> implementing
> > queues in the cluster for receiving high volumes of data.
> > Please suggest what will be more efficient to use in the case of
> receiving
> > 24 Million Json files.. approx 5 KB each in every 24 hours :
> > 1. Using Capacity Scheduler
> > 2. Implementing RabbitMQ and receive data from them using Spring
> Integration
> > Data pipe lines.
> >
> > I cannot afford to loose any of the JSON files received.
> >
> > Thanking You,
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
>
>
>
> --
> Harsh J
>

Re: queues in haddop

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Queues in the capacity scheduler are logical data structures into which
MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
framework, according to some capacity constraints that can be defined for a
queue.

So, given your use case, I don't think Capacity Scheduler is going to
directly help you (since you only spoke about data-in, and not processing)

So, yes something like Flume or Scribe

Thanks
Hemanth

On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:

> Your question in unclear: HDFS has no queues for ingesting data (it is
> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
> components have queues for processing data purposes.
>
> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
> wrote:
> > Hello,
> >
> > I have a hadoop cluster setup of 10 nodes and I an in need of
> implementing
> > queues in the cluster for receiving high volumes of data.
> > Please suggest what will be more efficient to use in the case of
> receiving
> > 24 Million Json files.. approx 5 KB each in every 24 hours :
> > 1. Using Capacity Scheduler
> > 2. Implementing RabbitMQ and receive data from them using Spring
> Integration
> > Data pipe lines.
> >
> > I cannot afford to loose any of the JSON files received.
> >
> > Thanking You,
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
>
>
>
> --
> Harsh J
>

Re: queues in haddop

Posted by Michael Segel <mi...@hotmail.com>.

He's got two different queues. 

1) queue in capacity scheduler so he can have a set or M/R tasks running in the background to pull data off of...

2) a durable queue that receives the inbound json files to be processed. 

You can have a customer written listener that pulls data from the queue and puts them either in HDFS or HBase, depending on the access patterns and the content of the files. 
Then you would write a M/R job that actually processes the data to be used by ancillary processes not mentioned in the OP's question. 

This is why he asked about RabbitMQ which is one option, there are others like ActiveMQ or something else....

On Jan 11, 2013, at 12:04 AM, Harsh J <ha...@cloudera.com> wrote:

> Your question in unclear: HDFS has no queues for ingesting data (it is
> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
> components have queues for processing data purposes.
> 
> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com> wrote:
>> Hello,
>> 
>> I have a hadoop cluster setup of 10 nodes and I an in need of implementing
>> queues in the cluster for receiving high volumes of data.
>> Please suggest what will be more efficient to use in the case of receiving
>> 24 Million Json files.. approx 5 KB each in every 24 hours :
>> 1. Using Capacity Scheduler
>> 2. Implementing RabbitMQ and receive data from them using Spring Integration
>> Data pipe lines.
>> 
>> I cannot afford to loose any of the JSON files received.
>> 
>> Thanking You,
>> 
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
> 
> 
> 
> -- 
> Harsh J
>

Re: queues in haddop

Posted by Michael Segel <mi...@hotmail.com>.

He's got two different queues. 

1) queue in capacity scheduler so he can have a set or M/R tasks running in the background to pull data off of...

2) a durable queue that receives the inbound json files to be processed. 

You can have a customer written listener that pulls data from the queue and puts them either in HDFS or HBase, depending on the access patterns and the content of the files. 
Then you would write a M/R job that actually processes the data to be used by ancillary processes not mentioned in the OP's question. 

This is why he asked about RabbitMQ which is one option, there are others like ActiveMQ or something else....

On Jan 11, 2013, at 12:04 AM, Harsh J <ha...@cloudera.com> wrote:

> Your question in unclear: HDFS has no queues for ingesting data (it is
> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
> components have queues for processing data purposes.
> 
> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com> wrote:
>> Hello,
>> 
>> I have a hadoop cluster setup of 10 nodes and I an in need of implementing
>> queues in the cluster for receiving high volumes of data.
>> Please suggest what will be more efficient to use in the case of receiving
>> 24 Million Json files.. approx 5 KB each in every 24 hours :
>> 1. Using Capacity Scheduler
>> 2. Implementing RabbitMQ and receive data from them using Spring Integration
>> Data pipe lines.
>> 
>> I cannot afford to loose any of the JSON files received.
>> 
>> Thanking You,
>> 
>> --
>> Regards,
>> Ouch Whisper
>> 010101010101
> 
> 
> 
> -- 
> Harsh J
>

Re: queues in haddop

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Queues in the capacity scheduler are logical data structures into which
MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
framework, according to some capacity constraints that can be defined for a
queue.

So, given your use case, I don't think Capacity Scheduler is going to
directly help you (since you only spoke about data-in, and not processing)

So, yes something like Flume or Scribe

Thanks
Hemanth

On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:

> Your question in unclear: HDFS has no queues for ingesting data (it is
> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
> components have queues for processing data purposes.
>
> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
> wrote:
> > Hello,
> >
> > I have a hadoop cluster setup of 10 nodes and I an in need of
> implementing
> > queues in the cluster for receiving high volumes of data.
> > Please suggest what will be more efficient to use in the case of
> receiving
> > 24 Million Json files.. approx 5 KB each in every 24 hours :
> > 1. Using Capacity Scheduler
> > 2. Implementing RabbitMQ and receive data from them using Spring
> Integration
> > Data pipe lines.
> >
> > I cannot afford to loose any of the JSON files received.
> >
> > Thanking You,
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
>
>
>
> --
> Harsh J
>

Re: queues in haddop

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Queues in the capacity scheduler are logical data structures into which
MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
framework, according to some capacity constraints that can be defined for a
queue.

So, given your use case, I don't think Capacity Scheduler is going to
directly help you (since you only spoke about data-in, and not processing)

So, yes something like Flume or Scribe

Thanks
Hemanth

On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:

> Your question in unclear: HDFS has no queues for ingesting data (it is
> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
> components have queues for processing data purposes.
>
> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com>
> wrote:
> > Hello,
> >
> > I have a hadoop cluster setup of 10 nodes and I an in need of
> implementing
> > queues in the cluster for receiving high volumes of data.
> > Please suggest what will be more efficient to use in the case of
> receiving
> > 24 Million Json files.. approx 5 KB each in every 24 hours :
> > 1. Using Capacity Scheduler
> > 2. Implementing RabbitMQ and receive data from them using Spring
> Integration
> > Data pipe lines.
> >
> > I cannot afford to loose any of the JSON files received.
> >
> > Thanking You,
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
>
>
>
> --
> Harsh J
>

Re: queues in haddop

Posted by Harsh J <ha...@cloudera.com>.

Your question in unclear: HDFS has no queues for ingesting data (it is
a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
components have queues for processing data purposes.

On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com> wrote:
> Hello,
>
> I have a hadoop cluster setup of 10 nodes and I an in need of implementing
> queues in the cluster for receiving high volumes of data.
> Please suggest what will be more efficient to use in the case of receiving
> 24 Million Json files.. approx 5 KB each in every 24 hours :
> 1. Using Capacity Scheduler
> 2. Implementing RabbitMQ and receive data from them using Spring Integration
> Data pipe lines.
>
> I cannot afford to loose any of the JSON files received.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101



-- 
Harsh J

Re: queues in haddop

Posted by Harsh J <ha...@cloudera.com>.

Your question in unclear: HDFS has no queues for ingesting data (it is
a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
components have queues for processing data purposes.

On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com> wrote:
> Hello,
>
> I have a hadoop cluster setup of 10 nodes and I an in need of implementing
> queues in the cluster for receiving high volumes of data.
> Please suggest what will be more efficient to use in the case of receiving
> 24 Million Json files.. approx 5 KB each in every 24 hours :
> 1. Using Capacity Scheduler
> 2. Implementing RabbitMQ and receive data from them using Spring Integration
> Data pipe lines.
>
> I cannot afford to loose any of the JSON files received.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101



-- 
Harsh J

Re: queues in haddop

Posted by Harsh J <ha...@cloudera.com>.

Your question in unclear: HDFS has no queues for ingesting data (it is
a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
components have queues for processing data purposes.

On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com> wrote:
> Hello,
>
> I have a hadoop cluster setup of 10 nodes and I an in need of implementing
> queues in the cluster for receiving high volumes of data.
> Please suggest what will be more efficient to use in the case of receiving
> 24 Million Json files.. approx 5 KB each in every 24 hours :
> 1. Using Capacity Scheduler
> 2. Implementing RabbitMQ and receive data from them using Spring Integration
> Data pipe lines.
>
> I cannot afford to loose any of the JSON files received.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101



-- 
Harsh J

Re: queues in haddop

Posted by shashwat shriparv <dw...@gmail.com>.

The attached screenshot will shows how flume will work, and also you can
consider RabbitMQ, as it persistent too..



∞
Shashwat Shriparv



On Fri, Jan 11, 2013 at 10:24 AM, Mohit Anchlia <mo...@gmail.com>wrote:

> Have you looked at flume?
>
> Sent from my iPhone
>
> On Jan 10, 2013, at 7:12 PM, Panshul Whisper <ou...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I have a hadoop cluster setup of 10 nodes and I an in need of
> implementing queues in the cluster for receiving high volumes of data.
> > Please suggest what will be more efficient to use in the case of
> receiving 24 Million Json files.. approx 5 KB each in every 24 hours :
> > 1. Using Capacity Scheduler
> > 2. Implementing RabbitMQ and receive data from them using Spring
> Integration Data pipe lines.
> >
> > I cannot afford to loose any of the JSON files received.
> >
> > Thanking You,
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
>

Re: queues in haddop

Posted by shashwat shriparv <dw...@gmail.com>.

The attached screenshot will shows how flume will work, and also you can
consider RabbitMQ, as it persistent too..



∞
Shashwat Shriparv



On Fri, Jan 11, 2013 at 10:24 AM, Mohit Anchlia <mo...@gmail.com>wrote:

> Have you looked at flume?
>
> Sent from my iPhone
>
> On Jan 10, 2013, at 7:12 PM, Panshul Whisper <ou...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I have a hadoop cluster setup of 10 nodes and I an in need of
> implementing queues in the cluster for receiving high volumes of data.
> > Please suggest what will be more efficient to use in the case of
> receiving 24 Million Json files.. approx 5 KB each in every 24 hours :
> > 1. Using Capacity Scheduler
> > 2. Implementing RabbitMQ and receive data from them using Spring
> Integration Data pipe lines.
> >
> > I cannot afford to loose any of the JSON files received.
> >
> > Thanking You,
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
>

Re: queues in haddop

Posted by shashwat shriparv <dw...@gmail.com>.

The attached screenshot will shows how flume will work, and also you can
consider RabbitMQ, as it persistent too..



∞
Shashwat Shriparv



On Fri, Jan 11, 2013 at 10:24 AM, Mohit Anchlia <mo...@gmail.com>wrote:

> Have you looked at flume?
>
> Sent from my iPhone
>
> On Jan 10, 2013, at 7:12 PM, Panshul Whisper <ou...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I have a hadoop cluster setup of 10 nodes and I an in need of
> implementing queues in the cluster for receiving high volumes of data.
> > Please suggest what will be more efficient to use in the case of
> receiving 24 Million Json files.. approx 5 KB each in every 24 hours :
> > 1. Using Capacity Scheduler
> > 2. Implementing RabbitMQ and receive data from them using Spring
> Integration Data pipe lines.
> >
> > I cannot afford to loose any of the JSON files received.
> >
> > Thanking You,
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
>

Re: queues in haddop

Posted by shashwat shriparv <dw...@gmail.com>.

The attached screenshot will shows how flume will work, and also you can
consider RabbitMQ, as it persistent too..



∞
Shashwat Shriparv



On Fri, Jan 11, 2013 at 10:24 AM, Mohit Anchlia <mo...@gmail.com>wrote:

> Have you looked at flume?
>
> Sent from my iPhone
>
> On Jan 10, 2013, at 7:12 PM, Panshul Whisper <ou...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I have a hadoop cluster setup of 10 nodes and I an in need of
> implementing queues in the cluster for receiving high volumes of data.
> > Please suggest what will be more efficient to use in the case of
> receiving 24 Million Json files.. approx 5 KB each in every 24 hours :
> > 1. Using Capacity Scheduler
> > 2. Implementing RabbitMQ and receive data from them using Spring
> Integration Data pipe lines.
> >
> > I cannot afford to loose any of the JSON files received.
> >
> > Thanking You,
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
>

Re: queues in haddop

Posted by Mohit Anchlia <mo...@gmail.com>.

Have you looked at flume?

Sent from my iPhone

On Jan 10, 2013, at 7:12 PM, Panshul Whisper <ou...@gmail.com> wrote:

> Hello,
> 
> I have a hadoop cluster setup of 10 nodes and I an in need of implementing queues in the cluster for receiving high volumes of data.
> Please suggest what will be more efficient to use in the case of receiving 24 Million Json files.. approx 5 KB each in every 24 hours :
> 1. Using Capacity Scheduler
> 2. Implementing RabbitMQ and receive data from them using Spring Integration Data pipe lines.
> 
> I cannot afford to loose any of the JSON files received.
> 
> Thanking You,
> 
> -- 
> Regards,
> Ouch Whisper
> 010101010101

Re: queues in haddop

Posted by Mohit Anchlia <mo...@gmail.com>.

Have you looked at flume?

Sent from my iPhone

On Jan 10, 2013, at 7:12 PM, Panshul Whisper <ou...@gmail.com> wrote:

> Hello,
> 
> I have a hadoop cluster setup of 10 nodes and I an in need of implementing queues in the cluster for receiving high volumes of data.
> Please suggest what will be more efficient to use in the case of receiving 24 Million Json files.. approx 5 KB each in every 24 hours :
> 1. Using Capacity Scheduler
> 2. Implementing RabbitMQ and receive data from them using Spring Integration Data pipe lines.
> 
> I cannot afford to loose any of the JSON files received.
> 
> Thanking You,
> 
> -- 
> Regards,
> Ouch Whisper
> 010101010101

Re: queues in haddop

Posted by Mohit Anchlia <mo...@gmail.com>.

Have you looked at flume?

Sent from my iPhone

On Jan 10, 2013, at 7:12 PM, Panshul Whisper <ou...@gmail.com> wrote:

> Hello,
> 
> I have a hadoop cluster setup of 10 nodes and I an in need of implementing queues in the cluster for receiving high volumes of data.
> Please suggest what will be more efficient to use in the case of receiving 24 Million Json files.. approx 5 KB each in every 24 hours :
> 1. Using Capacity Scheduler
> 2. Implementing RabbitMQ and receive data from them using Spring Integration Data pipe lines.
> 
> I cannot afford to loose any of the JSON files received.
> 
> Thanking You,
> 
> -- 
> Regards,
> Ouch Whisper
> 010101010101

Re: queues in haddop

Posted by Harsh J <ha...@cloudera.com>.

Your question in unclear: HDFS has no queues for ingesting data (it is
a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
components have queues for processing data purposes.

On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ou...@gmail.com> wrote:
> Hello,
>
> I have a hadoop cluster setup of 10 nodes and I an in need of implementing
> queues in the cluster for receiving high volumes of data.
> Please suggest what will be more efficient to use in the case of receiving
> 24 Million Json files.. approx 5 KB each in every 24 hours :
> 1. Using Capacity Scheduler
> 2. Implementing RabbitMQ and receive data from them using Spring Integration
> Data pipe lines.
>
> I cannot afford to loose any of the JSON files received.
>
> Thanking You,
>
> --
> Regards,
> Ouch Whisper
> 010101010101



-- 
Harsh J

Re: queues in haddop

Posted by Mohit Anchlia <mo...@gmail.com>.

Have you looked at flume?

Sent from my iPhone

On Jan 10, 2013, at 7:12 PM, Panshul Whisper <ou...@gmail.com> wrote:

> Hello,
> 
> I have a hadoop cluster setup of 10 nodes and I an in need of implementing queues in the cluster for receiving high volumes of data.
> Please suggest what will be more efficient to use in the case of receiving 24 Million Json files.. approx 5 KB each in every 24 hours :
> 1. Using Capacity Scheduler
> 2. Implementing RabbitMQ and receive data from them using Spring Integration Data pipe lines.
> 
> I cannot afford to loose any of the JSON files received.
> 
> Thanking You,
> 
> -- 
> Regards,
> Ouch Whisper
> 010101010101