You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Sidharth Kumar <si...@gmail.com> on 2017/07/01 14:26:01 UTC

Re: Kafka or Flume

Thanks for your suggestions. I feel kafka will be better but need some
extra like either kafka with flume or kafka with spark streaming. Can you
kindly suggest which will be better and in which situation which
combination will perform best.

Thanks in advance for your help.

Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599/7892 192 367 |  LinkedIn:
www.linkedin.com/in/sidharthkumar2792






On 30-Jun-2017 11:18 AM, "daemeon reiydelle" <da...@gmail.com> wrote:

> For fairly simple transformations, Flume is great, and works fine
> subscribing
> to some pretty 
> high volumes of messages from Kafka
>  (I think we hit 50M/second at one point)
> . If you need to do complex transformations, e.g. database lookups for the
> Kafka to Hadoop ETL, then you will start having complexity issues which
> will exceed the capability of Flume.
> There are git repos that have everything you need, which include the
> kafka adapter, hdfs writer, etc. A lot of this is built into flume. 
> I assume this might be a bit off topic, so googling flume & kafka will
> help you?
>
> On Thu, Jun 29, 2017 at 10:14 PM, Mallanagouda Patil <
> mallanagouda.c.patil@gmail.com> wrote:
>
>> Kafka is capable of processing billions of events per second. You can
>> scale it horizontally with Kafka broker servers.
>>
>> You can try out these steps
>>
>> 1. Create a topic in Kafka to get your all data. You have to use Kafka
>> producer to ingest data into Kafka.
>> 2. If you are going to write your own HDFS client to put data into HDFS
>> then, you can read data from topic in step-1, validate and store into HDFS.
>> 3. If you want to OpenSource tool (Gobbling or confluent Kafka HDFS
>> connector) to put data into HDFS then
>> Write tool to read data from topic, validate and store in other topic.
>>
>> We are using combination of these steps to process over 10 million
>> events/second.
>>
>> I hope it helps..
>>
>> Thanks
>> Mallan
>>
>> On Jun 30, 2017 10:31 AM, "Sidharth Kumar" <si...@gmail.com>
>> wrote:
>>
>>> Thanks! What about Kafka with Flume? And also I would like to tell that
>>> everyday data intake is in millions and can't afford to loose even a single
>>> piece of data. Which makes a need of  high availablity.
>>>
>>> Warm Regards
>>>
>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 30-Jun-2017 10:04 AM, "JP gupta" <JP...@altruistindia.com> wrote:
>>>
>>>> The ideal sequence should be:
>>>>
>>>> 1.      Ingress using Kafka -> Validation and processing using Spark
>>>> -> Write into any NoSql DB or Hive.
>>>>
>>>> From my recent experience, writing directly to HDFS can be slow
>>>> depending on the data format.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> JP
>>>>
>>>>
>>>>
>>>> *From:* Sudeep Singh Thakur [mailto:sudeepthakur90@gmail.com]
>>>> *Sent:* 30 June 2017 09:26
>>>> *To:* Sidharth Kumar
>>>> *Cc:* Maggy; common-user@hadoop.apache.org
>>>> *Subject:* Re: Kafka or Flume
>>>>
>>>>
>>>>
>>>> In your use Kafka would be better because you want some transformations
>>>> and validations.
>>>>
>>>> Kind regards,
>>>> Sudeep Singh Thakur
>>>>
>>>>
>>>>
>>>> On Jun 30, 2017 8:57 AM, "Sidharth Kumar" <si...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I have a requirement where I have all transactional data injestion into
>>>> hadoop in real time and before storing the data into hadoop, process it to
>>>> validate the data. If the data failed to pass validation process , it will
>>>> not be stored into hadoop. The validation process also make use of
>>>> historical data which is stored in hadoop. So, my question is which
>>>> injestion tool will be best for this Kafka or Flume?
>>>>
>>>>
>>>>
>>>> Any suggestions will be a great help for me.
>>>>
>>>>
>>>> Warm Regards
>>>>
>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>

Re: Kafka or Flume

Posted by Sidharth Kumar <si...@gmail.com>.

Thank you very much for your help. What about the flow as  Nifi --> Kafka
--> storm for  real time processing and then storing into HBase ?

Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599/7892 192 367 |  LinkedIn:
www.linkedin.com/in/sidharthkumar2792







On 02-Jul-2017 12:40 PM, "Gagan Brahmi" <ga...@gmail.com> wrote:

NiFi can do that job as well. While using NiFi to ingest data from the
source, you can apply validation and direct the flow of the data.

Even if you need to combine the incoming flow with another source of data
it is possible using NiFi. It is an intelligent tool to have while
designing any kind of data flow.


Regards,
Gagan Brahmi

On Sat, Jul 1, 2017 at 8:42 PM, Sidharth Kumar <si...@gmail.com>
wrote:

> Great, thanks! It's a great tool but you mentioned
>
> For ingestion
>
> NiFi -> Kafka
>
> For data verification
>
> Kafka -> NiFi -> HDFS/Hive/HBase
>
> Whereas I have to apply validation while ingesting data and then route
> them based on validation output. This validation will make use of history
> data stored in hadoop.
>
> So can you suggest a flow with a little more in detail
>
>
> Warm Regards
>
> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192 367
> |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>
>
>
>
>
>
> On 01-Jul-2017 9:46 PM, "Gagan Brahmi" <ga...@gmail.com> wrote:
>
> I'd say the data flow should be simpler since you might need some basic
> verification of the data. You may want to include NiFi in the mix which
> should do the job.
>
> It can look something like this:
>
>
> Regards,
> Gagan Brahmi
>
> On Sat, Jul 1, 2017 at 7:26 AM, Sidharth Kumar <
> sidharthkumar2707@gmail.com> wrote:
>
>> Thanks for your suggestions. I feel kafka will be better but need some
>> extra like either kafka with flume or kafka with spark streaming. Can you
>> kindly suggest which will be better and in which situation which
>> combination will perform best.
>>
>> Thanks in advance for your help.
>>
>> Warm Regards
>>
>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>
>>
>>
>>
>>
>>
>> On 30-Jun-2017 11:18 AM, "daemeon reiydelle" <da...@gmail.com> wrote:
>>
>>> For fairly simple transformations, Flume is great, and works fine
>>> subscribing
>>> to some pretty 
>>> high volumes of messages from Kafka
>>>  (I think we hit 50M/second at one point)
>>> . If you need to do complex transformations, e.g. database lookups for
>>> the Kafka to Hadoop ETL, then you will start having complexity issues which
>>> will exceed the capability of Flume.
>>> There are git repos that have everything you need, which include the
>>> kafka adapter, hdfs writer, etc. A lot of this is built into flume. 
>>> I assume this might be a bit off topic, so googling flume & kafka will
>>> help you?
>>>
>>> On Thu, Jun 29, 2017 at 10:14 PM, Mallanagouda Patil <
>>> mallanagouda.c.patil@gmail.com> wrote:
>>>
>>>> Kafka is capable of processing billions of events per second. You can
>>>> scale it horizontally with Kafka broker servers.
>>>>
>>>> You can try out these steps
>>>>
>>>> 1. Create a topic in Kafka to get your all data. You have to use Kafka
>>>> producer to ingest data into Kafka.
>>>> 2. If you are going to write your own HDFS client to put data into HDFS
>>>> then, you can read data from topic in step-1, validate and store into HDFS.
>>>> 3. If you want to OpenSource tool (Gobbling or confluent Kafka HDFS
>>>> connector) to put data into HDFS then
>>>> Write tool to read data from topic, validate and store in other topic.
>>>>
>>>> We are using combination of these steps to process over 10 million
>>>> events/second.
>>>>
>>>> I hope it helps..
>>>>
>>>> Thanks
>>>> Mallan
>>>>
>>>> On Jun 30, 2017 10:31 AM, "Sidharth Kumar" <si...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks! What about Kafka with Flume? And also I would like to tell
>>>>> that everyday data intake is in millions and can't afford to loose even a
>>>>> single piece of data. Which makes a need of  high availablity.
>>>>>
>>>>> Warm Regards
>>>>>
>>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 30-Jun-2017 10:04 AM, "JP gupta" <JP...@altruistindia.com>
>>>>> wrote:
>>>>>
>>>>>> The ideal sequence should be:
>>>>>>
>>>>>> 1.      Ingress using Kafka -> Validation and processing using Spark
>>>>>> -> Write into any NoSql DB or Hive.
>>>>>>
>>>>>> From my recent experience, writing directly to HDFS can be slow
>>>>>> depending on the data format.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> JP
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Sudeep Singh Thakur [mailto:sudeepthakur90@gmail.com]
>>>>>> *Sent:* 30 June 2017 09:26
>>>>>> *To:* Sidharth Kumar
>>>>>> *Cc:* Maggy; common-user@hadoop.apache.org
>>>>>> *Subject:* Re: Kafka or Flume
>>>>>>
>>>>>>
>>>>>>
>>>>>> In your use Kafka would be better because you want some
>>>>>> transformations and validations.
>>>>>>
>>>>>> Kind regards,
>>>>>> Sudeep Singh Thakur
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jun 30, 2017 8:57 AM, "Sidharth Kumar" <
>>>>>> sidharthkumar2707@gmail.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have a requirement where I have all transactional data injestion
>>>>>> into hadoop in real time and before storing the data into hadoop, process
>>>>>> it to validate the data. If the data failed to pass validation process , it
>>>>>> will not be stored into hadoop. The validation process also make use of
>>>>>> historical data which is stored in hadoop. So, my question is which
>>>>>> injestion tool will be best for this Kafka or Flume?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any suggestions will be a great help for me.
>>>>>>
>>>>>>
>>>>>> Warm Regards
>>>>>>
>>>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892
>>>>>> 192 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>
>

Re: Kafka or Flume

Posted by Gagan Brahmi <ga...@gmail.com>.

NiFi can do that job as well. While using NiFi to ingest data from the
source, you can apply validation and direct the flow of the data.

Even if you need to combine the incoming flow with another source of data
it is possible using NiFi. It is an intelligent tool to have while
designing any kind of data flow.


Regards,
Gagan Brahmi

On Sat, Jul 1, 2017 at 8:42 PM, Sidharth Kumar <si...@gmail.com>
wrote:

> Great, thanks! It's a great tool but you mentioned
>
> For ingestion
>
> NiFi -> Kafka
>
> For data verification
>
> Kafka -> NiFi -> HDFS/Hive/HBase
>
> Whereas I have to apply validation while ingesting data and then route
> them based on validation output. This validation will make use of history
> data stored in hadoop.
>
> So can you suggest a flow with a little more in detail
>
>
> Warm Regards
>
> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192 367
> |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>
>
>
>
>
>
> On 01-Jul-2017 9:46 PM, "Gagan Brahmi" <ga...@gmail.com> wrote:
>
> I'd say the data flow should be simpler since you might need some basic
> verification of the data. You may want to include NiFi in the mix which
> should do the job.
>
> It can look something like this:
>
>
> Regards,
> Gagan Brahmi
>
> On Sat, Jul 1, 2017 at 7:26 AM, Sidharth Kumar <
> sidharthkumar2707@gmail.com> wrote:
>
>> Thanks for your suggestions. I feel kafka will be better but need some
>> extra like either kafka with flume or kafka with spark streaming. Can you
>> kindly suggest which will be better and in which situation which
>> combination will perform best.
>>
>> Thanks in advance for your help.
>>
>> Warm Regards
>>
>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>
>>
>>
>>
>>
>>
>> On 30-Jun-2017 11:18 AM, "daemeon reiydelle" <da...@gmail.com> wrote:
>>
>>> For fairly simple transformations, Flume is great, and works fine
>>> subscribing
>>> to some pretty 
>>> high volumes of messages from Kafka
>>>  (I think we hit 50M/second at one point)
>>> . If you need to do complex transformations, e.g. database lookups for
>>> the Kafka to Hadoop ETL, then you will start having complexity issues which
>>> will exceed the capability of Flume.
>>> There are git repos that have everything you need, which include the
>>> kafka adapter, hdfs writer, etc. A lot of this is built into flume. 
>>> I assume this might be a bit off topic, so googling flume & kafka will
>>> help you?
>>>
>>> On Thu, Jun 29, 2017 at 10:14 PM, Mallanagouda Patil <
>>> mallanagouda.c.patil@gmail.com> wrote:
>>>
>>>> Kafka is capable of processing billions of events per second. You can
>>>> scale it horizontally with Kafka broker servers.
>>>>
>>>> You can try out these steps
>>>>
>>>> 1. Create a topic in Kafka to get your all data. You have to use Kafka
>>>> producer to ingest data into Kafka.
>>>> 2. If you are going to write your own HDFS client to put data into HDFS
>>>> then, you can read data from topic in step-1, validate and store into HDFS.
>>>> 3. If you want to OpenSource tool (Gobbling or confluent Kafka HDFS
>>>> connector) to put data into HDFS then
>>>> Write tool to read data from topic, validate and store in other topic.
>>>>
>>>> We are using combination of these steps to process over 10 million
>>>> events/second.
>>>>
>>>> I hope it helps..
>>>>
>>>> Thanks
>>>> Mallan
>>>>
>>>> On Jun 30, 2017 10:31 AM, "Sidharth Kumar" <si...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks! What about Kafka with Flume? And also I would like to tell
>>>>> that everyday data intake is in millions and can't afford to loose even a
>>>>> single piece of data. Which makes a need of  high availablity.
>>>>>
>>>>> Warm Regards
>>>>>
>>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 30-Jun-2017 10:04 AM, "JP gupta" <JP...@altruistindia.com>
>>>>> wrote:
>>>>>
>>>>>> The ideal sequence should be:
>>>>>>
>>>>>> 1.      Ingress using Kafka -> Validation and processing using Spark
>>>>>> -> Write into any NoSql DB or Hive.
>>>>>>
>>>>>> From my recent experience, writing directly to HDFS can be slow
>>>>>> depending on the data format.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> JP
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Sudeep Singh Thakur [mailto:sudeepthakur90@gmail.com]
>>>>>> *Sent:* 30 June 2017 09:26
>>>>>> *To:* Sidharth Kumar
>>>>>> *Cc:* Maggy; common-user@hadoop.apache.org
>>>>>> *Subject:* Re: Kafka or Flume
>>>>>>
>>>>>>
>>>>>>
>>>>>> In your use Kafka would be better because you want some
>>>>>> transformations and validations.
>>>>>>
>>>>>> Kind regards,
>>>>>> Sudeep Singh Thakur
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jun 30, 2017 8:57 AM, "Sidharth Kumar" <
>>>>>> sidharthkumar2707@gmail.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have a requirement where I have all transactional data injestion
>>>>>> into hadoop in real time and before storing the data into hadoop, process
>>>>>> it to validate the data. If the data failed to pass validation process , it
>>>>>> will not be stored into hadoop. The validation process also make use of
>>>>>> historical data which is stored in hadoop. So, my question is which
>>>>>> injestion tool will be best for this Kafka or Flume?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any suggestions will be a great help for me.
>>>>>>
>>>>>>
>>>>>> Warm Regards
>>>>>>
>>>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892
>>>>>> 192 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>
>

Re: Kafka or Flume

Posted by Sidharth Kumar <si...@gmail.com>.

Great, thanks! It's a great tool but you mentioned

For ingestion

NiFi -> Kafka

For data verification

Kafka -> NiFi -> HDFS/Hive/HBase

Whereas I have to apply validation while ingesting data and then route them
based on validation output. This validation will make use of history data
stored in hadoop.

So can you suggest a flow with a little more in detail


Warm Regards

Sidharth Kumar | Mob: +91 8197 555 599/7892 192 367 |  LinkedIn:
www.linkedin.com/in/sidharthkumar2792






On 01-Jul-2017 9:46 PM, "Gagan Brahmi" <ga...@gmail.com> wrote:

I'd say the data flow should be simpler since you might need some basic
verification of the data. You may want to include NiFi in the mix which
should do the job.

It can look something like this:


Regards,
Gagan Brahmi

On Sat, Jul 1, 2017 at 7:26 AM, Sidharth Kumar <si...@gmail.com>
wrote:

> Thanks for your suggestions. I feel kafka will be better but need some
> extra like either kafka with flume or kafka with spark streaming. Can you
> kindly suggest which will be better and in which situation which
> combination will perform best.
>
> Thanks in advance for your help.
>
> Warm Regards
>
> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192 367
> |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>
>
>
>
>
>
> On 30-Jun-2017 11:18 AM, "daemeon reiydelle" <da...@gmail.com> wrote:
>
>> For fairly simple transformations, Flume is great, and works fine
>> subscribing
>> to some pretty 
>> high volumes of messages from Kafka
>>  (I think we hit 50M/second at one point)
>> . If you need to do complex transformations, e.g. database lookups for
>> the Kafka to Hadoop ETL, then you will start having complexity issues which
>> will exceed the capability of Flume.
>> There are git repos that have everything you need, which include the
>> kafka adapter, hdfs writer, etc. A lot of this is built into flume. 
>> I assume this might be a bit off topic, so googling flume & kafka will
>> help you?
>>
>> On Thu, Jun 29, 2017 at 10:14 PM, Mallanagouda Patil <
>> mallanagouda.c.patil@gmail.com> wrote:
>>
>>> Kafka is capable of processing billions of events per second. You can
>>> scale it horizontally with Kafka broker servers.
>>>
>>> You can try out these steps
>>>
>>> 1. Create a topic in Kafka to get your all data. You have to use Kafka
>>> producer to ingest data into Kafka.
>>> 2. If you are going to write your own HDFS client to put data into HDFS
>>> then, you can read data from topic in step-1, validate and store into HDFS.
>>> 3. If you want to OpenSource tool (Gobbling or confluent Kafka HDFS
>>> connector) to put data into HDFS then
>>> Write tool to read data from topic, validate and store in other topic.
>>>
>>> We are using combination of these steps to process over 10 million
>>> events/second.
>>>
>>> I hope it helps..
>>>
>>> Thanks
>>> Mallan
>>>
>>> On Jun 30, 2017 10:31 AM, "Sidharth Kumar" <si...@gmail.com>
>>> wrote:
>>>
>>>> Thanks! What about Kafka with Flume? And also I would like to tell that
>>>> everyday data intake is in millions and can't afford to loose even a single
>>>> piece of data. Which makes a need of  high availablity.
>>>>
>>>> Warm Regards
>>>>
>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 30-Jun-2017 10:04 AM, "JP gupta" <JP...@altruistindia.com> wrote:
>>>>
>>>>> The ideal sequence should be:
>>>>>
>>>>> 1.      Ingress using Kafka -> Validation and processing using Spark
>>>>> -> Write into any NoSql DB or Hive.
>>>>>
>>>>> From my recent experience, writing directly to HDFS can be slow
>>>>> depending on the data format.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> JP
>>>>>
>>>>>
>>>>>
>>>>> *From:* Sudeep Singh Thakur [mailto:sudeepthakur90@gmail.com]
>>>>> *Sent:* 30 June 2017 09:26
>>>>> *To:* Sidharth Kumar
>>>>> *Cc:* Maggy; common-user@hadoop.apache.org
>>>>> *Subject:* Re: Kafka or Flume
>>>>>
>>>>>
>>>>>
>>>>> In your use Kafka would be better because you want some
>>>>> transformations and validations.
>>>>>
>>>>> Kind regards,
>>>>> Sudeep Singh Thakur
>>>>>
>>>>>
>>>>>
>>>>> On Jun 30, 2017 8:57 AM, "Sidharth Kumar" <si...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I have a requirement where I have all transactional data injestion
>>>>> into hadoop in real time and before storing the data into hadoop, process
>>>>> it to validate the data. If the data failed to pass validation process , it
>>>>> will not be stored into hadoop. The validation process also make use of
>>>>> historical data which is stored in hadoop. So, my question is which
>>>>> injestion tool will be best for this Kafka or Flume?
>>>>>
>>>>>
>>>>>
>>>>> Any suggestions will be a great help for me.
>>>>>
>>>>>
>>>>> Warm Regards
>>>>>
>>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>

Re: Kafka or Flume

Posted by Gagan Brahmi <ga...@gmail.com>.

I'd say the data flow should be simpler since you might need some basic
verification of the data. You may want to include NiFi in the mix which
should do the job.

It can look something like this:

For ingestion

NiFi -> Kafka

For data verification

Kafka -> NiFi -> HDFS/Hive/HBase


Regards,
Gagan Brahmi

On Sat, Jul 1, 2017 at 7:26 AM, Sidharth Kumar <si...@gmail.com>
wrote:

> Thanks for your suggestions. I feel kafka will be better but need some
> extra like either kafka with flume or kafka with spark streaming. Can you
> kindly suggest which will be better and in which situation which
> combination will perform best.
>
> Thanks in advance for your help.
>
> Warm Regards
>
> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192 367
> |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>
>
>
>
>
>
> On 30-Jun-2017 11:18 AM, "daemeon reiydelle" <da...@gmail.com> wrote:
>
>> For fairly simple transformations, Flume is great, and works fine
>> subscribing
>> to some pretty 
>> high volumes of messages from Kafka
>>  (I think we hit 50M/second at one point)
>> . If you need to do complex transformations, e.g. database lookups for
>> the Kafka to Hadoop ETL, then you will start having complexity issues which
>> will exceed the capability of Flume.
>> There are git repos that have everything you need, which include the
>> kafka adapter, hdfs writer, etc. A lot of this is built into flume. 
>> I assume this might be a bit off topic, so googling flume & kafka will
>> help you?
>>
>> On Thu, Jun 29, 2017 at 10:14 PM, Mallanagouda Patil <
>> mallanagouda.c.patil@gmail.com> wrote:
>>
>>> Kafka is capable of processing billions of events per second. You can
>>> scale it horizontally with Kafka broker servers.
>>>
>>> You can try out these steps
>>>
>>> 1. Create a topic in Kafka to get your all data. You have to use Kafka
>>> producer to ingest data into Kafka.
>>> 2. If you are going to write your own HDFS client to put data into HDFS
>>> then, you can read data from topic in step-1, validate and store into HDFS.
>>> 3. If you want to OpenSource tool (Gobbling or confluent Kafka HDFS
>>> connector) to put data into HDFS then
>>> Write tool to read data from topic, validate and store in other topic.
>>>
>>> We are using combination of these steps to process over 10 million
>>> events/second.
>>>
>>> I hope it helps..
>>>
>>> Thanks
>>> Mallan
>>>
>>> On Jun 30, 2017 10:31 AM, "Sidharth Kumar" <si...@gmail.com>
>>> wrote:
>>>
>>>> Thanks! What about Kafka with Flume? And also I would like to tell that
>>>> everyday data intake is in millions and can't afford to loose even a single
>>>> piece of data. Which makes a need of  high availablity.
>>>>
>>>> Warm Regards
>>>>
>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 30-Jun-2017 10:04 AM, "JP gupta" <JP...@altruistindia.com> wrote:
>>>>
>>>>> The ideal sequence should be:
>>>>>
>>>>> 1.      Ingress using Kafka -> Validation and processing using Spark
>>>>> -> Write into any NoSql DB or Hive.
>>>>>
>>>>> From my recent experience, writing directly to HDFS can be slow
>>>>> depending on the data format.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> JP
>>>>>
>>>>>
>>>>>
>>>>> *From:* Sudeep Singh Thakur [mailto:sudeepthakur90@gmail.com]
>>>>> *Sent:* 30 June 2017 09:26
>>>>> *To:* Sidharth Kumar
>>>>> *Cc:* Maggy; common-user@hadoop.apache.org
>>>>> *Subject:* Re: Kafka or Flume
>>>>>
>>>>>
>>>>>
>>>>> In your use Kafka would be better because you want some
>>>>> transformations and validations.
>>>>>
>>>>> Kind regards,
>>>>> Sudeep Singh Thakur
>>>>>
>>>>>
>>>>>
>>>>> On Jun 30, 2017 8:57 AM, "Sidharth Kumar" <si...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I have a requirement where I have all transactional data injestion
>>>>> into hadoop in real time and before storing the data into hadoop, process
>>>>> it to validate the data. If the data failed to pass validation process , it
>>>>> will not be stored into hadoop. The validation process also make use of
>>>>> historical data which is stored in hadoop. So, my question is which
>>>>> injestion tool will be best for this Kafka or Flume?
>>>>>
>>>>>
>>>>>
>>>>> Any suggestions will be a great help for me.
>>>>>
>>>>>
>>>>> Warm Regards
>>>>>
>>>>> Sidharth Kumar | Mob: +91 8197 555 599 <+91%2081975%2055599>/7892 192
>>>>> 367 |  LinkedIn:www.linkedin.com/in/sidharthkumar2792
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>