You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Mark <st...@gmail.com> on 2011/11/06 17:45:44 UTC

Hadoop import

This is more of a general design question but what is the preferred way 
of importing logs from Kafka to HDFS when you want your data segmented 
by hour or day? Is there anyway to say "Import only this {hour|day} of 
logs" or does one need to create their topics around the way they would 
like to import them.. ie Topic: "search_logs/2011/11/06". If its the 
latter, is there any documentation/best practices on topic/key design?

Thanks

Re: Hadoop import

Posted by Neha Narkhede <ne...@gmail.com>.

num.partitions decides the default value for the number of partitions
on a server. topic.partition.count.map is a topic based override for
the same. For more information on config, please see here -
http://incubator.apache.org/kafka/configuration.html

Thanks,
Neha

On Sun, Nov 6, 2011 at 11:45 AM, Mark <st...@gmail.com> wrote:
> Ok got it. How are partitions determined? Is this something that producer is
> responsible for or can it be automatically handled by the broker?
>
> On 11/6/11 11:13 AM, Neha Narkhede wrote:
>>>>>
>>>>> Ok so the partitioning is done on the hadoop side during importing and
>>>>> has
>>>
>>> nothing to do with Kafka partitions.
>>
>> That's right.
>>
>> Kafka partitions help scale consumption by allowing multiple consumer
>> processes pull data for a topic in parallel. The parallelism factor is
>> limited by the total number of Kafka partitions. For example, if a
>> topic has 2 partitions, 2 Hadoop mappers can pull data for the entire
>> topic in parallel. If another topic has 8 partitions, the parallelism
>> factor increased by 4x. Now 8 mappers can pull all the data for this
>> topic at the same time.
>>
>> Thanks,
>> Neha
>>
>> On Sun, Nov 6, 2011 at 11:00 AM, Mark<st...@gmail.com>  wrote:
>>>
>>> Ok so the partitioning is done on the hadoop side during importing and
>>> has
>>> nothing to do with Kafka partitions. Would you mind explaining what Kafka
>>> partitions are used for and when one should use them?
>>>
>>>
>>>
>>> On 11/6/11 10:52 AM, Neha Narkhede wrote:
>>>>
>>>> We use Avro serialization for the message data and use Avro schemas to
>>>> convert event objects into Kafka message payload on the producers. On
>>>> the Hadoop side, we use Avro schemas to deserialize Kafka message
>>>> payload back into an event object. Each such event object has a
>>>> timestamp field that the Hadoop job uses to put the message into its
>>>> hourly and daily partition. So if the Hadoop job runs every 15 mins,
>>>> it will run 4 times to collect data into the current hour's partition.
>>>>
>>>> Very soon, the plan is to open-source this Avro-Hadoop pipeline as well.
>>>>
>>>> Thanks,
>>>> Neha
>>>>
>>>> On Sun, Nov 6, 2011 at 10:37 AM, Mark<st...@gmail.com>
>>>>  wrote:
>>>>>
>>>>> "At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer
>>>>> to
>>>>> load the data for
>>>>> topics in daily and hourly partitions."
>>>>>
>>>>> Sorry for my ignorance but what exactly do you mean by loading the data
>>>>> in
>>>>> daily and hourly partitions?
>>>>>
>>>>>
>>>>> On 11/6/11 10:26 AM, Neha Narkhede wrote:
>>>>>>
>>>>>> There should be no changes to the way you create topics to achieve
>>>>>> this kind of HDFS data load for Kafka. At LinkedIn we use the
>>>>>> InputFormat provided in contrib/hadoop-consumer to load the data for
>>>>>> topics in daily and hourly partitions. These Hadoop jobs run every 10
>>>>>> mins or so. So the maximum delay of data being available from
>>>>>> producer->Hadoop is around 10 mins.
>>>>>>
>>>>>> Thanks,
>>>>>> Neha
>>>>>>
>>>>>> On Sun, Nov 6, 2011 at 8:45 AM, Mark<st...@gmail.com>
>>>>>>  wrote:
>>>>>>>
>>>>>>> This is more of a general design question but what is the preferred
>>>>>>> way
>>>>>>> of
>>>>>>> importing logs from Kafka to HDFS when you want your data segmented
>>>>>>> by
>>>>>>> hour
>>>>>>> or day? Is there anyway to say "Import only this {hour|day} of logs"
>>>>>>> or
>>>>>>> does
>>>>>>> one need to create their topics around the way they would like to
>>>>>>> import
>>>>>>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is
>>>>>>> there
>>>>>>> any
>>>>>>> documentation/best practices on topic/key design?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>

Re: Hadoop import

Posted by Mark <st...@gmail.com>.

Ok got it. How are partitions determined? Is this something that 
producer is responsible for or can it be automatically handled by the 
broker?

On 11/6/11 11:13 AM, Neha Narkhede wrote:
>>>> Ok so the partitioning is done on the hadoop side during importing and has
>> nothing to do with Kafka partitions.
> That's right.
>
> Kafka partitions help scale consumption by allowing multiple consumer
> processes pull data for a topic in parallel. The parallelism factor is
> limited by the total number of Kafka partitions. For example, if a
> topic has 2 partitions, 2 Hadoop mappers can pull data for the entire
> topic in parallel. If another topic has 8 partitions, the parallelism
> factor increased by 4x. Now 8 mappers can pull all the data for this
> topic at the same time.
>
> Thanks,
> Neha
>
> On Sun, Nov 6, 2011 at 11:00 AM, Mark<st...@gmail.com>  wrote:
>> Ok so the partitioning is done on the hadoop side during importing and has
>> nothing to do with Kafka partitions. Would you mind explaining what Kafka
>> partitions are used for and when one should use them?
>>
>>
>>
>> On 11/6/11 10:52 AM, Neha Narkhede wrote:
>>> We use Avro serialization for the message data and use Avro schemas to
>>> convert event objects into Kafka message payload on the producers. On
>>> the Hadoop side, we use Avro schemas to deserialize Kafka message
>>> payload back into an event object. Each such event object has a
>>> timestamp field that the Hadoop job uses to put the message into its
>>> hourly and daily partition. So if the Hadoop job runs every 15 mins,
>>> it will run 4 times to collect data into the current hour's partition.
>>>
>>> Very soon, the plan is to open-source this Avro-Hadoop pipeline as well.
>>>
>>> Thanks,
>>> Neha
>>>
>>> On Sun, Nov 6, 2011 at 10:37 AM, Mark<st...@gmail.com>    wrote:
>>>> "At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer
>>>> to
>>>> load the data for
>>>> topics in daily and hourly partitions."
>>>>
>>>> Sorry for my ignorance but what exactly do you mean by loading the data
>>>> in
>>>> daily and hourly partitions?
>>>>
>>>>
>>>> On 11/6/11 10:26 AM, Neha Narkhede wrote:
>>>>> There should be no changes to the way you create topics to achieve
>>>>> this kind of HDFS data load for Kafka. At LinkedIn we use the
>>>>> InputFormat provided in contrib/hadoop-consumer to load the data for
>>>>> topics in daily and hourly partitions. These Hadoop jobs run every 10
>>>>> mins or so. So the maximum delay of data being available from
>>>>> producer->Hadoop is around 10 mins.
>>>>>
>>>>> Thanks,
>>>>> Neha
>>>>>
>>>>> On Sun, Nov 6, 2011 at 8:45 AM, Mark<st...@gmail.com>
>>>>>   wrote:
>>>>>> This is more of a general design question but what is the preferred way
>>>>>> of
>>>>>> importing logs from Kafka to HDFS when you want your data segmented by
>>>>>> hour
>>>>>> or day? Is there anyway to say "Import only this {hour|day} of logs" or
>>>>>> does
>>>>>> one need to create their topics around the way they would like to
>>>>>> import
>>>>>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is there
>>>>>> any
>>>>>> documentation/best practices on topic/key design?
>>>>>>
>>>>>> Thanks
>>>>>>

Re: Hadoop import

Posted by Neha Narkhede <ne...@gmail.com>.

>> > Ok so the partitioning is done on the hadoop side during importing and has
> nothing to do with Kafka partitions.

That's right.

Kafka partitions help scale consumption by allowing multiple consumer
processes pull data for a topic in parallel. The parallelism factor is
limited by the total number of Kafka partitions. For example, if a
topic has 2 partitions, 2 Hadoop mappers can pull data for the entire
topic in parallel. If another topic has 8 partitions, the parallelism
factor increased by 4x. Now 8 mappers can pull all the data for this
topic at the same time.

Thanks,
Neha

On Sun, Nov 6, 2011 at 11:00 AM, Mark <st...@gmail.com> wrote:
> Ok so the partitioning is done on the hadoop side during importing and has
> nothing to do with Kafka partitions. Would you mind explaining what Kafka
> partitions are used for and when one should use them?
>
>
>
> On 11/6/11 10:52 AM, Neha Narkhede wrote:
>>
>> We use Avro serialization for the message data and use Avro schemas to
>> convert event objects into Kafka message payload on the producers. On
>> the Hadoop side, we use Avro schemas to deserialize Kafka message
>> payload back into an event object. Each such event object has a
>> timestamp field that the Hadoop job uses to put the message into its
>> hourly and daily partition. So if the Hadoop job runs every 15 mins,
>> it will run 4 times to collect data into the current hour's partition.
>>
>> Very soon, the plan is to open-source this Avro-Hadoop pipeline as well.
>>
>> Thanks,
>> Neha
>>
>> On Sun, Nov 6, 2011 at 10:37 AM, Mark<st...@gmail.com>  wrote:
>>>
>>> "At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer
>>> to
>>> load the data for
>>> topics in daily and hourly partitions."
>>>
>>> Sorry for my ignorance but what exactly do you mean by loading the data
>>> in
>>> daily and hourly partitions?
>>>
>>>
>>> On 11/6/11 10:26 AM, Neha Narkhede wrote:
>>>>
>>>> There should be no changes to the way you create topics to achieve
>>>> this kind of HDFS data load for Kafka. At LinkedIn we use the
>>>> InputFormat provided in contrib/hadoop-consumer to load the data for
>>>> topics in daily and hourly partitions. These Hadoop jobs run every 10
>>>> mins or so. So the maximum delay of data being available from
>>>> producer->Hadoop is around 10 mins.
>>>>
>>>> Thanks,
>>>> Neha
>>>>
>>>> On Sun, Nov 6, 2011 at 8:45 AM, Mark<st...@gmail.com>
>>>>  wrote:
>>>>>
>>>>> This is more of a general design question but what is the preferred way
>>>>> of
>>>>> importing logs from Kafka to HDFS when you want your data segmented by
>>>>> hour
>>>>> or day? Is there anyway to say "Import only this {hour|day} of logs" or
>>>>> does
>>>>> one need to create their topics around the way they would like to
>>>>> import
>>>>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is there
>>>>> any
>>>>> documentation/best practices on topic/key design?
>>>>>
>>>>> Thanks
>>>>>
>

Re: Hadoop import

Posted by Mark <st...@gmail.com>.

Ok so the partitioning is done on the hadoop side during importing and 
has nothing to do with Kafka partitions. Would you mind explaining what 
Kafka partitions are used for and when one should use them?



On 11/6/11 10:52 AM, Neha Narkhede wrote:
> We use Avro serialization for the message data and use Avro schemas to
> convert event objects into Kafka message payload on the producers. On
> the Hadoop side, we use Avro schemas to deserialize Kafka message
> payload back into an event object. Each such event object has a
> timestamp field that the Hadoop job uses to put the message into its
> hourly and daily partition. So if the Hadoop job runs every 15 mins,
> it will run 4 times to collect data into the current hour's partition.
>
> Very soon, the plan is to open-source this Avro-Hadoop pipeline as well.
>
> Thanks,
> Neha
>
> On Sun, Nov 6, 2011 at 10:37 AM, Mark<st...@gmail.com>  wrote:
>> "At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer to
>> load the data for
>> topics in daily and hourly partitions."
>>
>> Sorry for my ignorance but what exactly do you mean by loading the data in
>> daily and hourly partitions?
>>
>>
>> On 11/6/11 10:26 AM, Neha Narkhede wrote:
>>> There should be no changes to the way you create topics to achieve
>>> this kind of HDFS data load for Kafka. At LinkedIn we use the
>>> InputFormat provided in contrib/hadoop-consumer to load the data for
>>> topics in daily and hourly partitions. These Hadoop jobs run every 10
>>> mins or so. So the maximum delay of data being available from
>>> producer->Hadoop is around 10 mins.
>>>
>>> Thanks,
>>> Neha
>>>
>>> On Sun, Nov 6, 2011 at 8:45 AM, Mark<st...@gmail.com>    wrote:
>>>> This is more of a general design question but what is the preferred way
>>>> of
>>>> importing logs from Kafka to HDFS when you want your data segmented by
>>>> hour
>>>> or day? Is there anyway to say "Import only this {hour|day} of logs" or
>>>> does
>>>> one need to create their topics around the way they would like to import
>>>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is there
>>>> any
>>>> documentation/best practices on topic/key design?
>>>>
>>>> Thanks
>>>>

Re: Hadoop import

Posted by Neha Narkhede <ne...@gmail.com>.

We use Avro serialization for the message data and use Avro schemas to
convert event objects into Kafka message payload on the producers. On
the Hadoop side, we use Avro schemas to deserialize Kafka message
payload back into an event object. Each such event object has a
timestamp field that the Hadoop job uses to put the message into its
hourly and daily partition. So if the Hadoop job runs every 15 mins,
it will run 4 times to collect data into the current hour's partition.

Very soon, the plan is to open-source this Avro-Hadoop pipeline as well.

Thanks,
Neha

On Sun, Nov 6, 2011 at 10:37 AM, Mark <st...@gmail.com> wrote:
> "At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer to
> load the data for
> topics in daily and hourly partitions."
>
> Sorry for my ignorance but what exactly do you mean by loading the data in
> daily and hourly partitions?
>
>
> On 11/6/11 10:26 AM, Neha Narkhede wrote:
>>
>> There should be no changes to the way you create topics to achieve
>> this kind of HDFS data load for Kafka. At LinkedIn we use the
>> InputFormat provided in contrib/hadoop-consumer to load the data for
>> topics in daily and hourly partitions. These Hadoop jobs run every 10
>> mins or so. So the maximum delay of data being available from
>> producer->Hadoop is around 10 mins.
>>
>> Thanks,
>> Neha
>>
>> On Sun, Nov 6, 2011 at 8:45 AM, Mark<st...@gmail.com>  wrote:
>>>
>>> This is more of a general design question but what is the preferred way
>>> of
>>> importing logs from Kafka to HDFS when you want your data segmented by
>>> hour
>>> or day? Is there anyway to say "Import only this {hour|day} of logs" or
>>> does
>>> one need to create their topics around the way they would like to import
>>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is there
>>> any
>>> documentation/best practices on topic/key design?
>>>
>>> Thanks
>>>
>

Re: Hadoop import

Posted by Mark <st...@gmail.com>.

"At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer to load the data for
topics in daily and hourly partitions."

Sorry for my ignorance but what exactly do you mean by loading the data 
in daily and hourly partitions?


On 11/6/11 10:26 AM, Neha Narkhede wrote:
> There should be no changes to the way you create topics to achieve
> this kind of HDFS data load for Kafka. At LinkedIn we use the
> InputFormat provided in contrib/hadoop-consumer to load the data for
> topics in daily and hourly partitions. These Hadoop jobs run every 10
> mins or so. So the maximum delay of data being available from
> producer->Hadoop is around 10 mins.
>
> Thanks,
> Neha
>
> On Sun, Nov 6, 2011 at 8:45 AM, Mark<st...@gmail.com>  wrote:
>> This is more of a general design question but what is the preferred way of
>> importing logs from Kafka to HDFS when you want your data segmented by hour
>> or day? Is there anyway to say "Import only this {hour|day} of logs" or does
>> one need to create their topics around the way they would like to import
>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is there any
>> documentation/best practices on topic/key design?
>>
>> Thanks
>>

Re: Hadoop import

Posted by Neha Narkhede <ne...@gmail.com>.

There should be no changes to the way you create topics to achieve
this kind of HDFS data load for Kafka. At LinkedIn we use the
InputFormat provided in contrib/hadoop-consumer to load the data for
topics in daily and hourly partitions. These Hadoop jobs run every 10
mins or so. So the maximum delay of data being available from
producer->Hadoop is around 10 mins.

Thanks,
Neha

On Sun, Nov 6, 2011 at 8:45 AM, Mark <st...@gmail.com> wrote:
> This is more of a general design question but what is the preferred way of
> importing logs from Kafka to HDFS when you want your data segmented by hour
> or day? Is there anyway to say "Import only this {hour|day} of logs" or does
> one need to create their topics around the way they would like to import
> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is there any
> documentation/best practices on topic/key design?
>
> Thanks
>