You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@apex.apache.org by Ananth Gundabattula <ag...@gmail.com> on 2016/06/10 20:39:34 UTC

Kafka 0.9 operator to start consuming from a particular offset

Hello All,

I was wondering what would be the community's thoughts on the following ?

We are using kafka 0.9 input operator to read from a few topics. We are
using this stream to generate a parquet file. Now this approach is all good
for a beginners use case. At a later point in time, we would like to
"merge" all of the parquet files previously generated and for this I would
like to reprocess data exactly from a particular offset inside each of the
partitions. Each of the partitions will have their own starting and ending
offsets that I need to process for.

I was wondering if there is an easy way to extend the Kafka 0.9 operator (
perhaps along the lines of the offset manager in the 0.8 versions of the
kafka operator ) . Thoughts please ?

Regards,
Ananth

Re: Kafka 0.9 operator to start consuming from a particular offset

Posted by Ananth <ag...@gmail.com>.
Hello Thomas,

The reason why we wanted to source it from Kafka is because our tests showed that there is significant impact on reading it from the same disk that was serving the files for the impala engine that was using it serve queries.  

Perhaps we will have to use file based re-reads.

Regards
Ananth

> On 11 Jun 2016, at 2:14 PM, Thomas Weise <th...@gmail.com> wrote:
> 
> Ananth,
> 
> If your goal is to merge the parquet files, then why not use these files as source vs. going back to Kafka?
> 
> Thomas
> 
> 
>  
> 
>> On Fri, Jun 10, 2016 at 4:42 PM, Ananth Gundabattula <ag...@gmail.com> wrote:
>> Thanks for the thoughts Siyuan. 
>> 
>> Yes agree that the problem is inherently a batch oriented problem. We are hoping to build upon the window concepts to simulate a batch design. ( Primary reason is that we do not want two different ETL processing pipeline platforms within our eco system ). 
>> 
>> We are using kafka as the source of data over which multiple data processing frameworks ( ETL, M/L frameworks etc) run through. Hence Kafka is being used  both for streaming (primarily ETL - Apex system ) and batch use cases ( primarily M/L ) . 
>> 
>> I shall create a ticket. 
>> 
>> Regards,
>> Ananth  
>> 
>> 
>> 
>>> On Sat, Jun 11, 2016 at 7:15 AM, hsy541@gmail.com <hs...@gmail.com> wrote:
>>> Hi Ananth,
>>> Unlike files, Kafka is usually for streaming cases. Correct me if I'm wrong, your use case seems like a batch processing. We didn't consider end offset in our Kafka input operator design. But it could be a useful feature. Unfortunately there is no easy way, as of I know, to extend existing operator to achieve that.
>>> 
>>> OffsetManager is not designed for end offset. It's only a  customizable callback to update the committed offsets. And the start offsets it loads are supposed for stateful application restart.
>>> 
>>> Can you create a ticket and elaborate your use case there? Thanks!
>>> 
>>> Regards,
>>> Siyuan
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Friday, June 10, 2016, Ananth Gundabattula <ag...@gmail.com> wrote:
>>>> Hello All,
>>>> 
>>>> I was wondering what would be the community's thoughts on the following ? 
>>>> 
>>>> We are using kafka 0.9 input operator to read from a few topics. We are using this stream to generate a parquet file. Now this approach is all good for a beginners use case. At a later point in time, we would like to "merge" all of the parquet files previously generated and for this I would like to reprocess data exactly from a particular offset inside each of the partitions. Each of the partitions will have their own starting and ending offsets that I need to process for.
>>>> 
>>>> I was wondering if there is an easy way to extend the Kafka 0.9 operator ( perhaps along the lines of the offset manager in the 0.8 versions of the kafka operator ) . Thoughts please ? 
>>>> 
>>>> Regards,
>>>> Ananth
> 

Re: Kafka 0.9 operator to start consuming from a particular offset

Posted by Thomas Weise <th...@gmail.com>.
Ananth,

If your goal is to merge the parquet files, then why not use these files as
source vs. going back to Kafka?

Thomas




On Fri, Jun 10, 2016 at 4:42 PM, Ananth Gundabattula <
agundabattula@gmail.com> wrote:

> Thanks for the thoughts Siyuan.
>
> Yes agree that the problem is inherently a batch oriented problem. We are
> hoping to build upon the window concepts to simulate a batch design. (
> Primary reason is that we do not want two different ETL processing pipeline
> platforms within our eco system ).
>
> We are using kafka as the source of data over which multiple data
> processing frameworks ( ETL, M/L frameworks etc) run through. Hence Kafka
> is being used  both for streaming (primarily ETL - Apex system ) and batch
> use cases ( primarily M/L ) .
>
> I shall create a ticket.
>
> Regards,
> Ananth
>
>
>
> On Sat, Jun 11, 2016 at 7:15 AM, hsy541@gmail.com <hs...@gmail.com>
> wrote:
>
>> Hi Ananth,
>> Unlike files, Kafka is usually for streaming cases. Correct me if I'm
>> wrong, your use case seems like a batch processing. We didn't consider end
>> offset in our Kafka input operator design. But it could be a useful
>> feature. Unfortunately there is no easy way, as of I know, to extend
>> existing operator to achieve that.
>>
>> OffsetManager is not designed for end offset. It's only
>> a  customizable callback to update the committed offsets. And the start
>> offsets it loads are supposed for stateful application restart.
>>
>> Can you create a ticket and elaborate your use case there? Thanks!
>>
>> Regards,
>> Siyuan
>>
>>
>>
>>
>>
>> On Friday, June 10, 2016, Ananth Gundabattula <ag...@gmail.com>
>> wrote:
>>
>>> Hello All,
>>>
>>> I was wondering what would be the community's thoughts on the following
>>> ?
>>>
>>> We are using kafka 0.9 input operator to read from a few topics. We are
>>> using this stream to generate a parquet file. Now this approach is all good
>>> for a beginners use case. At a later point in time, we would like to
>>> "merge" all of the parquet files previously generated and for this I would
>>> like to reprocess data exactly from a particular offset inside each of the
>>> partitions. Each of the partitions will have their own starting and ending
>>> offsets that I need to process for.
>>>
>>> I was wondering if there is an easy way to extend the Kafka 0.9 operator
>>> ( perhaps along the lines of the offset manager in the 0.8 versions of the
>>> kafka operator ) . Thoughts please ?
>>>
>>> Regards,
>>> Ananth
>>>
>>
>

Re: Kafka 0.9 operator to start consuming from a particular offset

Posted by Ananth Gundabattula <ag...@gmail.com>.
Thanks for the thoughts Siyuan.

Yes agree that the problem is inherently a batch oriented problem. We are
hoping to build upon the window concepts to simulate a batch design. (
Primary reason is that we do not want two different ETL processing pipeline
platforms within our eco system ).

We are using kafka as the source of data over which multiple data
processing frameworks ( ETL, M/L frameworks etc) run through. Hence Kafka
is being used  both for streaming (primarily ETL - Apex system ) and batch
use cases ( primarily M/L ) .

I shall create a ticket.

Regards,
Ananth



On Sat, Jun 11, 2016 at 7:15 AM, hsy541@gmail.com <hs...@gmail.com> wrote:

> Hi Ananth,
> Unlike files, Kafka is usually for streaming cases. Correct me if I'm
> wrong, your use case seems like a batch processing. We didn't consider end
> offset in our Kafka input operator design. But it could be a useful
> feature. Unfortunately there is no easy way, as of I know, to extend
> existing operator to achieve that.
>
> OffsetManager is not designed for end offset. It's only
> a  customizable callback to update the committed offsets. And the start
> offsets it loads are supposed for stateful application restart.
>
> Can you create a ticket and elaborate your use case there? Thanks!
>
> Regards,
> Siyuan
>
>
>
>
>
> On Friday, June 10, 2016, Ananth Gundabattula <ag...@gmail.com>
> wrote:
>
>> Hello All,
>>
>> I was wondering what would be the community's thoughts on the following ?
>>
>> We are using kafka 0.9 input operator to read from a few topics. We are
>> using this stream to generate a parquet file. Now this approach is all good
>> for a beginners use case. At a later point in time, we would like to
>> "merge" all of the parquet files previously generated and for this I would
>> like to reprocess data exactly from a particular offset inside each of the
>> partitions. Each of the partitions will have their own starting and ending
>> offsets that I need to process for.
>>
>> I was wondering if there is an easy way to extend the Kafka 0.9 operator
>> ( perhaps along the lines of the offset manager in the 0.8 versions of the
>> kafka operator ) . Thoughts please ?
>>
>> Regards,
>> Ananth
>>
>

Re: Kafka 0.9 operator to start consuming from a particular offset

Posted by "hsy541@gmail.com" <hs...@gmail.com>.
Hi Ananth,
Unlike files, Kafka is usually for streaming cases. Correct me if I'm
wrong, your use case seems like a batch processing. We didn't consider end
offset in our Kafka input operator design. But it could be a useful
feature. Unfortunately there is no easy way, as of I know, to extend
existing operator to achieve that.

OffsetManager is not designed for end offset. It's only
a  customizable callback to update the committed offsets. And the start
offsets it loads are supposed for stateful application restart.

Can you create a ticket and elaborate your use case there? Thanks!

Regards,
Siyuan




On Friday, June 10, 2016, Ananth Gundabattula <ag...@gmail.com>
wrote:

> Hello All,
>
> I was wondering what would be the community's thoughts on the following ?
>
> We are using kafka 0.9 input operator to read from a few topics. We are
> using this stream to generate a parquet file. Now this approach is all good
> for a beginners use case. At a later point in time, we would like to
> "merge" all of the parquet files previously generated and for this I would
> like to reprocess data exactly from a particular offset inside each of the
> partitions. Each of the partitions will have their own starting and ending
> offsets that I need to process for.
>
> I was wondering if there is an easy way to extend the Kafka 0.9 operator (
> perhaps along the lines of the offset manager in the 0.8 versions of the
> kafka operator ) . Thoughts please ?
>
> Regards,
> Ananth
>