You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Cody Koeninger <co...@koeninger.org> on 2016/11/17 00:22:47 UTC

Re: Kafka segmentation

Moved to user list.

I'm not really clear on what you're trying to accomplish (why put the
csv file through Kafka instead of reading it directly with spark?)

auto.offset.reset=largest just means that when starting the job
without any defined offsets, it will start at the highest (most
recent) available offsets.  That's probably not what you want if
you've already loaded csv lines into kafka.

On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien <hb...@gmail.com> wrote:
> Hi all,
>
> I would like to ask a question related to the size of Kafka stream. I want
> to put data (e.g., file *.csv) to Kafka then use Spark streaming to get the
> output from Kafka and then save to Hive by using SparkSQL. The file csv is
> about 100MB with ~250K messages/rows (Each row has about 10 fields of
> integer). I see that Spark Streaming first received two partitions/batches,
> the first is of 60K messages and the second is of 50K msgs. But from the
> third batch, Spark just received 200 messages for each batch (or partition).
> I think that this problem is coming from Kafka or some configuration in
> Spark. I already tried to configure with the setting
> "auto.offset.reset=largest", but every batch only gets 200 messages.
>
> Could you please tell me how to fix this problem?
> Thank you so much.
>
> Best regards,
> Alex
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Kafka segmentation

Posted by bo yang <bo...@gmail.com>.

I did not remember what exact configuration I was using. That link has some
good information! Thanks Cody!


On Wed, Nov 16, 2016 at 5:32 PM, Cody Koeninger <co...@koeninger.org> wrote:

> Yeah, if you're reporting issues, please be clear as to whether
> backpressure is enabled, and whether maxRatePerPartition is set.
>
> I expect that there is something wrong with backpressure, see e.g.
> https://issues.apache.org/jira/browse/SPARK-18371
>
> On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bo...@gmail.com> wrote:
> > I hit similar issue with Spark Streaming. The batch size seemed a little
> > random. Sometime it was large with many Kafka messages inside same batch,
> > sometimes it was very small with just a few messages. Is it possible that
> > was caused by the backpressure implementation in Spark Streaming?
> >
> > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> Moved to user list.
> >>
> >> I'm not really clear on what you're trying to accomplish (why put the
> >> csv file through Kafka instead of reading it directly with spark?)
> >>
> >> auto.offset.reset=largest just means that when starting the job
> >> without any defined offsets, it will start at the highest (most
> >> recent) available offsets.  That's probably not what you want if
> >> you've already loaded csv lines into kafka.
> >>
> >> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien <hbthien0410@gmail.com
> >
> >> wrote:
> >> > Hi all,
> >> >
> >> > I would like to ask a question related to the size of Kafka stream. I
> >> > want
> >> > to put data (e.g., file *.csv) to Kafka then use Spark streaming to
> get
> >> > the
> >> > output from Kafka and then save to Hive by using SparkSQL. The file
> csv
> >> > is
> >> > about 100MB with ~250K messages/rows (Each row has about 10 fields of
> >> > integer). I see that Spark Streaming first received two
> >> > partitions/batches,
> >> > the first is of 60K messages and the second is of 50K msgs. But from
> the
> >> > third batch, Spark just received 200 messages for each batch (or
> >> > partition).
> >> > I think that this problem is coming from Kafka or some configuration
> in
> >> > Spark. I already tried to configure with the setting
> >> > "auto.offset.reset=largest", but every batch only gets 200 messages.
> >> >
> >> > Could you please tell me how to fix this problem?
> >> > Thank you so much.
> >> >
> >> > Best regards,
> >> > Alex
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>
> >
>

Re: Kafka segmentation

Posted by Cody Koeninger <co...@koeninger.org>.

I mean I don't understand exactly what the issue is.  Can you fill in
these blanks

My settings are :

My code is :

I expected to see :

Instead, I saw :

On Thu, Nov 17, 2016 at 12:53 PM, Hoang Bao Thien <hb...@gmail.com> wrote:
> I am sorry I don't understand your idea. What do you mean exactly?
>
> On Fri, Nov 18, 2016 at 1:50 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Ok, I don't think I'm clear on the issue then.  Can you say what the
>> expected behavior is, and what the observed behavior is?
>>
>> On Thu, Nov 17, 2016 at 10:48 AM, Hoang Bao Thien <hb...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Thanks for your comments. But in fact, I don't want to limit the size of
>> > batches, it could be any greater size as it does.
>> >
>> > Thien
>> >
>> > On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger <co...@koeninger.org>
>> > wrote:
>> >>
>> >> If you want a consistent limit on the size of batches, use
>> >> spark.streaming.kafka.maxRatePerPartition  (assuming you're using
>> >> createDirectStream)
>> >>
>> >> http://spark.apache.org/docs/latest/configuration.html#spark-streaming
>> >>
>> >> On Thu, Nov 17, 2016 at 12:52 AM, Hoang Bao Thien
>> >> <hb...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I use CSV and other text files to Kafka just to test Kafka + Spark
>> >> > Streaming
>> >> > by using direct stream. That's why I don't want Spark streaming reads
>> >> > CSVs
>> >> > or text files directly.
>> >> > In addition, I don't want a giant batch of records like the link you
>> >> > sent.
>> >> > The problem is that we should receive the "similar" number of record
>> >> > of
>> >> > all
>> >> > batchs instead of the first two or three batches have so large number
>> >> > of
>> >> > records (e.g., 100K) but the last 1000 batches with only 200 records.
>> >> >
>> >> > I know that the problem is not from the auto.offset.reset=largest,
>> >> > but I
>> >> > don't know what I can do in this case.
>> >> >
>> >> > Do you and other ones could suggest me some solutions please as this
>> >> > seems
>> >> > the normal situation with Kafka+SpartStreaming.
>> >> >
>> >> > Thanks.
>> >> > Alex
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger <co...@koeninger.org>
>> >> > wrote:
>> >> >>
>> >> >> Yeah, if you're reporting issues, please be clear as to whether
>> >> >> backpressure is enabled, and whether maxRatePerPartition is set.
>> >> >>
>> >> >> I expect that there is something wrong with backpressure, see e.g.
>> >> >> https://issues.apache.org/jira/browse/SPARK-18371
>> >> >>
>> >> >> On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bo...@gmail.com>
>> >> >> wrote:
>> >> >> > I hit similar issue with Spark Streaming. The batch size seemed a
>> >> >> > little
>> >> >> > random. Sometime it was large with many Kafka messages inside same
>> >> >> > batch,
>> >> >> > sometimes it was very small with just a few messages. Is it
>> >> >> > possible
>> >> >> > that
>> >> >> > was caused by the backpressure implementation in Spark Streaming?
>> >> >> >
>> >> >> > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger
>> >> >> > <co...@koeninger.org>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Moved to user list.
>> >> >> >>
>> >> >> >> I'm not really clear on what you're trying to accomplish (why put
>> >> >> >> the
>> >> >> >> csv file through Kafka instead of reading it directly with
>> >> >> >> spark?)
>> >> >> >>
>> >> >> >> auto.offset.reset=largest just means that when starting the job
>> >> >> >> without any defined offsets, it will start at the highest (most
>> >> >> >> recent) available offsets.  That's probably not what you want if
>> >> >> >> you've already loaded csv lines into kafka.
>> >> >> >>
>> >> >> >> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien
>> >> >> >> <hb...@gmail.com>
>> >> >> >> wrote:
>> >> >> >> > Hi all,
>> >> >> >> >
>> >> >> >> > I would like to ask a question related to the size of Kafka
>> >> >> >> > stream. I
>> >> >> >> > want
>> >> >> >> > to put data (e.g., file *.csv) to Kafka then use Spark
>> >> >> >> > streaming
>> >> >> >> > to
>> >> >> >> > get
>> >> >> >> > the
>> >> >> >> > output from Kafka and then save to Hive by using SparkSQL. The
>> >> >> >> > file
>> >> >> >> > csv
>> >> >> >> > is
>> >> >> >> > about 100MB with ~250K messages/rows (Each row has about 10
>> >> >> >> > fields
>> >> >> >> > of
>> >> >> >> > integer). I see that Spark Streaming first received two
>> >> >> >> > partitions/batches,
>> >> >> >> > the first is of 60K messages and the second is of 50K msgs. But
>> >> >> >> > from
>> >> >> >> > the
>> >> >> >> > third batch, Spark just received 200 messages for each batch
>> >> >> >> > (or
>> >> >> >> > partition).
>> >> >> >> > I think that this problem is coming from Kafka or some
>> >> >> >> > configuration
>> >> >> >> > in
>> >> >> >> > Spark. I already tried to configure with the setting
>> >> >> >> > "auto.offset.reset=largest", but every batch only gets 200
>> >> >> >> > messages.
>> >> >> >> >
>> >> >> >> > Could you please tell me how to fix this problem?
>> >> >> >> > Thank you so much.
>> >> >> >> >
>> >> >> >> > Best regards,
>> >> >> >> > Alex
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >> >> >>
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Kafka segmentation

Posted by Hoang Bao Thien <hb...@gmail.com>.

I am sorry I don't understand your idea. What do you mean exactly?

On Fri, Nov 18, 2016 at 1:50 AM, Cody Koeninger <co...@koeninger.org> wrote:

> Ok, I don't think I'm clear on the issue then.  Can you say what the
> expected behavior is, and what the observed behavior is?
>
> On Thu, Nov 17, 2016 at 10:48 AM, Hoang Bao Thien <hb...@gmail.com>
> wrote:
> > Hi,
> >
> > Thanks for your comments. But in fact, I don't want to limit the size of
> > batches, it could be any greater size as it does.
> >
> > Thien
> >
> > On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> If you want a consistent limit on the size of batches, use
> >> spark.streaming.kafka.maxRatePerPartition  (assuming you're using
> >> createDirectStream)
> >>
> >> http://spark.apache.org/docs/latest/configuration.html#spark-streaming
> >>
> >> On Thu, Nov 17, 2016 at 12:52 AM, Hoang Bao Thien <
> hbthien0410@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I use CSV and other text files to Kafka just to test Kafka + Spark
> >> > Streaming
> >> > by using direct stream. That's why I don't want Spark streaming reads
> >> > CSVs
> >> > or text files directly.
> >> > In addition, I don't want a giant batch of records like the link you
> >> > sent.
> >> > The problem is that we should receive the "similar" number of record
> of
> >> > all
> >> > batchs instead of the first two or three batches have so large number
> of
> >> > records (e.g., 100K) but the last 1000 batches with only 200 records.
> >> >
> >> > I know that the problem is not from the auto.offset.reset=largest,
> but I
> >> > don't know what I can do in this case.
> >> >
> >> > Do you and other ones could suggest me some solutions please as this
> >> > seems
> >> > the normal situation with Kafka+SpartStreaming.
> >> >
> >> > Thanks.
> >> > Alex
> >> >
> >> >
> >> >
> >> > On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger <co...@koeninger.org>
> >> > wrote:
> >> >>
> >> >> Yeah, if you're reporting issues, please be clear as to whether
> >> >> backpressure is enabled, and whether maxRatePerPartition is set.
> >> >>
> >> >> I expect that there is something wrong with backpressure, see e.g.
> >> >> https://issues.apache.org/jira/browse/SPARK-18371
> >> >>
> >> >> On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bo...@gmail.com>
> wrote:
> >> >> > I hit similar issue with Spark Streaming. The batch size seemed a
> >> >> > little
> >> >> > random. Sometime it was large with many Kafka messages inside same
> >> >> > batch,
> >> >> > sometimes it was very small with just a few messages. Is it
> possible
> >> >> > that
> >> >> > was caused by the backpressure implementation in Spark Streaming?
> >> >> >
> >> >> > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <
> cody@koeninger.org>
> >> >> > wrote:
> >> >> >>
> >> >> >> Moved to user list.
> >> >> >>
> >> >> >> I'm not really clear on what you're trying to accomplish (why put
> >> >> >> the
> >> >> >> csv file through Kafka instead of reading it directly with spark?)
> >> >> >>
> >> >> >> auto.offset.reset=largest just means that when starting the job
> >> >> >> without any defined offsets, it will start at the highest (most
> >> >> >> recent) available offsets.  That's probably not what you want if
> >> >> >> you've already loaded csv lines into kafka.
> >> >> >>
> >> >> >> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien
> >> >> >> <hb...@gmail.com>
> >> >> >> wrote:
> >> >> >> > Hi all,
> >> >> >> >
> >> >> >> > I would like to ask a question related to the size of Kafka
> >> >> >> > stream. I
> >> >> >> > want
> >> >> >> > to put data (e.g., file *.csv) to Kafka then use Spark streaming
> >> >> >> > to
> >> >> >> > get
> >> >> >> > the
> >> >> >> > output from Kafka and then save to Hive by using SparkSQL. The
> >> >> >> > file
> >> >> >> > csv
> >> >> >> > is
> >> >> >> > about 100MB with ~250K messages/rows (Each row has about 10
> fields
> >> >> >> > of
> >> >> >> > integer). I see that Spark Streaming first received two
> >> >> >> > partitions/batches,
> >> >> >> > the first is of 60K messages and the second is of 50K msgs. But
> >> >> >> > from
> >> >> >> > the
> >> >> >> > third batch, Spark just received 200 messages for each batch (or
> >> >> >> > partition).
> >> >> >> > I think that this problem is coming from Kafka or some
> >> >> >> > configuration
> >> >> >> > in
> >> >> >> > Spark. I already tried to configure with the setting
> >> >> >> > "auto.offset.reset=largest", but every batch only gets 200
> >> >> >> > messages.
> >> >> >> >
> >> >> >> > Could you please tell me how to fix this problem?
> >> >> >> > Thank you so much.
> >> >> >> >
> >> >> >> > Best regards,
> >> >> >> > Alex
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> ------------------------------------------------------------
> ---------
> >> >> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >> >> >>
> >> >> >
> >> >
> >> >
> >
> >
>

Re: Kafka segmentation

Posted by Cody Koeninger <co...@koeninger.org>.

Ok, I don't think I'm clear on the issue then.  Can you say what the
expected behavior is, and what the observed behavior is?

On Thu, Nov 17, 2016 at 10:48 AM, Hoang Bao Thien <hb...@gmail.com> wrote:
> Hi,
>
> Thanks for your comments. But in fact, I don't want to limit the size of
> batches, it could be any greater size as it does.
>
> Thien
>
> On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> If you want a consistent limit on the size of batches, use
>> spark.streaming.kafka.maxRatePerPartition  (assuming you're using
>> createDirectStream)
>>
>> http://spark.apache.org/docs/latest/configuration.html#spark-streaming
>>
>> On Thu, Nov 17, 2016 at 12:52 AM, Hoang Bao Thien <hb...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I use CSV and other text files to Kafka just to test Kafka + Spark
>> > Streaming
>> > by using direct stream. That's why I don't want Spark streaming reads
>> > CSVs
>> > or text files directly.
>> > In addition, I don't want a giant batch of records like the link you
>> > sent.
>> > The problem is that we should receive the "similar" number of record of
>> > all
>> > batchs instead of the first two or three batches have so large number of
>> > records (e.g., 100K) but the last 1000 batches with only 200 records.
>> >
>> > I know that the problem is not from the auto.offset.reset=largest, but I
>> > don't know what I can do in this case.
>> >
>> > Do you and other ones could suggest me some solutions please as this
>> > seems
>> > the normal situation with Kafka+SpartStreaming.
>> >
>> > Thanks.
>> > Alex
>> >
>> >
>> >
>> > On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger <co...@koeninger.org>
>> > wrote:
>> >>
>> >> Yeah, if you're reporting issues, please be clear as to whether
>> >> backpressure is enabled, and whether maxRatePerPartition is set.
>> >>
>> >> I expect that there is something wrong with backpressure, see e.g.
>> >> https://issues.apache.org/jira/browse/SPARK-18371
>> >>
>> >> On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bo...@gmail.com> wrote:
>> >> > I hit similar issue with Spark Streaming. The batch size seemed a
>> >> > little
>> >> > random. Sometime it was large with many Kafka messages inside same
>> >> > batch,
>> >> > sometimes it was very small with just a few messages. Is it possible
>> >> > that
>> >> > was caused by the backpressure implementation in Spark Streaming?
>> >> >
>> >> > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <co...@koeninger.org>
>> >> > wrote:
>> >> >>
>> >> >> Moved to user list.
>> >> >>
>> >> >> I'm not really clear on what you're trying to accomplish (why put
>> >> >> the
>> >> >> csv file through Kafka instead of reading it directly with spark?)
>> >> >>
>> >> >> auto.offset.reset=largest just means that when starting the job
>> >> >> without any defined offsets, it will start at the highest (most
>> >> >> recent) available offsets.  That's probably not what you want if
>> >> >> you've already loaded csv lines into kafka.
>> >> >>
>> >> >> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien
>> >> >> <hb...@gmail.com>
>> >> >> wrote:
>> >> >> > Hi all,
>> >> >> >
>> >> >> > I would like to ask a question related to the size of Kafka
>> >> >> > stream. I
>> >> >> > want
>> >> >> > to put data (e.g., file *.csv) to Kafka then use Spark streaming
>> >> >> > to
>> >> >> > get
>> >> >> > the
>> >> >> > output from Kafka and then save to Hive by using SparkSQL. The
>> >> >> > file
>> >> >> > csv
>> >> >> > is
>> >> >> > about 100MB with ~250K messages/rows (Each row has about 10 fields
>> >> >> > of
>> >> >> > integer). I see that Spark Streaming first received two
>> >> >> > partitions/batches,
>> >> >> > the first is of 60K messages and the second is of 50K msgs. But
>> >> >> > from
>> >> >> > the
>> >> >> > third batch, Spark just received 200 messages for each batch (or
>> >> >> > partition).
>> >> >> > I think that this problem is coming from Kafka or some
>> >> >> > configuration
>> >> >> > in
>> >> >> > Spark. I already tried to configure with the setting
>> >> >> > "auto.offset.reset=largest", but every batch only gets 200
>> >> >> > messages.
>> >> >> >
>> >> >> > Could you please tell me how to fix this problem?
>> >> >> > Thank you so much.
>> >> >> >
>> >> >> > Best regards,
>> >> >> > Alex
>> >> >> >
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >> >>
>> >> >
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Kafka segmentation

Posted by Hoang Bao Thien <hb...@gmail.com>.

Hi,

Thanks for your comments. But in fact, I don't want to limit the size of
batches, it could be any greater size as it does.

Thien

On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger <co...@koeninger.org> wrote:

> If you want a consistent limit on the size of batches, use
> spark.streaming.kafka.maxRatePerPartition  (assuming you're using
> createDirectStream)
>
> http://spark.apache.org/docs/latest/configuration.html#spark-streaming
>
> On Thu, Nov 17, 2016 at 12:52 AM, Hoang Bao Thien <hb...@gmail.com>
> wrote:
> > Hi,
> >
> > I use CSV and other text files to Kafka just to test Kafka + Spark
> Streaming
> > by using direct stream. That's why I don't want Spark streaming reads
> CSVs
> > or text files directly.
> > In addition, I don't want a giant batch of records like the link you
> sent.
> > The problem is that we should receive the "similar" number of record of
> all
> > batchs instead of the first two or three batches have so large number of
> > records (e.g., 100K) but the last 1000 batches with only 200 records.
> >
> > I know that the problem is not from the auto.offset.reset=largest, but I
> > don't know what I can do in this case.
> >
> > Do you and other ones could suggest me some solutions please as this
> seems
> > the normal situation with Kafka+SpartStreaming.
> >
> > Thanks.
> > Alex
> >
> >
> >
> > On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> Yeah, if you're reporting issues, please be clear as to whether
> >> backpressure is enabled, and whether maxRatePerPartition is set.
> >>
> >> I expect that there is something wrong with backpressure, see e.g.
> >> https://issues.apache.org/jira/browse/SPARK-18371
> >>
> >> On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bo...@gmail.com> wrote:
> >> > I hit similar issue with Spark Streaming. The batch size seemed a
> little
> >> > random. Sometime it was large with many Kafka messages inside same
> >> > batch,
> >> > sometimes it was very small with just a few messages. Is it possible
> >> > that
> >> > was caused by the backpressure implementation in Spark Streaming?
> >> >
> >> > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <co...@koeninger.org>
> >> > wrote:
> >> >>
> >> >> Moved to user list.
> >> >>
> >> >> I'm not really clear on what you're trying to accomplish (why put the
> >> >> csv file through Kafka instead of reading it directly with spark?)
> >> >>
> >> >> auto.offset.reset=largest just means that when starting the job
> >> >> without any defined offsets, it will start at the highest (most
> >> >> recent) available offsets.  That's probably not what you want if
> >> >> you've already loaded csv lines into kafka.
> >> >>
> >> >> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien
> >> >> <hb...@gmail.com>
> >> >> wrote:
> >> >> > Hi all,
> >> >> >
> >> >> > I would like to ask a question related to the size of Kafka
> stream. I
> >> >> > want
> >> >> > to put data (e.g., file *.csv) to Kafka then use Spark streaming to
> >> >> > get
> >> >> > the
> >> >> > output from Kafka and then save to Hive by using SparkSQL. The file
> >> >> > csv
> >> >> > is
> >> >> > about 100MB with ~250K messages/rows (Each row has about 10 fields
> of
> >> >> > integer). I see that Spark Streaming first received two
> >> >> > partitions/batches,
> >> >> > the first is of 60K messages and the second is of 50K msgs. But
> from
> >> >> > the
> >> >> > third batch, Spark just received 200 messages for each batch (or
> >> >> > partition).
> >> >> > I think that this problem is coming from Kafka or some
> configuration
> >> >> > in
> >> >> > Spark. I already tried to configure with the setting
> >> >> > "auto.offset.reset=largest", but every batch only gets 200
> messages.
> >> >> >
> >> >> > Could you please tell me how to fix this problem?
> >> >> > Thank you so much.
> >> >> >
> >> >> > Best regards,
> >> >> > Alex
> >> >> >
> >> >>
> >> >> ------------------------------------------------------------
> ---------
> >> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >> >>
> >> >
> >
> >
>

Re: Kafka segmentation

Posted by Cody Koeninger <co...@koeninger.org>.

If you want a consistent limit on the size of batches, use
spark.streaming.kafka.maxRatePerPartition  (assuming you're using
createDirectStream)

http://spark.apache.org/docs/latest/configuration.html#spark-streaming

On Thu, Nov 17, 2016 at 12:52 AM, Hoang Bao Thien <hb...@gmail.com> wrote:
> Hi,
>
> I use CSV and other text files to Kafka just to test Kafka + Spark Streaming
> by using direct stream. That's why I don't want Spark streaming reads CSVs
> or text files directly.
> In addition, I don't want a giant batch of records like the link you sent.
> The problem is that we should receive the "similar" number of record of all
> batchs instead of the first two or three batches have so large number of
> records (e.g., 100K) but the last 1000 batches with only 200 records.
>
> I know that the problem is not from the auto.offset.reset=largest, but I
> don't know what I can do in this case.
>
> Do you and other ones could suggest me some solutions please as this seems
> the normal situation with Kafka+SpartStreaming.
>
> Thanks.
> Alex
>
>
>
> On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Yeah, if you're reporting issues, please be clear as to whether
>> backpressure is enabled, and whether maxRatePerPartition is set.
>>
>> I expect that there is something wrong with backpressure, see e.g.
>> https://issues.apache.org/jira/browse/SPARK-18371
>>
>> On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bo...@gmail.com> wrote:
>> > I hit similar issue with Spark Streaming. The batch size seemed a little
>> > random. Sometime it was large with many Kafka messages inside same
>> > batch,
>> > sometimes it was very small with just a few messages. Is it possible
>> > that
>> > was caused by the backpressure implementation in Spark Streaming?
>> >
>> > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <co...@koeninger.org>
>> > wrote:
>> >>
>> >> Moved to user list.
>> >>
>> >> I'm not really clear on what you're trying to accomplish (why put the
>> >> csv file through Kafka instead of reading it directly with spark?)
>> >>
>> >> auto.offset.reset=largest just means that when starting the job
>> >> without any defined offsets, it will start at the highest (most
>> >> recent) available offsets.  That's probably not what you want if
>> >> you've already loaded csv lines into kafka.
>> >>
>> >> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien
>> >> <hb...@gmail.com>
>> >> wrote:
>> >> > Hi all,
>> >> >
>> >> > I would like to ask a question related to the size of Kafka stream. I
>> >> > want
>> >> > to put data (e.g., file *.csv) to Kafka then use Spark streaming to
>> >> > get
>> >> > the
>> >> > output from Kafka and then save to Hive by using SparkSQL. The file
>> >> > csv
>> >> > is
>> >> > about 100MB with ~250K messages/rows (Each row has about 10 fields of
>> >> > integer). I see that Spark Streaming first received two
>> >> > partitions/batches,
>> >> > the first is of 60K messages and the second is of 50K msgs. But from
>> >> > the
>> >> > third batch, Spark just received 200 messages for each batch (or
>> >> > partition).
>> >> > I think that this problem is coming from Kafka or some configuration
>> >> > in
>> >> > Spark. I already tried to configure with the setting
>> >> > "auto.offset.reset=largest", but every batch only gets 200 messages.
>> >> >
>> >> > Could you please tell me how to fix this problem?
>> >> > Thank you so much.
>> >> >
>> >> > Best regards,
>> >> > Alex
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Kafka segmentation

Posted by Cody Koeninger <co...@koeninger.org>.

Yeah, if you're reporting issues, please be clear as to whether
backpressure is enabled, and whether maxRatePerPartition is set.

I expect that there is something wrong with backpressure, see e.g.
https://issues.apache.org/jira/browse/SPARK-18371

On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bo...@gmail.com> wrote:
> I hit similar issue with Spark Streaming. The batch size seemed a little
> random. Sometime it was large with many Kafka messages inside same batch,
> sometimes it was very small with just a few messages. Is it possible that
> was caused by the backpressure implementation in Spark Streaming?
>
> On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Moved to user list.
>>
>> I'm not really clear on what you're trying to accomplish (why put the
>> csv file through Kafka instead of reading it directly with spark?)
>>
>> auto.offset.reset=largest just means that when starting the job
>> without any defined offsets, it will start at the highest (most
>> recent) available offsets.  That's probably not what you want if
>> you've already loaded csv lines into kafka.
>>
>> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien <hb...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > I would like to ask a question related to the size of Kafka stream. I
>> > want
>> > to put data (e.g., file *.csv) to Kafka then use Spark streaming to get
>> > the
>> > output from Kafka and then save to Hive by using SparkSQL. The file csv
>> > is
>> > about 100MB with ~250K messages/rows (Each row has about 10 fields of
>> > integer). I see that Spark Streaming first received two
>> > partitions/batches,
>> > the first is of 60K messages and the second is of 50K msgs. But from the
>> > third batch, Spark just received 200 messages for each batch (or
>> > partition).
>> > I think that this problem is coming from Kafka or some configuration in
>> > Spark. I already tried to configure with the setting
>> > "auto.offset.reset=largest", but every batch only gets 200 messages.
>> >
>> > Could you please tell me how to fix this problem?
>> > Thank you so much.
>> >
>> > Best regards,
>> > Alex
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Kafka segmentation

Posted by bo yang <bo...@gmail.com>.

I hit similar issue with Spark Streaming. The batch size seemed a little
random. Sometime it was large with many Kafka messages inside same batch,
sometimes it was very small with just a few messages. Is it possible that
was caused by the backpressure implementation in Spark Streaming?

On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger <co...@koeninger.org> wrote:

> Moved to user list.
>
> I'm not really clear on what you're trying to accomplish (why put the
> csv file through Kafka instead of reading it directly with spark?)
>
> auto.offset.reset=largest just means that when starting the job
> without any defined offsets, it will start at the highest (most
> recent) available offsets.  That's probably not what you want if
> you've already loaded csv lines into kafka.
>
> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien <hb...@gmail.com>
> wrote:
> > Hi all,
> >
> > I would like to ask a question related to the size of Kafka stream. I
> want
> > to put data (e.g., file *.csv) to Kafka then use Spark streaming to get
> the
> > output from Kafka and then save to Hive by using SparkSQL. The file csv
> is
> > about 100MB with ~250K messages/rows (Each row has about 10 fields of
> > integer). I see that Spark Streaming first received two
> partitions/batches,
> > the first is of 60K messages and the second is of 50K msgs. But from the
> > third batch, Spark just received 200 messages for each batch (or
> partition).
> > I think that this problem is coming from Kafka or some configuration in
> > Spark. I already tried to configure with the setting
> > "auto.offset.reset=largest", but every batch only gets 200 messages.
> >
> > Could you please tell me how to fix this problem?
> > Thank you so much.
> >
> > Best regards,
> > Alex
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>