You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Chen Wang <ch...@gmail.com> on 2014/07/11 20:58:43 UTC

writing huge amount of data to HDFS

Hi, Guys,
I have a storm topology, with a single thread bolt querying large amount of
data (From elasticsearch), and emit to a HBase bolt(10 threads), doing some
filtering, then emit to Arvo bolt.(10threads) The arvo bolt simply emit the
tuple to arvo client, which will be received by two flume node and then
sink into hdfs. I am testing in local mode.

In the query bolt, i am  getting around 15000 entries in a batch, the query
itself takes about 4second, however, he emit method in the query bolt takes
about 20 seconds. Does it mean that
the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the query
bolt?

How can I tune my topology to make this process as fast as possible? I
tried to increase the HBase thread to 20 but it does not seem to help.

I use shuffleGrouping from query bolt to hbase bolt, and from hbase bolt to
avro.

Thanks for any advice.
Chen

Re: writing huge amount of data to HDFS

Posted by Harsha <st...@harsha.io>.

Hi Chen,

          I thought your bolt was the one doing reading from ES
and there is no spout. I suppose its ok since the ES queries
are flowing from kafka.  Did you measure Hbase bolt's execute
method. It looks like its making read call on hbase for each
tuple emitted from ES bolt. From what I see ES bolt emits bunch
of tuples and it goes to Hbase bolt which makes call to hbase
db it might be hanging there to get the results from hbase
query which makes it slower to consume from ES bolt.

Ideally if you can batch tuples to hbase query it will speed up
instead of making a call for every tuple or you can reduce the
batch size for ES query and emit fewer tuples instead of 15k at
a time . Increasing parallelism of hbase bolt might not be
helpful as you increase the no.of connections to hbase. I start
with measuring HbaseBolt execute method latency and reduce the
ES batch size , try to batch up hbase reads.

-Harsha





On Sat, Jul 12, 2014, at 12:33 AM, Chen Wang wrote:

Thanks Harsha.
My spout is listening to a kafka queue which contains the es
query from user's input. Is it safe to spawn a thread in the
spout and do the ES query directly in the spout? What is the
fundamental difference in doing the query in a thread of spout
VS a thread of bolt?

The reason of using flume is that I have to split the data into
different partitions(hdfs folders) depending on the value of
the bolt: meaning I will need to modify the hdfs bolt any ways.
In the past, i tried to shift large amount of data to a
partitioned hive table using this approach(avro to flume to
hdfs), and it seems to working well. Thus i stick to this
approach without reinventing the wheel.

Thanks,
Chen


On Fri, Jul 11, 2014 at 4:51 PM, Harsha <[1...@harsha.io>
wrote:

Hi Chen,
          I looked at your code. The first part is inside a
Bolt's execute method ?  and it looks like fetching all the
data (10000 per call)  from a elastic search and emitting each
value from inside the execute method which ends when the ES
result set runs out.
It doesn't look like you followed storm's conventions here was
there any reason not use Spout here . A bolt' execute method
gets called for every tuple that's getting passed. Docs on
spout &
bolt [2]https://storm.incubator.apache.org/documentation/Concep
ts.html

from your comment in the code "10000 hits per shard will be
returned for each scroll" and if it taking longer  read 10000
records from ES I would suggest you to reduce this batch size
". The idea here is you are making quicker calls to ES and
pushing the data downstream and making another call to ES for
the next batch instead of acquiring one big batch in single
call.

 "i am  getting around 15000 entries in a batch, the query
itself takes about 4second, however, he emit method in the
query bolt takes about 20 seconds." Can you try reducing the
batch size here too it looks like the time is taking emitting
15k entries at one go.

          Was there any reason/utility of using flume to write
to hdfs. If not I would recommend
using [3]https://github.com/ptgoetz/storm-hdfs bolt .



On Fri, Jul 11, 2014, at 03:37 PM, Chen Wang wrote:

Here is the output from the ES query bolt:
 "Total execution time for this batch: 179655(millisecond)" is
the call time around .emit. As you can see, to emit 14000
entries, it takes
anytime from 145231 to 180000



On Fri, Jul 11, 2014 at 2:14 PM, Chen Wang
<[4...@gmail.com> wrote:

here you go:
[5]https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
Its actually pretty straight forward. The only thing worth of
mention is that I use another thread in the ES bolt to do the
actual query and tuple emit.
Thanks for looking.
Chen



On Fri, Jul 11, 2014 at 1:18 PM, Sam Goodwin
<[6...@gmail.com> wrote:

Can you show some code? 200 seconds for 15K puts sounds like
you're not batching.



On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang
<[7...@gmail.com> wrote:

typo in previous email
The emit method in the query bolt takes about 200(instead of
20) seconds..



On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang
<[8...@gmail.com> wrote:

Hi, Guys,
I have a storm topology, with a single thread bolt querying
large amount of data (From elasticsearch), and emit to a HBase
bolt(10 threads), doing some filtering, then emit to Arvo
bolt.(10threads) The arvo bolt simply emit the tuple to arvo
client, which will be received by two flume node and then sink
into hdfs. I am testing in local mode.

In the query bolt, i am  getting around 15000 entries in a
batch, the query itself takes about 4second, however, he emit
method in the query bolt takes about 20 seconds. Does it mean
that
the downstream bolt(HBaseBolt and Avro bolt) cannot catch up
with the query bolt?

How can I tune my topology to make this process as fast as
possible? I tried to increase the HBase thread to 20 but it
does not seem to help.

I use shuffleGrouping from query bolt to hbase bolt, and from
hbase bolt to avro.

Thanks for any advice.
Chen

References

1. mailto:storm@harsha.io
2. https://storm.incubator.apache.org/documentation/Concepts.html
3. https://github.com/ptgoetz/storm-hdfs
4. mailto:chen.apache.solr@gmail.com
5. https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
6. mailto:sam.goodwin89@gmail.com
7. mailto:chen.apache.solr@gmail.com
8. mailto:chen.apache.solr@gmail.com

Re: writing huge amount of data to HDFS

Posted by Chen Wang <ch...@gmail.com>.

Thanks Harsha.
My spout is listening to a kafka queue which contains the es query from
user's input. Is it safe to spawn a thread in the spout and do the ES query
directly in the spout? What is the fundamental difference in doing the
query in a thread of spout VS a thread of bolt?

The reason of using flume is that I have to split the data into different
partitions(hdfs folders) depending on the value of the bolt: meaning I will
need to modify the hdfs bolt any ways. In the past, i tried to shift large
amount of data to a partitioned hive table using this approach(avro to
flume to hdfs), and it seems to working well. Thus i stick to this approach
without reinventing the wheel.

Thanks,
Chen


On Fri, Jul 11, 2014 at 4:51 PM, Harsha <st...@harsha.io> wrote:

>  Hi Chen,
>           I looked at your code. The first part is inside a Bolt's execute
> method ?  and it looks like fetching all the data (10000 per call)  from a
> elastic search and emitting each value from inside the execute method which
> ends when the ES result set runs out.
> It doesn't look like you followed storm's conventions here was there any
> reason not use Spout here . A bolt' execute method gets called for every
> tuple that's getting passed. Docs on spout & bolt
> https://storm.incubator.apache.org/documentation/Concepts.html
>
> from your comment in the code "10000 hits per shard will be returned for
> each scroll" and if it taking longer  read 10000 records from ES I would
> suggest you to reduce this batch size ". The idea here is you are making
> quicker calls to ES and pushing the data downstream and making another call
> to ES for the next batch instead of acquiring one big batch in single call.
>
>  "i am  getting around 15000 entries in a batch, the query itself takes
> about 4second, however, he emit method in the query bolt takes about 20
> seconds." Can you try reducing the batch size here too it looks like the
> time is taking emitting 15k entries at one go.
>
>           Was there any reason/utility of using flume to write to hdfs. If
> not I would recommend using https://github.com/ptgoetz/storm-hdfs bolt .
>
>
>
> On Fri, Jul 11, 2014, at 03:37 PM, Chen Wang wrote:
>
> Here is the output from the ES query bolt:
>  "Total execution time for this batch: 179655(millisecond)" is the call
> time around .emit. As you can see, to emit 14000 entries, it takes
> anytime from 145231 to 180000
>
>
>
> On Fri, Jul 11, 2014 at 2:14 PM, Chen Wang <ch...@gmail.com>
> wrote:
>
> here you go:
> https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
> Its actually pretty straight forward. The only thing worth of mention is
> that I use another thread in the ES bolt to do the actual query and tuple
> emit.
> Thanks for looking.
> Chen
>
>
>
> On Fri, Jul 11, 2014 at 1:18 PM, Sam Goodwin <sa...@gmail.com>
> wrote:
>
> Can you show some code? 200 seconds for 15K puts sounds like you're not
> batching.
>
>
> On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang <ch...@gmail.com>
> wrote:
>
> typo in previous email
> The emit method in the query bolt takes about 200(instead of 20) seconds..
>
>
> On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang <ch...@gmail.com>
> wrote:
>
> Hi, Guys,
> I have a storm topology, with a single thread bolt querying large amount
> of data (From elasticsearch), and emit to a HBase bolt(10 threads), doing
> some filtering, then emit to Arvo bolt.(10threads) The arvo bolt simply
> emit the tuple to arvo client, which will be received by two flume node and
> then sink into hdfs. I am testing in local mode.
>
> In the query bolt, i am  getting around 15000 entries in a batch, the
> query itself takes about 4second, however, he emit method in the query bolt
> takes about 20 seconds. Does it mean that
> the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the
> query bolt?
>
> How can I tune my topology to make this process as fast as possible? I
> tried to increase the HBase thread to 20 but it does not seem to help.
>
> I use shuffleGrouping from query bolt to hbase bolt, and from hbase bolt
> to avro.
>
> Thanks for any advice.
> Chen
>
>
>
>
>
>
>
>
>
>
>

Re: writing huge amount of data to HDFS

Posted by Harsha <st...@harsha.io>.

Hi Chen,

          I looked at your code. The first part is inside a
Bolt's execute method ?  and it looks like fetching all the
data (10000 per call)  from a elastic search and emitting each
value from inside the execute method which ends when the ES
result set runs out.

It doesn't look like you followed storm's conventions here was
there any reason not use Spout here . A bolt' execute method
gets called for every tuple that's getting passed. Docs on
spout &
bolt [1]https://storm.incubator.apache.org/documentation/Concep
ts.html



from your comment in the code "10000 hits per shard will be
returned for each scroll" and if it taking longer  read 10000
records from ES I would suggest you to reduce this batch size
". The idea here is you are making quicker calls to ES and
pushing the data downstream and making another call to ES for
the next batch instead of acquiring one big batch in single
call.



 "i am  getting around 15000 entries in a batch, the query
itself takes about 4second, however, he emit method in the
query bolt takes about 20 seconds." Can you try reducing the
batch size here too it looks like the time is taking emitting
15k entries at one go.



          Was there any reason/utility of using flume to write
to hdfs. If not I would recommend
using [2]https://github.com/ptgoetz/storm-hdfs bolt .







On Fri, Jul 11, 2014, at 03:37 PM, Chen Wang wrote:

Here is the output from the ES query bolt:
 "Total execution time for this batch: 179655(millisecond)" is
the call time around .emit. As you can see, to emit 14000
entries, it takes
anytime from 145231 to 180000


 INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=14000 hits=14000 took=26172
40813 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-13_00-00-00
40889 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 782
40890 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 4000 records
59335 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
59335 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=28000 hits=14000 took=18033
238920 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-14_00-00-00
238990 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 179655
238990 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 8000 records
257633 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
257633 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=42000 hits=14000 took=17926
260932 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-15_00-00-00
402852 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-16_00-00-00
402865 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 145231
402865 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 2000 records
417427 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
417427 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=56000 hits=14000 took=13962
417459 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-17_00-00-00
417493 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 66
417493 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 6000 records
429629 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
429629 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=70000 hits=14000 took=12009
441208 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-18_00-00-00
744276 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-19_00-00-00
744277 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 314647
744277 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 0 records
779030 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
779030 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=84000 hits=14000 took=34631
785315 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-20_00-00-00
785332 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 6302
785332 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 4000 records
811859 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
811859 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=98000 hits=14000 took=25806
945938 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-21_00-00-00
960308 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 148449
960308 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 8000 records
983611 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
983611 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=112000 hits=14000 took=22698
983627 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-22_00-00-00
1002262 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-23_00-00-00
1002272 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 18661
1002272 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 2000 records
1021226 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
1021227 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=126000 hits=14000 took=18854
1110480 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-24_00-00-00
1188188 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 166961
1188188 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 6000 records
1204474 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
1204474 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=140000 hits=14000 took=15422
1204495 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-25_00-00-00
1270240 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-26_00-00-00
1270240 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 65766
1270240 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 0 records
1284391 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
1284391 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=145861 hits=5861 took=14084
1284414 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 23
1284414 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 5861 records
1284417 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
1284417 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=145861 hits=0 took=0
1284417 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 0
1284418 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 5861 records
Total execution time: 1276946



On Fri, Jul 11, 2014 at 2:14 PM, Chen Wang
<[3...@gmail.com> wrote:

here you go:
[4]https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
Its actually pretty straight forward. The only thing worth of
mention is that I use another thread in the ES bolt to do the
actual query and tuple emit.
Thanks for looking.
Chen



On Fri, Jul 11, 2014 at 1:18 PM, Sam Goodwin
<[5...@gmail.com> wrote:

Can you show some code? 200 seconds for 15K puts sounds like
you're not batching.



On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang
<[6...@gmail.com> wrote:

typo in previous email
The emit method in the query bolt takes about 200(instead of
20) seconds..



On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang
<[7...@gmail.com> wrote:

Hi, Guys,
I have a storm topology, with a single thread bolt querying
large amount of data (From elasticsearch), and emit to a HBase
bolt(10 threads), doing some filtering, then emit to Arvo
bolt.(10threads) The arvo bolt simply emit the tuple to arvo
client, which will be received by two flume node and then sink
into hdfs. I am testing in local mode.

In the query bolt, i am  getting around 15000 entries in a
batch, the query itself takes about 4second, however, he emit
method in the query bolt takes about 20 seconds. Does it mean
that
the downstream bolt(HBaseBolt and Avro bolt) cannot catch up
with the query bolt?

How can I tune my topology to make this process as fast as
possible? I tried to increase the HBase thread to 20 but it
does not seem to help.

I use shuffleGrouping from query bolt to hbase bolt, and from
hbase bolt to avro.

Thanks for any advice.
Chen

References

1. https://storm.incubator.apache.org/documentation/Concepts.html
2. https://github.com/ptgoetz/storm-hdfs
3. mailto:chen.apache.solr@gmail.com
4. https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
5. mailto:sam.goodwin89@gmail.com
6. mailto:chen.apache.solr@gmail.com
7. mailto:chen.apache.solr@gmail.com

Re: writing huge amount of data to HDFS

Posted by Chen Wang <ch...@gmail.com>.

Here is the output from the ES query bolt:
 "Total execution time for this batch: 179655(millisecond)" is the call
time around .emit. As you can see, to emit 14000 entries, it takes
anytime from 145231 to 180000


 INFO  com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=14000 hits=14000 took=26172
40813 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-13_00-00-00
40889 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 782
40890 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 4000 records
59335 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
59335 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=28000 hits=14000 took=18033
238920 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-14_00-00-00
238990 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 179655
238990 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 8000 records
257633 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
257633 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=42000 hits=14000 took=17926
260932 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-15_00-00-00
402852 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-16_00-00-00
402865 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 145231
402865 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 2000 records
417427 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
417427 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=56000 hits=14000 took=13962
417459 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-17_00-00-00
417493 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 66
417493 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 6000 records
429629 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
429629 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=70000 hits=14000 took=12009
441208 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-18_00-00-00
744276 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-19_00-00-00
744277 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 314647
744277 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 0 records
779030 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
779030 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=84000 hits=14000 took=34631
785315 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-20_00-00-00
785332 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 6302
785332 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 4000 records
811859 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
811859 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=98000 hits=14000 took=25806
945938 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-21_00-00-00
960308 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 148449
960308 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 8000 records
983611 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
983611 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=112000 hits=14000 took=22698
983627 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-22_00-00-00
1002262 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-23_00-00-00
1002272 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 18661
1002272 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 2000 records
1021226 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
1021227 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=126000 hits=14000 took=18854
1110480 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-24_00-00-00
1188188 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 166961
1188188 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 6000 records
1204474 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
1204474 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=140000 hits=14000 took=15422
1204495 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-25_00-00-00
1270240 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the new
key(hdfs folder) is 2014-07-26_00-00-00
1270240 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 65766
1270240 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 0 records
1284391 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
1284391 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=145861 hits=5861 took=14084
1284414 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 23
1284414 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 5861 records
1284417 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the total
hits are 145861
1284417 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  -
total=145861 hits=0 took=0
1284417 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - Total
execution time for this batch: 0
1284418 [pool-1-thread-1] INFO
 com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner  - the
current batch has 5861 records
Total execution time: 1276946


On Fri, Jul 11, 2014 at 2:14 PM, Chen Wang <ch...@gmail.com>
wrote:

> here you go:
> https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
> Its actually pretty straight forward. The only thing worth of mention is
> that I use another thread in the ES bolt to do the actual query and tuple
> emit.
> Thanks for looking.
> Chen
>
>
>
> On Fri, Jul 11, 2014 at 1:18 PM, Sam Goodwin <sa...@gmail.com>
> wrote:
>
>> Can you show some code? 200 seconds for 15K puts sounds like you're not
>> batching.
>>
>>
>> On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang <ch...@gmail.com>
>> wrote:
>>
>>> typo in previous email
>>> The emit method in the query bolt takes about 200(instead of 20)
>>> seconds..
>>>
>>>
>>> On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang <ch...@gmail.com>
>>> wrote:
>>>
>>>> Hi, Guys,
>>>> I have a storm topology, with a single thread bolt querying large
>>>> amount of data (From elasticsearch), and emit to a HBase bolt(10 threads),
>>>> doing some filtering, then emit to Arvo bolt.(10threads) The arvo bolt
>>>> simply emit the tuple to arvo client, which will be received by two flume
>>>> node and then sink into hdfs. I am testing in local mode.
>>>>
>>>> In the query bolt, i am  getting around 15000 entries in a batch, the
>>>> query itself takes about 4second, however, he emit method in the query bolt
>>>> takes about 20 seconds. Does it mean that
>>>> the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the
>>>> query bolt?
>>>>
>>>> How can I tune my topology to make this process as fast as possible? I
>>>> tried to increase the HBase thread to 20 but it does not seem to help.
>>>>
>>>> I use shuffleGrouping from query bolt to hbase bolt, and from hbase
>>>> bolt to avro.
>>>>
>>>> Thanks for any advice.
>>>> Chen
>>>>
>>>
>>>
>>
>

Re: writing huge amount of data to HDFS

Posted by Chen Wang <ch...@gmail.com>.

here you go:
https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
Its actually pretty straight forward. The only thing worth of mention is
that I use another thread in the ES bolt to do the actual query and tuple
emit.
Thanks for looking.
Chen



On Fri, Jul 11, 2014 at 1:18 PM, Sam Goodwin <sa...@gmail.com>
wrote:

> Can you show some code? 200 seconds for 15K puts sounds like you're not
> batching.
>
>
> On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang <ch...@gmail.com>
> wrote:
>
>> typo in previous email
>> The emit method in the query bolt takes about 200(instead of 20) seconds..
>>
>>
>> On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang <ch...@gmail.com>
>> wrote:
>>
>>> Hi, Guys,
>>> I have a storm topology, with a single thread bolt querying large amount
>>> of data (From elasticsearch), and emit to a HBase bolt(10 threads), doing
>>> some filtering, then emit to Arvo bolt.(10threads) The arvo bolt simply
>>> emit the tuple to arvo client, which will be received by two flume node and
>>> then sink into hdfs. I am testing in local mode.
>>>
>>> In the query bolt, i am  getting around 15000 entries in a batch, the
>>> query itself takes about 4second, however, he emit method in the query bolt
>>> takes about 20 seconds. Does it mean that
>>> the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the
>>> query bolt?
>>>
>>> How can I tune my topology to make this process as fast as possible? I
>>> tried to increase the HBase thread to 20 but it does not seem to help.
>>>
>>> I use shuffleGrouping from query bolt to hbase bolt, and from hbase bolt
>>> to avro.
>>>
>>> Thanks for any advice.
>>> Chen
>>>
>>
>>
>

Re: writing huge amount of data to HDFS

Posted by Sam Goodwin <sa...@gmail.com>.

Can you show some code? 200 seconds for 15K puts sounds like you're not
batching.


On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang <ch...@gmail.com>
wrote:

> typo in previous email
> The emit method in the query bolt takes about 200(instead of 20) seconds..
>
>
> On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang <ch...@gmail.com>
> wrote:
>
>> Hi, Guys,
>> I have a storm topology, with a single thread bolt querying large amount
>> of data (From elasticsearch), and emit to a HBase bolt(10 threads), doing
>> some filtering, then emit to Arvo bolt.(10threads) The arvo bolt simply
>> emit the tuple to arvo client, which will be received by two flume node and
>> then sink into hdfs. I am testing in local mode.
>>
>> In the query bolt, i am  getting around 15000 entries in a batch, the
>> query itself takes about 4second, however, he emit method in the query bolt
>> takes about 20 seconds. Does it mean that
>> the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the
>> query bolt?
>>
>> How can I tune my topology to make this process as fast as possible? I
>> tried to increase the HBase thread to 20 but it does not seem to help.
>>
>> I use shuffleGrouping from query bolt to hbase bolt, and from hbase bolt
>> to avro.
>>
>> Thanks for any advice.
>> Chen
>>
>
>

Re: writing huge amount of data to HDFS

Posted by Chen Wang <ch...@gmail.com>.

typo in previous email
The emit method in the query bolt takes about 200(instead of 20) seconds..


On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang <ch...@gmail.com>
wrote:

> Hi, Guys,
> I have a storm topology, with a single thread bolt querying large amount
> of data (From elasticsearch), and emit to a HBase bolt(10 threads), doing
> some filtering, then emit to Arvo bolt.(10threads) The arvo bolt simply
> emit the tuple to arvo client, which will be received by two flume node and
> then sink into hdfs. I am testing in local mode.
>
> In the query bolt, i am  getting around 15000 entries in a batch, the
> query itself takes about 4second, however, he emit method in the query bolt
> takes about 20 seconds. Does it mean that
> the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the
> query bolt?
>
> How can I tune my topology to make this process as fast as possible? I
> tried to increase the HBase thread to 20 but it does not seem to help.
>
> I use shuffleGrouping from query bolt to hbase bolt, and from hbase bolt
> to avro.
>
> Thanks for any advice.
> Chen
>