You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/09/03 16:07:17 UTC

[GitHub] [incubator-druid] quenlang opened a new issue #8456: ingesting with high cardinality dimension have low performance

quenlang opened a new issue #8456: ingesting with high cardinality dimension have low performance
URL: https://github.com/apache/incubator-druid/issues/8456

Hello, @himanshug

I met a huge performance when I ingested with high cardinality dimensions. The ```uri_id``` and ```host_ip``` have a cardinality range 2000000000 to 5000000000.

With the above two dimensions, 1 task consumed 2w/s. When I incremented task count to 18, the datasource only consumed 20w/s. Even though I incremented task count to 36, the datasource ingestion throughput always keeps 20w/s.
one of the tasks log:
[index_kafka_APP_NETWORK_DATA_MIN_JSON1_f25e9db0956e624_bebmfiod.txt](https://github.com/apache/incubator-druid/files/3570313/index_kafka_APP_NETWORK_DATA_MIN_JSON1_f25e9db0956e624_bebmfiod.txt)

Then I removed ```uri_id``` and ```host_ip``` from the dimension list, 18 tasks consumed 37w/s, seem as 2w * 18=36w. I incremented task count to 36, throughput was 60w/s.
one of the tasks log:
[index_kafka_APP_NETWORK_DATA_MIN_JSON1_361e588173127a7_peommncl.txt](https://github.com/apache/incubator-druid/files/3570326/index_kafka_APP_NETWORK_DATA_MIN_JSON1_361e588173127a7_peommncl.txt)

I'm not understander that why increment task count but the throughput does not improve with high cardinality dimensions.

The num of Kafka topic partition is 36. If I partition data by ```hash(all dimensions)``` into topic partitions to reduce the dimensions cardinality which needs to be rolled in the task, does it improve the performance?

Is there any way to resolve this problem? Can you give me some advice?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org