You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/06/20 00:48:42 UTC

[GitHub] [druid] JackDavidson opened a new issue #10057: index_parallel with single_dim partitionSpec type generating just one file/segment

JackDavidson opened a new issue #10057:
URL: https://github.com/apache/druid/issues/10057


   ### Affected Version
   Druid Built From Source as of Fri Jun 5, 2020
   
   ### Description
   
   We are trying to create new druid ingestion specs to pull from s3 directly via index_parallel rather than hadoop. The outputs simply don't seem to be partitioned though.
   
   To make it easy to reproduce, here is a simple config that shows the issue:
   
   ```
   {
     "spec": {
       "type": "index_parallel",
       "ioConfig": {
         "type": "index_parallel",
         "inputSource": {
           "type": "http",
           "uris": [
             "https://druid.apache.org/data/wikipedia.json.gz",
             "https://druid.apache.org/data/wikipedia.json.gz"
           ]
         },
         "inputFormat": {
           "type": "json"
         }
       },
       "tuningConfig": {
         "type": "index_parallel",
         "partitionsSpec": {
           "type": "single_dim",
           "partitionDimension": "channel",
           "maxRowsPerSegment": 2000
         },
         "forceGuaranteedRollup": true,
         "maxNumConcurrentSubTasks": 4
       },
       "dataSchema": {
         "dataSource": "wikipedia-test-partitioned-2",
         "granularitySpec": {
           "type": "uniform",
           "segmentGranularity": "DAY",
           "queryGranularity": "HOUR",
           "rollup": true,
           "intervals": [
             "2000-01-01/2030-01-01"
           ]
         },
         "timestampSpec": {
           "column": "timestamp",
           "format": "iso"
         },
         "dimensionsSpec": {
           "dimensions": [
             "channel",
             "cityName",
             "comment",
             "countryIsoCode",
             "countryName",
             "diffUrl",
             "flags",
             "isAnonymous",
             "isMinor",
             "isNew",
             "isRobot",
             "isUnpatrolled",
             "metroCode",
             "namespace",
             "page",
             "regionIsoCode",
             "regionName",
             "user"
           ]
         },
         "metricsSpec": [
           {
             "name": "count",
             "type": "count"
           },
           {
             "name": "sum_added",
             "type": "longSum",
             "fieldName": "added"
           },
           {
             "name": "sum_commentLength",
             "type": "longSum",
             "fieldName": "commentLength"
           },
           {
             "name": "sum_deleted",
             "type": "longSum",
             "fieldName": "deleted"
           },
           {
             "name": "sum_delta",
             "type": "longSum",
             "fieldName": "delta"
           },
           {
             "name": "sum_deltaBucket",
             "type": "longSum",
             "fieldName": "deltaBucket"
           }
         ]
       }
     },
     "type": "index_parallel"
   }
   ```
   
   Since maxRowsPerSegment is 2,000 and there are 24,000 rows in the dateset, I was expecting many partitions.
   
   I made sure to set two files so that it could be parallelized, since I saw some comments about needing to set 
   
   Of course the data that I have is much larger, coming out to a few GB, but I'm seeing the exact same issue.
   
   I have tried both setting targetRowsPerSegment and maxRowsPerSegment, and neither work.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jimj edited a comment on issue #10057: index_parallel with single_dim partitionSpec type generating just one file/segment

Posted by GitBox <gi...@apache.org>.

jimj edited a comment on issue #10057:
URL: https://github.com/apache/druid/issues/10057#issuecomment-1039529443


   I see the same broken behavior in Druid 0.21.1 today.  When I specify `targetRowsPerSegment` it appears to be completely ignored and i get 1 giant segment containing about 50million rows (my entire dataset for the segment).  If I use the same ingestion spec but instead use `index_hadoop` instead of `index_parallel` I get 1 segment with ~10 partitions all hovering near 5 million rows a piece.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] JackDavidson commented on issue #10057: index_parallel with single_dim partitionSpec type generating just one file/segment

Posted by GitBox <gi...@apache.org>.

JackDavidson commented on issue #10057:
URL: https://github.com/apache/druid/issues/10057#issuecomment-646911735


   Hash Partitioning seems to work fine, by the way.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] ccaominh closed issue #10057: index_parallel with single_dim partitionSpec type generating just one file/segment

Posted by GitBox <gi...@apache.org>.

ccaominh closed issue #10057:
URL: https://github.com/apache/druid/issues/10057


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jimj commented on issue #10057: index_parallel with single_dim partitionSpec type generating just one file/segment

Posted by GitBox <gi...@apache.org>.

jimj commented on issue #10057:
URL: https://github.com/apache/druid/issues/10057#issuecomment-1039529443


   I see the same broken behavior in Druid 0.21.1 today.  When I specify `targetRowsPerSegment` it appears to be completely ignored and i get 1 giant segment containing about 50million rows (my entire dataset for the segment).  If I use the same ingestion spec but instead use hadoop instead of index_parallel I get 1 segment with ~10 partitions all hovering near 5 million rows a piece.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] avalanchy commented on issue #10057: index_parallel with single_dim partitionSpec type generating just one file/segment

Posted by GitBox <gi...@apache.org>.

avalanchy commented on issue #10057:
URL: https://github.com/apache/druid/issues/10057#issuecomment-945649261


   The issue is not fixed. I just tested single_dim and hashed. In both situations, maxRowsPerSegment is ignored. The hashed at least is loading different values into separate segments, in comparison where single_dim loads everything together. Tested on 0.21.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jimj commented on issue #10057: index_parallel with single_dim partitionSpec type generating just one file/segment

Posted by GitBox <gi...@apache.org>.

jimj commented on issue #10057:
URL: https://github.com/apache/druid/issues/10057#issuecomment-1039529443


   I see the same broken behavior in Druid 0.21.1 today.  When I specify `targetRowsPerSegment` it appears to be completely ignored and i get 1 giant segment containing about 50million rows (my entire dataset for the segment).  If I use the same ingestion spec but instead use hadoop instead of index_parallel I get 1 segment with ~10 partitions all hovering near 5 million rows a piece.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jimj edited a comment on issue #10057: index_parallel with single_dim partitionSpec type generating just one file/segment

Posted by GitBox <gi...@apache.org>.

jimj edited a comment on issue #10057:
URL: https://github.com/apache/druid/issues/10057#issuecomment-1039529443


   I see the same broken behavior in Druid 0.21.1 today for `hashed` partioning.  When I specify `targetRowsPerSegment` it appears to be completely ignored and i get 1 giant segment containing about 50million rows (my entire dataset for the segment).  If I use the same ingestion spec but instead use `index_hadoop` instead of `index_parallel` I get 1 segment with ~10 partitions all hovering near 5 million rows a piece.
   
   I have not been able to duplicate this regression using the test input originally provided in this ticket, however.  Anonymizing my dataset will be challenging unfortunately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jimj edited a comment on issue #10057: index_parallel with single_dim partitionSpec type generating just one file/segment

Posted by GitBox <gi...@apache.org>.

jimj edited a comment on issue #10057:
URL: https://github.com/apache/druid/issues/10057#issuecomment-1039529443


   I see the same broken behavior in Druid 0.21.1 today for `hashed` partioning.  When I specify `targetRowsPerSegment` it appears to be completely ignored and i get 1 giant segment containing about 50million rows (my entire dataset for the segment).  If I use the same ingestion spec but instead use `index_hadoop` instead of `index_parallel` I get 1 segment with ~10 partitions all hovering near 5 million rows a piece.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jimj edited a comment on issue #10057: index_parallel with single_dim partitionSpec type generating just one file/segment

Posted by GitBox <gi...@apache.org>.

jimj edited a comment on issue #10057:
URL: https://github.com/apache/druid/issues/10057#issuecomment-1039529443






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jimj edited a comment on issue #10057: index_parallel with single_dim partitionSpec type generating just one file/segment

Posted by GitBox <gi...@apache.org>.

jimj edited a comment on issue #10057:
URL: https://github.com/apache/druid/issues/10057#issuecomment-1039529443


   I see the same broken behavior in Druid 0.20.1 today for `hashed` partioning.  When I specify `targetRowsPerSegment` it appears to be completely ignored and i get 1 giant segment containing about 50million rows (my entire dataset for the segment).  If I use the same ingestion spec but instead use `index_hadoop` instead of `index_parallel` I get 1 segment with ~10 partitions all hovering near 5 million rows a piece.
   
   I have not been able to duplicate this regression using the test input originally provided in this ticket, however.  Anonymizing my dataset will be challenging unfortunately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org