You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/01/13 19:19:08 UTC
[GitHub] [incubator-pinot] amitchopraait opened a new issue #6437: Getting numGroupsLimitReached for medium cardinality column (10000 distinct values)

amitchopraait opened a new issue #6437:
URL: https://github.com/apache/incubator-pinot/issues/6437


   I have a column with 10000 distinct values.
   
   select count(distinct(device)) from metrics
   
   {
     "resultTable": {
       "dataSchema": {
         "columnNames": [
           "distinctcount(device)"
         ],
         "columnDataTypes": [
           "INT"
         ]
       },
       "rows": [
         [
           10000
         ]
       ]
     },
     "exceptions": [],
     "numServersQueried": 1,
     "numServersResponded": 1,
     "numSegmentsQueried": 1,
     "numSegmentsProcessed": 1,
     "numSegmentsMatched": 1,
     "numConsumingSegmentsQueried": 0,
     "numDocsScanned": 3999999,
     "numEntriesScannedInFilter": 0,
     "numEntriesScannedPostFilter": 0,
     "numGroupsLimitReached": false,
     "totalDocs": 3999999,
     "timeUsedMs": 21,
     "segmentStatistics": [],
     "traceInfo": {},
     "minConsumingFreshnessTimeMs": 0
   }
   
   
   When i do a group by for this column, i get numGroupsLimitReached = true in the response. Even though the documentation states the default limit is set to 100k
   
   select device, count(device) as aggreg from metrics group by device order by aggreg desc limit 10
   
   {
     "resultTable": {
       "dataSchema": {
         "columnNames": [
           "device",
           "aggreg"
         ],
         "columnDataTypes": [
           "STRING",
           "LONG"
         ]
       },
       "rows": [
         [
           "device-6230",
           475
         ],
         [
           "device-3277",
           470
         ],
         [
           "device-2311",
           469
         ],
         [
           "device-3933",
           469
         ],
         [
           "device-4059",
           468
         ],
         [
           "device-6002",
           468
         ],
         [
           "device-621",
           466
         ],
         [
           "device-2903",
           465
         ],
         [
           "device-3900",
           463
         ],
         [
           "device-9324",
           463
         ]
       ]
     },
     "exceptions": [],
     "numServersQueried": 1,
     "numServersResponded": 1,
     "numSegmentsQueried": 1,
     "numSegmentsProcessed": 1,
     "numSegmentsMatched": 1,
     "numConsumingSegmentsQueried": 0,
     "numDocsScanned": 3999999,
     "numEntriesScannedInFilter": 0,
     "numEntriesScannedPostFilter": 3999999,
     "numGroupsLimitReached": true,
     "totalDocs": 3999999,
     "timeUsedMs": 87,
     "segmentStatistics": [],
     "traceInfo": {},
     "minConsumingFreshnessTimeMs": 0
   }
   
   
   
   As per conversation in slack:
   
   Do you have it configured explicitly? The config key is pinot.server.query.executor.num.groups.limit
   10 replies
   
   Amit Chopra  23 minutes ago
   @Jackie - I believe you mean in pinot-server.conf? no i haven’t set it.
   
   Jackie  21 minutes ago
   Hmm.. That is unexpected
   
   Jackie  20 minutes ago
   Do you run the query in PQL mode or SQL mode?
   
   Amit Chopra  20 minutes ago
   SQL mode
   
   Jackie  16 minutes ago
   I just checked the code and we don't set it in SQL mode..
   
   Jackie  15 minutes ago
   Could you please file a github issue and put the details?
   
   Amit Chopra  14 minutes ago
   sure. Let me file an issue. Just so that i understand, you mean the 100k limit is not set? But what is the default limit in SQL mode today then?
   
   Jackie  13 minutes ago
   We don't put the numGroupsLimitReached in SQL mode. I don't know how it shows up in the response
   
   Amit Chopra  12 minutes ago
   got it
   
   Amit Chopra  11 minutes ago
   As per the second part of the question, given there was a limit of 10 on the query, shouldn’t this be handled by the engine (even if it was a column with more than 100k distinct values)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org