You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/01/12 23:45:09 UTC

[GitHub] [pinot] bowen-stripe opened a new issue #8011: Pinot offline table refresh impacts query latency

bowen-stripe opened a new issue #8011:
URL: https://github.com/apache/pinot/issues/8011


   whenever i refresh an offline table (via Spark to run job SegmentCreationAndMetadataPush), Pinot server CPU usage goes up (from 15% to 70%) and query latency is affected noticeably (from 20ms to 2s) during that time.
   Reducing `pushParallelism` (to 2) helps but latency impact is still noticeable (from 20ms to 1s).
   
   We need a way to offload these table preparation work from Pinot server to minimize query performance impact.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #8011: Pinot offline table refresh impacts query latency

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #8011:
URL: https://github.com/apache/pinot/issues/8011#issuecomment-1013522025


   If possible, can you share the table config for this table? Currently several index types are not generated at segment creation by default, which might cause the issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] klsince commented on issue #8011: Pinot offline table refresh impacts query latency

Posted by GitBox <gi...@apache.org>.
klsince commented on issue #8011:
URL: https://github.com/apache/pinot/issues/8011#issuecomment-1011816959


   hey @bowen-stripe , does the spark job generate segments with the latest table config and schema? So that the servers don't need to preprocess the segments, like to add or remove indices, to make segments consistent with latest table config/schema before serving queries.
   
   And when CPU util goes high, you might take some stack traces or flamegraphs to help identify the hot methods. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #8011: Pinot offline table refresh impacts query latency

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #8011:
URL: https://github.com/apache/pinot/issues/8011#issuecomment-1013551986


   Based on the table config, there should be no index needed to be generated on the server side.
   Cold start after refreshing the segments could be another factor for the high latency


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] bowen-stripe commented on issue #8011: Pinot offline table refresh impacts query latency

Posted by GitBox <gi...@apache.org>.
bowen-stripe commented on issue #8011:
URL: https://github.com/apache/pinot/issues/8011#issuecomment-1013530304


   Here you go:
   ```
   {
     "OFFLINE": {
       "tableName": "mytable",
       "tableType": "OFFLINE",
       "segmentsConfig": {
         "timeType": "MILLISECONDS",
         "schemaName": "myschema,
         "replication": "2",
         "timeColumnName": "created_at__hour",
         "allowNullTimeValue": false
       },
       "tenants": {
         "broker": "DefaultTenant",
         "server": "DefaultTenant"
       },
       "tableIndexConfig": {
         "loadMode": "MMAP",
         "rangeIndexVersion": 1,
         "autoGeneratedInvertedIndex": false,
         "createInvertedIndexDuringSegmentGeneration": false,
         "enableDefaultStarTree": false,
         "starTreeIndexConfigs": [
           {
             "dimensionsSplitOrder": [
               "merchant",
               "created_at__hour",
               "country",
               "currency"
             ],
             "functionColumnPairs": [
               "SUMPRECISION__amount"
             ],
             "maxLeafRecords": 0
           }
         ],
         "enableDynamicStarTreeCreation": false,
         "aggregateMetrics": false,
         "nullHandlingEnabled": false
       },
       "metadata": {},
       "ingestionConfig": {
         "batchIngestionConfig": {
           "segmentIngestionType": "REFRESH"
         }
       },
       "isDimTable": false
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] bowen-stripe commented on issue #8011: Pinot offline table refresh impacts query latency

Posted by GitBox <gi...@apache.org>.
bowen-stripe commented on issue #8011:
URL: https://github.com/apache/pinot/issues/8011#issuecomment-1012584594


   I used `LaunchDataIngestionJobCommand` with job config like this:
   
   ```
   executionFrameworkSpec:
            |  name: spark
            |  segmentMetadataPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentMetadataPushJobRunner
            |  segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner
            |  segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner
            |jobType: 'SegmentCreationAndMetadataPush'
            |inputDirURI: '$inputS3Prefix'
            |outputDirURI: '$outputS3Prefix'
            |includeFileNamePattern: 'glob:**/*.parquet'
            |overwriteOutput: true
            |pinotFSSpecs:
            |- scheme: s3
            |  className: org.apache.pinot.plugin.filesystem.S3PinotFS
            |  configs:
            |    region: $s3Region
            |recordReaderSpec:
            |  dataFormat: 'parquet'
            |  className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
            |tableSpec:
            |  tableName: $tableName
            |pinotClusterSpecs:
            |- controllerURI: $clusterString
            |segmentNameGeneratorSpec:
            |  type: normalizedDate
            |segmentCreationJobParallelism: 1000
            |pushJobSpec:
            |  segmentUriPrefix: 's3://$s3Bucket'
            |  segmentUriSuffix: ''
            |  pushParallelism: 5
            |  pushAttempts: 5
            |  pushRetryIntervalMillis: 3000
   ```
   
   I assume it reads table config from cluster when producing segment data. And no change was made to table config before / after.
   
   I can try to grab some stack trace / flamegraph.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] bowen-stripe commented on issue #8011: Pinot offline table refresh impacts query latency

Posted by GitBox <gi...@apache.org>.
bowen-stripe commented on issue #8011:
URL: https://github.com/apache/pinot/issues/8011#issuecomment-1024696420


   Update: this appeared to be a (lack-of) memory issue. After scaling up the cluster by 2x, # of page swap went down significantly. Closing this for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] bowen-stripe closed issue #8011: Pinot offline table refresh impacts query latency

Posted by GitBox <gi...@apache.org>.
bowen-stripe closed issue #8011:
URL: https://github.com/apache/pinot/issues/8011


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org