You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "alexch2000 (via GitHub)" <gi...@apache.org> on 2023/11/28 14:15:40 UTC

[I] Performance Degradation in Queries with DateTime Field Filtering [pinot]

alexch2000 opened a new issue, #12060:
URL: https://github.com/apache/pinot/issues/12060

   We are experiencing a significant performance issue in Pinot, specifically related to how DateTime field filtering is applied in our queries.
   
   **Dataset Description**: Our dataset consists of a large table containing event records. Each record includes fields like `name`, `stage`, `status`, and a `timestamp`.
   
   **Query Performance Issue**: We've observed that the location of the DateTime filter in our query significantly impacts the query's performance. There is a noticeable performance degradation – up to 90% slower – depending on how and where the DateTime filter is applied.
   
   **Query Example**:  Initially, when the timestamp filter is not the first condition in the query, we notice that almost 80% of the documents are scanned, and the query completion time is around 8.5 seconds. Below is an example of the query generated by our custom connector to Pinot
   
   ```
   SELECT
     name,
     sum(CAST(CASE WHEN ((status = 'success') AND (stage = 'start')) THEN 1 ELSE 0 END AS LONG))
   FROM
     events_table
   WHERE
     ((status = 'success') AND (stage IN ('complete', 'start'))
       AND (((timestamp BETWEEN 1699920000000 AND 1701129600000) AND (stage = 'start')) OR (stage = 'complete')
       )
     )
   GROUP BY
     name
   LIMIT
     10000
   ```
   As you can see that almost 80% of docs were scanned + time to finish the query is  around 8.5s. 
   <img width="685" alt="Screenshot 2023-11-28 at 14 56 39" src="https://github.com/apache/pinot/assets/994452/7f9764fb-c8e5-4bf5-baac-797a381f4358">
   
   **Improved Performance with Adjusted Query**: However, when we move the timestamp check to the first condition in the query, there is a significant improvement. The query completion time drops to approximately 5 seconds, and far fewer documents are scanned.
   
   ```
   SELECT
     name,
     sum(CAST(CASE WHEN ((status = 'success') AND (stage = 'start')) THEN 1 ELSE 0 END AS LONG))
   FROM
     events_table
   WHERE
     timestamp BETWEEN 1699920000000 AND 1701129600000
     AND (
       (
         (status = 'success')
         AND (stage IN ('complete', 'start'))
       )
       AND (
         (stage = 'start')
         OR (stage = 'complete')
       )
     )
   GROUP BY
     name
   LIMIT
     10000
   ```
   
   <img width="677" alt="Screenshot 2023-11-28 at 14 36 49" src="https://github.com/apache/pinot/assets/994452/b9126c9c-e885-4129-b04d-282b53a8a9b4">
   
   We expect the query performance to be consistent regardless of the positioning of the DateTime filter. However, the current behavior shows a critical variance in execution times.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Performance Degradation in Queries with DateTime Field Filtering [pinot]

Posted by "alexch2000 (via GitHub)" <gi...@apache.org>.
alexch2000 commented on issue #12060:
URL: https://github.com/apache/pinot/issues/12060#issuecomment-1842945709

   @Jackie-Jiang you are correct. I didn't notice that we scan all rows for `stage = 'complete'`. I will retest with the correct query and close the ticket 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Performance Degradation in Queries with DateTime Field Filtering [pinot]

Posted by "dario-liberman (via GitHub)" <gi...@apache.org>.
dario-liberman commented on issue #12060:
URL: https://github.com/apache/pinot/issues/12060#issuecomment-1830058437

   It would seem that time based segment pruning is not kicking in unless it is the first filter in the where clause?
   https://github.com/apache/pinot/issues/6189


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Performance Degradation in Queries with DateTime Field Filtering [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on issue #12060:
URL: https://github.com/apache/pinot/issues/12060#issuecomment-1839248221

   If I read correctly, these 2 queries are not equivalent. With `OR` in the first query after the time filter, there is no way for the engine to prune the segment based on the time.
   `numDocsScanned` reflects how many records matching the filter. If you see them different in 2 queries, that means the 2 queries are not equivalent. To optimize this query, you may add range filter to `timestamp` and inverted index on `status` and `stage`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Performance Degradation in Queries with DateTime Field Filtering [pinot]

Posted by "alexch2000 (via GitHub)" <gi...@apache.org>.
alexch2000 closed issue #12060: Performance Degradation in Queries with DateTime Field Filtering
URL: https://github.com/apache/pinot/issues/12060


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Performance Degradation in Queries with DateTime Field Filtering [pinot]

Posted by "dario-liberman (via GitHub)" <gi...@apache.org>.
dario-liberman commented on issue #12060:
URL: https://github.com/apache/pinot/issues/12060#issuecomment-1830107606

   Ah, the problem is the parenthesis nesting, since the time filter is not at the top of the tree, then the pruner does not realize somehow that conjunctions are associative.
   This is the logic in `TimeSegmentPruner::getFilterTimeIntervals`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org