You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Vladimir Sitnikov (JIRA)" <ji...@apache.org> on 2019/01/09 19:09:00 UTC
[jira] [Comment Edited] (CALCITE-2635) getMonotonocity is slow on wide tables

    [ https://issues.apache.org/jira/browse/CALCITE-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738551#comment-16738551 ] 

Vladimir Sitnikov edited comment on CALCITE-2635 at 1/9/19 7:08 PM:
--------------------------------------------------------------------

{quote}@PerformanceTest(expectedDuration = "2s", variance = "5%"){quote}

Expected duration depends on the hardware. For instance, notebook, virtual machine, desktop, vps, etc, all could have very different raw performance.

I think it is much better to invest time to having something like https://arewefastyet.com
In other words, we could have a set of "standard" benchmarks + consistent machine for execution + scheduled executions so we can track regressions.

**I'm inclined to merge this fix with no extra tests.**


Note: the change is a clear win.
Alternative option is to implement HashMap to speedup {{org.apache.calcite.rel.type.RelDataType#getField(String fieldName, boolean caseSensitive, boolean elideRecord)}}. We do have {{org.apache.calcite.rel.type.RelDataTypeFactoryImpl#canonize(org.apache.calcite.rel.type.RelDataType)}}, so lazy initialized cache of field positions might help.


However, we don't really expect single table to have lots of collations, so we could just go with PR#891
On top of that, we might add a hard limit like "try no more than first 50 collations of the table", so even a table with extreme amount of collations won't create a problem for {{getMonotonocity}}


was (Author: vladimirsitnikov):
{quote}@PerformanceTest(expectedDuration = "2s", variance = "5%"){quote}

Expected duration depends on the hardware. For instance, notebook, virtual machine, desktop, vps, etc, all could have very different raw performance.

I think it is much better to invest time to having something like https://arewefastyet.com
In other words, we could have a set of "standard" benchmarks + consistent machine for execution + scheduled executions so we can track regressions.

I'm inclined to merge this fix with no extra tests.


Note: the change is a clear win.
Alternative option is to implement HashMap to speedup {{org.apache.calcite.rel.type.RelDataType#getField(String fieldName, boolean caseSensitive, boolean elideRecord)}}. We do have {{org.apache.calcite.rel.type.RelDataTypeFactoryImpl#canonize(org.apache.calcite.rel.type.RelDataType)}}, so lazy initialized cache of field positions might help.


However, we don't really expect single table to have lots of collations, so we could just go with PR#891
On top of that, we might add a hard limit like "try no more than first 50 collations of the table", so even a table with extreme amount of collations won't create a problem for {{getMonotonocity}}

> getMonotonocity is slow on wide tables
> --------------------------------------
>
>                 Key: CALCITE-2635
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2635
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: Gian Merlino
>            Assignee: Gian Merlino
>            Priority: Major
>              Labels: performance
>
> RelOptTableImpl's getMonotonocity does an indexOf on {{rowType.getFieldNames()}}, which is O(N) in the number of fields. IdentifierNamespace calls getMonotonicity once for every field in the table namespace, so it becomes O(N^2) in the number of fields. We observed 2-4 second query planning times with a table that had 18,000 columns, reduced to about 150ms after patching getMonotonicity to be O(1) in the number of fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)