You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bruce Robbins (JIRA)" <ji...@apache.org> on 2018/04/11 16:25:00 UTC

[jira] [Updated] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase

     [ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bruce Robbins updated SPARK-23963:
----------------------------------
    Description: 
TableReader gets disproportionately slower as the number of columns in the query increase.

For example, reading a table with 6000 columns is 4 times more expensive per record than reading a table with 3000 columns, rather than twice as expensive.

The increase in processing time is due to several Lists (fieldRefs, fieldOrdinals, and unwrappers), each of which the reader accesses by column number for each column in a record. Because each List has O(n) time for lookup by column number, these lookups grow increasingly expensive as the column count increases.

When I patched the code to change those 3 Lists to Arrays, the query times became proportional.

 

 

 

 

  was:
TableReader gets disproportionately slower as the number of columns in the query increase.

For example, reading a table with 6000 columns is 4 times more expensive per record than reading a table with 3000 columns, rather than twice as expensive.

The increase in processing time is due to several Lists (fieldRefs, fieldOrdinals, and unwrappers), each of which the reader accesses by column number for each column in a record. Because each List has O(n) time for lookup by column number, these lookups grow increasingly expensive as the column count increases.

When I patched the code to change those 3 Lists to Arrays, the query times became proportional.

 

 

 

 


> Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-23963
>                 URL: https://issues.apache.org/jira/browse/SPARK-23963
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Bruce Robbins
>            Priority: Minor
>
> TableReader gets disproportionately slower as the number of columns in the query increase.
> For example, reading a table with 6000 columns is 4 times more expensive per record than reading a table with 3000 columns, rather than twice as expensive.
> The increase in processing time is due to several Lists (fieldRefs, fieldOrdinals, and unwrappers), each of which the reader accesses by column number for each column in a record. Because each List has O(n) time for lookup by column number, these lookups grow increasingly expensive as the column count increases.
> When I patched the code to change those 3 Lists to Arrays, the query times became proportional.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org