You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "He Yongqiang (JIRA)" <ji...@apache.org> on 2009/05/07 13:05:30 UTC

[jira] Created: (HIVE-477) Some optimization thoughts for Hive

Some optimization thoughts for Hive
-----------------------------------

                 Key: HIVE-477
                 URL: https://issues.apache.org/jira/browse/HIVE-477
             Project: Hadoop Hive
          Issue Type: Improvement
            Reporter: He Yongqiang


Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:

minor :
1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
2) let StructObjectInspector's 
    {noformat}
     public List<Object> getStructFieldsDataAsList(Object data);
    {noformat}
to be 
    {noformat}
     public Object[] getStructFieldsDataAsArray(Object data);
    {noformat}

In my profile, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList

not so minor:
3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
4) the operator stack is kind of deep. In order to avoid instruction cache, and increase the efficiency data cache. I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-477) Some optimization thoughts for Hive

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708796#action_12708796 ] 

He Yongqiang commented on HIVE-477:
-----------------------------------

One Comment for 1):
Avoiding byte copy when init LazyString seems will not save CPU time. 
In my test, i use two tables of 30 1K columns, and insert one from the other. The table's size is about 140M.
Two tests, one with byte copy and the other without byte copy, cost the same time.

So it seems java's array copy time can be ignored.

> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-477) Some optimization thoughts for Hive

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708732#action_12708732 ] 

He Yongqiang commented on HIVE-477:
-----------------------------------

I did the same test without compression. 
It turns out both insert overwrite.. commands finished in near one minutes(60 +/- 9 seconds).

> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-477) Some optimization thoughts for Hive

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708053#action_12708053 ] 

He Yongqiang commented on HIVE-477:
-----------------------------------

New test results for understanding how much time is used in the RecordWriter, and how much time is used in OperatorProcessing.

The whole test involves 4 tables: tablerc1,tablerc2, tableseq1, tableseq2. They all have 30 string columns.
tablerc1 and tablerc2 are stored as RCFile. tableseq1 and tableseq2 are stored as SequenceFile.
tablerc1 and tablerc2 are about 134M. tableseq1 and tableseq2 are about 178M. They all store the same original data.

Here are the results:
|| Command || Normal Execution Time( the whole job costs / the first mapper / the second mapper ) ||  No RecordWriter's write in FileSinkOperator ( the whole job costs / the first mapper / the second mapper )|| Empty ExecMapper's map Body( the whole job costs / the first mapper / the second mapper ) ||
| insert overwrite tablerc2 select * from tablerc1 | 131 / 115 /  117 | 45 / 34 / 34 | 26 / 16 / 15 |
| insert overwrite tablerc2 select * from tablerc1 | 121 / 114 /  116 | 42 / 34 / 33 | 20 / 16 / 15 |
| insert overwrite tableseq2 select * from tableseq1 | 129 / 120 /  122 | 37 / 35 / 34 | 18 / 12 / 12 |
| insert overwrite tableseq2 select * from tableseq1 | 130 / 127 /  123 | 38/ 35 / 35 | 17 / 13 / 12 |

> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-477) Some optimization thoughts for Hive

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707218#action_12707218 ] 

Zheng Shao commented on HIVE-477:
---------------------------------

For 3), adding another thread means we need to buffer the data between the 2 threads. It will be great to have some data on how much percentage of time this can save us beforehand. At least, we should know how much time is spent in operator stack, and how much is spent in writer.

For 4), there are some difficulties. We are using a single object to pass all rows. Doing 4) means we need to use multiple objects. Also, given the bigger cache size of modern CPUs, I am not sure whether our operator stack will go out of cache or not.


> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-477) Some optimization thoughts for Hive

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707872#action_12707872 ] 

He Yongqiang commented on HIVE-477:
-----------------------------------

I did a test to see how much time is used in the RecordWriter, and how much time is used in OperatorProcessing.

Use : insert overwrite table tableRC2 select * from tableRC1;

tableRC1 is about 132M, and will use 2 maps (block size is 64M).

Normal:
{noformat} 
it costs about 126s, and each map cost about 114s
{noformat}

Comment out outWriter.write(recordValue) in FileSinkOperator's process method
{noformat}
The whole job costs about 80s. But one mapper only executes about 32 sec, and the other mapper process very slowly (costs about 70s). The slow mapper leads to the whole job's processing time upto 80s (i think the slow mapper is caused by other reasons).
{noformat}

Comment out the whole ExecMapper's map(), and this equals to only read and do nothing.
{noformat}
it costs about 27s, and each map cost about 15s
{noformat}

Using hadoop-streaming.jar 
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar  -input /user/hive/warehouse/tablerc1/HiveStackTestData  -output testHiveWriter  -mapper org.apache.hadoop.mapred.lib.IdentityMapper  -numReduceTasks 0
{noformat}
It costs about 55s. One mapper costs about 5s, the other mapper costs about 50s.
{noformat}


> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HIVE-477) Some optimization thoughts for Hive

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708056#action_12708056 ] 

He Yongqiang edited comment on HIVE-477 at 5/11/09 10:06 PM:
-------------------------------------------------------------

Using hadoop-streaming.jar 

RCFile:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar  -input /user/hive/warehouse/tablerc1  -output testHiveWriter  -inputformat org.apache.hadoop.hive.ql.io.RCFileInputFormat -outputformat org.apache.hadoop.hive.ql.io.RCFileOutputFormat -mapper org.apache.hadoop.mapred.lib.IdentityMapper  -jobconf mapred.work.output.dir=.  -jobconf hive.io.rcfile.column.number.conf=32 -jobconf mapred.output.compress=true -numReduceTasks 0

It costs 100+3 seconds.

And in order to execute this command succuessfully, we need to change the RCFile's Generic signature to <WritableComparable,...>.

      was (Author: he yongqiang):
    Using hadoop-streaming.jar 

RCFile:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar  -input /user/hive/warehouse/tablerc1  -output testHiveWriter  -mapper org.apache.hadoop.mapred.lib.IdentityMapper  -numReduceTasks 0

SequenceFile:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar  -input /user/hive/warehouse/tableseq1  -output testHiveWriter  -mapper org.apache.hadoop.mapred.lib.IdentityMapper  -numReduceTasks 0

They all cost less than 10 seconds.
  
> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-477) Some optimization thoughts for Hive

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-477:
------------------------------

    Description: 
Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:

minor :
1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
2) let StructObjectInspector's 
    {noformat}
     public List<Object> getStructFieldsDataAsList(Object data);
    {noformat}
to be 
    {noformat}
     public Object[] getStructFieldsDataAsArray(Object data);
    {noformat}

In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList

not so minor:
3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

  was:
Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:

minor :
1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
2) let StructObjectInspector's 
    {noformat}
     public List<Object> getStructFieldsDataAsList(Object data);
    {noformat}
to be 
    {noformat}
     public Object[] getStructFieldsDataAsArray(Object data);
    {noformat}

In my profile, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList

not so minor:
3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
4) the operator stack is kind of deep. In order to avoid instruction cache, and increase the efficiency data cache. I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.


> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-477) Some optimization thoughts for Hive

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708056#action_12708056 ] 

He Yongqiang commented on HIVE-477:
-----------------------------------

Using hadoop-streaming.jar 

RCFile:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar  -input /user/hive/warehouse/tablerc1  -output testHiveWriter  -mapper org.apache.hadoop.mapred.lib.IdentityMapper  -numReduceTasks 0

SequenceFile:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar  -input /user/hive/warehouse/tableseq1  -output testHiveWriter  -mapper org.apache.hadoop.mapred.lib.IdentityMapper  -numReduceTasks 0

They all cost less than 10 seconds.

> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-477) Some optimization thoughts for Hive

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-477:
----------------------------

    Comment: was deleted

(was: I did a test to see how much time is used in the RecordWriter, and how much time is used in OperatorProcessing.

Use : insert overwrite table tableRC2 select * from tableRC1;

tableRC1 is about 132M, and will use 2 maps (block size is 64M).

Normal:
{noformat} 
it costs about 126s, and each map cost about 114s
{noformat}

Comment out outWriter.write(recordValue) in FileSinkOperator's process method
{noformat}
The whole job costs about 80s. But one mapper only executes about 32 sec, and the other mapper process very slowly (costs about 70s). The slow mapper leads to the whole job's processing time upto 80s (i think the slow mapper is caused by other reasons).
{noformat}

Comment out the whole ExecMapper's map(), and this equals to only read and do nothing.
{noformat}
it costs about 27s, and each map cost about 15s
{noformat}

Using hadoop-streaming.jar 
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar  -input /user/hive/warehouse/tablerc1/HiveStackTestData  -output testHiveWriter  -mapper org.apache.hadoop.mapred.lib.IdentityMapper  -numReduceTasks 0
{noformat}
It costs about 55s. One mapper costs about 5s, the other mapper costs about 50s.
{noformat}
)

> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-477) Some optimization thoughts for Hive

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707194#action_12707194 ] 

He Yongqiang commented on HIVE-477:
-----------------------------------

one refrence for 4):
Breaking the Memory Wall in MonetDB.
And there are also many other references of Array-based execution.

> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.