You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by 岑玉海 <ce...@163.com> on 2017/10/15 05:25:56 UTC

[DISCUSSION] About the future of hive integration

Hi, community:
    The purpose I create the module "hive integration" is not just making carbondata available to query in hive. I want carbondata to be a common fileformat in hadoop ecosystem like orc、parquet.
    In hadoop ecosystem, the common fileformat in data warehouse  are orc、parquet.  For my company, we have thousands of orc tables.
    Can carbondata become a common fileformat? I think yes!
   Hive is slow, but hive is widely used, stable and hive metastore has become a industrial standard in fact.
    Follow the behavior of hive will help us easier to change the existed tables to carbondata and will not use double storage. Just like the usage described in jira: https://issues.apache.org/jira/browse/CARBONDATA-1377.
    The detailed things we can do in Carbondata 1.3.0 are as below:
    1. File Level OutputFormat https://issues.apache.org/jira/browse/CARBONDATA-729 
    2. Hive partition https://issues.apache.org/jira/browse/CARBONDATA-1377
    3. Improve batch query performance, query without index should be as faster as parquet (In my test, query without index is slower than parquet, It is blocked by initializing tasks)
    4. Improve the time to load the index, it is too slow for the first time
    If there are omissions, please add.




    My imagination, What do you think?
    
    






Best regards!
Yuhai Cen


Re: [DISCUSSION] About the future of hive integration

Posted by Sea <26...@qq.com>.
Hi, community:
    The purpose I create the module "hive integration" is not just making carbondata available to query in hive. I want carbondata to be a common fileformat in hadoop ecosystem like orc、parquet.
    In hadoop ecosystem, the common fileformat in data warehouse  are orc、parquet.  For my company, we have thousands of orc tables.
    Can carbondata become a common fileformat? I think yes!
    Hive is slow, but hive is widely used, stable and hive metastore has become a industrial standard in fact.
     Follow the behavior of hive will help us easier to change the existed tables to carbondata and will not use double storage. Just like the usage described in jira: https://issues.apache.org/jira/browse/CARBONDATA-1377.
     The detailed things we can do in Carbondata 1.3.0 are as below:
     1. File Level OutputFormat https://issues.apache.org/jira/browse/CARBONDATA-729 
     2. Hive partition https://issues.apache.org/jira/browse/CARBONDATA-1377
    3. Improve batch query performance, query without index should be as faster as parquet (In my test, query without index is slower than parquet, It is blocked by initializing tasks)
    4. Improve the time to load the index, it is too slow for the first time
    If there are omissions, please add.




    My imagination, What do you think?
    
    
 

 

                                                  

Best regards!
Yuhai Cen

回复:[DISCUSSION] About the future of hive integration

Posted by 岑玉海 <ce...@163.com>.









Best regards!
Yuhai Cen


在2017年10月15日 13:25,岑玉海<ce...@163.com> 写道:
Hi, community:
    The purpose I create the module "hive integration" is not just making carbondata available to query in hive. I want carbondata to be a common fileformat in hadoop ecosystem like orc、parquet.
    In hadoop ecosystem, the common fileformat in data warehouse  are orc、parquet.  For my company, we have thousands of orc tables.
    Can carbondata become a common fileformat? I think yes!
   Hive is slow, but hive is widely used, stable and hive metastore has become a industrial standard in fact.
    Follow the behavior of hive will help us easier to change the existed tables to carbondata and will not use double storage. Just like the usage described in jira: https://issues.apache.org/jira/browse/CARBONDATA-1377.
    The detailed things we can do in Carbondata 1.3.0 are as below:
    1. File Level OutputFormat https://issues.apache.org/jira/browse/CARBONDATA-729 
    2. Hive partition https://issues.apache.org/jira/browse/CARBONDATA-1377
    3. Improve batch query performance, query without index should be as faster as parquet (In my test, query without index is slower than parquet, It is blocked by initializing tasks)
    4. Improve the time to load the index, it is too slow for the first time
    If there are omissions, please add.




    My imagination, What do you think?
    
    






Best regards!
Yuhai Cen