You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Rana Faisal Munir <fm...@essi.upc.edu> on 2017/04/12 19:48:42 UTC

CarbonData performance benchmkaring

Dear all,

 

I am running some experiments to benchmark the performance of both Parquet
and CarbonData. I am using TPC-H lineitem table of size 8GB. It has 16
columns and I am running different projection queries where I am reading
different number of columns (3,6,9,12,15). I am facing some problem with
CarbonData and it seems to be very slow when I select more than 8 columns.
It takes almost hours to process my request whereas Parquet is very quick.
Could please anybody helps me to know this behavior.

 

 

 

 

  This is my configuration of cluster:

 

3 Machines

1 Driver Machine (128 GB, 24 cores)

2 Worker Machines  (128GB, 24 cores)

 

My configuration settings for Spark are:

 

spark.executor.instances        12

spark.executor.memory   18g

spark.driver.memory     57g

spark.executor.cores    3

spark.driver.cores      5

spark.default.parallelism       72

 

carbon.sort.file.buffer.size=20

carbon.graph.rowset.size=100000

carbon.number.of.cores.while.loading=6

carbon.sort.size=500000

carbon.enableXXHash=true

carbon.number.of.cores.while.compacting=2

carbon.compaction.level.threshold=4,3

carbon.major.compaction.size=1024

carbon.number.of.cores=4

carbon.inmemory.record.size=120000

carbon.enable.quick.filter=false

 

 

My Queries:

 

carbon.sql("CREATE TABLE IF NOT EXISTS lineitem_4  (orderkey BIGINT, partkey
BIGINT, suppkey BIGINT, linenumber BIGINT, quantity DOUBLE, extendedprice
DOUBLE, discount DOUBLE, tax DOUBLE, returnflag STRING, linestatus STRING,
shipdate DATE, commitdate DATE, receiptdate DATE, shipinstruct STRING,
shipmode STRING, comment STRING) STORED BY 'carbondata'
TBLPROPERTIES('TABLE_BLOCKSIZE'='128 MB')")

 

carbon.sql("LOAD DATA INPATH 'hdfs://hdfsmaster/input/lineitem/' INTO TABLE
lineitem_4 OPTIONS ('FILEHEADER' =
'orderkey,partkey,suppkey,linenumber,quantity,extendedprice,discount,tax,ret
urnflag,linestatus,shipdate,commitdate,receiptdate,shipinstruct,shipmode,com
ment', 'USE_KETTLE' = 'false', 'DELIMITER'='|')")

 

 

val proj1 = carbon.sql("SELECT orderkey,partkey,linenumber FROM
lineitem_4"))

proj1.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")

                                                                

val proj2 = carbon.sql("SELECT
orderkey,partkey,linenumber,quantity,discount,returnflag FROM lineitem_4")

proj2.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj2/")

                                                                

val proj3 = carbon.sql("SELECT
orderkey,partkey,linenumber,quantity,discount,returnflag,linestatus,commitda
te,receiptdate FROM lineitem_4")

proj3.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj3/")

                                                                

val proj4 = carbon.sql("SELECT
orderkey,partkey,linenumber,quantity,discount,returnflag,linestatus,commitda
te,receiptdate,shipinstruct,shipmode,comment FROM lineitem_4")

proj4.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj4/")

 

Thank you

 

Regards

Faisal

Re: CarbonData performance benchmkaring

Posted by Rana Faisal <fm...@essi.upc.edu>.

Hi,

It would be perfect. Thank you very much.


Regards

Faisal


On 13.04.2017 12:29, ayushmantri wrote:
> Hi Faisal ,
>
> Thanks for your interest in Carbondata TPC-H benchmarking. I had started
> this work few days back. I'll share u the DDLs and Configurations we are
> using  separately through private message . meanwhile please download latest
> code from  master and compile..
>
> Looking fwd for your support in bench-marking
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/CarbonData-performance-benchmkaring-tp10902p10952.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

RE: CarbonData performance benchmkaring

Posted by ayushmantri <aa...@gmail.com>.

Hi Faisal ,

Thanks for your interest in Carbondata TPC-H benchmarking. I had started
this work few days back. I'll share u the DDLs and Configurations we are
using  separately through private message . meanwhile please download latest
code from  master and compile..

Looking fwd for your support in bench-marking





--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/CarbonData-performance-benchmkaring-tp10902p10952.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

Re: CarbonData performance benchmkaring

Posted by Rana Faisal <fm...@essi.upc.edu>.

Hi Liang,

Thank you very much. Now, I am doing my tests with the updated one. I 
will share the results with you. I have one question: Can I adjust table 
block size in DataFrame write?

Regards

Faisal


On 13.04.2017 07:07, Liang Chen wrote:
> Hi Rana
>
> That would be very nice, you could participate in us to test TPC-H. One
> contributor will contact you and share with you about the script and DDL of
> TPC-H.
>
> Actually, you are using old version(Jan version), the current master has
> done many optimization for TPC-H , maybe you need to clone the master
> version.
>
> Regards
> Liang
>
> 2017-04-13 4:38 GMT+05:30 Rana Faisal Munir <fm...@essi.upc.edu>:
>
>> Hi Liang,
>>
>> Thank you very much for your reply. I am giving answers side by side to
>> your questions
>>
>>
>> 1.Did you use the latest master version , or 1.0 ? suggest you use master
>> to test
>>
>> I have downloaded the latest version from GIT and compile it. It is
>> carbondata_2.11-1.0.0-incubating-shade-hadoop2.2.0
>>
>> 2.Have you tested other TPC-H query which including where/filter?
>>
>> I just started recently and my future plan is to move towards whole TPCH
>> queries to see CarbonData performance improvements over Parquet. But right
>> now, I am just running my own queries with different projected columns to
>> see how well CarbonData can push down the projection.
>>
>> 3.In your case, the query is slow ? or the below "write.format" is slow ?
>> write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")
>>
>> I have the same line for Parquet and Parquet is working perfectly fine. I
>> don't think , this writing is causing any problem.
>>
>> 4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to
>> true.
>> import org.apache.carbondata.core.util.CarbonProperties
>> import org.apache.carbondata.core.constants.CarbonCommonConstants
>> CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_
>> VECTOR_READER,
>> "true")
>>
>> Thanks for this suggestion. I will enable it and will share with you the
>> updated results.
>>
>> Community is doing TPC-H test also currently, do you want to participate
>> in test together?
>>
>> It would be nice to be part of this. Could you please guide me how I can
>> contribute.
>>
>> Thank you
>>
>> Regards
>> Faisal
>> -----Original Message-----
>> From: Liang Chen [mailto:chenliang6136@gmail.com]
>> Sent: Thursday, April 13, 2017 1:00 AM
>> To: dev@carbondata.incubator.apache.org
>> Subject: Re: CarbonData performance benchmkaring
>>
>> Hi
>>
>> 1.Did you use the latest master version , or 1.0 ?  suggest you use master
>> to test 2.Have you tested other TPC-H query which including where/filter?
>> 3.In your case, the query is slow ? or the below "write.format" is slow ?
>> write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")
>>
>> 4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to
>> true.
>> import org.apache.carbondata.core.util.CarbonProperties
>> import org.apache.carbondata.core.constants.CarbonCommonConstants
>> CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_
>> VECTOR_READER,
>> "true")
>>
>> Community is doing TPC-H test also currently, do you want to participate
>> in test together?
>>
>> Regards
>> Liang
>>
>> 2017-04-13 1:18 GMT+05:30 Rana Faisal Munir <fm...@essi.upc.edu>:
>>
>>> Dear all,
>>>
>>>
>>>
>>> I am running some experiments to benchmark the performance of both
>>> Parquet and CarbonData. I am using TPC-H lineitem table of size 8GB.
>>> It has 16 columns and I am running different projection queries where
>>> I am reading different number of columns (3,6,9,12,15). I am facing
>>> some problem with CarbonData and it seems to be very slow when I select
>> more than 8 columns.
>>> It takes almost hours to process my request whereas Parquet is very
>> quick.
>>> Could please anybody helps me to know this behavior.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>    This is my configuration of cluster:
>>>
>>>
>>>
>>> 3 Machines
>>>
>>> 1 Driver Machine (128 GB, 24 cores)
>>>
>>> 2 Worker Machines  (128GB, 24 cores)
>>>
>>>
>>>
>>> My configuration settings for Spark are:
>>>
>>>
>>>
>>> spark.executor.instances        12
>>>
>>> spark.executor.memory   18g
>>>
>>> spark.driver.memory     57g
>>>
>>> spark.executor.cores    3
>>>
>>> spark.driver.cores      5
>>>
>>> spark.default.parallelism       72
>>>
>>>
>>>
>>> carbon.sort.file.buffer.size=20
>>>
>>> carbon.graph.rowset.size=100000
>>>
>>> carbon.number.of.cores.while.loading=6
>>>
>>> carbon.sort.size=500000
>>>
>>> carbon.enableXXHash=true
>>>
>>> carbon.number.of.cores.while.compacting=2
>>>
>>> carbon.compaction.level.threshold=4,3
>>>
>>> carbon.major.compaction.size=1024
>>>
>>> carbon.number.of.cores=4
>>>
>>> carbon.inmemory.record.size=120000
>>>
>>> carbon.enable.quick.filter=false
>>>
>>>
>>>
>>>
>>>
>>> My Queries:
>>>
>>>
>>>
>>> carbon.sql("CREATE TABLE IF NOT EXISTS lineitem_4  (orderkey BIGINT,
>>> partkey BIGINT, suppkey BIGINT, linenumber BIGINT, quantity DOUBLE,
>>> extendedprice DOUBLE, discount DOUBLE, tax DOUBLE, returnflag STRING,
>>> linestatus STRING, shipdate DATE, commitdate DATE, receiptdate DATE,
>>> shipinstruct STRING, shipmode STRING, comment STRING) STORED BY
>>> 'carbondata'
>>> TBLPROPERTIES('TABLE_BLOCKSIZE'='128 MB')")
>>>
>>>
>>>
>>> carbon.sql("LOAD DATA INPATH 'hdfs://hdfsmaster/input/lineitem/' INTO
>>> TABLE
>>> lineitem_4 OPTIONS ('FILEHEADER' =
>>> 'orderkey,partkey,suppkey,linenumber,quantity,
>>> extendedprice,discount,tax,ret
>>> urnflag,linestatus,shipdate,commitdate,receiptdate,
>>> shipinstruct,shipmode,com
>>> ment', 'USE_KETTLE' = 'false', 'DELIMITER'='|')")
>>>
>>>
>>>
>>>
>>>
>>> val proj1 = carbon.sql("SELECT orderkey,partkey,linenumber FROM
>>> lineitem_4"))
>>>
>>> proj1.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/
>>> ")
>>>
>>>
>>>
>>> val proj2 = carbon.sql("SELECT
>>> orderkey,partkey,linenumber,quantity,discount,returnflag FROM
>>> lineitem_4")
>>>
>>> proj2.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj2/
>>> ")
>>>
>>>
>>>
>>> val proj3 = carbon.sql("SELECT
>>> orderkey,partkey,linenumber,quantity,discount,returnflag,
>>> linestatus,commitda
>>> te,receiptdate FROM lineitem_4")
>>>
>>> proj3.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj3/
>>> ")
>>>
>>>
>>>
>>> val proj4 = carbon.sql("SELECT
>>> orderkey,partkey,linenumber,quantity,discount,returnflag,
>>> linestatus,commitda
>>> te,receiptdate,shipinstruct,shipmode,comment FROM lineitem_4")
>>>
>>> proj4.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj4/
>>> ")
>>>
>>>
>>>
>>> Thank you
>>>
>>>
>>>
>>> Regards
>>>
>>> Faisal
>>>
>>>
>>>
>>>
>>
>> --
>> Regards
>> Liang
>>
>>
>

Re: CarbonData performance benchmkaring

Posted by Liang Chen <ch...@gmail.com>.

Hi Rana

That would be very nice, you could participate in us to test TPC-H. One
contributor will contact you and share with you about the script and DDL of
TPC-H.

Actually, you are using old version(Jan version), the current master has
done many optimization for TPC-H , maybe you need to clone the master
version.

Regards
Liang

2017-04-13 4:38 GMT+05:30 Rana Faisal Munir <fm...@essi.upc.edu>:

> Hi Liang,
>
> Thank you very much for your reply. I am giving answers side by side to
> your questions
>
>
> 1.Did you use the latest master version , or 1.0 ? suggest you use master
> to test
>
> I have downloaded the latest version from GIT and compile it. It is
> carbondata_2.11-1.0.0-incubating-shade-hadoop2.2.0
>
> 2.Have you tested other TPC-H query which including where/filter?
>
> I just started recently and my future plan is to move towards whole TPCH
> queries to see CarbonData performance improvements over Parquet. But right
> now, I am just running my own queries with different projected columns to
> see how well CarbonData can push down the projection.
>
> 3.In your case, the query is slow ? or the below "write.format" is slow ?
> write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")
>
> I have the same line for Parquet and Parquet is working perfectly fine. I
> don't think , this writing is causing any problem.
>
> 4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to
> true.
> import org.apache.carbondata.core.util.CarbonProperties
> import org.apache.carbondata.core.constants.CarbonCommonConstants
> CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_
> VECTOR_READER,
> "true")
>
> Thanks for this suggestion. I will enable it and will share with you the
> updated results.
>
> Community is doing TPC-H test also currently, do you want to participate
> in test together?
>
> It would be nice to be part of this. Could you please guide me how I can
> contribute.
>
> Thank you
>
> Regards
> Faisal
> -----Original Message-----
> From: Liang Chen [mailto:chenliang6136@gmail.com]
> Sent: Thursday, April 13, 2017 1:00 AM
> To: dev@carbondata.incubator.apache.org
> Subject: Re: CarbonData performance benchmkaring
>
> Hi
>
> 1.Did you use the latest master version , or 1.0 ?  suggest you use master
> to test 2.Have you tested other TPC-H query which including where/filter?
> 3.In your case, the query is slow ? or the below "write.format" is slow ?
> write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")
>
> 4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to
> true.
> import org.apache.carbondata.core.util.CarbonProperties
> import org.apache.carbondata.core.constants.CarbonCommonConstants
> CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_
> VECTOR_READER,
> "true")
>
> Community is doing TPC-H test also currently, do you want to participate
> in test together?
>
> Regards
> Liang
>
> 2017-04-13 1:18 GMT+05:30 Rana Faisal Munir <fm...@essi.upc.edu>:
>
> > Dear all,
> >
> >
> >
> > I am running some experiments to benchmark the performance of both
> > Parquet and CarbonData. I am using TPC-H lineitem table of size 8GB.
> > It has 16 columns and I am running different projection queries where
> > I am reading different number of columns (3,6,9,12,15). I am facing
> > some problem with CarbonData and it seems to be very slow when I select
> more than 8 columns.
> > It takes almost hours to process my request whereas Parquet is very
> quick.
> > Could please anybody helps me to know this behavior.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >   This is my configuration of cluster:
> >
> >
> >
> > 3 Machines
> >
> > 1 Driver Machine (128 GB, 24 cores)
> >
> > 2 Worker Machines  (128GB, 24 cores)
> >
> >
> >
> > My configuration settings for Spark are:
> >
> >
> >
> > spark.executor.instances        12
> >
> > spark.executor.memory   18g
> >
> > spark.driver.memory     57g
> >
> > spark.executor.cores    3
> >
> > spark.driver.cores      5
> >
> > spark.default.parallelism       72
> >
> >
> >
> > carbon.sort.file.buffer.size=20
> >
> > carbon.graph.rowset.size=100000
> >
> > carbon.number.of.cores.while.loading=6
> >
> > carbon.sort.size=500000
> >
> > carbon.enableXXHash=true
> >
> > carbon.number.of.cores.while.compacting=2
> >
> > carbon.compaction.level.threshold=4,3
> >
> > carbon.major.compaction.size=1024
> >
> > carbon.number.of.cores=4
> >
> > carbon.inmemory.record.size=120000
> >
> > carbon.enable.quick.filter=false
> >
> >
> >
> >
> >
> > My Queries:
> >
> >
> >
> > carbon.sql("CREATE TABLE IF NOT EXISTS lineitem_4  (orderkey BIGINT,
> > partkey BIGINT, suppkey BIGINT, linenumber BIGINT, quantity DOUBLE,
> > extendedprice DOUBLE, discount DOUBLE, tax DOUBLE, returnflag STRING,
> > linestatus STRING, shipdate DATE, commitdate DATE, receiptdate DATE,
> > shipinstruct STRING, shipmode STRING, comment STRING) STORED BY
> > 'carbondata'
> > TBLPROPERTIES('TABLE_BLOCKSIZE'='128 MB')")
> >
> >
> >
> > carbon.sql("LOAD DATA INPATH 'hdfs://hdfsmaster/input/lineitem/' INTO
> > TABLE
> > lineitem_4 OPTIONS ('FILEHEADER' =
> > 'orderkey,partkey,suppkey,linenumber,quantity,
> > extendedprice,discount,tax,ret
> > urnflag,linestatus,shipdate,commitdate,receiptdate,
> > shipinstruct,shipmode,com
> > ment', 'USE_KETTLE' = 'false', 'DELIMITER'='|')")
> >
> >
> >
> >
> >
> > val proj1 = carbon.sql("SELECT orderkey,partkey,linenumber FROM
> > lineitem_4"))
> >
> > proj1.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/
> > ")
> >
> >
> >
> > val proj2 = carbon.sql("SELECT
> > orderkey,partkey,linenumber,quantity,discount,returnflag FROM
> > lineitem_4")
> >
> > proj2.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj2/
> > ")
> >
> >
> >
> > val proj3 = carbon.sql("SELECT
> > orderkey,partkey,linenumber,quantity,discount,returnflag,
> > linestatus,commitda
> > te,receiptdate FROM lineitem_4")
> >
> > proj3.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj3/
> > ")
> >
> >
> >
> > val proj4 = carbon.sql("SELECT
> > orderkey,partkey,linenumber,quantity,discount,returnflag,
> > linestatus,commitda
> > te,receiptdate,shipinstruct,shipmode,comment FROM lineitem_4")
> >
> > proj4.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj4/
> > ")
> >
> >
> >
> > Thank you
> >
> >
> >
> > Regards
> >
> > Faisal
> >
> >
> >
> >
>
>
> --
> Regards
> Liang
>
>


-- 
Regards
Liang

RE: CarbonData performance benchmkaring

Posted by Rana Faisal Munir <fm...@essi.upc.edu>.

Hi Liang,

Thank you very much for your reply. I am giving answers side by side to your questions


1.Did you use the latest master version , or 1.0 ? suggest you use master to test

I have downloaded the latest version from GIT and compile it. It is carbondata_2.11-1.0.0-incubating-shade-hadoop2.2.0

2.Have you tested other TPC-H query which including where/filter?

I just started recently and my future plan is to move towards whole TPCH queries to see CarbonData performance improvements over Parquet. But right now, I am just running my own queries with different projected columns to see how well CarbonData can push down the projection. 

3.In your case, the query is slow ? or the below "write.format" is slow ?
write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")

I have the same line for Parquet and Parquet is working perfectly fine. I don't think , this writing is causing any problem.

4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to true.
import org.apache.carbondata.core.util.CarbonProperties
import org.apache.carbondata.core.constants.CarbonCommonConstants
CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER,
"true")

Thanks for this suggestion. I will enable it and will share with you the updated results.

Community is doing TPC-H test also currently, do you want to participate in test together?

It would be nice to be part of this. Could you please guide me how I can contribute. 

Thank you

Regards
Faisal
-----Original Message-----
From: Liang Chen [mailto:chenliang6136@gmail.com] 
Sent: Thursday, April 13, 2017 1:00 AM
To: dev@carbondata.incubator.apache.org
Subject: Re: CarbonData performance benchmkaring

Hi

1.Did you use the latest master version , or 1.0 ?  suggest you use master to test 2.Have you tested other TPC-H query which including where/filter?
3.In your case, the query is slow ? or the below "write.format" is slow ?
write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")

4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to true.
import org.apache.carbondata.core.util.CarbonProperties
import org.apache.carbondata.core.constants.CarbonCommonConstants
CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER,
"true")

Community is doing TPC-H test also currently, do you want to participate in test together?

Regards
Liang

2017-04-13 1:18 GMT+05:30 Rana Faisal Munir <fm...@essi.upc.edu>:

> Dear all,
>
>
>
> I am running some experiments to benchmark the performance of both 
> Parquet and CarbonData. I am using TPC-H lineitem table of size 8GB. 
> It has 16 columns and I am running different projection queries where 
> I am reading different number of columns (3,6,9,12,15). I am facing 
> some problem with CarbonData and it seems to be very slow when I select more than 8 columns.
> It takes almost hours to process my request whereas Parquet is very quick.
> Could please anybody helps me to know this behavior.
>
>
>
>
>
>
>
>
>
>   This is my configuration of cluster:
>
>
>
> 3 Machines
>
> 1 Driver Machine (128 GB, 24 cores)
>
> 2 Worker Machines  (128GB, 24 cores)
>
>
>
> My configuration settings for Spark are:
>
>
>
> spark.executor.instances        12
>
> spark.executor.memory   18g
>
> spark.driver.memory     57g
>
> spark.executor.cores    3
>
> spark.driver.cores      5
>
> spark.default.parallelism       72
>
>
>
> carbon.sort.file.buffer.size=20
>
> carbon.graph.rowset.size=100000
>
> carbon.number.of.cores.while.loading=6
>
> carbon.sort.size=500000
>
> carbon.enableXXHash=true
>
> carbon.number.of.cores.while.compacting=2
>
> carbon.compaction.level.threshold=4,3
>
> carbon.major.compaction.size=1024
>
> carbon.number.of.cores=4
>
> carbon.inmemory.record.size=120000
>
> carbon.enable.quick.filter=false
>
>
>
>
>
> My Queries:
>
>
>
> carbon.sql("CREATE TABLE IF NOT EXISTS lineitem_4  (orderkey BIGINT, 
> partkey BIGINT, suppkey BIGINT, linenumber BIGINT, quantity DOUBLE, 
> extendedprice DOUBLE, discount DOUBLE, tax DOUBLE, returnflag STRING, 
> linestatus STRING, shipdate DATE, commitdate DATE, receiptdate DATE, 
> shipinstruct STRING, shipmode STRING, comment STRING) STORED BY 
> 'carbondata'
> TBLPROPERTIES('TABLE_BLOCKSIZE'='128 MB')")
>
>
>
> carbon.sql("LOAD DATA INPATH 'hdfs://hdfsmaster/input/lineitem/' INTO 
> TABLE
> lineitem_4 OPTIONS ('FILEHEADER' =
> 'orderkey,partkey,suppkey,linenumber,quantity,
> extendedprice,discount,tax,ret
> urnflag,linestatus,shipdate,commitdate,receiptdate,
> shipinstruct,shipmode,com
> ment', 'USE_KETTLE' = 'false', 'DELIMITER'='|')")
>
>
>
>
>
> val proj1 = carbon.sql("SELECT orderkey,partkey,linenumber FROM
> lineitem_4"))
>
> proj1.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/
> ")
>
>
>
> val proj2 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag FROM 
> lineitem_4")
>
> proj2.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj2/
> ")
>
>
>
> val proj3 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag,
> linestatus,commitda
> te,receiptdate FROM lineitem_4")
>
> proj3.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj3/
> ")
>
>
>
> val proj4 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag,
> linestatus,commitda
> te,receiptdate,shipinstruct,shipmode,comment FROM lineitem_4")
>
> proj4.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj4/
> ")
>
>
>
> Thank you
>
>
>
> Regards
>
> Faisal
>
>
>
>


--
Regards
Liang

Re: CarbonData performance benchmkaring

Posted by Liang Chen <ch...@gmail.com>.

Hi

1.Did you use the latest master version , or 1.0 ?  suggest you use master
to test
2.Have you tested other TPC-H query which including where/filter?
3.In your case, the query is slow ? or the below "write.format" is slow ?
write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")

4.Use master version to do query test , and set "ENABLE_VECTOR_READER" to
true.
import org.apache.carbondata.core.util.CarbonProperties
import org.apache.carbondata.core.constants.CarbonCommonConstants
CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_VECTOR_READER,
"true")

Community is doing TPC-H test also currently, do you want to participate in
test together?

Regards
Liang

2017-04-13 1:18 GMT+05:30 Rana Faisal Munir <fm...@essi.upc.edu>:

> Dear all,
>
>
>
> I am running some experiments to benchmark the performance of both Parquet
> and CarbonData. I am using TPC-H lineitem table of size 8GB. It has 16
> columns and I am running different projection queries where I am reading
> different number of columns (3,6,9,12,15). I am facing some problem with
> CarbonData and it seems to be very slow when I select more than 8 columns.
> It takes almost hours to process my request whereas Parquet is very quick.
> Could please anybody helps me to know this behavior.
>
>
>
>
>
>
>
>
>
>   This is my configuration of cluster:
>
>
>
> 3 Machines
>
> 1 Driver Machine (128 GB, 24 cores)
>
> 2 Worker Machines  (128GB, 24 cores)
>
>
>
> My configuration settings for Spark are:
>
>
>
> spark.executor.instances        12
>
> spark.executor.memory   18g
>
> spark.driver.memory     57g
>
> spark.executor.cores    3
>
> spark.driver.cores      5
>
> spark.default.parallelism       72
>
>
>
> carbon.sort.file.buffer.size=20
>
> carbon.graph.rowset.size=100000
>
> carbon.number.of.cores.while.loading=6
>
> carbon.sort.size=500000
>
> carbon.enableXXHash=true
>
> carbon.number.of.cores.while.compacting=2
>
> carbon.compaction.level.threshold=4,3
>
> carbon.major.compaction.size=1024
>
> carbon.number.of.cores=4
>
> carbon.inmemory.record.size=120000
>
> carbon.enable.quick.filter=false
>
>
>
>
>
> My Queries:
>
>
>
> carbon.sql("CREATE TABLE IF NOT EXISTS lineitem_4  (orderkey BIGINT,
> partkey
> BIGINT, suppkey BIGINT, linenumber BIGINT, quantity DOUBLE, extendedprice
> DOUBLE, discount DOUBLE, tax DOUBLE, returnflag STRING, linestatus STRING,
> shipdate DATE, commitdate DATE, receiptdate DATE, shipinstruct STRING,
> shipmode STRING, comment STRING) STORED BY 'carbondata'
> TBLPROPERTIES('TABLE_BLOCKSIZE'='128 MB')")
>
>
>
> carbon.sql("LOAD DATA INPATH 'hdfs://hdfsmaster/input/lineitem/' INTO
> TABLE
> lineitem_4 OPTIONS ('FILEHEADER' =
> 'orderkey,partkey,suppkey,linenumber,quantity,
> extendedprice,discount,tax,ret
> urnflag,linestatus,shipdate,commitdate,receiptdate,
> shipinstruct,shipmode,com
> ment', 'USE_KETTLE' = 'false', 'DELIMITER'='|')")
>
>
>
>
>
> val proj1 = carbon.sql("SELECT orderkey,partkey,linenumber FROM
> lineitem_4"))
>
> proj1.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj1/")
>
>
>
> val proj2 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag FROM lineitem_4")
>
> proj2.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj2/")
>
>
>
> val proj3 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag,
> linestatus,commitda
> te,receiptdate FROM lineitem_4")
>
> proj3.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj3/")
>
>
>
> val proj4 = carbon.sql("SELECT
> orderkey,partkey,linenumber,quantity,discount,returnflag,
> linestatus,commitda
> te,receiptdate,shipinstruct,shipmode,comment FROM lineitem_4")
>
> proj4.write.format("csv").save("hdfs://hdfsmaster/output/carbon/proj4/")
>
>
>
> Thank you
>
>
>
> Regards
>
> Faisal
>
>
>
>


-- 
Regards
Liang