You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by vandana7 <gi...@git.apache.org> on 2018/07/27 06:06:50 UTC

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

GitHub user vandana7 opened a pull request:

    https://github.com/apache/carbondata/pull/2568

    [Presto-integration-Technical-note] created documentation for presto integration

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vandana7/incubator-carbondata presto-documentation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2568.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2568
    
----
commit 1dede8a7d4fe6ce29c9b11f124e236901b0e8814
Author: vandana7 <va...@...>
Date:   2018-07-26T15:48:18Z

    created documetation for presto integration

----


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207436115
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    +
    +- Use Double instead of Decimal if required precision is low.
    +
    +- Columns having low-cardinality should be created as dictionary columns. This will improve query performance to a great extent.
    +
    +**Performance Optimization by changing Queries:**
    +
    +- There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
    +
    +- Aggregating a series of LIKE clauses in one single regexp_like clause.
    +
    +**For example :**
    +
    +```
    +SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  method LIKE '%GET%' OR
    +  method LIKE '%POST%' OR
    +  method LIKE '%PUT%' OR
    +  method LIKE '%DELETE%'
    + ```
    +
    + can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause:
    +
    + ```
    + SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  regexp_like(method, 'GET|POST|PUT|DELETE')
    + ```
    +
    +- Specifying large tables first in join clause
    +
    +
    +The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.
    +
    +**Note :** If you still see the memory issue, try distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join would work even if the right-hand side table is large, but the performance can be slower because it increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:
    +
    +```
    +PrestoCli> -- set session distributed_join = 'true'
    +SELECT ... FROM large_table l, small_table s WHERE l.id = s.id
    +```
    +
    +**Performance optimizations by using certain Configuration properties:**
    --- End diff --
    
    put lower values as default if needs to be specified. better not to give values as it will be directly copy pasted


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6081/



---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1990/



---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    LGTM


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6312/



---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207456457
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    +
    +- Use Double instead of Decimal if required precision is low.
    +
    +- Columns having low-cardinality should be created as dictionary columns. This will improve query performance to a great extent.
    +
    +**Performance Optimization by changing Queries:**
    +
    +- There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
    +
    +- Aggregating a series of LIKE clauses in one single regexp_like clause.
    +
    +**For example :**
    +
    +```
    +SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  method LIKE '%GET%' OR
    +  method LIKE '%POST%' OR
    +  method LIKE '%PUT%' OR
    +  method LIKE '%DELETE%'
    + ```
    +
    + can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause:
    +
    + ```
    + SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  regexp_like(method, 'GET|POST|PUT|DELETE')
    + ```
    +
    +- Specifying large tables first in join clause
    +
    +
    +The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.
    +
    +**Note :** If you still see the memory issue, try distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join would work even if the right-hand side table is large, but the performance can be slower because it increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:
    +
    +```
    +PrestoCli> -- set session distributed_join = 'true'
    +SELECT ... FROM large_table l, small_table s WHERE l.id = s.id
    +```
    +
    +**Performance optimizations by using certain Configuration properties:**
    --- End diff --
    
    @sraghunandan can you please provide some more clarity on this point, as I am not able to understand this point, I have not provided any values it is only columns I have used in this.


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207435782
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    +
    +- Use Double instead of Decimal if required precision is low.
    +
    +- Columns having low-cardinality should be created as dictionary columns. This will improve query performance to a great extent.
    +
    +**Performance Optimization by changing Queries:**
    +
    +- There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
    --- End diff --
    
    Change the sentence.Probability word need not be used


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207433311
  
    --- Diff: integration/presto/presto-integration-in-carbondata.md ---
    @@ -0,0 +1,134 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# PRESTO INTEGRATION IN CARBONDATA
    +
    +1. [Document Purpose](#document-purpose)
    +    1. [Purpose](#purpose)
    +    1. [Scope](#scope)
    +    1. [Definitions and Acronyms](#definitions-and-acronyms)
    +1. [Requirements addressed](#requirements-addressed)
    +1. [Design Considerations](#design-considerations)
    +    1. [Row Iterator Implementation](#row-iterator-implementation)
    +    1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach)
    +1. [Module Structure](#module-structure)
    +1. [Detailed design](#detailed-design)
    +    1. [Modules](#modules)
    +    1. [Functions Developed](#functions-developed)
    +1. [Integration Tests](#integration-tests)
    +1. [Tools and languages used](#tools-and-languages-used)
    +1. [References](#references)
    +
    +## Document Purpose
    +
    + * #### _Purpose_
    + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData.
    +
    + Its main purpose is to -
    +   *  Provide the link between the Functional Requirement and the detailed Technical Design documents.
    +   *  Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design.
    +
    + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
    + * #### _Scope_
    + Presto Integration with CarbonData will allow execution of CarbonData queries on the Presto CLI.  CarbonData can be added easily as a Data Source among the multiple heterogeneous data sources for Presto.
    + * #### _Definitions and Acronyms_
    +  **CarbonData :** CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. In customer benchmarks, CarbonData has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).
    +
    + **Presto :** Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
    +
    +## Requirements addressed
    +This integration of Presto mainly serves two purpose:
    + * Support of Apache CarbonData as Data Source in Presto.
    + * Execution of Apache CarbonData Queries on Presto.
    +
    +## Design Considerations
    +Following are the design considerations for the Presto Integration with the Carbondata.
    +
    +#### Row Iterator Implementation
    +
    +   Presto provides a way to iterate the records through a RecordSetProvider which creates a RecordCursor so we have to extend this class to create a CarbondataRecordSetProvider and CarbondataRecordCursor to read data from Carbondata core module. The CarbondataRecordCursor will utilize the DictionaryBasedResultCollector class of Core module to read data row by row. This approach has two drawbacks.
    +   * The Presto converts this row data into columnar data again since carbondata itself store data in columnar format we are adding an additional conversion to row to column instead of directly using the column.
    +   * The cursor reads the data row by row instead of a batch of data , so this is a costly operation as we are already storing the data in pages or batches we can directly read the batches of data.
    +
    +#### ColumnarReaders or StreamReaders approach
    +
    +   In this design we can create StreamReaders that can read data from the Carbondata Column based on DataType and directly convert it into Presto Block. This approach saves us the row by row processing as well as reduce the transition and conversion of data . By this approach we can achieve the fastest read from Presto and create a Presto Page by extending PageSourceProvider and PageSource class. This design will be discussed in detail in the next sections of this document.
    +
    +## Module Structure
    +
    +
    +![module structure](../presto/images/module-structure.jpg?raw=true)
    +
    +
    +
    +## Detailed design
    +#### Modules
    +
    +Based on the above functionality, Presto Integration is implemented as following module:
    +
    +1. **Presto**
    +
    +Integration of Presto with CarbonData includes implementation of connector Api of the Presto.
    --- End diff --
    
    carbondata with presto


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by ajantha-bhat <gi...@git.apache.org>.
Github user ajantha-bhat commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207440925
  
    --- Diff: integration/presto/performance-report-of-presto-with-carbon.md ---
    @@ -0,0 +1,27 @@
    +<!--
    --- End diff --
    
    **Please remove this section right now as the report is not fair.**
    
    reasons:
    1. Spark-carbon TPCH results mentioned in website is not matching with this spark-carbon results. Few queries have huge difference due to machine problem [these machines are not in same rack]
    2. Also comparison report should have machine details [RAM, VM/Bare metal] . This was not mentioned.



---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6413/



---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207457064
  
    --- Diff: integration/presto/presto-integration-in-carbondata.md ---
    @@ -0,0 +1,134 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# PRESTO INTEGRATION IN CARBONDATA
    +
    +1. [Document Purpose](#document-purpose)
    +    1. [Purpose](#purpose)
    +    1. [Scope](#scope)
    +    1. [Definitions and Acronyms](#definitions-and-acronyms)
    +1. [Requirements addressed](#requirements-addressed)
    +1. [Design Considerations](#design-considerations)
    +    1. [Row Iterator Implementation](#row-iterator-implementation)
    +    1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach)
    +1. [Module Structure](#module-structure)
    +1. [Detailed design](#detailed-design)
    +    1. [Modules](#modules)
    +    1. [Functions Developed](#functions-developed)
    +1. [Integration Tests](#integration-tests)
    +1. [Tools and languages used](#tools-and-languages-used)
    +1. [References](#references)
    +
    +## Document Purpose
    +
    + * #### _Purpose_
    + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData.
    +
    + Its main purpose is to -
    +   *  Provide the link between the Functional Requirement and the detailed Technical Design documents.
    +   *  Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design.
    +
    + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
    --- End diff --
    
    actual implementation represents the installation and configuration of carbondata which is provided on carbondata repo's doc.
    to make it more clear I have removed the actual implementation to carbondata.


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6519/



---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207456639
  
    --- Diff: integration/presto/performance-report-of-presto-with-carbon.md ---
    @@ -0,0 +1,27 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Performance Report Of Presto combined with Carbondata
    +Presto is a MPP (Massively Parallel Processing) tool designed to efficiently query vast amounts of data using distributed queries. Presto can be and has been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Cassandra. It is capable of handling data warehousing and analytics: data analysis, aggregating large amounts of data and producing reports. These workloads are often classified as Online Analytical Processing (OLAP).
    +
    +On the other side, Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
    +
    +While dealing with Carbondata, both of them have their own advantage but presto is far better than spark while executing 90% of the queries. As the Presto-carbon vector readers are much optimized and reduces the table scan time dealing with large table. Even in case of dictionary aggregation and multiple table join, presto performs much better due to its own optimised way of dealing with properties.
    --- End diff --
    
    done, removed the word far better.


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/719/



---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207474334
  
    --- Diff: integration/presto/presto-integration-in-carbondata.md ---
    @@ -0,0 +1,134 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# PRESTO INTEGRATION IN CARBONDATA
    +
    +1. [Document Purpose](#document-purpose)
    +    1. [Purpose](#purpose)
    +    1. [Scope](#scope)
    +    1. [Definitions and Acronyms](#definitions-and-acronyms)
    +1. [Requirements addressed](#requirements-addressed)
    +1. [Design Considerations](#design-considerations)
    +    1. [Row Iterator Implementation](#row-iterator-implementation)
    +    1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach)
    +1. [Module Structure](#module-structure)
    +1. [Detailed design](#detailed-design)
    +    1. [Modules](#modules)
    +    1. [Functions Developed](#functions-developed)
    +1. [Integration Tests](#integration-tests)
    +1. [Tools and languages used](#tools-and-languages-used)
    +1. [References](#references)
    +
    +## Document Purpose
    +
    + * #### _Purpose_
    + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData.
    +
    + Its main purpose is to -
    +   *  Provide the link between the Functional Requirement and the detailed Technical Design documents.
    +   *  Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design.
    +
    + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
    + * #### _Scope_
    + Presto Integration with CarbonData will allow execution of CarbonData queries on the Presto CLI.  CarbonData can be added easily as a Data Source among the multiple heterogeneous data sources for Presto.
    --- End diff --
    
    done.


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207519570
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    +
    +- Use Double instead of Decimal if required precision is low.
    +
    +- Columns having low-cardinality should be created as dictionary columns. This will improve query performance to a great extent.
    +
    +**Performance Optimization by changing Queries:**
    +
    +- There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
    +
    +- Aggregating a series of LIKE clauses in one single regexp_like clause.
    +
    +**For example :**
    +
    +```
    +SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  method LIKE '%GET%' OR
    +  method LIKE '%POST%' OR
    +  method LIKE '%PUT%' OR
    +  method LIKE '%DELETE%'
    + ```
    +
    + can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause:
    +
    + ```
    + SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  regexp_like(method, 'GET|POST|PUT|DELETE')
    + ```
    +
    +- Specifying large tables first in join clause
    +
    +
    +The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.
    +
    +**Note :** If you still see the memory issue, try distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join would work even if the right-hand side table is large, but the performance can be slower because it increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:
    +
    +```
    +PrestoCli> -- set session distributed_join = 'true'
    +SELECT ... FROM large_table l, small_table s WHERE l.id = s.id
    +```
    +
    +**Performance optimizations by using certain Configuration properties:**
    +- **Presto Properties (location: presto/etc/config.properties)**
    +
    +```
    +query.max-memory=210GB
    +```
    +This property value should be given according to available RAM as per all cluster worker nodes.
    --- End diff --
    
    Done


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207430412
  
    --- Diff: integration/presto/performance-report-of-presto-with-carbon.md ---
    @@ -0,0 +1,27 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Performance Report Of Presto combined with Carbondata
    +Presto is a MPP (Massively Parallel Processing) tool designed to efficiently query vast amounts of data using distributed queries. Presto can be and has been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Cassandra. It is capable of handling data warehousing and analytics: data analysis, aggregating large amounts of data and producing reports. These workloads are often classified as Online Analytical Processing (OLAP).
    +
    +On the other side, Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
    +
    +While dealing with Carbondata, both of them have their own advantage but presto is far better than spark while executing 90% of the queries. As the Presto-carbon vector readers are much optimized and reduces the table scan time dealing with large table. Even in case of dictionary aggregation and multiple table join, presto performs much better due to its own optimised way of dealing with properties.
    --- End diff --
    
    remove the work far better.


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7550/



---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7682/



---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207833427
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    +
    +- Use Double instead of Decimal if required precision is low.
    +
    +- Columns having low-cardinality should be created as dictionary columns. This will improve query performance to a great extent.
    +
    +**Performance Optimization by changing Queries:**
    +
    +- There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
    +
    +- Aggregating a series of LIKE clauses in one single regexp_like clause.
    +
    +**For example :**
    +
    +```
    +SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  method LIKE '%GET%' OR
    +  method LIKE '%POST%' OR
    +  method LIKE '%PUT%' OR
    +  method LIKE '%DELETE%'
    + ```
    +
    + can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause:
    +
    + ```
    + SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  regexp_like(method, 'GET|POST|PUT|DELETE')
    + ```
    +
    +- Specifying large tables first in join clause
    +
    +
    +The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.
    --- End diff --
    
    done.


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by chenliang613 <gi...@git.apache.org>.
Github user chenliang613 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r206490628
  
    --- Diff: integration/presto/Presto-integration-in-carbondata.md ---
    @@ -0,0 +1,132 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# PRESTO INTEGRATION IN CARBONDATA
    +
    +1. [Document Purpose](#document-purpose)
    +    1. [Purpose](#purpose)
    +    1. [Scope](#scope)
    +    1. [Definitions and Acronyms](#definitions-and-acronyms)
    +1. [Requirements addressed](#requirements-addressed)
    +1. [Design Considerations](#design-considerations)
    +    1. [Row Iterator Implementation](#row-iterator-implementation)
    +    1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach)
    +1. [Module Structure](#module-structure)
    +1. [Detailed design](#detailed-design)
    +    1. [Modules](#modules)
    +    1. [Functions Developed](#functions-developed)
    +1. [Integration Tests](#integration-tests)
    +1. [Tools and languages used](#tools-and-languages-used)
    +1. [References](#references)
    +
    +## Document Purpose
    +
    + * #### _Purpose_
    + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData.
    +
    + Its main purpose is to -
    +   *  Provide the link between the Functional Requirement and the detailed Technical Design documents.
    +   *  Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design.
    +
    + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
    + * #### _Scope_
    + Presto Integration with CarbonData will allow execution of CarbonData queries on the Presto CLI.  CarbonData can be added easily as a Data Source among the multiple heterogeneous data sources for Presto.
    + * #### _Definitions and Acronyms_
    +  **CarbonData :** CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. In customer benchmarks, CarbonData has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).
    +
    + **Presto :** Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
    +
    +## Requirements addressed
    +This integration of Presto mainly serves two purpose:
    + * Support of Apache CarbonData as Data Source in Presto.
    + * Execution of Apache CarbonData Queries on Presto.
    +
    +## Design Considerations
    +Following are the design considerations for the Presto Integration with the Carbondata.
    +
    +#### Row Iterator Implementation
    +
    +   Presto provides a way to iterate the records through a RecordSetProvider which creates a RecordCursor so we have to extend this class to create a CarbondataRecordSetProvider and CarbondataRecordCursor to read data from Carbondata core module. The CarbondataRecordCursor will utilize the DictionaryBasedResultCollector class of Core module to read data row by row. This approach has two drawbacks.
    +   * The Presto converts this row data into columnar data again since carbondata itself store data in columnar format we are adding an additional conversion to row to column instead of directly using the column.
    +   * The cursor reads the data row by row instead of a batch of data , so this is a costly operation as we are already storing the data in pages or batches we can directly read the batches of data.
    +
    +#### ColumnarReaders or StreamReaders approach
    +
    +   In this design we can create StreamReaders that can read data from the Carbondata Column based on DataType and directly convert it into Presto Block. This approach saves us the row by row processing as well as reduce the transition and conversion of data . By this approach we can achieve the fastest read from Presto and create a Presto Page by extending PageSourceProvider and PageSource class. This design will be discussed in detail in the next sections of this document.
    +
    +## Module Structure
    +
    +
    +![module structure](../presto/images/module-structure.jpg?raw=true)
    +
    +
    +
    +## Detailed design
    +#### Modules
    +
    +Based on the above functionality, Presto Integration is implemented as following module:
    +
    +1. **Presto**
    +
    +Integration of Presto with CarbonData includes implementation of connector Api of the Presto.
    +#### Functions developed
    +
    +![functionas developed](../presto/images/functions-developed-diagram.png?raw=true)
    +
    +1. **CarbonDataPlugin :** It implements the Plugin Interface of the Presto.
    +1. CarbonDataConnectorFactory : It implements the ConnectorFactory Interface of the Presto. The connector factory is a simple interface responsible for creating an instance of a Connector object that returns instances of the following services:
    +    * ConnectorMetadata
    +
    +    * ConnectorSplitManager
    +
    +    * ConnectorHandleResolver
    +
    +1. **CarbonDataConnector :** It implements the Connector Interface of the Presto.
    +1. **CarbonDataMetadata  :** It implements the ConnectorMetadata Interface of the Presto.  The connector metadata interface has a large number of important methods that are responsible for allowing Presto to look at lists of schemas, lists of tables, lists of columns, and other metadata about a particular data source.
    --- End diff --
    
    it would be better that add these description to code as annotation


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7557/



---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6175/



---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207433045
  
    --- Diff: integration/presto/presto-integration-in-carbondata.md ---
    @@ -0,0 +1,134 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# PRESTO INTEGRATION IN CARBONDATA
    +
    +1. [Document Purpose](#document-purpose)
    +    1. [Purpose](#purpose)
    +    1. [Scope](#scope)
    +    1. [Definitions and Acronyms](#definitions-and-acronyms)
    +1. [Requirements addressed](#requirements-addressed)
    +1. [Design Considerations](#design-considerations)
    +    1. [Row Iterator Implementation](#row-iterator-implementation)
    +    1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach)
    +1. [Module Structure](#module-structure)
    +1. [Detailed design](#detailed-design)
    +    1. [Modules](#modules)
    +    1. [Functions Developed](#functions-developed)
    +1. [Integration Tests](#integration-tests)
    +1. [Tools and languages used](#tools-and-languages-used)
    +1. [References](#references)
    +
    +## Document Purpose
    +
    + * #### _Purpose_
    + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData.
    +
    + Its main purpose is to -
    +   *  Provide the link between the Functional Requirement and the detailed Technical Design documents.
    +   *  Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design.
    +
    + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
    --- End diff --
    
    what do you mean by installation and configuration details of actual implementation?


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207481959
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    +
    +- Use Double instead of Decimal if required precision is low.
    +
    +- Columns having low-cardinality should be created as dictionary columns. This will improve query performance to a great extent.
    +
    +**Performance Optimization by changing Queries:**
    +
    +- There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
    --- End diff --
    
    done


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8786/



---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7795/



---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207833200
  
    --- Diff: integration/presto/performance-report-of-presto-with-carbon.md ---
    @@ -0,0 +1,27 @@
    +<!--
    --- End diff --
    
    done


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207517977
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    +
    +- Use Double instead of Decimal if required precision is low.
    +
    +- Columns having low-cardinality should be created as dictionary columns. This will improve query performance to a great extent.
    +
    +**Performance Optimization by changing Queries:**
    +
    +- There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
    +
    +- Aggregating a series of LIKE clauses in one single regexp_like clause.
    +
    +**For example :**
    +
    +```
    +SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  method LIKE '%GET%' OR
    +  method LIKE '%POST%' OR
    +  method LIKE '%PUT%' OR
    +  method LIKE '%DELETE%'
    + ```
    +
    + can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause:
    +
    + ```
    + SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  regexp_like(method, 'GET|POST|PUT|DELETE')
    + ```
    +
    +- Specifying large tables first in join clause
    +
    +
    +The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.
    +
    +**Note :** If you still see the memory issue, try distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join would work even if the right-hand side table is large, but the performance can be slower because it increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:
    --- End diff --
    
    done


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207479703
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    --- End diff --
    
    done


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207475132
  
    --- Diff: integration/presto/presto-integration-in-carbondata.md ---
    @@ -0,0 +1,134 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# PRESTO INTEGRATION IN CARBONDATA
    +
    +1. [Document Purpose](#document-purpose)
    +    1. [Purpose](#purpose)
    +    1. [Scope](#scope)
    +    1. [Definitions and Acronyms](#definitions-and-acronyms)
    +1. [Requirements addressed](#requirements-addressed)
    +1. [Design Considerations](#design-considerations)
    +    1. [Row Iterator Implementation](#row-iterator-implementation)
    +    1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach)
    +1. [Module Structure](#module-structure)
    +1. [Detailed design](#detailed-design)
    +    1. [Modules](#modules)
    +    1. [Functions Developed](#functions-developed)
    +1. [Integration Tests](#integration-tests)
    +1. [Tools and languages used](#tools-and-languages-used)
    +1. [References](#references)
    +
    +## Document Purpose
    +
    + * #### _Purpose_
    + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData.
    +
    + Its main purpose is to -
    +   *  Provide the link between the Functional Requirement and the detailed Technical Design documents.
    +   *  Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design.
    +
    + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
    + * #### _Scope_
    + Presto Integration with CarbonData will allow execution of CarbonData queries on the Presto CLI.  CarbonData can be added easily as a Data Source among the multiple heterogeneous data sources for Presto.
    + * #### _Definitions and Acronyms_
    +  **CarbonData :** CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. In customer benchmarks, CarbonData has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).
    +
    + **Presto :** Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
    +
    +## Requirements addressed
    +This integration of Presto mainly serves two purpose:
    + * Support of Apache CarbonData as Data Source in Presto.
    + * Execution of Apache CarbonData Queries on Presto.
    +
    +## Design Considerations
    +Following are the design considerations for the Presto Integration with the Carbondata.
    +
    +#### Row Iterator Implementation
    +
    +   Presto provides a way to iterate the records through a RecordSetProvider which creates a RecordCursor so we have to extend this class to create a CarbondataRecordSetProvider and CarbondataRecordCursor to read data from Carbondata core module. The CarbondataRecordCursor will utilize the DictionaryBasedResultCollector class of Core module to read data row by row. This approach has two drawbacks.
    +   * The Presto converts this row data into columnar data again since carbondata itself store data in columnar format we are adding an additional conversion to row to column instead of directly using the column.
    +   * The cursor reads the data row by row instead of a batch of data , so this is a costly operation as we are already storing the data in pages or batches we can directly read the batches of data.
    +
    +#### ColumnarReaders or StreamReaders approach
    +
    +   In this design we can create StreamReaders that can read data from the Carbondata Column based on DataType and directly convert it into Presto Block. This approach saves us the row by row processing as well as reduce the transition and conversion of data . By this approach we can achieve the fastest read from Presto and create a Presto Page by extending PageSourceProvider and PageSource class. This design will be discussed in detail in the next sections of this document.
    +
    +## Module Structure
    +
    +
    +![module structure](../presto/images/module-structure.jpg?raw=true)
    +
    +
    +
    +## Detailed design
    +#### Modules
    +
    +Based on the above functionality, Presto Integration is implemented as following module:
    +
    +1. **Presto**
    +
    +Integration of Presto with CarbonData includes implementation of connector Api of the Presto.
    --- End diff --
    
    done


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207435928
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    +
    +- Use Double instead of Decimal if required precision is low.
    +
    +- Columns having low-cardinality should be created as dictionary columns. This will improve query performance to a great extent.
    +
    +**Performance Optimization by changing Queries:**
    +
    +- There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
    +
    +- Aggregating a series of LIKE clauses in one single regexp_like clause.
    +
    +**For example :**
    +
    +```
    +SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  method LIKE '%GET%' OR
    +  method LIKE '%POST%' OR
    +  method LIKE '%PUT%' OR
    +  method LIKE '%DELETE%'
    + ```
    +
    + can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause:
    +
    + ```
    + SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  regexp_like(method, 'GET|POST|PUT|DELETE')
    + ```
    +
    +- Specifying large tables first in join clause
    +
    +
    +The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.
    +
    +**Note :** If you still see the memory issue, try distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join would work even if the right-hand side table is large, but the performance can be slower because it increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:
    --- End diff --
    
    can we specify how to use distributed has join


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207436035
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    +
    +- Use Double instead of Decimal if required precision is low.
    +
    +- Columns having low-cardinality should be created as dictionary columns. This will improve query performance to a great extent.
    +
    +**Performance Optimization by changing Queries:**
    +
    +- There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
    +
    +- Aggregating a series of LIKE clauses in one single regexp_like clause.
    +
    +**For example :**
    +
    +```
    +SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  method LIKE '%GET%' OR
    +  method LIKE '%POST%' OR
    +  method LIKE '%PUT%' OR
    +  method LIKE '%DELETE%'
    + ```
    +
    + can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause:
    +
    + ```
    + SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  regexp_like(method, 'GET|POST|PUT|DELETE')
    + ```
    +
    +- Specifying large tables first in join clause
    +
    +
    +The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.
    +
    +**Note :** If you still see the memory issue, try distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join would work even if the right-hand side table is large, but the performance can be slower because it increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:
    +
    +```
    +PrestoCli> -- set session distributed_join = 'true'
    +SELECT ... FROM large_table l, small_table s WHERE l.id = s.id
    +```
    +
    +**Performance optimizations by using certain Configuration properties:**
    +- **Presto Properties (location: presto/etc/config.properties)**
    +
    +```
    +query.max-memory=210GB
    +```
    +This property value should be given according to available RAM as per all cluster worker nodes.
    --- End diff --
    
    Total RAM available in the cluster(Sum of all nodes RAM memory)


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207434322
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    --- End diff --
    
    change the sentence to specify the suggestion first


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6174/



---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Failed  with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10242/



---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6304/



---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207433161
  
    --- Diff: integration/presto/presto-integration-in-carbondata.md ---
    @@ -0,0 +1,134 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# PRESTO INTEGRATION IN CARBONDATA
    +
    +1. [Document Purpose](#document-purpose)
    +    1. [Purpose](#purpose)
    +    1. [Scope](#scope)
    +    1. [Definitions and Acronyms](#definitions-and-acronyms)
    +1. [Requirements addressed](#requirements-addressed)
    +1. [Design Considerations](#design-considerations)
    +    1. [Row Iterator Implementation](#row-iterator-implementation)
    +    1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach)
    +1. [Module Structure](#module-structure)
    +1. [Detailed design](#detailed-design)
    +    1. [Modules](#modules)
    +    1. [Functions Developed](#functions-developed)
    +1. [Integration Tests](#integration-tests)
    +1. [Tools and languages used](#tools-and-languages-used)
    +1. [References](#references)
    +
    +## Document Purpose
    +
    + * #### _Purpose_
    + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData.
    +
    + Its main purpose is to -
    +   *  Provide the link between the Functional Requirement and the detailed Technical Design documents.
    +   *  Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design.
    +
    + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
    + * #### _Scope_
    + Presto Integration with CarbonData will allow execution of CarbonData queries on the Presto CLI.  CarbonData can be added easily as a Data Source among the multiple heterogeneous data sources for Presto.
    --- End diff --
    
    Carbondata integration with presto. carbondata is not an execution engine


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2167/



---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207435876
  
    --- Diff: integration/presto/presto-integration-technical-note.md ---
    @@ -0,0 +1,253 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +
    +* Setting up Presto Cluster
    +
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +
    +
    +* ### Installing Presto
    +
    + 1. Download the 0.187 version of Presto using:
    +  `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
    +
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    +
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +
    +  ```
    +    wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar
    +
    +    mv presto-cli-0.187-executable.jar presto
    +
    +    chmod +x presto
    +  ```
    +
    +### Create Configuration Files
    +
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files.
    +  3. Install uuid to generate a node.id.
    +
    +      ```
    +      sudo apt-get install uuid
    +
    +      uuid
    +      ```
    +
    +
    +##### Contents of your node.properties file
    +
    +  ```
    +  node.environment=production
    +  node.id=<generated uuid>
    +  node.data-dir=/home/ubuntu/data
    +  ```
    +
    +##### Contents of your jvm.config file
    +
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +
    +##### Contents of your log.properties file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    +
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +
    +### Coordinator Configurations
    +
    +##### Contents of your config.properties
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +
    +### Worker Configurations
    +
    +##### Contents of your config.properties
    +
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +
    +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command).
    +
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +
    +### Catalog Configurations
    +
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +
    +##### Configuring Carbondata in Presto
    +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes.
    +
    +### Add Plugins
    +
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +
    +### Start Presto Server on all nodes
    +
    +```
    +./presto-server-0.187/bin/launcher start
    +```
    +To run it as a background process.
    +
    +```
    +./presto-server-0.187/bin/launcher run
    +```
    +To run it in foreground.
    +
    +### Start Presto CLI
    +```
    +./presto
    +```
    +To connect to carbondata catalog use the following command:
    +
    +```
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +```
    +Execute the following command to ensure the workers are connected.
    +
    +```
    +select * from system.runtime.nodes;
    +```
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +
    +## **Presto Performance Tuning**
    +
    +**Performance Optimizations according to data types and schema:**
    +
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    +
    +- Use Double instead of Decimal if required precision is low.
    +
    +- Columns having low-cardinality should be created as dictionary columns. This will improve query performance to a great extent.
    +
    +**Performance Optimization by changing Queries:**
    +
    +- There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
    +
    +- Aggregating a series of LIKE clauses in one single regexp_like clause.
    +
    +**For example :**
    +
    +```
    +SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  method LIKE '%GET%' OR
    +  method LIKE '%POST%' OR
    +  method LIKE '%PUT%' OR
    +  method LIKE '%DELETE%'
    + ```
    +
    + can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause:
    +
    + ```
    + SELECT
    +  ...
    +FROM
    +  access
    +WHERE
    +  regexp_like(method, 'GET|POST|PUT|DELETE')
    + ```
    +
    +- Specifying large tables first in join clause
    +
    +
    +The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.
    --- End diff --
    
    check whether we can use the word like fact table to left and dimension table to right


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6308/



---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207832179
  
    --- Diff: integration/presto/presto-integration-in-carbondata.md ---
    @@ -0,0 +1,134 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# PRESTO INTEGRATION IN CARBONDATA
    +
    +1. [Document Purpose](#document-purpose)
    +    1. [Purpose](#purpose)
    +    1. [Scope](#scope)
    +    1. [Definitions and Acronyms](#definitions-and-acronyms)
    +1. [Requirements addressed](#requirements-addressed)
    +1. [Design Considerations](#design-considerations)
    +    1. [Row Iterator Implementation](#row-iterator-implementation)
    +    1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach)
    +1. [Module Structure](#module-structure)
    +1. [Detailed design](#detailed-design)
    +    1. [Modules](#modules)
    +    1. [Functions Developed](#functions-developed)
    +1. [Integration Tests](#integration-tests)
    +1. [Tools and languages used](#tools-and-languages-used)
    +1. [References](#references)
    +
    +## Document Purpose
    +
    + * #### _Purpose_
    + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData.
    +
    + Its main purpose is to -
    +   *  Provide the link between the Functional Requirement and the detailed Technical Design documents.
    +   *  Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design.
    +
    + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
    + * #### _Scope_
    + Presto Integration with CarbonData will allow execution of CarbonData queries on the Presto CLI.  CarbonData can be added easily as a Data Source among the multiple heterogeneous data sources for Presto.
    + * #### _Definitions and Acronyms_
    +  **CarbonData :** CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. In customer benchmarks, CarbonData has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).
    +
    + **Presto :** Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
    +
    +## Requirements addressed
    +This integration of Presto mainly serves two purpose:
    + * Support of Apache CarbonData as Data Source in Presto.
    + * Execution of Apache CarbonData Queries on Presto.
    +
    +## Design Considerations
    --- End diff --
    
    Done


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/37/



---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by vandana7 <gi...@git.apache.org>.
Github user vandana7 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207462575
  
    --- Diff: integration/presto/presto-integration-in-carbondata.md ---
    @@ -0,0 +1,134 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# PRESTO INTEGRATION IN CARBONDATA
    +
    +1. [Document Purpose](#document-purpose)
    +    1. [Purpose](#purpose)
    +    1. [Scope](#scope)
    +    1. [Definitions and Acronyms](#definitions-and-acronyms)
    +1. [Requirements addressed](#requirements-addressed)
    +1. [Design Considerations](#design-considerations)
    +    1. [Row Iterator Implementation](#row-iterator-implementation)
    +    1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach)
    +1. [Module Structure](#module-structure)
    +1. [Detailed design](#detailed-design)
    +    1. [Modules](#modules)
    +    1. [Functions Developed](#functions-developed)
    +1. [Integration Tests](#integration-tests)
    +1. [Tools and languages used](#tools-and-languages-used)
    +1. [References](#references)
    +
    +## Document Purpose
    +
    + * #### _Purpose_
    + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData.
    +
    + Its main purpose is to -
    +   *  Provide the link between the Functional Requirement and the detailed Technical Design documents.
    +   *  Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design.
    +
    + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
    --- End diff --
    
    To make it more clear I have linked the installation and configuration for integrating Carbondata with presto to this document. If anyone wants to know about installation and configuration they can easily visit that document page.


---

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Posted by sraghunandan <gi...@git.apache.org>.
Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2568#discussion_r207433636
  
    --- Diff: integration/presto/presto-integration-in-carbondata.md ---
    @@ -0,0 +1,134 @@
    +<!--
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +
    +      http://www.apache.org/licenses/LICENSE-2.0
    +
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +-->
    +
    +# PRESTO INTEGRATION IN CARBONDATA
    +
    +1. [Document Purpose](#document-purpose)
    +    1. [Purpose](#purpose)
    +    1. [Scope](#scope)
    +    1. [Definitions and Acronyms](#definitions-and-acronyms)
    +1. [Requirements addressed](#requirements-addressed)
    +1. [Design Considerations](#design-considerations)
    +    1. [Row Iterator Implementation](#row-iterator-implementation)
    +    1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach)
    +1. [Module Structure](#module-structure)
    +1. [Detailed design](#detailed-design)
    +    1. [Modules](#modules)
    +    1. [Functions Developed](#functions-developed)
    +1. [Integration Tests](#integration-tests)
    +1. [Tools and languages used](#tools-and-languages-used)
    +1. [References](#references)
    +
    +## Document Purpose
    +
    + * #### _Purpose_
    + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData.
    +
    + Its main purpose is to -
    +   *  Provide the link between the Functional Requirement and the detailed Technical Design documents.
    +   *  Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design.
    +
    + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
    + * #### _Scope_
    + Presto Integration with CarbonData will allow execution of CarbonData queries on the Presto CLI.  CarbonData can be added easily as a Data Source among the multiple heterogeneous data sources for Presto.
    + * #### _Definitions and Acronyms_
    +  **CarbonData :** CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. In customer benchmarks, CarbonData has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).
    +
    + **Presto :** Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
    +
    +## Requirements addressed
    +This integration of Presto mainly serves two purpose:
    + * Support of Apache CarbonData as Data Source in Presto.
    + * Execution of Apache CarbonData Queries on Presto.
    +
    +## Design Considerations
    --- End diff --
    
    Can we add a design from presto which talks about integration of data sources


---

[GitHub] carbondata issue #2568: [Presto-integration-Technical-note] created document...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2568
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/541/



---