You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by sraghunandan <> on 2018/08/03 04:11:34 UTC

[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...

Github user sraghunandan commented on a diff in the pull request:
    --- Diff: integration/presto/ ---
    @@ -0,0 +1,253 @@
    +    Licensed to the Apache Software Foundation (ASF) under one or more
    +    contributor license agreements.  See the NOTICE file distributed with
    +    this work for additional information regarding copyright ownership.
    +    The ASF licenses this file to you under the Apache License, Version 2.0
    +    (the "License"); you may not use this file except in compliance with
    +    the License.  You may obtain a copy of the License at
    +    Unless required by applicable law or agreed to in writing, software
    +    distributed under the License is distributed on an "AS IS" BASIS,
    +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +    See the License for the specific language governing permissions and
    +    limitations under the License.
    +# Presto Integration Technical Note
    +Presto Integration with Carbon data include the below steps:
    +* Setting up Presto Cluster
    +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto.
    +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto.
    +## **Let us begin with the first step of Presto Cluster Setup:**
    +* ### Installing Presto
    + 1. Download the 0.187 version of Presto using:
    +  `wget`
    + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`.
    + 3. Download the Presto CLI for the coordinator and name it presto.
    +  ```
    +    wget
    +    mv presto-cli-0.187-executable.jar presto
    +    chmod +x presto
    +  ```
    +### Create Configuration Files
    +  1. Create `etc` folder in presto-server-0.187 directory.
    +  2. Create ``, `jvm.config`, ``, and `` files.
    +  3. Install uuid to generate a
    +      ```
    +      sudo apt-get install uuid
    +      uuid
    +      ```
    +##### Contents of your file
    +  ```
    +  node.environment=production
    +<generated uuid>
    +  ```
    +##### Contents of your jvm.config file
    +  ```
    +  -server
    +  -Xmx16G
    +  -XX:+UseG1GC
    +  -XX:G1HeapRegionSize=32M
    +  -XX:+UseGCOverheadLimit
    +  -XX:+ExplicitGCInvokesConcurrent
    +  -XX:+HeapDumpOnOutOfMemoryError
    +  -XX:OnOutOfMemoryError=kill -9 %p
    +  ```
    +##### Contents of your file
    +  ```
    +  com.facebook.presto=INFO
    +  ```
    + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`.
    +### Coordinator Configurations
    +##### Contents of your
    +  ```
    +  coordinator=true
    +  node-scheduler.include-coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery-server.enabled=true
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers.
    +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`.
    +Also relation between below two configuration-properties should be like:
    +If, `query.max-memory-per-node=30GB`
    +Then, `query.max-memory=<30GB * number of nodes>`.
    +### Worker Configurations
    +##### Contents of your
    +  ```
    +  coordinator=false
    +  http-server.http.port=8086
    +  query.max-memory=50GB
    +  query.max-memory-per-node=2GB
    +  discovery.uri=<coordinator_ip>:8086
    +  ```
    +**Note**: `jvm.config` and `` files are same for all the nodes (worker + coordinator). All the nodes should have different ``.(generated by uuid command).
    +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:**
    +### Catalog Configurations
    +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator.
    +##### Configuring Carbondata in Presto
    +1. Create a file named `` in the `catalog` folder and set the required properties on all the nodes.
    +### Add Plugins
    +1. Create a directory named `carbondata` in plugin directory of presto.
    +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes.
    +### Start Presto Server on all nodes
    +./presto-server-0.187/bin/launcher start
    +To run it as a background process.
    +./presto-server-0.187/bin/launcher run
    +To run it in foreground.
    +### Start Presto CLI
    +To connect to carbondata catalog use the following command:
    +./presto --server <coordinator_ip>:8086 --catalog carbondata --schema <schema_name>
    +Execute the following command to ensure the workers are connected.
    +select * from system.runtime.nodes;
    +Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
    +**Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
    +## **Presto Performance Tuning**
    +**Performance Optimizations according to data types and schema:**
    +- When the data could be stored in Int as well as String. Example: keys for a table then using Int gives a better performance
    --- End diff --
    change the sentence to specify the suggestion first
