You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tajo.apache.org by Apache Wiki <wi...@apache.org> on 2013/10/19 19:43:36 UTC
[Tajo Wiki] Update of "GettingStarted" by HyunsikChoi

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tajo Wiki" for change notification.

The "GettingStarted" page has been changed by HyunsikChoi:
https://wiki.apache.org/tajo/GettingStarted?action=diff&rev1=14&rev2=15

Comment:
updated by TAJO-261

- == Prerequisites ==
+ = Prerequisites =
-  * Hadoop 2.0.3-alpha
-  * Java 1.6
+  * Hadoop 2.0.3-alpha or 2.0.5-alpha
+  * Java 1.6 or higher
+  * Protocol buffer 2.4.1
  
- == Build Tajo from Source Code ==
+ = Build Tajo from Source Code =
   
  Download the source code and build Tajo as follows:
  
@@ -15, +16 @@

  $ ls tajo-dist/target/tajo-x.y.z-SNAPSHOT.tar.gz
  }}}
  
- If you meet some errors or you want to know the build instruction in more detail, please read Build Instruction.
+ If you meet some errors or you want to know the build instruction in more detail, please read [[BuildInstruction|Build Instruction]].
  
- == Unpack tarball ==
+ = Unpack tarball =
  
  You should unpack the tarball (refer to build instruction).
  
@@ -25, +26 @@

  $ tar xzvf tajo-0.2.0-SNAPSHOT.tar.gz
  }}}
  
- This will result in the creation of subdirectory named tajo-x.y.z-SNAPSHOT. You MUST copy this directory into the same directory on all yarn cluster nodes.
+ This will result in the creation of subdirectory named tajo-x.y.z-SNAPSHOT. You MUST copy this directory into the same directory on all cluster nodes.
  
- == Configuration ==
+ = Configuration =
  First of all, you need to set the environment variables for your Hadoop cluster and Tajo.
  
  {{{
  export JAVA_HOME=/usr/lib/jvm/openjdk-1.6.x
  export HADOOP_HOME=/usr/local/hadoop-2.0.x
- export HADOOP_YARN_HOME=/usr/local/hadoop-2.0.x
  export TAJO_HOME=<tajo-install-dir>
  }}}
  
- Tajo provides two cluster running modes: On-demand mode using Yarn and Standby mode where Tajo works with its own resource manager. You should choose one mode of them.
- 
- === On-demand Mode ===
- On-demand mode employs Hadoop Yarn as a primary cluster resource manager. In the on-demand mode, TajoMaster and QueryMaster ask Yarn resource manager to allocate container resources for each query. So, it is needed to add some configs to yarn-site.xml.
- 
- First of all, in on-demand mode, Tajo requires an auxiliary service called PullServer for data repartitioning. For this, you must add or modify the following configuration parameters in $HADOOP_HOME/etc/hadoop/yarn-site.xml.
- 
- {{{
- <property>
-   <name>yarn.nodemanager.aux-services</name>
-   <value>mapreduce.shuffle,tajo.pullserver</value>
- </property>
- 
- <property>
-   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
-   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
- </property>
- 
- <property>
-   <name>yarn.nodemanager.aux-services.tajo.pullserver.class</name>
-   <value>org.apache.tajo.pullserver.PullServerAuxService</value>
- </property>
- 
- <property>
-   <name>tajo.task.localdir</name>
-   <value>/tmp/tajo-localdir</value>
- </property>
- }}}
- 
- For the auxiliary, you should copy some jar files to the Hadoop Yarn library dir.
- 
- {{{
- $ cp $TAJO_HOME/tajo-common-x.y.z.jar $HADOOP_HOME/share/hadoop/yarn/lib
- $ cp $TAJO_HOME/tajo-catalog-common-x.y.z.jar $HADOOP_HOME/share/hadoop/yarn/lib
- $ cp $TAJO_HOME/tajo-core-pullserver-x.y.z.jar $HADOOP_HOME/share/hadoop/yarn/lib
- $ cp $TAJO_HOME/tajo-core-storage-x.y.z.jar $HADOOP_HOME/share/hadoop/yarn/lib
- }}}
- 
- Please copy $TAJO_HOME/conf/tajo-site.xml.template to tajo-site.xml. You must add the following configs to your tajo-site.xml and then change hostname and port to your namenode address.
- {{{
-   <property>
-     <name>tajo.rootdir</name>
-     <value>hdfs://hostname:port/tajo</value>
-   </property>
- 
-   <property>
-     <name>tajo.task.localdir</name>
-     <value>/tmp/tajo-localdir</value>
-   </property>
- }}}
- 
- If you want know configuration in more detail, read Configuration Guide.
- 
- === Standby Mode ===
- In the standby mode, TajoMaster preempts the cluster resource and uses its own cluster resource manager called TajoWorkerResourceManager. TajoWorkerResourceManager coordinates and allocates cluster resources including CPU, memory, and disk to a query.
- 
- {{{
-   <property>
-     <name>tajo.rootdir</name>
-     <value>hdfs://hostname:port/tajo</value>
-   </property>
- 
-   <property>
-     <name>tajo.master.manager.addr</name>
-     <value>hostname:port</value>
-     <description>the default port is 9005</description>
-   </property>
- 
-   <property>
-     <name>tajo.task.localdir</name>
-     <value>/tmp/tajo-localdir</value>
-   </property>
- 
-   <property>
-     <name>tajo.resource.manager</name>
-     <value>org.apache.tajo.master.rm.TajoWorkerResourceManager</value>
-   </property>
- 
-   <property>
-     <name>tajo.worker.slots.use.os.info</name>
-     <value>false</value>
-     <description>If true, Tajo system obtains the physical resource information from OS. If false, the physical resource information is obtained from the below configs.
-     </description>
-   </property>
- 
-   <property>
-     <name>tajo.worker.slots.memoryMB</name>
-     <value>5000</value>
-   </property>
- 
-   <property>
-     <name>tajo.worker.slots.disk</name>
-     <value>4</value>
-     <description>The number of disks on a worker</description>
-   </property>
- 
-   <property>
-     <name>tajo.worker.slots.disk.concurrency</name>
-     <value>4</value>
-     <description>the maximum concurrency number per disk slot</description>
-   </property>
- 
-   <property>
-     <name>tajo.worker.slots.cpu.core</name>
-     <value>4</value>
-     <description>The number of CPU cores on a worker</description>
-   </property> 
- }}}
- 
- In addition, you need to add 'TAJO_WORKERS_STANDBY_MODE' variable to conf/tajo-env.sh as follows:
- {{{
- export TAJO_WORKER_STANDBY_MODE=true
- }}}
- 
- == Running Tajo ==
+ = Running Tajo =
- Before launching the tajo, you should create the tajo root dir and set the permission as follows:
- {{{
- $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tajo
- $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tajo
- }}}
- 
  To launch the tajo master, execute start-tajo.sh.
  {{{
  $ $TAJO_HOME/bin/start-tajo.sh
  }}}
  
- After then, you can use tajo-cli to access the command line interface of Tajo.
+ After then, you can use tajo-cli to access the command line interface of Tajo. If you want to how to use tsql, read [[https://wiki.apache.org/tajo/tsql|Tajo Interactive Shell]] document.
  {{{
  $ $TAJO_HOME/bin/tsql
  }}}
  
+ If you type \? on tsql, you can see help documentation.
- == Query Execution ==
- First of all, we need to prepare some data for query execution.
  
+ = First Query Execution =
+ First of all, we need to prepare some data for query execution. For example, you can make a simple text-based table as follows:
  {{{
  $ mkdir /home/x/table1
  $ cd /home/x/table1
@@ -186, +67 @@

  This schema of this table is (int, text, float, text).
  
  {{{
- $ $TAJO_HOME/bin/tajo cli
+ $ $TAJO_HOME/bin/tsql
  
- tajo> create external table table1 (id int, name text, score float, type text) using csv with ('csvfile.delimiter'='|') location 'file:/home/x/table1';;
+ tajo> create external table table1 (id int, name text, score float, type text) using csv with ('csvfile.delimiter'='|') location 'file:/home/x/table1';
  }}}
  
  In order to load an external table, you need to use 'create external table' statement. In the location clause, you should use the absolute directory path with an appropriate scheme. If the table resides in HDFS, you should use 'hdfs' instead of 'file'.
  
- If you want to know DDL statements in more detail, please see Query Language.
+ If you want to know DDL statements in more detail, please see [[QueryLanguage|Query Language]].
  {{{
  tajo> \d
  table1
@@ -218, +99 @@

  
  '\d [table name]' command shows the description of a given table.
  
- Now, you can execute SQL queries as follows:
+ Also, you can execute SQL queries as follows:
  
  {{{
  tajo> select * from table1 where id > 2;
- final state: QUERY_SUCCEEDED, init time: 4.118 sec, execution time: 4.334 sec, total response time: 8.452 sec
- result: hdfs://x.x.x.x:8020/user/x/tajo/q_1363768615503_0001_000001
+ final state: QUERY_SUCCEEDED, init time: 0.069 sec, response time: 0.397 sec
+ result: file:/tmp/tajo-hadoop/staging/q_1363768615503_0001_000001/RESULT, 3 rows ( 35B)
  
  id,  name,  score,  type
  - - - - - - - - - -  - - -
  3,  ghi,  3.4,  c
  4,  jkl,  4.5,  d
  5,  mno,  5.6,  e
+ 
  tajo>
  }}}
  
- (In the current implementation, for each query, Tajo has some initial overhead to launch containers on node managers. However, we will reduce this overhead soon.)
+ = Distributed mode on HDFS cluster =
+ Add the following configs to tajo-site.xml file.
+ 
+ {{{
+   <property>
+     <name>tajo.rootdir</name>
+     <value>hdfs://hostname:port/tajo</value>
+   </property>
+ 
+   <property>
+     <name>tajo.master.umbilical-rpc.address</name>
+     <value>hostname:26001</value>
+   </property>
+ 
+   <property>
+     <name>tajo.catalog.client-rpc.address</name>
+     <value>hostname:26005</value>
+   </property>
+ }}}
+ 
+ Before launching the tajo, you should create the tajo root dir and set the permission as follows:
+ {{{
+ $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tajo
+ $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tajo
+ }}}
+ 
+ Then, execute start-tajo.sh
+ {{{
+ $ $TAJO_HOME/bin/start-tajo.sh
+ }}}
  
  Enjoy Apache Tajo!