You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by chenliang613 <gi...@git.apache.org> on 2018/05/03 07:15:50 UTC
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Github user chenliang613 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r185711684
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,213 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
    +
    +2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
    +   ```shell
    +   mvn clean package -DskipTests -Pspark-2.2
    +   ```
    +
    +3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +   ```scala
    +   import java.io.File
    +   import org.apache.spark.sql.{CarbonEnv, SparkSession}
    +   import org.apache.spark.sql.CarbonSession._
    +   import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    +   import org.apache.carbondata.core.util.path.CarbonStorePath
    +   
    +   val warehouse = new File("./warehouse").getCanonicalPath
    +   val metastore = new File("./metastore").getCanonicalPath
    + 
    +   val spark = SparkSession
    +     .builder()
    +     .master("local")
    +     .appName("luceneDatamapExample")
    +     .config("spark.sql.warehouse.dir", warehouse)
    +     .getOrCreateCarbonSession(warehouse, metastore)
    +
    +   spark.sparkContext.setLogLevel("ERROR")
    +
    +   // drop table if exists previously
    +   spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    +   
    +   // Create main table
    +   spark.sql(
    +     s"""
    +        |CREATE TABLE datamap_test (
    +        |name string,
    +        |age int,
    +        |city string,
    +        |country string)
    +        |STORED BY 'carbondata'
    +      """.stripMargin)
    + 
    +    // Create lucene datamap on the main table
    +   spark.sql(
    +     s"""
    +        |CREATE DATAMAP dm
    +        |ON TABLE datamap_test
    +        |USING "lucene"
    +        |DMPROPERTIES ('TEXT_COLUMNS' = 'name,country')
    +      """.stripMargin)
    +   
    +      import spark.implicits._
    +      import org.apache.spark.sql.SaveMode
    +      import scala.util.Random
    +   
    +      // Load data to the main table, if
    +      // lucene index writing fails, the datamap
    +      // will be disabled in query
    +    val r = new Random()
    +    spark.sparkContext.parallelize(1 to 10)
    +     .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +     .toDF("name", "age", "city", "country")
    +     .write
    +     .format("carbondata")
    +     .option("tableName", "datamap_test")
    +     .option("compress", "true")
    +     .mode(SaveMode.Append)
    +     .save()
    +        
    +    spark.sql(
    +      s"""
    +         |SELECT *
    +         |from datamap_test where
    +         |TEXT_MATCH('name:c10')
    +       """.stripMargin).show
    +    
    +    spark.sql(
    +          s"""
    +             |SELECT *
    +             |from datamap_test where
    +             |TEXT_MATCH('name:c10', 10)
    +           """.stripMargin).show
    +  
    +    spark.stop
    +   ```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL:
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    +  index datamap and managed along with main tables by CarbonData.User can create lucene datamaps 
    +  to improve query performance on string columns.
    +  
    +  For instance, main table called **datamap_test** which is defined as:
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL:
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, lucene index files will be generated for all the
    +text_columns(String Columns) given in DMProperties which contains information about the data
    +location of text_columns. These index files will be written inside a folder named with datamap name
    +inside each segment folders.
    +
    +A system level configuration carbon.lucene.compression.mode can be added for best compression of
    +lucene index files. The default value is speed, where the index writing speed will be more. If the
    +value is compression, the index file size will be compressed.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or 
    +TEXT_MATCH('name:n10',10)[the second parameter represents the number of result to be returned, if 
    +user does not specify this value, all results will be returned without any limit] is fired, two jobs 
    +are fired.The first job writes the temporary files in folder created at table level which contains 
    +lucene's seach results and these files will be read in second job to give faster results. These 
    +temporary files will be cleared once the query finishes.
    +
    +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
    +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
    +filter is applied on query or not.
    +
    +Below like queries can be converted to text_match queries as following:
    +```
    +select * from datamap_test where name='n10'
    +
    +select * from datamap_test where name like 'n1%'
    +
    +select * from datamap_test where name like '%10'
    +
    +select * from datamap_test where name like '%n%'
    +
    +select * from datamap_test where name like '%10' and name not like '%n%'
    +```
    +Lucene TEXT_MATCH Queries:
    +```
    +select * from datamap_test where TEXT_MATCH('name:n10')
    +
    +select * from datamap_test where TEXT_MATCH('name:n1*')
    +
    +select * from datamap_test where TEXT_MATCH('name:*10')
    +
    +select * from datamap_test where TEXT_MATCH('name:*n*')
    +
    +select * from datamap_test where TEXT_MATCH('name:*10 and -name:*n*')
    --- End diff --
    
    the syntax is wrong, don't need "and", should be TEXT_MATCH('name:*10 -name:*n*')


---