You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@carbondata.apache.org by akashrn5 <gi...@git.apache.org> on 2018/04/23 14:03:36 UTC

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

GitHub user akashrn5 opened a pull request:

    https://github.com/apache/carbondata/pull/2215

    [wip]add documentation for lucene datamap

    added documentation for lucene datamap
    
    Be sure to do all of the following checklist to help us incorporate 
    your contribution quickly and easily:
    
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
    
     - [ ] Testing done
            Please provide details on 
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/akashrn5/incubator-carbondata doc_lucene

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2215.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2215
    
----
commit 5403c832ca98569f60acf42a95c42ae21d8d3be5
Author: akashrn5 <ak...@...>
Date:   2018-04-23T13:57:56Z

    add documentation for lucene datamap

----


---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189421956
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,133 @@
    +# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
    +  
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +#### DataMap Management 
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('index_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL:
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    +  an index datamap and managed along with main tables by CarbonData.User can create lucene datamap 
    +  to improve query performance on string columns which has content of more length.
    --- End diff --
    
    Please rephrase to describe: this datamap is intended for text content, and you want to search the tokenized word or pattern of it.



---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by KanakaKumar <gi...@git.apache.org>.

Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r184316398
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
    +  User can create as many lucene datamaps required to improve query performance,
    +  provided the storage requirements and loading speeds are acceptable.
    +  
    +  Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
    +  row level for the filter query by launching a spark datamap job. This pruned data will be read to
    +  give the proper and faster result
    +    
    +  For instance, main table called **sales** which is defined as 
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
    +then lucene index files will be generated for all the text_columns (String Columns) given in
    +DMProperties which contains information about the blocklet_id, page_id and row_id and for all the
    +data of text_columns. These index files will be written inside a folder named as datamap name inside
    +each segment directories.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so
    +when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as
    --- End diff --
    
    Please add the details to mention supported syntax is lucene  query. And list few example queries which can cover tokenezier based search and like queries


---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183617996
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
    +  User can create as many lucene datamaps required to improve query performance,
    +  provided the storage requirements and loading speeds are acceptable.
    +  
    +  Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
    +  row level for the filter query by launching a spark datamap job. This pruned data will be read to
    +  give the proper and faster result
    +    
    +  For instance, main table called **sales** which is defined as 
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
    +then lucene index files will be generated for all the text_columns (String Columns) given in
    +DMProperties which contains information about the blocklet_id, page_id and row_id and for all the
    +data of text_columns. These index files will be written inside a folder named as datamap name inside
    +each segment directories.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so
    +when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as
    +pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each
    +blocklet a temporary file will be generated which has information till row level, but prune will
    +return blocklets finally.
    +
    +When query reaches executor side, the temporary files written will be read and bitset groups are
    +formed to return the query result.
    +
    +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
    +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
    +filter is applied on query or not.
    +
    +
    +## Data Management with pre-aggregate tables
    +Once there is lucene datamap is created on the main table, following command on the main
    +table
    +is not supported:
    +1. Data management command: `UPDATE/DELETE`.
    +2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`, 
    +`ALTER TABLE RENAME`. Note that adding a new column is supported, and for dropping columns and
    --- End diff --
    
    **Note:**
    
    Use this format for Note and start in a new line


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4436/



---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by akashrn5 <gi...@git.apache.org>.

Github user akashrn5 commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    @chenliang613 please review and merge


---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by akashrn5 <gi...@git.apache.org>.

Github user akashrn5 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189227705
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,213 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
    +
    +2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
    +   ```shell
    +   mvn clean package -DskipTests -Pspark-2.2
    +   ```
    +
    +3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +   ```scala
    +   import java.io.File
    +   import org.apache.spark.sql.{CarbonEnv, SparkSession}
    +   import org.apache.spark.sql.CarbonSession._
    +   import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    +   import org.apache.carbondata.core.util.path.CarbonStorePath
    +   
    --- End diff --
    
    ok


---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183617239
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
    +  User can create as many lucene datamaps required to improve query performance,
    +  provided the storage requirements and loading speeds are acceptable.
    +  
    +  Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
    +  row level for the filter query by launching a spark datamap job. This pruned data will be read to
    +  give the proper and faster result
    +    
    +  For instance, main table called **sales** which is defined as 
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL
    --- End diff --
    
    User can create Lucene datamap using the Create DataMap DDL:


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5964/



---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4990/



---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183617096
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
    +  User can create as many lucene datamaps required to improve query performance,
    +  provided the storage requirements and loading speeds are acceptable.
    +  
    +  Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
    +  row level for the filter query by launching a spark datamap job. This pruned data will be read to
    +  give the proper and faster result
    --- End diff --
    
    end all sentence with a period (.)


---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183616698
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    --- End diff --
    
    Lucene DataMap can be created using following DDL:


---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183616213
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    --- End diff --
    
    The below is a procedure, so put it in a numbered list: 
    Step 1:
    Step 2:



---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183616653
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    --- End diff --
    
    Why a red background. Please check once


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4809/



---

[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4577/



---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4806/



---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by KanakaKumar <gi...@git.apache.org>.

Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r184317255
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
    +  User can create as many lucene datamaps required to improve query performance,
    +  provided the storage requirements and loading speeds are acceptable.
    +  
    +  Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
    +  row level for the filter query by launching a spark datamap job. This pruned data will be read to
    +  give the proper and faster result
    +    
    +  For instance, main table called **sales** which is defined as 
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
    +then lucene index files will be generated for all the text_columns (String Columns) given in
    +DMProperties which contains information about the blocklet_id, page_id and row_id and for all the
    +data of text_columns. These index files will be written inside a folder named as datamap name inside
    +each segment directories.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so
    +when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as
    +pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each
    +blocklet a temporary file will be generated which has information till row level, but prune will
    +return blocklets finally.
    +
    +When query reaches executor side, the temporary files written will be read and bitset groups are
    +formed to return the query result.
    --- End diff --
    
    please mention the cleanup procedure for temp files


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4447/



---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189421502
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,133 @@
    +# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
    +  
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +#### DataMap Management 
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('index_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL:
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    +  an index datamap and managed along with main tables by CarbonData.User can create lucene datamap 
    +  to improve query performance on string columns which has content of more length.
    +  
    +  For instance, main table called **datamap_test** which is defined as:
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL:
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('INDEX_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, lucene index files will be generated for all the
    +index_columns(String Columns) given in DMProperties which contains information about the data
    +location of index_columns. These index files will be written inside a folder named with datamap name
    +inside each segment folders.
    +
    +A system level configuration carbon.lucene.compression.mode can be added for best compression of
    +lucene index files. The default value is speed, where the index writing speed will be more. If the
    +value is compression, the index file size will be compressed.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or 
    +TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be 
    +returned, if user does not specify this value, all results will be returned without any limit] is 
    +fired, two jobs are fired.The first job writes the temporary files in folder created at table level 
    +which contains lucene's seach results and these files will be read in second job to give faster 
    +results. These temporary files will be cleared once the query finishes.
    +
    +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
    +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
    +filter is applied on query or not.
    +
    +Note: The filter columns in TEXT_MATCH or TEXT_MATCH_WITH_LIMIT must be always in lower case and 
    +filter condition like 'AND','OR' must be in upper case.
    +
    --- End diff --
    
    Add a limitation description here: In this version, we support one TEXT_MATCH UDF for one relation only and user should put AND/OR logic inside this UDF, instead of writing separate UDF. For example
    `select * from T where TEXT_MATCH('col1:a AND col2:b')` is supported
    `select * from T where TEXT_MATCH('col1:a') and TEXT_MATCH('col2:b')` is not supported


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4539/



---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by chenliang613 <gi...@git.apache.org>.

Github user chenliang613 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r186898007
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,213 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
    +
    +2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
    +   ```shell
    +   mvn clean package -DskipTests -Pspark-2.2
    +   ```
    +
    +3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +   ```scala
    +   import java.io.File
    +   import org.apache.spark.sql.{CarbonEnv, SparkSession}
    +   import org.apache.spark.sql.CarbonSession._
    +   import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    +   import org.apache.carbondata.core.util.path.CarbonStorePath
    +   
    --- End diff --
    
    Please remove the example code, because pr2268 already provided the executable example.  
    Example code should be maintained under examples module, not inside document.


---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189421589
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,133 @@
    +# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
    +  
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    --- End diff --
    
    Can you make a section to describe `REBUILD DATAMAP` and `WITH DEFERRED REBUILD` feature when creating datamap


---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189421639
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,133 @@
    +# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
    +  
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +#### DataMap Management 
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('index_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL:
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    +  an index datamap and managed along with main tables by CarbonData.User can create lucene datamap 
    +  to improve query performance on string columns which has content of more length.
    +  
    +  For instance, main table called **datamap_test** which is defined as:
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL:
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('INDEX_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, lucene index files will be generated for all the
    +index_columns(String Columns) given in DMProperties which contains information about the data
    +location of index_columns. These index files will be written inside a folder named with datamap name
    +inside each segment folders.
    +
    +A system level configuration carbon.lucene.compression.mode can be added for best compression of
    +lucene index files. The default value is speed, where the index writing speed will be more. If the
    +value is compression, the index file size will be compressed.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or 
    +TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be 
    +returned, if user does not specify this value, all results will be returned without any limit] is 
    +fired, two jobs are fired.The first job writes the temporary files in folder created at table level 
    +which contains lucene's seach results and these files will be read in second job to give faster 
    +results. These temporary files will be cleared once the query finishes.
    +
    +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
    +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
    +filter is applied on query or not.
    +
    +Note: The filter columns in TEXT_MATCH or TEXT_MATCH_WITH_LIMIT must be always in lower case and 
    +filter condition like 'AND','OR' must be in upper case.
    +
    +Ex:  ```
    +     select * from datamap_test where TEXT_MATCH('name:*10 AND name:*n*')
    +     ```
    +
    +Below like queries can be converted to text_match queries as following:
    +```
    +select * from datamap_test where name='n10'
    +
    +select * from datamap_test where name like 'n1%'
    +
    +select * from datamap_test where name like '%10'
    +
    +select * from datamap_test where name like '%n%'
    +
    +select * from datamap_test where name like '%10' and name not like '%n%'
    +```
    +Lucene TEXT_MATCH Queries:
    +```
    +select * from datamap_test where TEXT_MATCH('name:n10')
    +
    +select * from datamap_test where TEXT_MATCH('name:n1*')
    +
    +select * from datamap_test where TEXT_MATCH('name:*10')
    +
    +select * from datamap_test where TEXT_MATCH('name:*n*')
    +
    +select * from datamap_test where TEXT_MATCH('name:*10 -name:*n*')
    --- End diff --
    
    For all these queries, please describe what is the effect of it, since user maybe not familiar with lucene syntax. And provide a link for user to refer to lucene syntax


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4846/



---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5699/



---

[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4256/



---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6005/



---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by chenliang613 <gi...@git.apache.org>.

Github user chenliang613 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r185711684
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,213 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
    +
    +2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
    +   ```shell
    +   mvn clean package -DskipTests -Pspark-2.2
    +   ```
    +
    +3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +   ```scala
    +   import java.io.File
    +   import org.apache.spark.sql.{CarbonEnv, SparkSession}
    +   import org.apache.spark.sql.CarbonSession._
    +   import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    +   import org.apache.carbondata.core.util.path.CarbonStorePath
    +   
    +   val warehouse = new File("./warehouse").getCanonicalPath
    +   val metastore = new File("./metastore").getCanonicalPath
    + 
    +   val spark = SparkSession
    +     .builder()
    +     .master("local")
    +     .appName("luceneDatamapExample")
    +     .config("spark.sql.warehouse.dir", warehouse)
    +     .getOrCreateCarbonSession(warehouse, metastore)
    +
    +   spark.sparkContext.setLogLevel("ERROR")
    +
    +   // drop table if exists previously
    +   spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    +   
    +   // Create main table
    +   spark.sql(
    +     s"""
    +        |CREATE TABLE datamap_test (
    +        |name string,
    +        |age int,
    +        |city string,
    +        |country string)
    +        |STORED BY 'carbondata'
    +      """.stripMargin)
    + 
    +    // Create lucene datamap on the main table
    +   spark.sql(
    +     s"""
    +        |CREATE DATAMAP dm
    +        |ON TABLE datamap_test
    +        |USING "lucene"
    +        |DMPROPERTIES ('TEXT_COLUMNS' = 'name,country')
    +      """.stripMargin)
    +   
    +      import spark.implicits._
    +      import org.apache.spark.sql.SaveMode
    +      import scala.util.Random
    +   
    +      // Load data to the main table, if
    +      // lucene index writing fails, the datamap
    +      // will be disabled in query
    +    val r = new Random()
    +    spark.sparkContext.parallelize(1 to 10)
    +     .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +     .toDF("name", "age", "city", "country")
    +     .write
    +     .format("carbondata")
    +     .option("tableName", "datamap_test")
    +     .option("compress", "true")
    +     .mode(SaveMode.Append)
    +     .save()
    +        
    +    spark.sql(
    +      s"""
    +         |SELECT *
    +         |from datamap_test where
    +         |TEXT_MATCH('name:c10')
    +       """.stripMargin).show
    +    
    +    spark.sql(
    +          s"""
    +             |SELECT *
    +             |from datamap_test where
    +             |TEXT_MATCH('name:c10', 10)
    +           """.stripMargin).show
    +  
    +    spark.stop
    +   ```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL:
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    +  index datamap and managed along with main tables by CarbonData.User can create lucene datamaps 
    +  to improve query performance on string columns.
    +  
    +  For instance, main table called **datamap_test** which is defined as:
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL:
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, lucene index files will be generated for all the
    +text_columns(String Columns) given in DMProperties which contains information about the data
    +location of text_columns. These index files will be written inside a folder named with datamap name
    +inside each segment folders.
    +
    +A system level configuration carbon.lucene.compression.mode can be added for best compression of
    +lucene index files. The default value is speed, where the index writing speed will be more. If the
    +value is compression, the index file size will be compressed.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or 
    +TEXT_MATCH('name:n10',10)[the second parameter represents the number of result to be returned, if 
    +user does not specify this value, all results will be returned without any limit] is fired, two jobs 
    +are fired.The first job writes the temporary files in folder created at table level which contains 
    +lucene's seach results and these files will be read in second job to give faster results. These 
    +temporary files will be cleared once the query finishes.
    +
    +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
    +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
    +filter is applied on query or not.
    +
    +Below like queries can be converted to text_match queries as following:
    +```
    +select * from datamap_test where name='n10'
    +
    +select * from datamap_test where name like 'n1%'
    +
    +select * from datamap_test where name like '%10'
    +
    +select * from datamap_test where name like '%n%'
    +
    +select * from datamap_test where name like '%10' and name not like '%n%'
    +```
    +Lucene TEXT_MATCH Queries:
    +```
    +select * from datamap_test where TEXT_MATCH('name:n10')
    +
    +select * from datamap_test where TEXT_MATCH('name:n1*')
    +
    +select * from datamap_test where TEXT_MATCH('name:*10')
    +
    +select * from datamap_test where TEXT_MATCH('name:*n*')
    +
    +select * from datamap_test where TEXT_MATCH('name:*10 and -name:*n*')
    --- End diff --
    
    the syntax is wrong, don't need "and", should be TEXT_MATCH('name:*10 -name:*n*')


---

[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    The word wrap is strange, better to write a paragraph and let the editor do the rest.


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5608/



---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183618083
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
    +  User can create as many lucene datamaps required to improve query performance,
    +  provided the storage requirements and loading speeds are acceptable.
    +  
    +  Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
    +  row level for the filter query by launching a spark datamap job. This pruned data will be read to
    +  give the proper and faster result
    +    
    +  For instance, main table called **sales** which is defined as 
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
    +then lucene index files will be generated for all the text_columns (String Columns) given in
    +DMProperties which contains information about the blocklet_id, page_id and row_id and for all the
    +data of text_columns. These index files will be written inside a folder named as datamap name inside
    +each segment directories.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so
    +when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as
    +pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each
    +blocklet a temporary file will be generated which has information till row level, but prune will
    +return blocklets finally.
    +
    +When query reaches executor side, the temporary files written will be read and bitset groups are
    +formed to return the query result.
    +
    +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
    +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
    +filter is applied on query or not.
    +
    +
    +## Data Management with pre-aggregate tables
    +Once there is lucene datamap is created on the main table, following command on the main
    +table
    +is not supported:
    +1. Data management command: `UPDATE/DELETE`.
    +2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`, 
    +`ALTER TABLE RENAME`. Note that adding a new column is supported, and for dropping columns and
    +change datatype command, CarbonData will check whether it will impact the lucene datamap, if
    + not, the operation is allowed, otherwise operation will be rejected by throwing exception.   
    +3. Partition management command: `ALTER TABLE ADD/DROP PARTITION`
    +
    +However, there is still way to support these operations on main table, in current CarbonData 
    +release, user can do as following:
    +1. Remove the lucene datamap by `DROP DATAMAP` command
    --- End diff --
    
    End all sentences with a period (.)


---

[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5431/



---

[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5442/



---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by chenliang613 <gi...@git.apache.org>.

Github user chenliang613 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r186900225
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,213 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
    +
    +2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
    +   ```shell
    +   mvn clean package -DskipTests -Pspark-2.2
    +   ```
    +
    +3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +   ```scala
    +   import java.io.File
    +   import org.apache.spark.sql.{CarbonEnv, SparkSession}
    +   import org.apache.spark.sql.CarbonSession._
    +   import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    +   import org.apache.carbondata.core.util.path.CarbonStorePath
    +   
    +   val warehouse = new File("./warehouse").getCanonicalPath
    +   val metastore = new File("./metastore").getCanonicalPath
    + 
    +   val spark = SparkSession
    +     .builder()
    +     .master("local")
    +     .appName("luceneDatamapExample")
    +     .config("spark.sql.warehouse.dir", warehouse)
    +     .getOrCreateCarbonSession(warehouse, metastore)
    +
    +   spark.sparkContext.setLogLevel("ERROR")
    +
    +   // drop table if exists previously
    +   spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    +   
    +   // Create main table
    +   spark.sql(
    +     s"""
    +        |CREATE TABLE datamap_test (
    +        |name string,
    +        |age int,
    +        |city string,
    +        |country string)
    +        |STORED BY 'carbondata'
    +      """.stripMargin)
    + 
    +    // Create lucene datamap on the main table
    +   spark.sql(
    +     s"""
    +        |CREATE DATAMAP dm
    +        |ON TABLE datamap_test
    +        |USING "lucene"
    +        |DMPROPERTIES ('TEXT_COLUMNS' = 'name,country')
    +      """.stripMargin)
    +   
    +      import spark.implicits._
    +      import org.apache.spark.sql.SaveMode
    +      import scala.util.Random
    +   
    +      // Load data to the main table, if
    +      // lucene index writing fails, the datamap
    +      // will be disabled in query
    +    val r = new Random()
    +    spark.sparkContext.parallelize(1 to 10)
    +     .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +     .toDF("name", "age", "city", "country")
    +     .write
    +     .format("carbondata")
    +     .option("tableName", "datamap_test")
    +     .option("compress", "true")
    +     .mode(SaveMode.Append)
    +     .save()
    +        
    +    spark.sql(
    +      s"""
    +         |SELECT *
    +         |from datamap_test where
    +         |TEXT_MATCH('name:c10')
    +       """.stripMargin).show
    +    
    +    spark.sql(
    +          s"""
    +             |SELECT *
    +             |from datamap_test where
    +             |TEXT_MATCH_WITH_LIMIT('name:c10', 10)
    +           """.stripMargin).show
    +  
    +    spark.stop
    +   ```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('index_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL:
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    +  index datamap and managed along with main tables by CarbonData.User can create lucene datamaps 
    +  to improve query performance on string columns.
    +  
    +  For instance, main table called **datamap_test** which is defined as:
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL:
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, lucene index files will be generated for all the
    +index_columns(String Columns) given in DMProperties which contains information about the data
    +location of index_columns. These index files will be written inside a folder named with datamap name
    +inside each segment folders.
    +
    +A system level configuration carbon.lucene.compression.mode can be added for best compression of
    +lucene index files. The default value is speed, where the index writing speed will be more. If the
    +value is compression, the index file size will be compressed.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or 
    +TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be 
    +returned, if user does not specify this value, all results will be returned without any limit] is 
    +fired, two jobs are fired.The first job writes the temporary files in folder created at table level 
    +which contains lucene's seach results and these files will be read in second job to give faster 
    +results. These temporary files will be cleared once the query finishes.
    +
    +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
    +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
    +filter is applied on query or not.
    +
    +Below like queries can be converted to text_match queries as following:
    +```
    +select * from datamap_test where name='n10'
    +
    +select * from datamap_test where name like 'n1%'
    +
    +select * from datamap_test where name like '%10'
    --- End diff --
    
    I tested,  the result is different for the below two query, please double check:
    
    select * from datamap_test where name like '%10'
    select * from datamap_test where TEXT_MATCH('name:*10')


---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183616769
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL
    --- End diff --
    
    DataMap can be dropped using following DDL:


---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/carbondata/pull/2215


---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183617201
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
    +  User can create as many lucene datamaps required to improve query performance,
    +  provided the storage requirements and loading speeds are acceptable.
    +  
    +  Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
    +  row level for the filter query by launching a spark datamap job. This pruned data will be read to
    +  give the proper and faster result
    +    
    +  For instance, main table called **sales** which is defined as 
    --- End diff --
    
    For instance, main table called **sales** which is defined as:


---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by akashrn5 <gi...@git.apache.org>.

Github user akashrn5 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189228188
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,213 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
    +
    +2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
    +   ```shell
    +   mvn clean package -DskipTests -Pspark-2.2
    +   ```
    +
    +3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +   ```scala
    +   import java.io.File
    +   import org.apache.spark.sql.{CarbonEnv, SparkSession}
    +   import org.apache.spark.sql.CarbonSession._
    +   import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    +   import org.apache.carbondata.core.util.path.CarbonStorePath
    +   
    +   val warehouse = new File("./warehouse").getCanonicalPath
    +   val metastore = new File("./metastore").getCanonicalPath
    + 
    +   val spark = SparkSession
    +     .builder()
    +     .master("local")
    +     .appName("luceneDatamapExample")
    +     .config("spark.sql.warehouse.dir", warehouse)
    +     .getOrCreateCarbonSession(warehouse, metastore)
    +
    +   spark.sparkContext.setLogLevel("ERROR")
    +
    +   // drop table if exists previously
    +   spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    +   
    +   // Create main table
    +   spark.sql(
    +     s"""
    +        |CREATE TABLE datamap_test (
    +        |name string,
    +        |age int,
    +        |city string,
    +        |country string)
    +        |STORED BY 'carbondata'
    +      """.stripMargin)
    + 
    +    // Create lucene datamap on the main table
    +   spark.sql(
    +     s"""
    +        |CREATE DATAMAP dm
    +        |ON TABLE datamap_test
    +        |USING "lucene"
    +        |DMPROPERTIES ('TEXT_COLUMNS' = 'name,country')
    +      """.stripMargin)
    +   
    +      import spark.implicits._
    +      import org.apache.spark.sql.SaveMode
    +      import scala.util.Random
    +   
    +      // Load data to the main table, if
    +      // lucene index writing fails, the datamap
    +      // will be disabled in query
    +    val r = new Random()
    +    spark.sparkContext.parallelize(1 to 10)
    +     .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +     .toDF("name", "age", "city", "country")
    +     .write
    +     .format("carbondata")
    +     .option("tableName", "datamap_test")
    +     .option("compress", "true")
    +     .mode(SaveMode.Append)
    +     .save()
    +        
    +    spark.sql(
    +      s"""
    +         |SELECT *
    +         |from datamap_test where
    +         |TEXT_MATCH('name:c10')
    +       """.stripMargin).show
    +    
    +    spark.sql(
    +          s"""
    +             |SELECT *
    +             |from datamap_test where
    +             |TEXT_MATCH_WITH_LIMIT('name:c10', 10)
    +           """.stripMargin).show
    +  
    +    spark.stop
    +   ```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('index_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL:
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    +  index datamap and managed along with main tables by CarbonData.User can create lucene datamaps 
    +  to improve query performance on string columns.
    +  
    +  For instance, main table called **datamap_test** which is defined as:
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL:
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, lucene index files will be generated for all the
    +index_columns(String Columns) given in DMProperties which contains information about the data
    +location of index_columns. These index files will be written inside a folder named with datamap name
    +inside each segment folders.
    +
    +A system level configuration carbon.lucene.compression.mode can be added for best compression of
    +lucene index files. The default value is speed, where the index writing speed will be more. If the
    +value is compression, the index file size will be compressed.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or 
    +TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be 
    +returned, if user does not specify this value, all results will be returned without any limit] is 
    +fired, two jobs are fired.The first job writes the temporary files in folder created at table level 
    +which contains lucene's seach results and these files will be read in second job to give faster 
    +results. These temporary files will be cleared once the query finishes.
    +
    +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
    +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
    +filter is applied on query or not.
    +
    +Below like queries can be converted to text_match queries as following:
    +```
    +select * from datamap_test where name='n10'
    +
    +select * from datamap_test where name like 'n1%'
    +
    +select * from datamap_test where name like '%10'
    --- End diff --
    
    i have tested this, we have a UT also for this, it is working fine, and we cannot compare all the like query results with text_match, as lucene search way is different compare to like query search


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5967/



---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183615908
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    --- End diff --
    
    These are procedure steps, so we can have numbered list


---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189422398
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,133 @@
    +# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
    +  
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +#### DataMap Management 
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    --- End diff --
    
    Is it `"` or `'` ?


---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by KanakaKumar <gi...@git.apache.org>.

Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r184316832
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
    +  User can create as many lucene datamaps required to improve query performance,
    +  provided the storage requirements and loading speeds are acceptable.
    +  
    +  Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
    +  row level for the filter query by launching a spark datamap job. This pruned data will be read to
    +  give the proper and faster result
    +    
    +  For instance, main table called **sales** which is defined as 
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
    --- End diff --
    
    What new configurations added and how it can impact data load can be added. Example:- compression types


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by chenliang613 <gi...@git.apache.org>.

Github user chenliang613 commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    retest this please


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5599/



---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by KanakaKumar <gi...@git.apache.org>.

Github user KanakaKumar commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189232106
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,133 @@
    +# CarbonData Lucene DataMap
    --- End diff --
    
    Please mark Lucene feature as Alpha for 1.4.0


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by akashrn5 <gi...@git.apache.org>.

Github user akashrn5 commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    @xuchuanyin and @jackylk please review


---

[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4175/



---

[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5343/



---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4776/



---

[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4564/



---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by akashrn5 <gi...@git.apache.org>.

Github user akashrn5 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189507570
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,133 @@
    +# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
    +  
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +#### DataMap Management 
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('index_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL:
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    +  an index datamap and managed along with main tables by CarbonData.User can create lucene datamap 
    +  to improve query performance on string columns which has content of more length.
    +  
    +  For instance, main table called **datamap_test** which is defined as:
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL:
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('INDEX_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, lucene index files will be generated for all the
    +index_columns(String Columns) given in DMProperties which contains information about the data
    +location of index_columns. These index files will be written inside a folder named with datamap name
    +inside each segment folders.
    +
    +A system level configuration carbon.lucene.compression.mode can be added for best compression of
    +lucene index files. The default value is speed, where the index writing speed will be more. If the
    +value is compression, the index file size will be compressed.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or 
    +TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be 
    +returned, if user does not specify this value, all results will be returned without any limit] is 
    +fired, two jobs are fired.The first job writes the temporary files in folder created at table level 
    +which contains lucene's seach results and these files will be read in second job to give faster 
    +results. These temporary files will be cleared once the query finishes.
    +
    +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
    +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
    +filter is applied on query or not.
    +
    +Note: The filter columns in TEXT_MATCH or TEXT_MATCH_WITH_LIMIT must be always in lower case and 
    +filter condition like 'AND','OR' must be in upper case.
    +
    +Ex:  ```
    +     select * from datamap_test where TEXT_MATCH('name:*10 AND name:*n*')
    +     ```
    +
    +Below like queries can be converted to text_match queries as following:
    +```
    +select * from datamap_test where name='n10'
    +
    +select * from datamap_test where name like 'n1%'
    +
    +select * from datamap_test where name like '%10'
    +
    +select * from datamap_test where name like '%n%'
    +
    +select * from datamap_test where name like '%10' and name not like '%n%'
    +```
    +Lucene TEXT_MATCH Queries:
    +```
    +select * from datamap_test where TEXT_MATCH('name:n10')
    +
    +select * from datamap_test where TEXT_MATCH('name:n1*')
    +
    +select * from datamap_test where TEXT_MATCH('name:*10')
    +
    +select * from datamap_test where TEXT_MATCH('name:*n*')
    +
    +select * from datamap_test where TEXT_MATCH('name:*10 -name:*n*')
    --- End diff --
    
    added a link, which will provide details of all these queries


---

[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4490/



---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4438/



---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by KanakaKumar <gi...@git.apache.org>.

Github user KanakaKumar commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    LGTM


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4993/



---

[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4264/



---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by akashrn5 <gi...@git.apache.org>.

Github user akashrn5 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189507649
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,133 @@
    +# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
    +  
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    --- End diff --
    
    yes, i think the same, and about refresh im also not sure about how it works, so this PR will be specific to lucene, 


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by CarbonDataQA <gi...@git.apache.org>.

Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5597/



---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183617702
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
    +  User can create as many lucene datamaps required to improve query performance,
    +  provided the storage requirements and loading speeds are acceptable.
    +  
    +  Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
    +  row level for the filter query by launching a spark datamap job. This pruned data will be read to
    +  give the proper and faster result
    +    
    +  For instance, main table called **sales** which is defined as 
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
    +then lucene index files will be generated for all the text_columns (String Columns) given in
    +DMProperties which contains information about the blocklet_id, page_id and row_id and for all the
    +data of text_columns. These index files will be written inside a folder named as datamap name inside
    +each segment directories.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so
    +when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as
    +pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each
    +blocklet a temporary file will be generated which has information till row level, but prune will
    +return blocklets finally.
    +
    +When query reaches executor side, the temporary files written will be read and bitset groups are
    +formed to return the query result.
    +
    +User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
    +command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
    +filter is applied on query or not.
    +
    +
    +## Data Management with pre-aggregate tables
    +Once there is lucene datamap is created on the main table, following command on the main
    --- End diff --
    
    Once lucene datamap is created on the main table, following command on the main table is not supported:


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4701/



---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r184665011
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,204 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
    +
    +2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
    +   ```shell
    +   mvn clean package -DskipTests -Pspark-2.2
    +   ```
    +
    +3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +   ```scala
    +   import java.io.File
    +   import org.apache.spark.sql.{CarbonEnv, SparkSession}
    +   import org.apache.spark.sql.CarbonSession._
    +   import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    +   import org.apache.carbondata.core.util.path.CarbonStorePath
    +   
    +   val warehouse = new File("./warehouse").getCanonicalPath
    +   val metastore = new File("./metastore").getCanonicalPath
    + 
    +   val spark = SparkSession
    +     .builder()
    +     .master("local")
    +     .appName("luceneDatamapExample")
    +     .config("spark.sql.warehouse.dir", warehouse)
    +     .getOrCreateCarbonSession(warehouse, metastore)
    +
    +   spark.sparkContext.setLogLevel("ERROR")
    +
    +   // drop table if exists previously
    +   spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    +   
    +   // Create main table
    +   spark.sql(
    +     s"""
    +        |CREATE TABLE datamap_test (
    +        |name string,
    +        |age int,
    +        |city string,
    +        |country string)
    +        |STORED BY 'carbondata'
    +      """.stripMargin)
    + 
    +    // Create lucene datamap on the main table
    +   spark.sql(
    +     s"""
    +        |CREATE DATAMAP dm
    +        |ON TABLE datamap_test
    +        |USING "lucene"
    +        |DMPROPERTIES ('TEXT_COLUMNS' = 'name,country')
    +      """.stripMargin)
    +   
    +      import spark.implicits._
    +      import org.apache.spark.sql.SaveMode
    +      import scala.util.Random
    +   
    +      // Load data to the main table, if
    +      // lucene index writing fails, the datamap
    +      // will be disabled in query
    +    val r = new Random()
    +    spark.sparkContext.parallelize(1 to 10)
    +     .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +     .toDF("name", "age", "city", "country")
    +     .write
    +     .format("carbondata")
    +     .option("tableName", "datamap_test")
    +     .option("compress", "true")
    +     .mode(SaveMode.Append)
    +     .save()
    +        
    +    spark.sql(
    +      s"""
    +         |SELECT *
    +         |from datamap_test where
    +         |TEXT_MATCH('name:c10')
    +       """.stripMargin).show
    +  
    +    spark.stop
    +   ```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL:
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    +  index datamap and managed along with main tables by CarbonData.User can create lucene datamaps 
    +  to improve query performance on string columns.
    +  
    +  For instance, main table called **datamap_test** which is defined as:
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL:
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, lucene index files will be generated for all the
    +text_columns(String Columns) given in DMProperties which contains information about the data
    +location of text_columns. These index files will be written inside a folder named with datamap name
    +inside each segment folders.
    +
    +A system level configuration carbon.lucene.compression.mode can be added for best compression of
    +lucene index files. The default value is speed, where the index writing speed will be more. If the
    +value is compression, the index file size will be compressed.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried directly.
    +Queries are to be made on main table. when a query with TEXT_MATCH() is fired, two jobs are fired.
    --- End diff --
    
    Now, there is one more UDF added (TEXT_MATCH_WITH_LIMIT), please add it also


---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by jackylk <gi...@git.apache.org>.

Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189421681
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,133 @@
    +# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
    +  
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +#### DataMap Management 
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('index_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL:
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
    +  an index datamap and managed along with main tables by CarbonData.User can create lucene datamap 
    +  to improve query performance on string columns which has content of more length.
    +  
    +  For instance, main table called **datamap_test** which is defined as:
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL:
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('INDEX_COLUMNS' = 'name, country')
    +  ```
    +
    --- End diff --
    
    There is more DMPROPERTY introduced in PR2275, please add


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4692/



---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189435815
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,133 @@
    +# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
    +  
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    --- End diff --
    
    @jackylk I think it's better to add another document to describe the common operations for index datamap, since the descriptions for `Data Management`, `REBUILD DATAMAP`, `WITH DEFERRED REBUILD` are the same for `BloomFilterDataMap` and `LuceneDataMap`.


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by ravipesala <gi...@git.apache.org>.

Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4989/



---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by akashrn5 <gi...@git.apache.org>.

Github user akashrn5 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r184359005
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    --- End diff --
    
    added numbering for steps and sub headings are kept as it is


---

[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...

Posted by chenliang613 <gi...@git.apache.org>.

Github user chenliang613 commented on the issue:

    https://github.com/apache/carbondata/pull/2215
  
    LGTM


---

[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...

Posted by xuchuanyin <gi...@git.apache.org>.

Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r189435741
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,133 @@
    +# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
    +  
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    --- End diff --
    
    It's incorrect here:
    `data-management-with-pre-aggregate-tables`
    It should be
    `data-management-with-lucene-datamap`


---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Posted by sgururajshetty <gi...@git.apache.org>.

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183616317
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    --- End diff --
    
    Close all the sentence with a period (.). This is applicable for all the sentences in this topics.


---