You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by akashrn5 <gi...@git.apache.org> on 2018/04/23 14:03:36 UTC
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
GitHub user akashrn5 opened a pull request:
https://github.com/apache/carbondata/pull/2215
[wip]add documentation for lucene datamap
added documentation for lucene datamap
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
- [ ] Any interfaces changed?
- [ ] Any backward compatibility impacted?
- [ ] Document update required?
- [ ] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.
- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/akashrn5/incubator-carbondata doc_lucene
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/2215.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2215
----
commit 5403c832ca98569f60acf42a95c42ae21d8d3be5
Author: akashrn5 <ak...@...>
Date: 2018-04-23T13:57:56Z
add documentation for lucene datamap
----
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189421956
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,133 @@
+# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
+
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('index_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL:
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
+ an index datamap and managed along with main tables by CarbonData.User can create lucene datamap
+ to improve query performance on string columns which has content of more length.
--- End diff --
Please rephrase to describe: this datamap is intended for text content, and you want to search the tokenized word or pattern of it.
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by KanakaKumar <gi...@git.apache.org>.
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r184316398
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
+ User can create as many lucene datamaps required to improve query performance,
+ provided the storage requirements and loading speeds are acceptable.
+
+ Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
+ row level for the filter query by launching a spark datamap job. This pruned data will be read to
+ give the proper and faster result
+
+ For instance, main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
+then lucene index files will be generated for all the text_columns (String Columns) given in
+DMProperties which contains information about the blocklet_id, page_id and row_id and for all the
+data of text_columns. These index files will be written inside a folder named as datamap name inside
+each segment directories.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so
+when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as
--- End diff --
Please add the details to mention supported syntax is lucene query. And list few example queries which can cover tokenezier based search and like queries
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183617996
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
+ User can create as many lucene datamaps required to improve query performance,
+ provided the storage requirements and loading speeds are acceptable.
+
+ Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
+ row level for the filter query by launching a spark datamap job. This pruned data will be read to
+ give the proper and faster result
+
+ For instance, main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
+then lucene index files will be generated for all the text_columns (String Columns) given in
+DMProperties which contains information about the blocklet_id, page_id and row_id and for all the
+data of text_columns. These index files will be written inside a folder named as datamap name inside
+each segment directories.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so
+when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as
+pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each
+blocklet a temporary file will be generated which has information till row level, but prune will
+return blocklets finally.
+
+When query reaches executor side, the temporary files written will be read and bitset groups are
+formed to return the query result.
+
+User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
+filter is applied on query or not.
+
+
+## Data Management with pre-aggregate tables
+Once there is lucene datamap is created on the main table, following command on the main
+table
+is not supported:
+1. Data management command: `UPDATE/DELETE`.
+2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`,
+`ALTER TABLE RENAME`. Note that adding a new column is supported, and for dropping columns and
--- End diff --
**Note:**
Use this format for Note and start in a new line
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4436/
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by akashrn5 <gi...@git.apache.org>.
Github user akashrn5 commented on the issue:
https://github.com/apache/carbondata/pull/2215
@chenliang613 please review and merge
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by akashrn5 <gi...@git.apache.org>.
Github user akashrn5 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189227705
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,213 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
+
+2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
+ ```shell
+ mvn clean package -DskipTests -Pspark-2.2
+ ```
+
+3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
+ ```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
--- End diff --
ok
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183617239
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
+ User can create as many lucene datamaps required to improve query performance,
+ provided the storage requirements and loading speeds are acceptable.
+
+ Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
+ row level for the filter query by launching a spark datamap job. This pruned data will be read to
+ give the proper and faster result
+
+ For instance, main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL
--- End diff --
User can create Lucene datamap using the Create DataMap DDL:
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5964/
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4990/
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183617096
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
+ User can create as many lucene datamaps required to improve query performance,
+ provided the storage requirements and loading speeds are acceptable.
+
+ Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
+ row level for the filter query by launching a spark datamap job. This pruned data will be read to
+ give the proper and faster result
--- End diff --
end all sentence with a period (.)
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183616698
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
--- End diff --
Lucene DataMap can be created using following DDL:
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183616213
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
--- End diff --
The below is a procedure, so put it in a numbered list:
Step 1:
Step 2:
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183616653
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
--- End diff --
Why a red background. Please check once
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4809/
---
[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap
Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215
SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4577/
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4806/
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by KanakaKumar <gi...@git.apache.org>.
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r184317255
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
+ User can create as many lucene datamaps required to improve query performance,
+ provided the storage requirements and loading speeds are acceptable.
+
+ Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
+ row level for the filter query by launching a spark datamap job. This pruned data will be read to
+ give the proper and faster result
+
+ For instance, main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
+then lucene index files will be generated for all the text_columns (String Columns) given in
+DMProperties which contains information about the blocklet_id, page_id and row_id and for all the
+data of text_columns. These index files will be written inside a folder named as datamap name inside
+each segment directories.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so
+when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as
+pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each
+blocklet a temporary file will be generated which has information till row level, but prune will
+return blocklets finally.
+
+When query reaches executor side, the temporary files written will be read and bitset groups are
+formed to return the query result.
--- End diff --
please mention the cleanup procedure for temp files
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4447/
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189421502
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,133 @@
+# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
+
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('index_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL:
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
+ an index datamap and managed along with main tables by CarbonData.User can create lucene datamap
+ to improve query performance on string columns which has content of more length.
+
+ For instance, main table called **datamap_test** which is defined as:
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL:
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('INDEX_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, lucene index files will be generated for all the
+index_columns(String Columns) given in DMProperties which contains information about the data
+location of index_columns. These index files will be written inside a folder named with datamap name
+inside each segment folders.
+
+A system level configuration carbon.lucene.compression.mode can be added for best compression of
+lucene index files. The default value is speed, where the index writing speed will be more. If the
+value is compression, the index file size will be compressed.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or
+TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be
+returned, if user does not specify this value, all results will be returned without any limit] is
+fired, two jobs are fired.The first job writes the temporary files in folder created at table level
+which contains lucene's seach results and these files will be read in second job to give faster
+results. These temporary files will be cleared once the query finishes.
+
+User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
+filter is applied on query or not.
+
+Note: The filter columns in TEXT_MATCH or TEXT_MATCH_WITH_LIMIT must be always in lower case and
+filter condition like 'AND','OR' must be in upper case.
+
--- End diff --
Add a limitation description here: In this version, we support one TEXT_MATCH UDF for one relation only and user should put AND/OR logic inside this UDF, instead of writing separate UDF. For example
`select * from T where TEXT_MATCH('col1:a AND col2:b')` is supported
`select * from T where TEXT_MATCH('col1:a') and TEXT_MATCH('col2:b')` is not supported
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4539/
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by chenliang613 <gi...@git.apache.org>.
Github user chenliang613 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r186898007
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,213 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
+
+2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
+ ```shell
+ mvn clean package -DskipTests -Pspark-2.2
+ ```
+
+3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
+ ```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
--- End diff --
Please remove the example code, because pr2268 already provided the executable example.
Example code should be maintained under examples module, not inside document.
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189421589
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,133 @@
+# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
+
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
--- End diff --
Can you make a section to describe `REBUILD DATAMAP` and `WITH DEFERRED REBUILD` feature when creating datamap
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189421639
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,133 @@
+# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
+
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('index_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL:
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
+ an index datamap and managed along with main tables by CarbonData.User can create lucene datamap
+ to improve query performance on string columns which has content of more length.
+
+ For instance, main table called **datamap_test** which is defined as:
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL:
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('INDEX_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, lucene index files will be generated for all the
+index_columns(String Columns) given in DMProperties which contains information about the data
+location of index_columns. These index files will be written inside a folder named with datamap name
+inside each segment folders.
+
+A system level configuration carbon.lucene.compression.mode can be added for best compression of
+lucene index files. The default value is speed, where the index writing speed will be more. If the
+value is compression, the index file size will be compressed.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or
+TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be
+returned, if user does not specify this value, all results will be returned without any limit] is
+fired, two jobs are fired.The first job writes the temporary files in folder created at table level
+which contains lucene's seach results and these files will be read in second job to give faster
+results. These temporary files will be cleared once the query finishes.
+
+User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
+filter is applied on query or not.
+
+Note: The filter columns in TEXT_MATCH or TEXT_MATCH_WITH_LIMIT must be always in lower case and
+filter condition like 'AND','OR' must be in upper case.
+
+Ex: ```
+ select * from datamap_test where TEXT_MATCH('name:*10 AND name:*n*')
+ ```
+
+Below like queries can be converted to text_match queries as following:
+```
+select * from datamap_test where name='n10'
+
+select * from datamap_test where name like 'n1%'
+
+select * from datamap_test where name like '%10'
+
+select * from datamap_test where name like '%n%'
+
+select * from datamap_test where name like '%10' and name not like '%n%'
+```
+Lucene TEXT_MATCH Queries:
+```
+select * from datamap_test where TEXT_MATCH('name:n10')
+
+select * from datamap_test where TEXT_MATCH('name:n1*')
+
+select * from datamap_test where TEXT_MATCH('name:*10')
+
+select * from datamap_test where TEXT_MATCH('name:*n*')
+
+select * from datamap_test where TEXT_MATCH('name:*10 -name:*n*')
--- End diff --
For all these queries, please describe what is the effect of it, since user maybe not familiar with lucene syntax. And provide a link for user to refer to lucene syntax
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4846/
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5699/
---
[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4256/
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6005/
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by chenliang613 <gi...@git.apache.org>.
Github user chenliang613 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r185711684
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,213 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
+
+2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
+ ```shell
+ mvn clean package -DskipTests -Pspark-2.2
+ ```
+
+3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
+ ```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("luceneDatamapExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name,country')
+ """.stripMargin)
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10', 10)
+ """.stripMargin).show
+
+ spark.stop
+ ```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL:
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
+ index datamap and managed along with main tables by CarbonData.User can create lucene datamaps
+ to improve query performance on string columns.
+
+ For instance, main table called **datamap_test** which is defined as:
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL:
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, lucene index files will be generated for all the
+text_columns(String Columns) given in DMProperties which contains information about the data
+location of text_columns. These index files will be written inside a folder named with datamap name
+inside each segment folders.
+
+A system level configuration carbon.lucene.compression.mode can be added for best compression of
+lucene index files. The default value is speed, where the index writing speed will be more. If the
+value is compression, the index file size will be compressed.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or
+TEXT_MATCH('name:n10',10)[the second parameter represents the number of result to be returned, if
+user does not specify this value, all results will be returned without any limit] is fired, two jobs
+are fired.The first job writes the temporary files in folder created at table level which contains
+lucene's seach results and these files will be read in second job to give faster results. These
+temporary files will be cleared once the query finishes.
+
+User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
+filter is applied on query or not.
+
+Below like queries can be converted to text_match queries as following:
+```
+select * from datamap_test where name='n10'
+
+select * from datamap_test where name like 'n1%'
+
+select * from datamap_test where name like '%10'
+
+select * from datamap_test where name like '%n%'
+
+select * from datamap_test where name like '%10' and name not like '%n%'
+```
+Lucene TEXT_MATCH Queries:
+```
+select * from datamap_test where TEXT_MATCH('name:n10')
+
+select * from datamap_test where TEXT_MATCH('name:n1*')
+
+select * from datamap_test where TEXT_MATCH('name:*10')
+
+select * from datamap_test where TEXT_MATCH('name:*n*')
+
+select * from datamap_test where TEXT_MATCH('name:*10 and -name:*n*')
--- End diff --
the syntax is wrong, don't need "and", should be TEXT_MATCH('name:*10 -name:*n*')
---
[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap
Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2215
The word wrap is strange, better to write a paragraph and let the editor do the rest.
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5608/
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183618083
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
+ User can create as many lucene datamaps required to improve query performance,
+ provided the storage requirements and loading speeds are acceptable.
+
+ Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
+ row level for the filter query by launching a spark datamap job. This pruned data will be read to
+ give the proper and faster result
+
+ For instance, main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
+then lucene index files will be generated for all the text_columns (String Columns) given in
+DMProperties which contains information about the blocklet_id, page_id and row_id and for all the
+data of text_columns. These index files will be written inside a folder named as datamap name inside
+each segment directories.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so
+when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as
+pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each
+blocklet a temporary file will be generated which has information till row level, but prune will
+return blocklets finally.
+
+When query reaches executor side, the temporary files written will be read and bitset groups are
+formed to return the query result.
+
+User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
+filter is applied on query or not.
+
+
+## Data Management with pre-aggregate tables
+Once there is lucene datamap is created on the main table, following command on the main
+table
+is not supported:
+1. Data management command: `UPDATE/DELETE`.
+2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`,
+`ALTER TABLE RENAME`. Note that adding a new column is supported, and for dropping columns and
+change datatype command, CarbonData will check whether it will impact the lucene datamap, if
+ not, the operation is allowed, otherwise operation will be rejected by throwing exception.
+3. Partition management command: `ALTER TABLE ADD/DROP PARTITION`
+
+However, there is still way to support these operations on main table, in current CarbonData
+release, user can do as following:
+1. Remove the lucene datamap by `DROP DATAMAP` command
--- End diff --
End all sentences with a period (.)
---
[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5431/
---
[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5442/
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by chenliang613 <gi...@git.apache.org>.
Github user chenliang613 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r186900225
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,213 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
+
+2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
+ ```shell
+ mvn clean package -DskipTests -Pspark-2.2
+ ```
+
+3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
+ ```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("luceneDatamapExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name,country')
+ """.stripMargin)
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH_WITH_LIMIT('name:c10', 10)
+ """.stripMargin).show
+
+ spark.stop
+ ```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('index_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL:
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
+ index datamap and managed along with main tables by CarbonData.User can create lucene datamaps
+ to improve query performance on string columns.
+
+ For instance, main table called **datamap_test** which is defined as:
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL:
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, lucene index files will be generated for all the
+index_columns(String Columns) given in DMProperties which contains information about the data
+location of index_columns. These index files will be written inside a folder named with datamap name
+inside each segment folders.
+
+A system level configuration carbon.lucene.compression.mode can be added for best compression of
+lucene index files. The default value is speed, where the index writing speed will be more. If the
+value is compression, the index file size will be compressed.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or
+TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be
+returned, if user does not specify this value, all results will be returned without any limit] is
+fired, two jobs are fired.The first job writes the temporary files in folder created at table level
+which contains lucene's seach results and these files will be read in second job to give faster
+results. These temporary files will be cleared once the query finishes.
+
+User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
+filter is applied on query or not.
+
+Below like queries can be converted to text_match queries as following:
+```
+select * from datamap_test where name='n10'
+
+select * from datamap_test where name like 'n1%'
+
+select * from datamap_test where name like '%10'
--- End diff --
I tested, the result is different for the below two query, please double check:
select * from datamap_test where name like '%10'
select * from datamap_test where TEXT_MATCH('name:*10')
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183616769
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL
--- End diff --
DataMap can be dropped using following DDL:
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/carbondata/pull/2215
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183617201
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
+ User can create as many lucene datamaps required to improve query performance,
+ provided the storage requirements and loading speeds are acceptable.
+
+ Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
+ row level for the filter query by launching a spark datamap job. This pruned data will be read to
+ give the proper and faster result
+
+ For instance, main table called **sales** which is defined as
--- End diff --
For instance, main table called **sales** which is defined as:
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by akashrn5 <gi...@git.apache.org>.
Github user akashrn5 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189228188
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,213 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
+
+2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
+ ```shell
+ mvn clean package -DskipTests -Pspark-2.2
+ ```
+
+3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
+ ```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("luceneDatamapExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name,country')
+ """.stripMargin)
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH_WITH_LIMIT('name:c10', 10)
+ """.stripMargin).show
+
+ spark.stop
+ ```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('index_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL:
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
+ index datamap and managed along with main tables by CarbonData.User can create lucene datamaps
+ to improve query performance on string columns.
+
+ For instance, main table called **datamap_test** which is defined as:
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL:
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, lucene index files will be generated for all the
+index_columns(String Columns) given in DMProperties which contains information about the data
+location of index_columns. These index files will be written inside a folder named with datamap name
+inside each segment folders.
+
+A system level configuration carbon.lucene.compression.mode can be added for best compression of
+lucene index files. The default value is speed, where the index writing speed will be more. If the
+value is compression, the index file size will be compressed.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or
+TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be
+returned, if user does not specify this value, all results will be returned without any limit] is
+fired, two jobs are fired.The first job writes the temporary files in folder created at table level
+which contains lucene's seach results and these files will be read in second job to give faster
+results. These temporary files will be cleared once the query finishes.
+
+User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
+filter is applied on query or not.
+
+Below like queries can be converted to text_match queries as following:
+```
+select * from datamap_test where name='n10'
+
+select * from datamap_test where name like 'n1%'
+
+select * from datamap_test where name like '%10'
--- End diff --
i have tested this, we have a UT also for this, it is working fine, and we cannot compare all the like query results with text_match, as lucene search way is different compare to like query search
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5967/
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183615908
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
--- End diff --
These are procedure steps, so we can have numbered list
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189422398
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,133 @@
+# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
+
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
--- End diff --
Is it `"` or `'` ?
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by KanakaKumar <gi...@git.apache.org>.
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r184316832
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
+ User can create as many lucene datamaps required to improve query performance,
+ provided the storage requirements and loading speeds are acceptable.
+
+ Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
+ row level for the filter query by launching a spark datamap job. This pruned data will be read to
+ give the proper and faster result
+
+ For instance, main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
--- End diff --
What new configurations added and how it can impact data load can be added. Example:- compression types
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by chenliang613 <gi...@git.apache.org>.
Github user chenliang613 commented on the issue:
https://github.com/apache/carbondata/pull/2215
retest this please
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5599/
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by KanakaKumar <gi...@git.apache.org>.
Github user KanakaKumar commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189232106
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,133 @@
+# CarbonData Lucene DataMap
--- End diff --
Please mark Lucene feature as Alpha for 1.4.0
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by akashrn5 <gi...@git.apache.org>.
Github user akashrn5 commented on the issue:
https://github.com/apache/carbondata/pull/2215
@xuchuanyin and @jackylk please review
---
[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4175/
---
[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5343/
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4776/
---
[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap
Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4564/
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by akashrn5 <gi...@git.apache.org>.
Github user akashrn5 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189507570
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,133 @@
+# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
+
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('index_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL:
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
+ an index datamap and managed along with main tables by CarbonData.User can create lucene datamap
+ to improve query performance on string columns which has content of more length.
+
+ For instance, main table called **datamap_test** which is defined as:
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL:
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('INDEX_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, lucene index files will be generated for all the
+index_columns(String Columns) given in DMProperties which contains information about the data
+location of index_columns. These index files will be written inside a folder named with datamap name
+inside each segment folders.
+
+A system level configuration carbon.lucene.compression.mode can be added for best compression of
+lucene index files. The default value is speed, where the index writing speed will be more. If the
+value is compression, the index file size will be compressed.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. when a query with TEXT_MATCH('name:c10') or
+TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be
+returned, if user does not specify this value, all results will be returned without any limit] is
+fired, two jobs are fired.The first job writes the temporary files in folder created at table level
+which contains lucene's seach results and these files will be read in second job to give faster
+results. These temporary files will be cleared once the query finishes.
+
+User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
+filter is applied on query or not.
+
+Note: The filter columns in TEXT_MATCH or TEXT_MATCH_WITH_LIMIT must be always in lower case and
+filter condition like 'AND','OR' must be in upper case.
+
+Ex: ```
+ select * from datamap_test where TEXT_MATCH('name:*10 AND name:*n*')
+ ```
+
+Below like queries can be converted to text_match queries as following:
+```
+select * from datamap_test where name='n10'
+
+select * from datamap_test where name like 'n1%'
+
+select * from datamap_test where name like '%10'
+
+select * from datamap_test where name like '%n%'
+
+select * from datamap_test where name like '%10' and name not like '%n%'
+```
+Lucene TEXT_MATCH Queries:
+```
+select * from datamap_test where TEXT_MATCH('name:n10')
+
+select * from datamap_test where TEXT_MATCH('name:n1*')
+
+select * from datamap_test where TEXT_MATCH('name:*10')
+
+select * from datamap_test where TEXT_MATCH('name:*n*')
+
+select * from datamap_test where TEXT_MATCH('name:*10 -name:*n*')
--- End diff --
added a link, which will provide details of all these queries
---
[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap
Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215
SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4490/
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4438/
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by KanakaKumar <gi...@git.apache.org>.
Github user KanakaKumar commented on the issue:
https://github.com/apache/carbondata/pull/2215
LGTM
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4993/
---
[GitHub] carbondata issue #2215: [wip]add documentation for lucene datamap
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4264/
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by akashrn5 <gi...@git.apache.org>.
Github user akashrn5 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189507649
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,133 @@
+# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
+
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
--- End diff --
yes, i think the same, and about refresh im also not sure about how it works, so this PR will be specific to lucene,
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2215
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5597/
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183617702
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene datamap are created as index DataMaps and managed along with main tables by CarbonData.
+ User can create as many lucene datamaps required to improve query performance,
+ provided the storage requirements and loading speeds are acceptable.
+
+ Once lucene datamaps are created, the indexes generated by lucene will be read for pruning till
+ row level for the filter query by launching a spark datamap job. This pruned data will be read to
+ give the proper and faster result
+
+ For instance, main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, it checks whether any lucene datamaps are present or not, if it is,
+then lucene index files will be generated for all the text_columns (String Columns) given in
+DMProperties which contains information about the blocklet_id, page_id and row_id and for all the
+data of text_columns. These index files will be written inside a folder named as datamap name inside
+each segment directories.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. An UDF called TEXT_MATCH is registered in spark session, so
+when a query with TEXT_MATCH() is fired, While doing query planning, TEXT_MATCH will be treated as
+pushed filters. It checks for all the lucene datamaps, and a job is fired for pruning and for each
+blocklet a temporary file will be generated which has information till row level, but prune will
+return blocklets finally.
+
+When query reaches executor side, the temporary files written will be read and bitset groups are
+formed to return the query result.
+
+User can verify whether a query can leverage Lucene datamap or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
+filter is applied on query or not.
+
+
+## Data Management with pre-aggregate tables
+Once there is lucene datamap is created on the main table, following command on the main
--- End diff --
Once lucene datamap is created on the main table, following command on the main table is not supported:
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4701/
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r184665011
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,204 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+1. Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME.
+
+2. Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars.
+ ```shell
+ mvn clean package -DskipTests -Pspark-2.2
+ ```
+
+3. Start spark-shell in new terminal, type :paste, then copy and run the following code.
+ ```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("luceneDatamapExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name,country')
+ """.stripMargin)
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+ ```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL:
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
+ index datamap and managed along with main tables by CarbonData.User can create lucene datamaps
+ to improve query performance on string columns.
+
+ For instance, main table called **datamap_test** which is defined as:
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL:
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, lucene index files will be generated for all the
+text_columns(String Columns) given in DMProperties which contains information about the data
+location of text_columns. These index files will be written inside a folder named with datamap name
+inside each segment folders.
+
+A system level configuration carbon.lucene.compression.mode can be added for best compression of
+lucene index files. The default value is speed, where the index writing speed will be more. If the
+value is compression, the index file size will be compressed.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried directly.
+Queries are to be made on main table. when a query with TEXT_MATCH() is fired, two jobs are fired.
--- End diff --
Now, there is one more UDF added (TEXT_MATCH_WITH_LIMIT), please add it also
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189421681
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,133 @@
+# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
+
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('index_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL:
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as
+ an index datamap and managed along with main tables by CarbonData.User can create lucene datamap
+ to improve query performance on string columns which has content of more length.
+
+ For instance, main table called **datamap_test** which is defined as:
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL:
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('INDEX_COLUMNS' = 'name, country')
+ ```
+
--- End diff --
There is more DMPROPERTY introduced in PR2275, please add
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4692/
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189435815
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,133 @@
+# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
+
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
--- End diff --
@jackylk I think it's better to add another document to describe the common operations for index datamap, since the descriptions for `Data Management`, `REBUILD DATAMAP`, `WITH DEFERRED REBUILD` are the same for `BloomFilterDataMap` and `LuceneDataMap`.
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2215
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4989/
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by akashrn5 <gi...@git.apache.org>.
Github user akashrn5 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r184359005
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
--- End diff --
added numbering for steps and sub headings are kept as it is
---
[GitHub] carbondata issue #2215: [CARBONDATA-2206]add documentation for lucene datama...
Posted by chenliang613 <gi...@git.apache.org>.
Github user chenliang613 commented on the issue:
https://github.com/apache/carbondata/pull/2215
LGTM
---
[GitHub] carbondata pull request #2215: [CARBONDATA-2206]add documentation for lucene...
Posted by xuchuanyin <gi...@git.apache.org>.
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r189435741
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,133 @@
+# CarbonData Lucene DataMap (Alpha feature in 1.4.0)
+
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
--- End diff --
It's incorrect here:
`data-management-with-pre-aggregate-tables`
It should be
`data-management-with-lucene-datamap`
---
[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap
Posted by sgururajshetty <gi...@git.apache.org>.
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183616317
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
--- End diff --
Close all the sentence with a period (.). This is applicable for all the sentences in this topics.
---