You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@ignite.apache.org by ab...@apache.org on 2020/09/07 08:57:34 UTC

[ignite] branch IGNITE-7595 updated: copy-paste ignte-for-spark pages from readme.io. The pages are hopelessly outdated. most code samples probably don't work. the text should be updated

This is an automated email from the ASF dual-hosted git repository.

abudnikov pushed a commit to branch IGNITE-7595
in repository https://gitbox.apache.org/repos/asf/ignite.git


The following commit(s) were added to refs/heads/IGNITE-7595 by this push:
     new 5cc040e  copy-paste ignte-for-spark pages from readme.io. The pages are hopelessly outdated. most code samples probably don't work. the text should be updated
5cc040e is described below

commit 5cc040e9d55ceeaa4a729db2274bd9459bacc4fd
Author: abudnikov <ab...@gridgain.com>
AuthorDate: Mon Sep 7 11:56:22 2020 +0300

    copy-paste ignte-for-spark pages from readme.io. The pages are hopelessly outdated. most code samples probably don't work. the text should be updated
---
 docs/_data/toc.yaml                                |  37 +--
 docs/_docs/ignite-for-spark/ignite-dataframe.adoc  | 366 +++++++++++++++++++++
 .../ignite-for-spark/ignitecontext-and-rdd.adoc    |  92 ++++++
 docs/_docs/ignite-for-spark/installation.adoc      | 157 +++++++++
 docs/_docs/ignite-for-spark/overview.adoc          |  35 ++
 docs/_docs/ignite-for-spark/spark-shell.adoc       | 188 +++++++++++
 docs/_docs/ignite-for-spark/troubleshooting.adoc   |   9 +
 docs/_docs/images/spark_integration.png            | Bin 0 -> 115826 bytes
 8 files changed, 863 insertions(+), 21 deletions(-)

diff --git a/docs/_data/toc.yaml b/docs/_data/toc.yaml
index afc5036..1e81314 100644
--- a/docs/_data/toc.yaml
+++ b/docs/_data/toc.yaml
@@ -273,30 +273,25 @@
   url: /plugins
 - title: SQLLine 
   url: /sqlline
-
-#    - title: Capacity Planning
-#      url: /capacity-planning
-#    - title: Performance and Troubleshooting Guide
-#      url: /perf-troubleshooting-guide/general-perf-tips
-#      items:
-#        - title: General Performance Tips
-#          url: /perf-troubleshooting-guide/general-perf-tips
-#        - title: Memory and JVM Tuning
-#          url: /perf-troubleshooting-guide/memory-tuning
-#        - title: Persistence Tuning
-#          url: /perf-troubleshooting-guide/persistence-tuning
-#        - title: SQL Tuning
-#          url: /perf-troubleshooting-guide/sql-tuning
-#        - title: Thread Pools Tuning
-#          url: /perf-troubleshooting-guide/thread-pools-tuning
-#        - title: Troubleshooting and Debugging
-#          url: /perf-troubleshooting-guide/troubleshooting
-#
+- title: Ignite for Spark
+  items: 
+    - title: Overview
+      url: /ignite-for-spark/overview
+    - title: IgniteContext and IgniteRDD 
+      url:  /ignite-for-spark/ignitecontext-and-rdd
+    - title: Ignite DataFrame  
+      url: /ignite-for-spark/ignite-dataframe 
+    - title: Installation 
+      url: /ignite-for-spark/installation 
+    - title: Test Ignite with Spark-shell  
+      url: /ignite-for-spark/spark-shell
+    - title: Troubleshooting 
+      url: /ignite-for-spark/troubleshooting
 - title: SQL Reference
   url: /sql-reference/sql-reference-overview
   items:
-#    - title: SQL Conformance
-#      url: /sql-reference/sql-conformance
+    - title: SQL Conformance
+      url: /sql-reference/sql-conformance
     - title: Data Definition Language (DDL)
       url: /sql-reference/ddl
     - title: Data Manipulation Language (DML)
diff --git a/docs/_docs/ignite-for-spark/ignite-dataframe.adoc b/docs/_docs/ignite-for-spark/ignite-dataframe.adoc
new file mode 100644
index 0000000..a2e7c5a
--- /dev/null
+++ b/docs/_docs/ignite-for-spark/ignite-dataframe.adoc
@@ -0,0 +1,366 @@
+= Ignite DataFrame
+
+== Overview
+
+The Apache Spark DataFrame API introduced the concept of a schema to describe the data, allowing Spark to manage the schema and organize the data into a tabular format. To put it simply, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database and allows Spark to leverage the Catalyst query optimizer to produce much more efficient query execution plans in comparison to RDDs, which are just collections  [...]
+
+Ignite expands DataFrame, simplifying development and improving data access times whenever Ignite is used as memory-centric storage for Spark. Benefits include:
+
+* Ability to share data and state across Spark jobs by writing and reading DataFrames to/from Ignite.
+* Faster SparkSQL queries by optimizing Spark query execution plans with Ignite SQL engine which include​ advanced indexing and avoid data movement across the network from Ignite to Spark.
+
+== Integration
+
+`IgniteRelationProvider` is an implementation of the Spark `RelationProvider` and `CreatableRelationProvider` interfaces. The `IgniteRelationProvider` can talk directly to Ignite tables through the Spark SQL interface. The data are loaded and exchanged via `IgniteSQLRelation` that executes filtering operations on the Ignite side. For now, grouping, joining or ordering operations are fulfilled on the Spark side. These operations will be optimized and processed on the Ignite side in link:h [...]
+
+== Spark Session
+
+To use the Apache Spark DataFrame API, it is necessary to create an entry point for programming with Spark. This is achieved through the use of a `SparkSession` object, as shown in the following example:
+
+[tabs]
+--
+tab:Java[]
+[source, java]
+----
+// Creating spark session.
+SparkSession spark = SparkSession.builder()
+  .appName("Example Program")
+  .master("local")
+  .config("spark.executor.instances", "2")
+  .getOrCreate();
+----
+
+tab:Scala[]
+[source, scala]
+----
+// Creating spark session.
+implicit val spark = SparkSession.builder()
+  .appName("Example Program")
+  .master("local")
+  .config("spark.executor.instances", "2")
+  .getOrCreate()
+----
+--
+
+== Reading DataFrames
+
+In order to read data from Ignite, you need to specify its format and the path to the Ignite configuration file. For example, assume an Ignite table named ‘person’ is created and deployed in Ignite, as follows:
+
+
+[source, sql]
+----
+CREATE TABLE person (
+    id LONG,
+    name VARCHAR,
+    city_id LONG,
+    PRIMARY KEY (id, city_id)
+) WITH "backups=1, affinityKey=city_id”;
+----
+
+The following Spark code can find all the rows from the 'person' table where the name is ‘Mary Major’:
+
+[tabs]
+--
+
+tab:Java[]
+
+[source, java]
+----
+SparkSession spark = ...
+String cfgPath = "path/to/config/file";
+
+Dataset<Row> df = spark.read()
+  .format(IgniteDataFrameSettings.FORMAT_IGNITE())              //Data source
+  .option(IgniteDataFrameSettings.OPTION_TABLE(), "person")     //Table to read.
+  .option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
+  .load();
+
+df.createOrReplaceTempView("person");
+
+Dataset<Row> igniteDF = spark.sql(
+  "SELECT * FROM person WHERE name = 'Mary Major'");
+----
+
+
+tab:Scala[]
+
+[source, scala]
+----
+val spark: SparkSession = …
+val cfgPath: String = "path/to/config/file"
+
+val df = spark.read
+  .format(FORMAT_IGNITE)               // Data source type.
+  .option(OPTION_TABLE, "person")      // Table to read.
+  .option(OPTION_CONFIG_FILE, cfgPath) // Ignite config.
+  .load()
+
+df.createOrReplaceTempView("person")
+
+val igniteDF = spark.sql("SELECT * FROM person WHERE name = 'Mary Major'")
+----
+--
+
+
+
+== Saving DataFrames
+
+[NOTE]
+====
+[discrete]
+=== Implementation notes
+Internally all inserts are done through `IgniteDataStreamer`. Several optional parameters exist to configure the internal streamer. Please, see a <<Ignite DataFrame Options>> of available options.
+====
+
+
+Ignite can serve as a storage for DataFrames created or updated in Spark. The following save modes determine how a DataFrame is processed in Ignite:
+
+* `Append` - the DataFrame will be appended to an existing table. Set `OPTION_STREAMER_ALLOW_OVERWRITE=true` if you want to update existing entries with the data of the DataFrame.
+* `Overwrite` - the following steps will be executed:
+* If the table already exists in Ignite, it will be dropped.
+* A new table will be created using the schema of the DataFrame and provided options.
+* DataFrame content will be inserted into the new table.
+* `ErrorIfExists` (default) - an exception is thrown if the table already exists in Ignite. If a table does not exist:
+* A new table will be created using the schema of the DataFrame and provided options.
+* DataFrame content will be inserted into the new table.
+* `Ignore` - the operation is ignored if the table already exists in Ignite. If a table does not exist:
+* A new table will be created using the schema of the DataFrame and provided options.
+* DataFrame content will be inserted into the new table.
+
+Save mode can be specified using the `mode(SaveMode mode)` method. For more information, please see the link:https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter@mode&lpar;saveMode:org.apache.spark.sql.SaveMode&rpar;:org.apache.spark.sql.DataFrameWriter%5BT%5D[Spark Documentation^]). Here is a code example that shows this method:
+
+
+[tabs]
+--
+tab:Java[]
+
+[source, java]
+----
+SparkSession spark = ...
+
+String cfgPath = "path/to/config/file";
+
+Dataset<Row> jsonDataFrame = spark.read().json("path/to/file.json");
+
+jsonDataFrame.write()
+  .format(IgniteDataFrameSettings.FORMAT_IGNITE())
+  .mode(SaveMode.Append) // SaveMode.
+//... other options
+   .save();
+----
+
+tab:Scala[]
+
+[source, scala]
+----
+val spark: SparkSession = …
+
+val cfgPath: String = "path/to/config/file"
+
+val jsonDataFrame = spark.read.json("path/to/file.json")
+
+jsonDataFrame.write
+  .format(FORMAT_IGNITE)
+  .mode(SaveMode.Append) // SaveMode.
+//... other options
+  .save()
+----
+--
+
+You must define the following Ignite specific options if a new table will be created by a DataFrame's save routines:
+
+* `OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS` - a primary key is required for every Ignite table. This option has to contain a comma-separated list of fields/columns that represent a primary key.
+* `OPTION_CREATE_TABLE_PARAMETERS` - additional parameters to use upon Ignite table creation. The parameters are those that are supported by the link:sql-reference/ddl#create-table[CREATE TABLE] command.
+
+The following example shows how to write the content of a JSON file into Ignite:
+
+[tabs]
+--
+tab:Java[]
+
+[source, java]
+----
+SparkSession spark = ...
+
+String cfgPath = "path/to/config/file";
+
+Dataset<Row> jsonDataFrame = spark.read().json("path/to/file.json");
+
+jsonDataFrame.write()
+  .format(IgniteDataFrameSettings.FORMAT_IGNITE())
+  .option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), TEST_CONFIG_FILE)
+  .option(IgniteDataFrameSettings.OPTION_TABLE(), "json_table")
+  .option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "id")
+  .option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "template=replicated")
+  .save();
+----
+
+tab:Scala[]
+
+[source, scala]
+----
+val spark: SparkSession = …
+
+val cfgPath: String = "path/to/config/file"
+
+val jsonDataFrame = spark.read.json("path/to/file.json")
+
+jsonDataFrame.write
+  .format(FORMAT_IGNITE)
+  .option(OPTION_CONFIG_FILE, TEST_CONFIG_FILE)
+  .option(OPTION_TABLE, "json_table")
+  .option(OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS, "id")
+  .option(OPTION_CREATE_TABLE_PARAMETERS, "template=replicated")
+  .save()
+----
+
+--
+
+== IgniteSparkSession and IgniteExternalCatalog
+
+Spark introduces the entity called `catalog` to read and store meta-information about known data sources, such as tables and views. Ignite provides its own implementation of this catalog, called `IgniteExternalCatalog`.
+
+`IgniteExternalCatalog` can read information about all existing SQL tables deployed in the Ignite cluster. `IgniteExternalCatalog` is also required to build an `IgniteSparkSession` object.
+
+`IgniteSparkSession` is an extension of the regular `SparkSession` that stores `IgniteContext` and injects the `IgniteExternalCatalog` instance into Spark objects.
+
+`IgniteSparkSession.builder()` must be used to create `IgniteSparkSession`. For example, if the following two tables are created in Ignite:
+
+
+
+[source, sql]
+----
+CREATE TABLE city (
+    id LONG PRIMARY KEY,
+    name VARCHAR
+) WITH "template=replicated";
+
+CREATE TABLE person (
+    id LONG,
+    name VARCHAR,
+    city_id LONG,
+    PRIMARY KEY (id, city_id)
+) WITH "backups=1, affinityKey=city_id";
+----
+
+
+Then executing the following code provides table meta-information:
+
+
+[tabs]
+--
+tab:Java[]
+
+[source, java]
+----
+// Using SparkBuilder provided by Ignite.
+IgniteSparkSession igniteSession = IgniteSparkSession.builder()
+  .appName("Spark Ignite catalog example")
+  .master("local")
+  .config("spark.executor.instances", "2")
+  //Only additional option to refer to Ignite cluster.
+  .igniteConfig("/path/to/ignite/config.xml")
+  .getOrCreate();
+
+// This will print out info about all SQL tables existed in Ignite.
+igniteSession.catalog().listTables().show();
+
+// This will print out schema of PERSON table.
+igniteSession.catalog().listColumns("person").show();
+
+// This will print out schema of CITY table.
+igniteSession.catalog().listColumns("city").show();
+----
+
+
+tab:Scala[]
+
+[source, scala]
+----
+// Using SparkBuilder provided by Ignite.
+val igniteSession = IgniteSparkSession.builder()
+  .appName("Spark Ignite catalog example")
+  .master("local")
+  .config("spark.executor.instances", "2")
+  //Only additional option to refer to Ignite cluster.
+  .igniteConfig("/path/to/ignite/config.xml")
+  .getOrCreate()
+
+// This will print out info about all SQL tables existed in Ignite.
+igniteSession.catalog.listTables().show()
+
+// This will print out schema of PERSON table.
+igniteSession.catalog.listColumns("person").show()
+
+// This will print out schema of CITY table.
+igniteSession.catalog.listColumns("city").show()
+----
+--
+
+And the code output should be similar to the following:
+
+
+
+[source, text]
+----
++------+--------+-----------+---------+-----------+
+|  name|database|description|tableType|isTemporary|
++------+--------+-----------+---------+-----------+
+|  CITY|        |       null| EXTERNAL|      false|
+|PERSON|        |       null| EXTERNAL|      false|
++------+--------+-----------+---------+-----------+
+
+PERSON table description:
+
++-------+-----------+--------+--------+-----------+--------+
+|   name|description|dataType|nullable|isPartition|isBucket|
++-------+-----------+--------+--------+-----------+--------+
+|   NAME|       null|  string|    true|      false|   false|
+|     ID|       null|  bigint|   false|       true|   false|
+|CITY_ID|       null|  bigint|   false|       true|   false|
++-------+-----------+--------+--------+-----------+--------+
+
+CITY table description:
+
++----+-----------+--------+--------+-----------+--------+
+|name|description|dataType|nullable|isPartition|isBucket|
++----+-----------+--------+--------+-----------+--------+
+|NAME|       null|  string|    true|      false|   false|
+|  ID|       null|  bigint|   false|       true|   false|
++----+-----------+--------+--------+-----------+--------+
+----
+
+
+
+
+
+
+
+== Ignite DataFrame Options
+
+
+[cols="1,2",opts="header"]
+|===
+| Name  | Description
+| `FORMAT_IGNITE`|   Name of the Ignite Data Source
+|`OPTION_CONFIG_FILE` | Path to the config file
+|`OPTION_TABLE`   | Table name
+|`OPTION_CREATE_TABLE_PARAMETERS` | Additional parameters for a newly created table. The value of this option is used for the `WITH` part of a `CREATE TABLE` query.
+|`OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS`|  Comma separated list of primary key fields.
+|`OPTION_STREAMER_ALLOW_OVERWRITE` |If `true`, then an existing row will be overwritten with DataFrame content. If `false`, then the row will be skipped if the primary key already exists in the table.
+|`OPTION_STREAMER_FLUSH_FREQUENCY`| Automatic flush frequency. This is the time after which the streamer will make an attempt to submit all data added so far to remote nodes See link:data-streaming[Data Streaming]
+|`OPTION_STREAMER_PER_NODE_BUFFER_SIZE`|    Per node buffer size. See also. The size of the per node key-value pairs buffer.
+|`OPTION_STREAMER_PER_NODE_PARALLEL_OPERATIONS`|    Per node buffer size. The maximum number of parallel stream operations for a single node.
+|`OPTION_SCHEMA`|   The Ignite SQL schema name in which the specified table exists. When OPTION_SCHEMA is not specified, all schemas will be scanned to find a table with a matching name. This option can be used to differentiate two tables of the same name in different Ignite SQL schemas.
+
+When creating new tables, `OPTION_SCHEMA` must be specified as `PUBLIC`, otherwise an exception will be thrown because currently Ignite SQL can issue `CREATE TABLE` statements within the `PUBLIC` schema only.
+
+|===
+
+== Examples
+
+There are several examples available on GitHub that demonstrate how to use Spark DataFrames with Ignite:
+
+* link:{githubUrl}/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteDataFrameExample.scala[DataFrame]
+* link:{githubUrl}/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteDataFrameWriteExample.scala[Saving DataFrame]
+* link:{githubUrl}/examples/src/main/spark/org/apache/ignite/examples/spark/IgniteCatalogExample.scala[Catalog]
diff --git a/docs/_docs/ignite-for-spark/ignitecontext-and-rdd.adoc b/docs/_docs/ignite-for-spark/ignitecontext-and-rdd.adoc
new file mode 100644
index 0000000..c97950c
--- /dev/null
+++ b/docs/_docs/ignite-for-spark/ignitecontext-and-rdd.adoc
@@ -0,0 +1,92 @@
+= IgniteContext and IgniteRDD
+
+== IgniteContext
+
+IgniteContext is the main entry point to Spark-Ignite integration. To create an instance of Ignite context, user must provide an instance of SparkContext and a closure creating `IgniteConfiguration` (configuration factory). Ignite context will make sure that server or client Ignite nodes exist in all involved job instances. Alternatively, a path to an XML configuration file can be passed to `IgniteContext` constructor which will be used to configure nodes being started.
+
+When creating an `IgniteContext` instance, an optional boolean `client` argument (defaulting to `true`) can be passed to context constructor. This is typically used in a Shared Deployment installation. When `client` is set to `false`, context will operate in embedded mode and will start server nodes on all workers during the context construction. This is required in an Embedded Deployment installation. See link:ignite-for-spark/installation[Installation] for information on deployment con [...]
+
+[CAUTION]
+====
+[discrete]
+=== Embedded Mode Deprecation
+Embedded mode implies starting Ignite server nodes within Spark executors which can cause unexpected rebalancing or even data loss. Therefore this mode is currently deprecated and will be eventually discontinued. Consider starting a separate Ignite cluster and using standalone mode to avoid data consistency and performance issues.
+====
+
+Once `IgniteContext` is created, instances of `IgniteRDD` may be obtained using `fromCache` methods. It is not required that requested cache exist in Ignite cluster when RDD is created. If the cache with the given name does not exist, it will be created using provided configuration or template configuration.
+
+For example, the following code will create an Ignite context with default Ignite configuration
+
+
+[source, scala]
+----
+val igniteContext = new IgniteContext(sparkContext,
+    () => new IgniteConfiguration())
+----
+
+The following code will create an Ignite context configured from a file `example-shared-rdd.xml`:
+
+
+[source, scala]
+----
+val igniteContext = new IgniteContext(sparkContext,
+    "examples/config/spark/example-shared-rdd.xml")
+----
+
+
+== IgniteRDD
+
+`IgniteRDD` is an implementation of Spark RDD abstraction representing a live view of Ignite cache. `IgniteRDD` is not immutable, all changes in Ignite cache (regardless whether they were caused by another RDD or external changes in cache) will be visible to RDD users immediately.
+
+`IgniteRDD` utilizes partitioned nature of Ignite caches and provides partitioning information to Spark executor. Number of partitions in `IgniteRDD` equals to the number of partitions in underlying Ignite cache. `IgniteRDD` also provides affinity information to Spark via `getPrefferredLocations` method so that RDD computations use data locality.
+
+== Reading values from Ignite
+Since `IgniteRDD` is a live view of Ignite cache, there is no need to explicitly load data to Spark application from Ignite. All RDD methods are available to use right away after an instance of `IgniteRDD` is created.
+
+For example, assuming an Ignite cache with name "partitioned" contains string values, the following code will find all values that contain the word "Ignite":
+
+
+[source, scala]
+----
+val cache = igniteContext.fromCache("partitioned")
+val result = cache.filter(_._2.contains("Ignite")).collect()
+----
+
+
+== Saving values to Ignite
+
+Since Ignite caches operate on key-value pairs, the most straightforward way to save values to Ignite cache is to use a Spark tuple RDD and `savePairs` method. This method will take advantage of the RDD partitioning and store value to cache in a parallel manner, if possible.
+
+It is also possible to save value-only RDD into Ignite cache using `saveValues` method. In this case `IgniteRDD` will generate a unique affinity-local key for each value being stored into the cache.
+
+For example, the following code will store pairs of integers from 1 to 10000 into cache named "partitioned" using 10 parallel store operations:
+
+
+[source, scala]
+----
+val cacheRdd = igniteContext.fromCache("partitioned")
+
+cacheRdd.savePairs(sparkContext.parallelize(1 to 10000, 10).map(i => (i, i)))
+----
+
+
+== Running SQL queries against Ignite cache
+
+When Ignite cache is configured with the indexing subsystem enabled, it is possible to run SQL queries against the cache using `objectSql` and `sql` methods. See link:SQL/sql-introduction[Working with SQL] for more information about Ignite SQL queries.
+
+For example, assuming the "partitioned" cache is configured to index pairs of integers, the following code will get all integers in the range (10, 100):
+
+
+[source, scala]
+----
+val cacheRdd = igniteContext.fromCache("partitioned")
+
+val result = cacheRdd.sql("select _val from Integer where val > ? and val < ?", 10, 100)
+----
+
+== Example
+
+There are​ a couple of examples available on GitHub that demonstrate the usage of `IgniteRDD`:
+
+* link:{githubUrl}/examples/src/main/scala/org/apache/ignite/scalar/examples/spark/ScalarSharedRDDExample.scala[Scala Example^]
+* link:{githubUrl}/examples/src/main/spark/org/apache/ignite/examples/spark/SharedRDDExample.java[Java Example^]
diff --git a/docs/_docs/ignite-for-spark/installation.adoc b/docs/_docs/ignite-for-spark/installation.adoc
new file mode 100644
index 0000000..ac28c61
--- /dev/null
+++ b/docs/_docs/ignite-for-spark/installation.adoc
@@ -0,0 +1,157 @@
+= Installation
+
+== Shared Deployment
+
+Shared deployment implies that Apache Ignite nodes are running independently from Apache Spark applications and store state even after Apache Spark jobs die. Similarly to Apache Spark, there are three ways to deploy Apache Ignite to the cluster.
+
+=== Standalone Deployment
+
+In the Standalone deployment mode, Ignite nodes should be deployed together with Spark Worker nodes. Instruction on Ignite installation can be found link:installation[here]. After you install Ignite on all worker nodes, start a node on each Spark worker with your config using `ignite.sh` script.
+
+
+=== Adding Ignite libraries to Spark classpath by default
+
+Spark application deployment model allows dynamic jar distribution during application start. This model, however, has some drawbacks:
+
+  *  Spark dynamic class loader does not implement `getResource` methods, so you will not be able to access resources located in jar files.
+  * Java logger uses application class loader (not the context class loader) to load log handlers which results in `ClassNotFoundException` when using Java logging in Ignite.
+
+There is a way to alter the default Spark classpath for each launched application (this should be done on each machine of the Spark cluster, including master, worker and driver nodes).
+
+. Locate the `$SPARK_HOME/conf/spark-env.sh` file. If this file does not exist, create it from template using `$SPARK_HOME/conf/spark-env.sh.template`
+. Add the following lines to the end of the `spark-env.sh` file (uncomment the line setting `IGNITE_HOME` in case if you do not have it globally set):
+
+
+
+[source, shell]
+----
+# Optionally set IGNITE_HOME here.
+# IGNITE_HOME=/path/to/ignite
+
+IGNITE_LIBS="${IGNITE_HOME}/libs/*"
+
+for file in ${IGNITE_HOME}/libs/*
+do
+    if [ -d ${file} ] && [ "${file}" != "${IGNITE_HOME}"/libs/optional ]; then
+        IGNITE_LIBS=${IGNITE_LIBS}:${file}/*
+    fi
+done
+
+export SPARK_CLASSPATH=$IGNITE_LIBS
+----
+
+
+Copy any folders required from the `$IGNITE_HOME/libs/optional` folder, such as `ignite-log4j`, to the `$IGNITE_HOME/libs` folder.
+
+You can verify that the Spark classpath is changed by running `bin/spark-shell` and typing a simple import statement:
+
+
+
+[source, shell]
+----
+scala> import org.apache.ignite.configuration._
+import org.apache.ignite.configuration._
+----
+
+== Embedded Deployment
+
+[CAUTION]
+====
+[discrete]
+=== Embedded Mode Deprecation
+Embedded mode implies starting Ignite server nodes within Spark executors which can cause unexpected rebalancing or even data loss. Therefore this mode is currently deprecated and will be eventually discontinued. Consider starting a separate Ignite cluster and using standalone mode to avoid data consistency and performance issues.
+====
+
+
+Embedded deployment means that Apache Ignite nodes are started inside the Apache Spark job processes and are stopped when the job dies. There is no need for additional deployment steps in this case. Apache Ignite code will be distributed to worker machines using the Apache Spark deployment mechanism and nodes will be started on all workers as  part of the `IgniteContext` initialization.
+
+
+== Maven
+
+Ignite's Spark artifact is link:http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.ignite%22[hosted in Maven Central^]. Depending on a Scala version you use, include the artifact using one of the dependencies shown below.
+
+.Scala 2.11
+[source, scala]
+----
+<dependency>
+  <groupId>org.apache.ignite</groupId>
+  <artifactId>ignite-spark</artifactId>
+  <version>${ignite.version}</version>
+</dependency>
+----
+
+.Scala 2.10
+[source, scala]
+----
+<dependency>
+  <groupId>org.apache.ignite</groupId>
+  <artifactId>ignite-spark_2.10</artifactId>
+  <version>${ignite.version}</version>
+</dependency>
+----
+
+== SBT
+
+If SBT is used as a build tool for a Scala application, then Ignite's Spark artifact can be added into `build.sbt` with one of the commands below:
+
+.Scala 2.11
+[source, scala]
+----
+libraryDependencies += "org.apache.ignite" % "ignite-spark" % "ignite.version"
+----
+
+
+.Scala 2.10
+[source, scala]
+----
+libraryDependencies += "org.apache.ignite" % "ignite-spark_2.10" % "ignite.version"
+----
+
+
+== Classpath Configuration
+
+When IgniteRDD or Ignite Data Frames APIs are used, make sure that Spark executors and drivers have all the required Ignite jars available in their classpath. Spark provides several ways to modify the classpath of both the driver or the executor process.
+
+
+=== Parameters Configuration
+
+Ignite jars can be added to Spark using configuration parameters such as
+`spark.driver.extraClassPath` and `spark.executor.extraClassPath`. Refer to the link:https://spark.apache.org/docs/latest/configuration.html#runtime-environment[Spark official documentation] for all available options.
+
+The following shows how to fill in `spark.driver.extraClassPath` parameters:
+
+
+[source, shell]
+----
+spark.executor.extraClassPath /opt/ignite/libs/*:/opt/ignite/libs/optional/ignite-spark/*:/opt/ignite/libs/optional/ignite-log4j/*:/opt/ignite/libs/optional/ignite-yarn/*:/opt/ignite/libs/ignite-spring/*
+----
+
+=== Source Code Configuration
+
+Spark provides APIs to set up extra libraries from the application code. You can provide Ignite jars in the following way:
+
+
+
+[source, scala]
+----
+private val MAVEN_HOME = "/home/user/.m2/repository"
+
+val spark = SparkSession.builder()
+       .appName("Spark Ignite data sources example")
+       .master("spark://172.17.0.2:7077")
+       .getOrCreate()
+
+spark.sparkContext.addJar(MAVEN_HOME + "/org/apache/ignite/ignite-core/2.4.0/ignite-core-2.4.0.jar")
+spark.sparkContext.addJar(MAVEN_HOME + "/org/apache/ignite/ignite-spring/2.4.0/ignite-spring-2.4.0.jar")
+spark.sparkContext.addJar(MAVEN_HOME + "/org/apache/ignite/ignite-log4j/2.4.0/ignite-log4j-2.4.0.jar")
+spark.sparkContext.addJar(MAVEN_HOME + "/org/apache/ignite/ignite-spark/2.4.0/ignite-spark-2.4.0.jar")
+spark.sparkContext.addJar(MAVEN_HOME + "/org/apache/ignite/ignite-indexing/2.4.0/ignite-indexing-2.4.0.jar")
+spark.sparkContext.addJar(MAVEN_HOME + "/org/springframework/spring-beans/4.3.7.RELEASE/spring-beans-4.3.7.RELEASE.jar")
+spark.sparkContext.addJar(MAVEN_HOME + "/org/springframework/spring-core/4.3.7.RELEASE/spring-core-4.3.7.RELEASE.jar")
+spark.sparkContext.addJar(MAVEN_HOME + "/org/springframework/spring-context/4.3.7.RELEASE/spring-context-4.3.7.RELEASE.jar")
+spark.sparkContext.addJar(MAVEN_HOME + "/org/springframework/spring-expression/4.3.7.RELEASE/spring-expression-4.3.7.RELEASE.jar")
+spark.sparkContext.addJar(MAVEN_HOME + "/javax/cache/cache-api/1.0.0/cache-api-1.0.0.jar")
+spark.sparkContext.addJar(MAVEN_HOME + "/com/h2database/h2/1.4.195/h2-1.4.195.jar")
+----
+
+
diff --git a/docs/_docs/ignite-for-spark/overview.adoc b/docs/_docs/ignite-for-spark/overview.adoc
new file mode 100644
index 0000000..4312809
--- /dev/null
+++ b/docs/_docs/ignite-for-spark/overview.adoc
@@ -0,0 +1,35 @@
+= Ignite for Spark
+
+Apache Ignite is a distributed memory-centric database and caching platform that is used by Apache Spark users to:
+
+* Achieve true in-memory performance at scale and avoid data movement from a data source to Spark workers and applications.
+* Boost DataFrame and SQL performance.
+* More easily share state and data among Spark jobs.
+
+image::images/spark_integration.png[Spark Integration]
+
+
+== Ignite RDDs
+
+Apache Ignite provides an implementation of the Spark RDD which allows any data and state to be shared in memory as RDDs across Spark jobs. The Ignite RDD provides a shared, mutable view of the same data in-memory in Ignite across different Spark jobs, workers, or applications. Native Spark RDDs cannot be shared across Spark jobs or applications.
+
+The way an link:ignite-for-spark/ignitecontext-and-rdd[IgniteRDD] is implemented is as a view over a distributed Ignite table (aka. cache). It can be deployed with an Ignite node either within the Spark job executing process, on a Spark worker, or in a separate Ignite cluster. It means that depending on the chosen deployment mode the shared state may either exist only during the lifespan of a Spark application (embedded mode), or it may out-survive the Spark application (standalone mode).
+
+While Apache SparkSQL supports a fairly rich SQL syntax, it doesn't implement any indexing. As a result, Spark queries may take minutes even on moderately small data sets because they have to do full data scans. With Ignite, Spark users can configure primary and secondary indexes that can bring up to 1000x performance gains.
+
+
+== Ignite DataFrames
+
+The Apache Spark DataFrame API introduced the concept of a schema to describe the data, allowing Spark to manage the schema and organize the data into a tabular format. To put it simply, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database and allows Spark to leverage the Catalyst query optimizer to produce much more efficient query execution plans in comparison to RDDs, which are just collections  [...]
+
+Ignite expands link:ignite-for-spark/ignite-dataframe[DataFrame], simplifying development and improving data access times whenever Ignite is used as memory-centric storage for Spark. Benefits include:
+
+* Ability to share data and state across Spark jobs by writing and reading DataFrames to/from Ignite.
+* Faster SparkSQL queries by optimizing Spark query execution plans with Ignite SQL engine which include​ advanced indexing and avoid data movement across the network from Ignite to Spark.
+
+== Supported Spark Version
+
+Apache Ignite comes with two modules that support different versions of Apache Spark:
+
+* ignite-spark — integration with Spark 2.3
+* ignite-spark-2.4 — integration with Spark 2.4
diff --git a/docs/_docs/ignite-for-spark/spark-shell.adoc b/docs/_docs/ignite-for-spark/spark-shell.adoc
new file mode 100644
index 0000000..86a2e3a
--- /dev/null
+++ b/docs/_docs/ignite-for-spark/spark-shell.adoc
@@ -0,0 +1,188 @@
+= Testing Ignite with Spark-shell
+
+== Starting up the cluster
+
+Here we will briefly cover the process of Spark and Ignite cluster startup. Refer to link:https://spark.apache.org/docs/latest/[Spark documentation] for more details.
+
+For the testing you will need a Spark master process and at least one Spark worker. Usually Spark master and workers are separate machines, but for the test purposes you can start worker on the same machine where master starts.
+
+. Download and unpack Spark binary distribution to the same location (let it be `SPARK_HOME`) on all nodes.
+. Download and unpack Ignite binary distribution to the same location (let it be `IGNITE_HOME`) on all nodes.
+. On master node `cd` to `$SPARK_HOME` and run the following command:
++
+--
+[source, shell]
+----
+sbin/start-master.sh
+----
+
+The script should output the path to log file of the started process. Check the log file for the master URL which has the following format: `spark://master_host:master_port` Also check the log file for the Web UI url (usually it is `http://master_host:8080`).
+--
+. On each of the worker nodes `cd` to `$SPARK_HOME` and run the following command:
++
+[source, shell]
+----
+bin/spark-class org.apache.spark.deploy.worker.Worker spark://master_host:master_port
+----
+where `spark://master_host:master_port` is the master URL you grabbed from the master log file. After workers has started check the master Web UI interface, it should show all of your workers registered in status `ALIVE`.
+. On each of the worker nodes cd to `$IGNITE_HOME` and start an Ignite node by running the following command:
++
+[source, shell]
+----
+bin/ignite.sh
+----
+
+
+You should see Ignite nodes discover each other with default configuration. If your network does not allow multicast traffic, you will need to change the default configuration file and configure TCP discovery.
+
+
+== Working with Spark-Shell
+
+Now that you have your cluster up and running, you can run `spark-shell` and check the integration.
+
+1. Start spark shell:
++
+--
+* Either by providing Maven coordinates to Ignite artifacts (you can use `--repositories` if you need, but it may be omitted):
++
+[source, shell]
+----
+./bin/spark-shell
+    --packages org.apache.ignite:ignite-spark:1.8.0
+  --master spark://master_host:master_port
+  --repositories http://repo.maven.apache.org/maven2/org/apache/ignite
+----
+* Or by providing paths to Ignite jar file paths using `--jars` parameter
++
+[source, shell]
+----
+./bin/spark-shell --jars path/to/ignite-core.jar,path/to/ignite-spark.jar,path/to/cache-api.jar,path/to/ignite-log4j.jar,path/to/log4j.jar --master spark://master_host:master_port
+----
+
+You should see Spark shell started up.
+
+Note that if you are planning to use spring configuration loading, you will need to add the `ignite-spring` dependency as well:
+
+[source, shell]
+----
+./bin/spark-shell
+    --packages org.apache.ignite:ignite-spark:1.8.0,org.apache.ignite:ignite-spring:1.8.0
+  --master spark://master_host:master_port
+----
+--
+2. Let's create an instance of Ignite context using default configuration:
++
+--
+
+[source, scala]
+----
+import org.apache.ignite.spark._
+import org.apache.ignite.configuration._
+
+val ic = new IgniteContext(sc, () => new IgniteConfiguration())
+----
+
+You should see something like
+
+
+[source, text]
+----
+ic: org.apache.ignite.spark.IgniteContext = org.apache.ignite.spark.IgniteContext@62be2836
+----
+
+An alternative way to create an instance of IgniteContext is to use a configuration file. Note that if path to configuration is specified in a relative form, then the `IGNITE_HOME` environment variable should be globally set in the system as the path is resolved relative to `IGNITE_HOME`
+
+
+[source, scala]
+----
+import org.apache.ignite.spark._
+import org.apache.ignite.configuration._
+
+val ic = new IgniteContext(sc, "examples/config/spark/example-shared-rdd.xml")
+----
+--
+3. Let's now create an instance of `IgniteRDD` using "partitioned" cache in default configuration:
++
+--
+
+[source, scala]
+----
+val sharedRDD = ic.fromCache[Integer, Integer]("partitioned")
+----
+
+
+You should see an instance of RDD created for partitioned cache:
+
+
+[source, text]
+----
+shareRDD: org.apache.ignite.spark.IgniteRDD[Integer,Integer] = IgniteRDD[0] at RDD at IgniteAbstractRDD.scala:27
+----
+
+
+Note that creation of RDD is a local operation and will not create a cache in Ignite cluster.
+--
+4. Let's now actually ask Spark to do something with our RDD, for example, get all pairs where value is less than 10:
++
+--
+
+[source, scala]
+----
+sharedRDD.filter(_._2 < 10).collect()
+----
+
+
+As our cache has not been filled yet, the result will be an empty array:
+
+
+[source, text]
+----
+res0: Array[(Integer, Integer)] = Array()
+----
+
+
+Check the logs of remote spark workers and see how Ignite context will start clients on all remote workers in the cluster. You can also start command-line Visor and check that "partitioned" cache has been created.
+
+--
+5. Let's now save some values into Ignite:
++
+--
+
+[source, scala]
+----
+sharedRDD.savePairs(sc.parallelize(1 to 100000, 10).map(i => (i, i)))
+----
+
+After running this command you can check with command-line Visor that cache size is 100000 elements.
+
+--
+6. We can now check how the state we created will survive job restart. Shut down the spark shell and repeat steps 1-3. You should again have an instance of Ignite context and RDD for "partitioned" cache. We can now check how many keys there are in our RDD which value is greater than 50000:
++
+--
+
+[source, scala]
+----
+sharedRDD.filter(_._2 > 50000).count
+----
+
+Since we filled up cache with a sequence of number from 1 to 100000 inclusive, we should see `50000` as a result:
+
+
+[source, text]
+----
+res0: Long = 50000
+----
+--
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/_docs/ignite-for-spark/troubleshooting.adoc b/docs/_docs/ignite-for-spark/troubleshooting.adoc
new file mode 100644
index 0000000..a72919d
--- /dev/null
+++ b/docs/_docs/ignite-for-spark/troubleshooting.adoc
@@ -0,0 +1,9 @@
+= Troubleshooting
+
+*  My Spark application or Spark shell hangs when I invoke any action on IgniteRDD
+
+This will happen if you have created `IgniteContext` in client mode (which is default mode) and you do not have any Ignite server nodes started up. In this case Ignite client will wait until server nodes are started or fail after cluster join timeout has elapsed. You should start at least one Ignite server node when using `IgniteContext` in client mode.
+
+*  I am getting `java.lang.ClassNotFoundException` `org.apache.ignite.logger.java.JavaLoggerFileHandler` when using IgniteContext
+
+This issue appears when you do not have any loggers included in classpath and Ignite tries to use standard Java logging. By default Spark loads all user jar files using separate class loader. Java logging framework, on the other hand, uses application class loader to initialize log handlers. To resolve this, you can either add `ignite-log4j` module to the list of the used jars so that Ignite would use Log4j as a logging subsystem, or alter default Spark classpath as described link:ignite [...]
\ No newline at end of file
diff --git a/docs/_docs/images/spark_integration.png b/docs/_docs/images/spark_integration.png
new file mode 100644
index 0000000..466c6a3
Binary files /dev/null and b/docs/_docs/images/spark_integration.png differ