You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by sixers <bu...@gmail.com> on 2017/10/09 13:36:27 UTC

[Spark SQL] Missing data in Elastisearch when writing data with elasticsearch-spark connector

### Issue description

We have an issue with data consistency when storing data in Elasticsearch
using Spark and elasticsearch-spark connector. Job finishes successfully,
but when we compare the original data (stored in S3), with the data stored
in ES, some documents are not present in Elasticsearch.

### Steps to reproduce

This issue doesn't always happen and unfortunately we cannot reproduce it on
demand. The only indicator we found that correlates with occurrences of this
bug, is the presence of the failed stage while saving data in Elasticsearch. 
Jobs which have this stage failure eventually complete successfully, but the
data is inconsistent. 

We use the following configuration:

- Elasticsearch:
  - "es.write.operation": "index"
  - "es.nodes.discovery": "false"
  - "es.nodes.wan.only": "true"
- Spark:
  - write mode: "append"

### Version Info

- OS:         :  Amazon Linux
- JVM         :  1.8
- Hadoop/Spark:  Hadoop 2.7.3 (Amazon), Spark 2.2.0
- ES-Hadoop   :  elasticsearch-spark-20_2.11:5.5.2
- ES          :  5.3 (Amazon Elasticsearch Service).

### Questions

I'm looking for some guidance in order to debug this issue.

1. I want to understand why Elasticsearch doesn't have all the data although
Spark says it finished the job and saved the data?
2. What can we do to ensure that we write data to ES in a consistent manner?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: [Spark SQL] Missing data in Elastisearch when writing data with elasticsearch-spark connector

Posted by ayan guha <gu...@gmail.com>.

Have you raised it in ES connector github as issues? In my past experience
(with hadoop connector with Pig), they respond pretty quickly.

On Tue, Oct 10, 2017 at 12:36 AM, sixers <bu...@gmail.com> wrote:

> ### Issue description
>
> We have an issue with data consistency when storing data in Elasticsearch
> using Spark and elasticsearch-spark connector. Job finishes successfully,
> but when we compare the original data (stored in S3), with the data stored
> in ES, some documents are not present in Elasticsearch.
>
> ### Steps to reproduce
>
> This issue doesn't always happen and unfortunately we cannot reproduce it
> on
> demand. The only indicator we found that correlates with occurrences of
> this
> bug, is the presence of the failed stage while saving data in
> Elasticsearch.
> Jobs which have this stage failure eventually complete successfully, but
> the
> data is inconsistent.
>
> We use the following configuration:
>
> - Elasticsearch:
>   - "es.write.operation": "index"
>   - "es.nodes.discovery": "false"
>   - "es.nodes.wan.only": "true"
> - Spark:
>   - write mode: "append"
>
> ### Version Info
>
> - OS:         :  Amazon Linux
> - JVM         :  1.8
> - Hadoop/Spark:  Hadoop 2.7.3 (Amazon), Spark 2.2.0
> - ES-Hadoop   :  elasticsearch-spark-20_2.11:5.5.2
> - ES          :  5.3 (Amazon Elasticsearch Service).
>
> ### Questions
>
> I'm looking for some guidance in order to debug this issue.
>
> 1. I want to understand why Elasticsearch doesn't have all the data
> although
> Spark says it finished the job and saved the data?
> 2. What can we do to ensure that we write data to ES in a consistent
> manner?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha