You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@griffin.apache.org by "Chitral Verma (Jira)" <ji...@apache.org> on 2020/06/17 06:22:00 UTC

[jira] [Resolved] (GRIFFIN-326) New implementation for Elasticsearch Data Connector (Batch)

     [ https://issues.apache.org/jira/browse/GRIFFIN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chitral Verma resolved GRIFFIN-326.
-----------------------------------
      Assignee: Chitral Verma
    Resolution: Fixed

> New implementation for Elasticsearch Data Connector (Batch)
> -----------------------------------------------------------
>
>                 Key: GRIFFIN-326
>                 URL: https://issues.apache.org/jira/browse/GRIFFIN-326
>             Project: Griffin
>          Issue Type: Sub-task
>            Reporter: Chitral Verma
>            Assignee: Chitral Verma
>            Priority: Major
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
>  * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on the driver. If the index has a lot of data, due to the big response payload, a bottleneck would be created on the driver.
>  * Further, the driver then needs to parse this response payload and then parallelize it, this is again a driver side bottleneck as each JSON record needs to be mapped to a set schema in a type-safe manner.
>  * Only _host_, _port_ and _version_ are the available options to configure the connection to the ES node or cluster.
>  * Source partitioning logic is not carried forward when parallelizing records, the records will be randomized due to the Spark's default partitioning
>  * Even though this implementation is a first-class member of Apache Griffin, yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
>  * Deprecate the current implementation in favor of the direct official [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] library.
>  * This library is built on DataSource API built on spark 2.2.x+ and thus brings support for filter pushdowns, column pruning, unified read and write and additional optimizations.
>  * Many configuration options are available for ES connectivity, [check here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
>  * Any filters can be applied as expressions directly on the data frame and are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load("<resource_name>"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)