You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@griffin.apache.org by "Chitral Verma (Jira)" <ji...@apache.org> on 2020/06/17 06:22:00 UTC
[jira] [Resolved] (GRIFFIN-326) New implementation for
Elasticsearch Data Connector (Batch)
[ https://issues.apache.org/jira/browse/GRIFFIN-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chitral Verma resolved GRIFFIN-326.
-----------------------------------
Assignee: Chitral Verma
Resolution: Fixed
> New implementation for Elasticsearch Data Connector (Batch)
> -----------------------------------------------------------
>
> Key: GRIFFIN-326
> URL: https://issues.apache.org/jira/browse/GRIFFIN-326
> Project: Griffin
> Issue Type: Sub-task
> Reporter: Chitral Verma
> Assignee: Chitral Verma
> Priority: Major
> Time Spent: 4h 20m
> Remaining Estimate: 0h
>
> The current implementation of Elasticsearch relies on sending post requests from the driver using either SQL or search mode for query filtering.
> This implementation has the following potential issues,
> * Data is fetched for indexes (database scopes in ES) in bulk via 1 call on the driver. If the index has a lot of data, due to the big response payload, a bottleneck would be created on the driver.
> * Further, the driver then needs to parse this response payload and then parallelize it, this is again a driver side bottleneck as each JSON record needs to be mapped to a set schema in a type-safe manner.
> * Only _host_, _port_ and _version_ are the available options to configure the connection to the ES node or cluster.
> * Source partitioning logic is not carried forward when parallelizing records, the records will be randomized due to the Spark's default partitioning
> * Even though this implementation is a first-class member of Apache Griffin, yet it's based on the _custom_ connector trait.
> The proposed implementation aims to,
> * Deprecate the current implementation in favor of the direct official [elasticsearch-hadoop|[https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20]] library.
> * This library is built on DataSource API built on spark 2.2.x+ and thus brings support for filter pushdowns, column pruning, unified read and write and additional optimizations.
> * Many configuration options are available for ES connectivity, [check here|[https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java]]
> * Any filters can be applied as expressions directly on the data frame and are pushed automatically to the source.
> The new implementation will look something like,
> {code:java}
> sparkSession.read.format("es").options( ??? ).load("<resource_name>"){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)