You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vibhor Gupta <Vi...@walmart.com.INVALID> on 2022/11/09 18:13:04 UTC

Offline elastic index creation

Hi Spark Community,

Is there a way to create elastic indexes offline and then import them to an elastic cluster ?
We are trying to load an elastic index with around 10B documents (~1.5 to 2 TB data) using spark daily.

I know elastic provides a snapshot restore functionality through GCS/S3/Azure, but is there a way to generate this snapshot offline using spark ?

Thanks,
Vibhor Gupta

Re: Offline elastic index creation

Posted by Debasish Das <de...@gmail.com>.
Hi Vibhor,

We worked on a project to create lucene indexes using spark but the project
has not been managed for some time now. If there is interest we can
resurrect it

https://github.com/vsumanth10/trapezium/blob/master/dal/src/test/scala/com/verizon/bda/trapezium/dal/lucene/LuceneIndexerSuite.scala
https://www.databricks.com/session/fusing-apache-spark-and-lucene-for-near-realtime-predictive-model-building

After lucene indexes were created we uploaded it to solr for search ui. We
did not ingest it to elastisearch though.

Our scale was 100m+ rows and 100k+ columns, spark + lucene worked fine

Thank you.
Deb


On Wed, Nov 9, 2022, 10:13 AM Vibhor Gupta <Vi...@walmart.com.invalid>
wrote:

> Hi Spark Community,
>
> Is there a way to create elastic indexes offline and then import them to
> an elastic cluster ?
> We are trying to load an elastic index with around 10B documents (~1.5 to
> 2 TB data) using spark daily.
>
> I know elastic provides a snapshot restore functionality through
> GCS/S3/Azure, but is there a way to generate this snapshot offline using
> spark ?
>
> Thanks,
> Vibhor Gupta
>