You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Indumathy Rajagopalan (Jira)" <ji...@apache.org> on 2023/03/13 23:17:00 UTC

[jira] [Created] (SOLR-16697) New API support to import index files generated by Embedded SOLR into SOLR Cloud

Indumathy Rajagopalan created SOLR-16697:
--------------------------------------------

             Summary: New API support to import index files generated by Embedded SOLR into SOLR Cloud
                 Key: SOLR-16697
                 URL: https://issues.apache.org/jira/browse/SOLR-16697
             Project: Solr
          Issue Type: New Feature
      Security Level: Public (Default Security Level. Issues are Public)
          Components: Backup/Restore
            Reporter: Indumathy Rajagopalan


Offline indexing is a popular option when really large data sets needs to be indexed into SOLR. 
Data is loaded from data source ( eg. c*)  and index creation pipelines produce index files per shard using embedded SOLR.
 
With older versions of SOLR, we would copy these index files into SOLR Cloud data directories using a custom tools and reload the collection to be able to search/update on the newly uploaded collection.
Ideally, we should use the Restore API to import the index files from backup repository. However, the file structure expected for the Restore API to work is complex enough that massaging the index files in every shard into Restore compatible format is infeasible.
 
It would be good for SOLR to support a 'Restore' like API that would allow us to import index files generated by embedded SOLR into SOLR Cloud ? This API should operate on shard level and be able to import the index files into a single shard (per invocation)
 
*With the new API , offline indexing could look like this :* 
 
1. Generate index files per shard using embedded SOLR as a part of hadoop MR /Spark jobs  and copy all index files for every shard into backup repository.
 
2. The New API should be able to import the index from backup repository location into each shard on SOLR Cloud. The API would handle things like marking the collection as read-only, trigger replication etc. along the lines of what the 'RESTORE' API currently does.
 
The new API should be able to support relevant parameters from Restore API ( location & repository )



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org