You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Jason Gerlowski (Jira)" <ji...@apache.org> on 2023/03/14 15:26:00 UTC
[jira] [Commented] (SOLR-16697) New API support to import index files generated by Embedded SOLR into SOLR Cloud

    [ https://issues.apache.org/jira/browse/SOLR-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17700262#comment-17700262 ] 

Jason Gerlowski commented on SOLR-16697:
----------------------------------------

(Disclaimer: I work with Indumathy, and we've talked a good bit about this problem offline.)

I took a crack at implementing this via a Shard "Install' ("Import"?  Not sure about the naming here...) collection-admin API.  Ended up being pretty straightforward.  The API is able to use the same restore fundamentals Solr already has, but exposed by an end-user interface that's much more natural for offline-indexing use cases.

Check out the draft PR [here|https://github.com/apache/solr/pull/1458].  Definitely not ready to commit - needs some input validation, tests, docs, etc.  But should be enough to showcase the general approach for anyone interested.

> New API support to import index files generated by Embedded SOLR into SOLR Cloud
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-16697
>                 URL: https://issues.apache.org/jira/browse/SOLR-16697
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Backup/Restore
>            Reporter: Indumathy Rajagopalan
>            Priority: Blocker
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Offline indexing is a popular option when really large data sets needs to be indexed into SOLR. 
> Data is loaded from data source ( eg. c*)  and index creation pipelines produce index files per shard using embedded SOLR.
>  
> With older versions of SOLR, we would copy these index files into SOLR Cloud data directories using a custom tools and reload the collection to be able to search/update on the newly uploaded collection.
> Ideally, we should use the Restore API to import the index files from backup repository. However, the file structure expected for the Restore API to work is complex enough that massaging the index files in every shard into Restore compatible format is infeasible.
>  
> It would be good for SOLR to support a 'Restore' like API that would allow us to import index files generated by embedded SOLR into SOLR Cloud ? This API should operate on shard level and be able to import the index files into a single shard (per invocation)
>  
> *With the new API , offline indexing could look like this :* 
>  
> 1. Generate index files per shard using embedded SOLR as a part of hadoop MR /Spark jobs  and copy all index files for every shard into backup repository.
>  
> 2. The New API should be able to import the index from backup repository location into each shard on SOLR Cloud. The API would handle things like marking the collection as read-only, trigger replication etc. along the lines of what the 'RESTORE' API currently does.
>  
> The new API should be able to support relevant parameters from Restore API ( location & repository )



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org