You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2021/06/09 01:05:40 UTC

[GitHub] [solr] athrog commented on pull request #120: SOLR-15089: Allow backup/restoration to Amazon's S3 blobstore

athrog commented on pull request #120:
URL: https://github.com/apache/solr/pull/120#issuecomment-857294030


   @gerlowskija I'll address the cons list first, that seems to me the most salient part to discuss:
   
   > Classpath Bloat: Having all blob-store implementations in the same config means that a user using S3 for backups will also be loading a bunch of the GCS (and eventually ABS) related dependencies that they have no intention of ever using. How much it'll matter in practice, idk, but in theory this expanded but unused surface area is a liability in terms of triggering vulnerability scans or true security issues whenever one of these deps has a CVE.
   
   In my mind, if we go the path of using the BlobRepository abstraction, the ultimate goal would be to have it be a separate unit, apart from the S3 or GCS stuff. That could mean...
   1. We have three modules in solr-contrib (e.g., BlobRepository, S3Repo, GCSRepo) [this seems weird though, b/c BlobRepository is an "abstract" layer and that doesn't seem to belong in solr-contrib]
   1. We commit the BlobRepository code directly into the Solr core module, and then have S3Repo and GCSRepo pull from that (this sounds similar to the approach you mentioned "One side note on this question")
   1. We push the BlobRepository code into a separate Git repo, publish a jar that S3Repo and GCSRepo
   
   There may be other options, but for the bloat reasons you mentioned, I never really considered putting all the blob stores into one multi-cloud repo. 
   
   > Limited Code Reuse: While BlobRepository does allow some pieces to be reused, it's not as much as I was hoping might be possible. IMO the actual code interfacing with the S3/GCS client comprises such a large chunk of the logic here that it dwarfs the smaller bits that are able to be reused. As an "example" of this: GCSBackupRepository and S3storageClient both come in at roughly the same LOC count despite S3StorageClient taking advantage of all the shared logic up in BlobRepository. (Admittedly this is a real shaky comparison: the two code samples are using different client libraries, implement different interfaces, are written in different coding styles with different use of whitespace, comments, etc. And LOC is a dubious stand-in for complexity in any case. But the point that S3StorageClient's code-reuse still leaves the two in the same ballpark still stands.)
   
   Very fair criticism. For me, it's right on the edge of "are we over-architecting this?". It would of course help to have a third blob client (Azure?); perhaps with just two we're underestimating the value that middle layer could provide? 
   
   > Documentation Complexity: Combining multiple blob-store implementations under the same BackupRepository impl will add some documentation challenges: in calling out which config props affect which blob store "types", in describing the differing expectations on for identifiers like bucket names and paths, etc.
   
   Is this con still relevant if we pick one of the three options I listed above (and not have them all in one)? If each is separate, their docs would also be separate. Or do you mean internal docs (e.g., Javadoc)?
   
   
   You also asked
   
   > What distinguishes a few of the I/O classes in this PR (e.g. BlobIndexInput) from existing alternatives (e.g. BufferedChecksumIndexInput) that do similar things at a quick glance?
   
   In regards to BlobIndexInput, I'll have to defer the 'why' question to @psalagnac. That impl was written when we first wrote this plugin -- it's possible it was to work around a bug, or possible it was oversight that we built it ourselves and didn't use a pre-built impl. If I get some time, I can also try swapping our use for `BufferedChecksumIndexInput` and see if tests pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org