You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Ishan Chattopadhyaya <ic...@gmail.com> on 2023/04/10 17:08:54 UTC

Cloud storage modules for backup/restore

Hi all,

For backup/restore, we have out of the box support for GCS and S3, but not
Azure. I think we should deprecate both the modules for S3 and GCS, and
adopt Apache JCloud project that supports all three. For testing, we could
try Minio (unless we are already happy with S3Mock that we use today). Any
thoughts or concerns?

One of my colleagues at SearchScale has a solution for this (which
pre-dates the introduction of S3 and GCS repositories). The solution is
based on Apache JCloud, and I found that the integration was pretty clean.
If there's interest, we can consider open sourcing it.

Regards,
Ishan

Re: Cloud storage modules for backup/restore

Posted by David Smiley <ds...@apache.org>.
Big per-file overhead on writing suggests it'd be beneficial to set
useCompoundFile to true (the default is false).

I think unlocking more write performance requires some sort of write level
cache to enable segment merges to use local segment files if they have been
written recently.  It could be layered as a Directory wrapper (i.e. extends
FilterDirectory).  I've done some thinking about this lately.

The read side demands a cache for reasonable read performance.  Solr's HDFS
module includes not just the underlying HdfsDirectory but also
BlockDirectory -- a read cache.  I like that it has no entanglements with
HDFS, thus it could be used creatively with, say, NIOFSDirectory with cloud
storage NIO impls Joel mentions.

It's not clear to me if the HDFS API is better suited than NIO.  There are
certainly a ton of dependencies to deal with for the Hadoop
ecosystem, which is a negative.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Apr 24, 2023 at 5:23 PM Kevin Risden <kr...@apache.org> wrote:

> Solr already supports today reading and indexing on cloud storage - ABFS,
> GCS, and S3 - using the Hadoop HDFS module. I assume the same works with
> HDFS backup/restore as well. I haven't checked if all the supporting
> libraries are included in the shipped Solr distribution, but the HDFS
> filesystem support includes cloud storage. I can't attest to the
> performance, but last I heard it works.
>
>
> https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
> https://hadoop.apache.org/docs/stable/hadoop-azure/index.html
> https://github.com/GoogleCloudDataproc/hadoop-connectors
>
>
> Kevin Risden
>
>
> On Mon, Apr 24, 2023 at 5:15 PM Joel Bernstein <jo...@gmail.com> wrote:
>
> > As far as a Lucene/Solr directory on cloud storage. Performance on the
> > write has a lot of overhead per file, hundreds of millis. The read
> overhead
> > is about half as much. I believe the write is so expensive due to the
> > strong consistency of both gcs and s3. So I think the main bottleneck
> would
> > be indexing and merging lots of small segments etc ...
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Fri, Apr 21, 2023 at 3:27 AM Ishan Chattopadhyaya <
> > ichattopadhyaya@gmail.com> wrote:
> >
> > > My colleague at SearchScale has tried S3FS, and running Solr indexes
> off
> > > S3. We can chat about it, if you're interested.
> > >
> > > On Fri, 21 Apr, 2023, 10:38 am David Smiley, <ds...@apache.org>
> wrote:
> > >
> > > > Cool!
> > > > I wonder if anyone has tried such things for a Lucene/Solr
> "Directory"
> > as
> > > > well?
> > > >
> > > > ~ David Smiley
> > > > Apache Lucene/Solr Search Developer
> > > > http://www.linkedin.com/in/davidwsmiley
> > > >
> > > >
> > > > On Mon, Apr 17, 2023 at 1:14 PM Joel Bernstein <jo...@gmail.com>
> > > wrote:
> > > >
> > > > > I've been testing Java NIO providers for cloud storage. These two
> in
> > > > > particular worked for our use cases:
> > > > >
> > > > > https://github.com/googleapis/java-storage-nio
> > > > > https://github.com/carlspring/s3fs-nio
> > > > >
> > > > > I believe an Azure provider is available.
> > > > >
> > > > > We've been working on sponsoring getting the s3 provider into a
> > public
> > > > > maven repo and I can update this thread when that's done.
> > > > >
> > > > >
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > >
> > > > > On Mon, Apr 10, 2023 at 6:51 PM Ishan Chattopadhyaya <
> > > > > ichattopadhyaya@gmail.com> wrote:
> > > > >
> > > > > > Oh thanks, Jan. I had missed it. It is a shame because it looks
> > like
> > > a
> > > > > very
> > > > > > neat project.
> > > > > >
> > > > > > On Mon, 10 Apr, 2023, 23:53 Jan Høydahl, <ja...@cominvent.com>
> > > > wrote:
> > > > > >
> > > > > > > Looks like a nice project. With the promise of low-hanging
> > support
> > > > for
> > > > > > > more providers than those three for free.
> > > > > > >
> > > > > > > However,
> > > > > > https://lists.apache.org/thread/w61gzk2ohjtshbwcb5gy6wb2htv7fo0x
> > > > > > > does not look promising - they plan to move the project to the
> > > Attic,
> > > > > and
> > > > > > > no new releases has happened during the 6 months since the
> > > > proposal...
> > > > > > >
> > > > > > > Jan
> > > > > > >
> > > > > > > > 10. apr. 2023 kl. 19:08 skrev Ishan Chattopadhyaya <
> > > > > > > ichattopadhyaya@gmail.com>:
> > > > > > > >
> > > > > > > > I think we should deprecate both the modules for S3 and GCS,
> > and
> > > > > > > > adopt Apache JCloud project that supports all three.
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Cloud storage modules for backup/restore

Posted by Kevin Risden <kr...@apache.org>.
Solr already supports today reading and indexing on cloud storage - ABFS,
GCS, and S3 - using the Hadoop HDFS module. I assume the same works with
HDFS backup/restore as well. I haven't checked if all the supporting
libraries are included in the shipped Solr distribution, but the HDFS
filesystem support includes cloud storage. I can't attest to the
performance, but last I heard it works.

https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
https://hadoop.apache.org/docs/stable/hadoop-azure/index.html
https://github.com/GoogleCloudDataproc/hadoop-connectors


Kevin Risden


On Mon, Apr 24, 2023 at 5:15 PM Joel Bernstein <jo...@gmail.com> wrote:

> As far as a Lucene/Solr directory on cloud storage. Performance on the
> write has a lot of overhead per file, hundreds of millis. The read overhead
> is about half as much. I believe the write is so expensive due to the
> strong consistency of both gcs and s3. So I think the main bottleneck would
> be indexing and merging lots of small segments etc ...
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Apr 21, 2023 at 3:27 AM Ishan Chattopadhyaya <
> ichattopadhyaya@gmail.com> wrote:
>
> > My colleague at SearchScale has tried S3FS, and running Solr indexes off
> > S3. We can chat about it, if you're interested.
> >
> > On Fri, 21 Apr, 2023, 10:38 am David Smiley, <ds...@apache.org> wrote:
> >
> > > Cool!
> > > I wonder if anyone has tried such things for a Lucene/Solr "Directory"
> as
> > > well?
> > >
> > > ~ David Smiley
> > > Apache Lucene/Solr Search Developer
> > > http://www.linkedin.com/in/davidwsmiley
> > >
> > >
> > > On Mon, Apr 17, 2023 at 1:14 PM Joel Bernstein <jo...@gmail.com>
> > wrote:
> > >
> > > > I've been testing Java NIO providers for cloud storage. These two in
> > > > particular worked for our use cases:
> > > >
> > > > https://github.com/googleapis/java-storage-nio
> > > > https://github.com/carlspring/s3fs-nio
> > > >
> > > > I believe an Azure provider is available.
> > > >
> > > > We've been working on sponsoring getting the s3 provider into a
> public
> > > > maven repo and I can update this thread when that's done.
> > > >
> > > >
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > >
> > > > On Mon, Apr 10, 2023 at 6:51 PM Ishan Chattopadhyaya <
> > > > ichattopadhyaya@gmail.com> wrote:
> > > >
> > > > > Oh thanks, Jan. I had missed it. It is a shame because it looks
> like
> > a
> > > > very
> > > > > neat project.
> > > > >
> > > > > On Mon, 10 Apr, 2023, 23:53 Jan Høydahl, <ja...@cominvent.com>
> > > wrote:
> > > > >
> > > > > > Looks like a nice project. With the promise of low-hanging
> support
> > > for
> > > > > > more providers than those three for free.
> > > > > >
> > > > > > However,
> > > > > https://lists.apache.org/thread/w61gzk2ohjtshbwcb5gy6wb2htv7fo0x
> > > > > > does not look promising - they plan to move the project to the
> > Attic,
> > > > and
> > > > > > no new releases has happened during the 6 months since the
> > > proposal...
> > > > > >
> > > > > > Jan
> > > > > >
> > > > > > > 10. apr. 2023 kl. 19:08 skrev Ishan Chattopadhyaya <
> > > > > > ichattopadhyaya@gmail.com>:
> > > > > > >
> > > > > > > I think we should deprecate both the modules for S3 and GCS,
> and
> > > > > > > adopt Apache JCloud project that supports all three.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Cloud storage modules for backup/restore

Posted by Joel Bernstein <jo...@gmail.com>.
As far as a Lucene/Solr directory on cloud storage. Performance on the
write has a lot of overhead per file, hundreds of millis. The read overhead
is about half as much. I believe the write is so expensive due to the
strong consistency of both gcs and s3. So I think the main bottleneck would
be indexing and merging lots of small segments etc ...


Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Apr 21, 2023 at 3:27 AM Ishan Chattopadhyaya <
ichattopadhyaya@gmail.com> wrote:

> My colleague at SearchScale has tried S3FS, and running Solr indexes off
> S3. We can chat about it, if you're interested.
>
> On Fri, 21 Apr, 2023, 10:38 am David Smiley, <ds...@apache.org> wrote:
>
> > Cool!
> > I wonder if anyone has tried such things for a Lucene/Solr "Directory" as
> > well?
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Mon, Apr 17, 2023 at 1:14 PM Joel Bernstein <jo...@gmail.com>
> wrote:
> >
> > > I've been testing Java NIO providers for cloud storage. These two in
> > > particular worked for our use cases:
> > >
> > > https://github.com/googleapis/java-storage-nio
> > > https://github.com/carlspring/s3fs-nio
> > >
> > > I believe an Azure provider is available.
> > >
> > > We've been working on sponsoring getting the s3 provider into a public
> > > maven repo and I can update this thread when that's done.
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Mon, Apr 10, 2023 at 6:51 PM Ishan Chattopadhyaya <
> > > ichattopadhyaya@gmail.com> wrote:
> > >
> > > > Oh thanks, Jan. I had missed it. It is a shame because it looks like
> a
> > > very
> > > > neat project.
> > > >
> > > > On Mon, 10 Apr, 2023, 23:53 Jan Høydahl, <ja...@cominvent.com>
> > wrote:
> > > >
> > > > > Looks like a nice project. With the promise of low-hanging support
> > for
> > > > > more providers than those three for free.
> > > > >
> > > > > However,
> > > > https://lists.apache.org/thread/w61gzk2ohjtshbwcb5gy6wb2htv7fo0x
> > > > > does not look promising - they plan to move the project to the
> Attic,
> > > and
> > > > > no new releases has happened during the 6 months since the
> > proposal...
> > > > >
> > > > > Jan
> > > > >
> > > > > > 10. apr. 2023 kl. 19:08 skrev Ishan Chattopadhyaya <
> > > > > ichattopadhyaya@gmail.com>:
> > > > > >
> > > > > > I think we should deprecate both the modules for S3 and GCS, and
> > > > > > adopt Apache JCloud project that supports all three.
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Cloud storage modules for backup/restore

Posted by Ishan Chattopadhyaya <ic...@gmail.com>.
My colleague at SearchScale has tried S3FS, and running Solr indexes off
S3. We can chat about it, if you're interested.

On Fri, 21 Apr, 2023, 10:38 am David Smiley, <ds...@apache.org> wrote:

> Cool!
> I wonder if anyone has tried such things for a Lucene/Solr "Directory" as
> well?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Apr 17, 2023 at 1:14 PM Joel Bernstein <jo...@gmail.com> wrote:
>
> > I've been testing Java NIO providers for cloud storage. These two in
> > particular worked for our use cases:
> >
> > https://github.com/googleapis/java-storage-nio
> > https://github.com/carlspring/s3fs-nio
> >
> > I believe an Azure provider is available.
> >
> > We've been working on sponsoring getting the s3 provider into a public
> > maven repo and I can update this thread when that's done.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Mon, Apr 10, 2023 at 6:51 PM Ishan Chattopadhyaya <
> > ichattopadhyaya@gmail.com> wrote:
> >
> > > Oh thanks, Jan. I had missed it. It is a shame because it looks like a
> > very
> > > neat project.
> > >
> > > On Mon, 10 Apr, 2023, 23:53 Jan Høydahl, <ja...@cominvent.com>
> wrote:
> > >
> > > > Looks like a nice project. With the promise of low-hanging support
> for
> > > > more providers than those three for free.
> > > >
> > > > However,
> > > https://lists.apache.org/thread/w61gzk2ohjtshbwcb5gy6wb2htv7fo0x
> > > > does not look promising - they plan to move the project to the Attic,
> > and
> > > > no new releases has happened during the 6 months since the
> proposal...
> > > >
> > > > Jan
> > > >
> > > > > 10. apr. 2023 kl. 19:08 skrev Ishan Chattopadhyaya <
> > > > ichattopadhyaya@gmail.com>:
> > > > >
> > > > > I think we should deprecate both the modules for S3 and GCS, and
> > > > > adopt Apache JCloud project that supports all three.
> > > >
> > > >
> > >
> >
>

Re: Cloud storage modules for backup/restore

Posted by David Smiley <ds...@apache.org>.
Cool!
I wonder if anyone has tried such things for a Lucene/Solr "Directory" as
well?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Apr 17, 2023 at 1:14 PM Joel Bernstein <jo...@gmail.com> wrote:

> I've been testing Java NIO providers for cloud storage. These two in
> particular worked for our use cases:
>
> https://github.com/googleapis/java-storage-nio
> https://github.com/carlspring/s3fs-nio
>
> I believe an Azure provider is available.
>
> We've been working on sponsoring getting the s3 provider into a public
> maven repo and I can update this thread when that's done.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Mon, Apr 10, 2023 at 6:51 PM Ishan Chattopadhyaya <
> ichattopadhyaya@gmail.com> wrote:
>
> > Oh thanks, Jan. I had missed it. It is a shame because it looks like a
> very
> > neat project.
> >
> > On Mon, 10 Apr, 2023, 23:53 Jan Høydahl, <ja...@cominvent.com> wrote:
> >
> > > Looks like a nice project. With the promise of low-hanging support for
> > > more providers than those three for free.
> > >
> > > However,
> > https://lists.apache.org/thread/w61gzk2ohjtshbwcb5gy6wb2htv7fo0x
> > > does not look promising - they plan to move the project to the Attic,
> and
> > > no new releases has happened during the 6 months since the proposal...
> > >
> > > Jan
> > >
> > > > 10. apr. 2023 kl. 19:08 skrev Ishan Chattopadhyaya <
> > > ichattopadhyaya@gmail.com>:
> > > >
> > > > I think we should deprecate both the modules for S3 and GCS, and
> > > > adopt Apache JCloud project that supports all three.
> > >
> > >
> >
>

Re: Cloud storage modules for backup/restore

Posted by Joel Bernstein <jo...@gmail.com>.
I've been testing Java NIO providers for cloud storage. These two in
particular worked for our use cases:

https://github.com/googleapis/java-storage-nio
https://github.com/carlspring/s3fs-nio

I believe an Azure provider is available.

We've been working on sponsoring getting the s3 provider into a public
maven repo and I can update this thread when that's done.



Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, Apr 10, 2023 at 6:51 PM Ishan Chattopadhyaya <
ichattopadhyaya@gmail.com> wrote:

> Oh thanks, Jan. I had missed it. It is a shame because it looks like a very
> neat project.
>
> On Mon, 10 Apr, 2023, 23:53 Jan Høydahl, <ja...@cominvent.com> wrote:
>
> > Looks like a nice project. With the promise of low-hanging support for
> > more providers than those three for free.
> >
> > However,
> https://lists.apache.org/thread/w61gzk2ohjtshbwcb5gy6wb2htv7fo0x
> > does not look promising - they plan to move the project to the Attic, and
> > no new releases has happened during the 6 months since the proposal...
> >
> > Jan
> >
> > > 10. apr. 2023 kl. 19:08 skrev Ishan Chattopadhyaya <
> > ichattopadhyaya@gmail.com>:
> > >
> > > I think we should deprecate both the modules for S3 and GCS, and
> > > adopt Apache JCloud project that supports all three.
> >
> >
>

Re: Cloud storage modules for backup/restore

Posted by Ishan Chattopadhyaya <ic...@gmail.com>.
Oh thanks, Jan. I had missed it. It is a shame because it looks like a very
neat project.

On Mon, 10 Apr, 2023, 23:53 Jan Høydahl, <ja...@cominvent.com> wrote:

> Looks like a nice project. With the promise of low-hanging support for
> more providers than those three for free.
>
> However, https://lists.apache.org/thread/w61gzk2ohjtshbwcb5gy6wb2htv7fo0x
> does not look promising - they plan to move the project to the Attic, and
> no new releases has happened during the 6 months since the proposal...
>
> Jan
>
> > 10. apr. 2023 kl. 19:08 skrev Ishan Chattopadhyaya <
> ichattopadhyaya@gmail.com>:
> >
> > I think we should deprecate both the modules for S3 and GCS, and
> > adopt Apache JCloud project that supports all three.
>
>

Re: Cloud storage modules for backup/restore

Posted by Jan Høydahl <ja...@cominvent.com>.
Looks like a nice project. With the promise of low-hanging support for more providers than those three for free.

However, https://lists.apache.org/thread/w61gzk2ohjtshbwcb5gy6wb2htv7fo0x does not look promising - they plan to move the project to the Attic, and no new releases has happened during the 6 months since the proposal...

Jan

> 10. apr. 2023 kl. 19:08 skrev Ishan Chattopadhyaya <ic...@gmail.com>:
> 
> I think we should deprecate both the modules for S3 and GCS, and
> adopt Apache JCloud project that supports all three.


Re: Cloud storage modules for backup/restore

Posted by Gus Heck <gu...@gmail.com>.
Sounds interesting. Don't really know anything about jclouds, a quick
glance at your link didn't tell me much, but if they ship libraries that
can plug in (or otherwise be leveraged without need for any external
software) and handle connectivity that sounds like a win. Not as keen if it
requires an additional running service or exposes significant complexity to
the user. We need to make it easier to use solr (IMHO).

On Mon, Apr 10, 2023 at 1:22 PM Ishan Chattopadhyaya <
ichattopadhyaya@gmail.com> wrote:

> Supported storage providers, FYI:
> https://jclouds.apache.org/reference/providers/#blobstore-providers
>
> On Mon, 10 Apr 2023 at 22:49, Ishan Chattopadhyaya <
> ichattopadhyaya@gmail.com> wrote:
>
> > TBH, I haven't personally used either of them extensively, but just
> synced
> > up with my colleague who built that solution. So, I thought of bringing
> it
> > up here for any additional points of consideration (in case JClouds
> wasn't
> > considered earlier).
> > I'm not invested into this effort much either way as yet.
> >
> > On Mon, 10 Apr 2023 at 22:38, Ishan Chattopadhyaya <
> > ichattopadhyaya@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >> For backup/restore, we have out of the box support for GCS and S3, but
> >> not Azure. I think we should deprecate both the modules for S3 and GCS,
> and
> >> adopt Apache JCloud project that supports all three. For testing, we
> could
> >> try Minio (unless we are already happy with S3Mock that we use today).
> Any
> >> thoughts or concerns?
> >>
> >> One of my colleagues at SearchScale has a solution for this (which
> >> pre-dates the introduction of S3 and GCS repositories). The solution is
> >> based on Apache JCloud, and I found that the integration was pretty
> clean.
> >> If there's interest, we can consider open sourcing it.
> >>
> >> Regards,
> >> Ishan
> >>
> >>
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Cloud storage modules for backup/restore

Posted by Ishan Chattopadhyaya <ic...@gmail.com>.
Supported storage providers, FYI:
https://jclouds.apache.org/reference/providers/#blobstore-providers

On Mon, 10 Apr 2023 at 22:49, Ishan Chattopadhyaya <
ichattopadhyaya@gmail.com> wrote:

> TBH, I haven't personally used either of them extensively, but just synced
> up with my colleague who built that solution. So, I thought of bringing it
> up here for any additional points of consideration (in case JClouds wasn't
> considered earlier).
> I'm not invested into this effort much either way as yet.
>
> On Mon, 10 Apr 2023 at 22:38, Ishan Chattopadhyaya <
> ichattopadhyaya@gmail.com> wrote:
>
>> Hi all,
>>
>> For backup/restore, we have out of the box support for GCS and S3, but
>> not Azure. I think we should deprecate both the modules for S3 and GCS, and
>> adopt Apache JCloud project that supports all three. For testing, we could
>> try Minio (unless we are already happy with S3Mock that we use today). Any
>> thoughts or concerns?
>>
>> One of my colleagues at SearchScale has a solution for this (which
>> pre-dates the introduction of S3 and GCS repositories). The solution is
>> based on Apache JCloud, and I found that the integration was pretty clean.
>> If there's interest, we can consider open sourcing it.
>>
>> Regards,
>> Ishan
>>
>>

Re: Cloud storage modules for backup/restore

Posted by Ishan Chattopadhyaya <ic...@gmail.com>.
TBH, I haven't personally used either of them extensively, but just synced
up with my colleague who built that solution. So, I thought of bringing it
up here for any additional points of consideration (in case JClouds wasn't
considered earlier).
I'm not invested into this effort much either way as yet.

On Mon, 10 Apr 2023 at 22:38, Ishan Chattopadhyaya <
ichattopadhyaya@gmail.com> wrote:

> Hi all,
>
> For backup/restore, we have out of the box support for GCS and S3, but not
> Azure. I think we should deprecate both the modules for S3 and GCS, and
> adopt Apache JCloud project that supports all three. For testing, we could
> try Minio (unless we are already happy with S3Mock that we use today). Any
> thoughts or concerns?
>
> One of my colleagues at SearchScale has a solution for this (which
> pre-dates the introduction of S3 and GCS repositories). The solution is
> based on Apache JCloud, and I found that the integration was pretty clean.
> If there's interest, we can consider open sourcing it.
>
> Regards,
> Ishan
>
>