You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Zara Parst <ed...@gmail.com> on 2023/02/23 01:05:31 UTC

About Using Hadoop in SolrCloud

Hi,

I read at many places about using Hadoop in solrCloud. I try to find the
reason why to use Hadoop in place of a local file system. Can someone
briefly explain why to use Hadoop with SolrCloud when solr is just using
Hadoop for indexing and storing logs in Hadoop. Is there any compelling
reason to do that?

Is Hadoop having any advantage over the local file system with solr, since
I can achieve cloud mod storing index in the local file system and can
still use shard and replica.  So my question is what advantage Hadoop will
give me, does Hadoop do indexing fast, does Hadoop take less space to store
index, is that distributed file system is better in Hadoop, like sharding,
replication etc. Or does it take backup automatically?

Please do answer this question as much as possible,

Re: About Using Hadoop in SolrCloud

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

please ask this question on Solr's mailing list: dev@solr.apache.org

Am 23.02.2023 um 02:05 schrieb Zara Parst:
> Hi,
>
> I read at many places about using Hadoop in solrCloud. I try to find 
> the reason why to use Hadoop in place of a local file system. Can 
> someone briefly explain why to use Hadoop with SolrCloud when solr is 
> just using Hadoop for indexing and storing logs in Hadoop. Is there 
> any compelling reason to do that?
>
> Is Hadoop having any advantage over the local file system with solr, 
> since I can achieve cloud mod storing index in the local file system 
> and can still use shard and replica.  So my question is what advantage 
> Hadoop will give me, does Hadoop do indexing fast, does Hadoop take 
> less space to store index, is that distributed file system is better 
> in Hadoop, like sharding, replication etc. Or does it take backup 
> automatically?
>
> Please do answer this question as much as possible,

-- 
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: About Using Hadoop in SolrCloud

Posted by Eric Pugh <ep...@opensourceconnections.com>.

;-).


> On Feb 23, 2023, at 9:34 AM, Zara Parst <ed...@gmail.com> wrote:
> 
> I think I was looking for someone to tell me, hey don't do something fancy
> unless you must have. I am somehow at ease now. Now I will leave Hadoop for
> some other project.
> 
> On Thu, Feb 23, 2023 at 6:16 PM Eric Pugh <epugh@opensourceconnections.com <ma...@opensourceconnections.com>>
> wrote:
> 
>> I am replying, but just to the users mailing list, as it’s not appropriate
>> for dev@.
>> 
>> I think the short answer is that if you are already super into the Hadoop
>> ecosystem, then you already have strong reasons why, and you can answer all
>> of your questions listed already ;-).  You then look at Solr on Hadoop as
>> “hey, it works with what I am already doing” at my enterprise.
>> 
>> If you aren’t already in the Hadoop ecosystem, then there isn’t any
>> special Solr specific reason to go this way, and indeed many reasons NOT
>> to.   Hadoop isn’t for the faint of heart….
>> 
>> Not an answer per se….
>> 
>>> On Feb 23, 2023, at 5:57 AM, Zara Parst <ed...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I read at many places about using Hadoop in solrCloud. I try to find the
>>> reason why to use Hadoop in place of a local file system. Can someone
>>> briefly explain why to use Hadoop with SolrCloud when solr is just using
>>> Hadoop for indexing and storing logs in Hadoop. Is there any compelling
>>> reason to do that?
>>> 
>>> Is Hadoop having any advantage over the local file system with solr,
>> since
>>> I can achieve cloud mod storing index in the local file system and can
>>> still use shard and replica.  So my question is what advantage Hadoop
>> will
>>> give me, does Hadoop do indexing fast, does Hadoop take less space to
>> store
>>> index, is that distributed file system is better in Hadoop, like
>> sharding,
>>> replication etc. Or does it take backup automatically?
>>> 
>>> Please do answer this question as much as possible,
>> 
>> _______________________
>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>> http://www.opensourceconnections.com <
>> http://www.opensourceconnections.com/ <http://www.opensourceconnections.com/>> | My Free/Busy <
>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>> 
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless of
>> whether attachments are marked as such.

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

Re: About Using Hadoop in SolrCloud

Posted by Zara Parst <ed...@gmail.com>.

I think I was looking for someone to tell me, hey don't do something fancy
unless you must have. I am somehow at ease now. Now I will leave Hadoop for
some other project.

On Thu, Feb 23, 2023 at 6:16 PM Eric Pugh <ep...@opensourceconnections.com>
wrote:

> I am replying, but just to the users mailing list, as it’s not appropriate
> for dev@.
>
> I think the short answer is that if you are already super into the Hadoop
> ecosystem, then you already have strong reasons why, and you can answer all
> of your questions listed already ;-).  You then look at Solr on Hadoop as
> “hey, it works with what I am already doing” at my enterprise.
>
> If you aren’t already in the Hadoop ecosystem, then there isn’t any
> special Solr specific reason to go this way, and indeed many reasons NOT
> to.   Hadoop isn’t for the faint of heart….
>
> Not an answer per se….
>
> > On Feb 23, 2023, at 5:57 AM, Zara Parst <ed...@gmail.com> wrote:
> >
> > Hi,
> >
> > I read at many places about using Hadoop in solrCloud. I try to find the
> > reason why to use Hadoop in place of a local file system. Can someone
> > briefly explain why to use Hadoop with SolrCloud when solr is just using
> > Hadoop for indexing and storing logs in Hadoop. Is there any compelling
> > reason to do that?
> >
> > Is Hadoop having any advantage over the local file system with solr,
> since
> > I can achieve cloud mod storing index in the local file system and can
> > still use shard and replica.  So my question is what advantage Hadoop
> will
> > give me, does Hadoop do indexing fast, does Hadoop take less space to
> store
> > index, is that distributed file system is better in Hadoop, like
> sharding,
> > replication etc. Or does it take backup automatically?
> >
> > Please do answer this question as much as possible,
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

Re: About Using Hadoop in SolrCloud

Posted by David Smiley <ds...@apache.org>.

Yes; this was shocking to me at first because the implications are big and
it's almost a secret.  Ideally the ref guide would scream this loudly;
users today care *way* more about S3 than HDFS.  The "HDFS" Solr module
uses the HDFS client API which has a pluggable back-end, and thus you can
have it talk to S3.  You can search the user list for this; maybe JIRA.
I've briefly dabbled with it (got stuck with incompatible versions) but I
know others have done this (presumably at earlier versions than what I used
at the time).  It's a simple matter of adding the correct JAR files and
some trivial configuration.  The main problem is that such a home-brew
concoction of theoretically compatible things is on your shoulders to
debug/support.  Solr isn't testing its support for this; it will fail for
some versions as it did for me.  Maybe Solr *should* test/support this.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Feb 23, 2023 at 10:59 PM Zara Parst <ed...@gmail.com> wrote:

> David, you made a point. Is it true we can keep indexes to S3? I mean index
> under use not the backup ?
>
> On Fri, Feb 24, 2023 at 1:11 AM David Smiley <ds...@apache.org> wrote:
>
> > I agree with Eric, but wish to add one point:  Separation of compute from
> > storage to get: better redundancy (HDFS or S3 will do it better, maybe
> > cheaper), better elasticity (since Solr nodes become stateless; easy to
> add
> > more nodes), better cost?  Sacrifice indexing performance and a bit of
> > query.  Admittedly I don't have real experience here but this is my
> > thinking.  The most annoying thing about Solr's HDFS support is that
> > SolrCloud's replication is quite redundant/wasteful with that at the
> > storage layer, thus adding cost inefficiency. There is potential for
> > improvements there.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Thu, Feb 23, 2023 at 7:45 AM Eric Pugh <
> epugh@opensourceconnections.com
> > >
> > wrote:
> >
> > > I am replying, but just to the users mailing list, as it’s not
> > appropriate
> > > for dev@.
> > >
> > > I think the short answer is that if you are already super into the
> Hadoop
> > > ecosystem, then you already have strong reasons why, and you can answer
> > all
> > > of your questions listed already ;-).  You then look at Solr on Hadoop
> as
> > > “hey, it works with what I am already doing” at my enterprise.
> > >
> > > If you aren’t already in the Hadoop ecosystem, then there isn’t any
> > > special Solr specific reason to go this way, and indeed many reasons
> NOT
> > > to.   Hadoop isn’t for the faint of heart….
> > >
> > > Not an answer per se….
> > >
> > > > On Feb 23, 2023, at 5:57 AM, Zara Parst <ed...@gmail.com>
> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I read at many places about using Hadoop in solrCloud. I try to find
> > the
> > > > reason why to use Hadoop in place of a local file system. Can someone
> > > > briefly explain why to use Hadoop with SolrCloud when solr is just
> > using
> > > > Hadoop for indexing and storing logs in Hadoop. Is there any
> compelling
> > > > reason to do that?
> > > >
> > > > Is Hadoop having any advantage over the local file system with solr,
> > > since
> > > > I can achieve cloud mod storing index in the local file system and
> can
> > > > still use shard and replica.  So my question is what advantage Hadoop
> > > will
> > > > give me, does Hadoop do indexing fast, does Hadoop take less space to
> > > store
> > > > index, is that distributed file system is better in Hadoop, like
> > > sharding,
> > > > replication etc. Or does it take backup automatically?
> > > >
> > > > Please do answer this question as much as possible,
> > >
> > > _______________________
> > > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467
> |
> > > http://www.opensourceconnections.com <
> > > http://www.opensourceconnections.com/> | My Free/Busy <
> > > http://tinyurl.com/eric-cal>
> > > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> > >
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> > >
> > >
> > > This e-mail and all contents, including attachments, is considered to
> be
> > > Company Confidential unless explicitly stated otherwise, regardless of
> > > whether attachments are marked as such.
> > >
> > >
> >
>

Re: About Using Hadoop in SolrCloud

Posted by Zara Parst <ed...@gmail.com>.

David, you made a point. Is it true we can keep indexes to S3? I mean index
under use not the backup ?

On Fri, Feb 24, 2023 at 1:11 AM David Smiley <ds...@apache.org> wrote:

> I agree with Eric, but wish to add one point:  Separation of compute from
> storage to get: better redundancy (HDFS or S3 will do it better, maybe
> cheaper), better elasticity (since Solr nodes become stateless; easy to add
> more nodes), better cost?  Sacrifice indexing performance and a bit of
> query.  Admittedly I don't have real experience here but this is my
> thinking.  The most annoying thing about Solr's HDFS support is that
> SolrCloud's replication is quite redundant/wasteful with that at the
> storage layer, thus adding cost inefficiency. There is potential for
> improvements there.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, Feb 23, 2023 at 7:45 AM Eric Pugh <epugh@opensourceconnections.com
> >
> wrote:
>
> > I am replying, but just to the users mailing list, as it’s not
> appropriate
> > for dev@.
> >
> > I think the short answer is that if you are already super into the Hadoop
> > ecosystem, then you already have strong reasons why, and you can answer
> all
> > of your questions listed already ;-).  You then look at Solr on Hadoop as
> > “hey, it works with what I am already doing” at my enterprise.
> >
> > If you aren’t already in the Hadoop ecosystem, then there isn’t any
> > special Solr specific reason to go this way, and indeed many reasons NOT
> > to.   Hadoop isn’t for the faint of heart….
> >
> > Not an answer per se….
> >
> > > On Feb 23, 2023, at 5:57 AM, Zara Parst <ed...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I read at many places about using Hadoop in solrCloud. I try to find
> the
> > > reason why to use Hadoop in place of a local file system. Can someone
> > > briefly explain why to use Hadoop with SolrCloud when solr is just
> using
> > > Hadoop for indexing and storing logs in Hadoop. Is there any compelling
> > > reason to do that?
> > >
> > > Is Hadoop having any advantage over the local file system with solr,
> > since
> > > I can achieve cloud mod storing index in the local file system and can
> > > still use shard and replica.  So my question is what advantage Hadoop
> > will
> > > give me, does Hadoop do indexing fast, does Hadoop take less space to
> > store
> > > index, is that distributed file system is better in Hadoop, like
> > sharding,
> > > replication etc. Or does it take backup automatically?
> > >
> > > Please do answer this question as much as possible,
> >
> > _______________________
> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> > http://www.opensourceconnections.com <
> > http://www.opensourceconnections.com/> | My Free/Busy <
> > http://tinyurl.com/eric-cal>
> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> >
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
> >
> >
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless of
> > whether attachments are marked as such.
> >
> >
>

Re: About Using Hadoop in SolrCloud

Posted by David Smiley <ds...@apache.org>.

I agree with Eric, but wish to add one point:  Separation of compute from
storage to get: better redundancy (HDFS or S3 will do it better, maybe
cheaper), better elasticity (since Solr nodes become stateless; easy to add
more nodes), better cost?  Sacrifice indexing performance and a bit of
query.  Admittedly I don't have real experience here but this is my
thinking.  The most annoying thing about Solr's HDFS support is that
SolrCloud's replication is quite redundant/wasteful with that at the
storage layer, thus adding cost inefficiency. There is potential for
improvements there.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Feb 23, 2023 at 7:45 AM Eric Pugh <ep...@opensourceconnections.com>
wrote:

> I am replying, but just to the users mailing list, as it’s not appropriate
> for dev@.
>
> I think the short answer is that if you are already super into the Hadoop
> ecosystem, then you already have strong reasons why, and you can answer all
> of your questions listed already ;-).  You then look at Solr on Hadoop as
> “hey, it works with what I am already doing” at my enterprise.
>
> If you aren’t already in the Hadoop ecosystem, then there isn’t any
> special Solr specific reason to go this way, and indeed many reasons NOT
> to.   Hadoop isn’t for the faint of heart….
>
> Not an answer per se….
>
> > On Feb 23, 2023, at 5:57 AM, Zara Parst <ed...@gmail.com> wrote:
> >
> > Hi,
> >
> > I read at many places about using Hadoop in solrCloud. I try to find the
> > reason why to use Hadoop in place of a local file system. Can someone
> > briefly explain why to use Hadoop with SolrCloud when solr is just using
> > Hadoop for indexing and storing logs in Hadoop. Is there any compelling
> > reason to do that?
> >
> > Is Hadoop having any advantage over the local file system with solr,
> since
> > I can achieve cloud mod storing index in the local file system and can
> > still use shard and replica.  So my question is what advantage Hadoop
> will
> > give me, does Hadoop do indexing fast, does Hadoop take less space to
> store
> > index, is that distributed file system is better in Hadoop, like
> sharding,
> > replication etc. Or does it take backup automatically?
> >
> > Please do answer this question as much as possible,
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

Re: About Using Hadoop in SolrCloud

Posted by Eric Pugh <ep...@opensourceconnections.com>.

I am replying, but just to the users mailing list, as it’s not appropriate for dev@.

I think the short answer is that if you are already super into the Hadoop ecosystem, then you already have strong reasons why, and you can answer all of your questions listed already ;-).  You then look at Solr on Hadoop as “hey, it works with what I am already doing” at my enterprise.  

If you aren’t already in the Hadoop ecosystem, then there isn’t any special Solr specific reason to go this way, and indeed many reasons NOT to.   Hadoop isn’t for the faint of heart….  

Not an answer per se…. 

> On Feb 23, 2023, at 5:57 AM, Zara Parst <ed...@gmail.com> wrote:
> 
> Hi,
> 
> I read at many places about using Hadoop in solrCloud. I try to find the
> reason why to use Hadoop in place of a local file system. Can someone
> briefly explain why to use Hadoop with SolrCloud when solr is just using
> Hadoop for indexing and storing logs in Hadoop. Is there any compelling
> reason to do that?
> 
> Is Hadoop having any advantage over the local file system with solr, since
> I can achieve cloud mod storing index in the local file system and can
> still use shard and replica.  So my question is what advantage Hadoop will
> give me, does Hadoop do indexing fast, does Hadoop take less space to store
> index, is that distributed file system is better in Hadoop, like sharding,
> replication etc. Or does it take backup automatically?
> 
> Please do answer this question as much as possible,

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.