You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alireza Salimi <al...@gmail.com> on 2011/12/20 21:28:21 UTC

Solr Distributed Search vs Hadoop

Hi,

I have a basic question, let's say we're going to have a very very huge set
of data.
In a way that for sure we will need many servers (tens or hundreds of
servers).
We will also need failover.
Now the question is, if we should use Hadoop or using Solr Distributed
Search
with shards would be enough?

I've read lots of articles like:
http://www.lucidimagination.com/content/scaling-lucene-and-solr
http://wiki.apache.org/solr/DistributedSearch

But I'm still confused, Solr's distributed search seems to be able to handle
splitting the queries and merging the result. So what's the point of using
Hadoop?

I'm pretty sure I'm missing something here. Can anyone suggest
some links regarding this issue?

Regards

-- 
Alireza Salimi
Java EE Developer

Re: Solr Distributed Search vs Hadoop

Posted by Ted Dunning <te...@gmail.com>.

This copying is a bit overstated here because of the way that small
segments are merged into larger segments.  Those larger segments are then
copied much less often than the smaller ones.

While you can wind up with lots of copying in certain extreme cases, it is
quite rare.  In particular, if you have one of the following cases, you
won't see very many copies for any particular document:

- you don't delete files one at a time (i.e. indexing only without updates
or deletion)

or

- most documents that are going to be deleted are deleted as young documents

or

- the probability that any particular document will be deleted in a fixed
period of time decreases exponentially with the age of the documents

Any of these characteristics or many others will prevent a file from being
copied very many times because as the document ages, it keeps company with
similarly aged documents which are accordingly unlikely to have enough
compatriots deleted to make their segment have a small number of live
documents in it.  Put another way, the intervals between merges that a
particular document undergoes will become longer and longer as it ages and
thus the total number of copies it can undergo cannot grow very fast.

On Wed, Dec 28, 2011 at 7:53 PM, Lance Norskog <go...@gmail.com> wrote:

> ...
> One problem with indexing is that Solr continally copies data into
> "segments" (index parts) while you index. So, each 5MB PDF might get
> copied 50 times during a full index job. If you can strip the index
> down to what you really want to search on, terabytes become gigabytes.
> Solr seems to handle 100g-200g fine on modern hardware.
>
>

Re: Solr Distributed Search vs Hadoop

Posted by Lance Norskog <go...@gmail.com>.

Here is an example of schema design: a PDF file of 5MB might have
maybe 50k of actual text. The Solr ExtractingRequestHandler will find
that text and only index that. If you set the field to stored=true,
the 5mb will be saved. If saved=false, the PDF is not saved. Instead,
you would store a link to it.

One problem with indexing is that Solr continally copies data into
"segments" (index parts) while you index. So, each 5MB PDF might get
copied 50 times during a full index job. If you can strip the index
down to what you really want to search on, terabytes become gigabytes.
Solr seems to handle 100g-200g fine on modern hardware.

Lance

On Fri, Dec 23, 2011 at 1:54 AM, Nick Vincent <ni...@vtype.com> wrote:
> For data of this size you may want to look at something like Apache
> Cassandra, which is made specifically to handle data at this kind of
> scale across many machines.
>
> You can still use Hadoop to analyse and transform the data in a
> performant manner, however it's probably best to do some research on
> this on the relevant technical forums for those technologies.
>
> Nick

-- 
Lance Norskog
goksron@gmail.com

Re: Solr Distributed Search vs Hadoop

Posted by Nick Vincent <ni...@vtype.com>.

For data of this size you may want to look at something like Apache
Cassandra, which is made specifically to handle data at this kind of
scale across many machines.

You can still use Hadoop to analyse and transform the data in a
performant manner, however it's probably best to do some research on
this on the relevant technical forums for those technologies.

Nick

Re: Solr Distributed Search vs Hadoop

Posted by Ted Dunning <te...@gmail.com>.

Well that begins to not look so much like a Solr/Lucene problem.  Overall
data is moderately large (TB's to 10's of TB's) for Lucene and the
individual user profiles are distinctly large to be storing in Lucene.

If there is part of the profile that you might want to search, that would
be appropriate for Lucene.  If you can split the user data into several
components that are updated independently, then Hbase might be appropriate
with different components in different column families.

You aren't going to get a definitive answer on a mailing list, however.
 You are going to need somebody with a bit of experience to advise you
directly and/or you are going to need to prototype test cases.

On Tue, Dec 20, 2011 at 1:07 PM, Alireza Salimi <al...@gmail.com>wrote:

> Well, actually we haven't started the actual project yet.
> But probably it will have to handle the data of millions of users,
> and a rough estimation for each user's data would be something around
> 5 MB.
>
> The other problem is that those data will be changed very often.
>
> I hope I answered your question.
>
> Thanks
>
> On Tue, Dec 20, 2011 at 4:00 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > You didn't mention how big your data is or how you create it.
> >
> > Hadoop would mostly used in the preparation of the data or the off-line
> > creation of indexes.
> >
> > On Tue, Dec 20, 2011 at 12:28 PM, Alireza Salimi
> > <al...@gmail.com>wrote:
> >
> > > Hi,
> > >
> > > I have a basic question, let's say we're going to have a very very huge
> > set
> > > of data.
> > > In a way that for sure we will need many servers (tens or hundreds of
> > > servers).
> > > We will also need failover.
> > > Now the question is, if we should use Hadoop or using Solr Distributed
> > > Search
> > > with shards would be enough?
> > >
> > > I've read lots of articles like:
> > > http://www.lucidimagination.com/content/scaling-lucene-and-solr
> > > http://wiki.apache.org/solr/DistributedSearch
> > >
> > > But I'm still confused, Solr's distributed search seems to be able to
> > > handle
> > > splitting the queries and merging the result. So what's the point of
> > using
> > > Hadoop?
> > >
> > > I'm pretty sure I'm missing something here. Can anyone suggest
> > > some links regarding this issue?
> > >
> > > Regards
> > >
> > > --
> > > Alireza Salimi
> > > Java EE Developer
> > >
> >
>
>
>
> --
> Alireza Salimi
> Java EE Developer
>

Re: Solr Distributed Search vs Hadoop

Posted by Alireza Salimi <al...@gmail.com>.

Well, actually we haven't started the actual project yet.
But probably it will have to handle the data of millions of users,
and a rough estimation for each user's data would be something around
5 MB.

The other problem is that those data will be changed very often.

I hope I answered your question.

Thanks

On Tue, Dec 20, 2011 at 4:00 PM, Ted Dunning <te...@gmail.com> wrote:

> You didn't mention how big your data is or how you create it.
>
> Hadoop would mostly used in the preparation of the data or the off-line
> creation of indexes.
>
> On Tue, Dec 20, 2011 at 12:28 PM, Alireza Salimi
> <al...@gmail.com>wrote:
>
> > Hi,
> >
> > I have a basic question, let's say we're going to have a very very huge
> set
> > of data.
> > In a way that for sure we will need many servers (tens or hundreds of
> > servers).
> > We will also need failover.
> > Now the question is, if we should use Hadoop or using Solr Distributed
> > Search
> > with shards would be enough?
> >
> > I've read lots of articles like:
> > http://www.lucidimagination.com/content/scaling-lucene-and-solr
> > http://wiki.apache.org/solr/DistributedSearch
> >
> > But I'm still confused, Solr's distributed search seems to be able to
> > handle
> > splitting the queries and merging the result. So what's the point of
> using
> > Hadoop?
> >
> > I'm pretty sure I'm missing something here. Can anyone suggest
> > some links regarding this issue?
> >
> > Regards
> >
> > --
> > Alireza Salimi
> > Java EE Developer
> >
>



-- 
Alireza Salimi
Java EE Developer

Re: Solr Distributed Search vs Hadoop

Posted by Ted Dunning <te...@gmail.com>.

You didn't mention how big your data is or how you create it.

Hadoop would mostly used in the preparation of the data or the off-line
creation of indexes.

On Tue, Dec 20, 2011 at 12:28 PM, Alireza Salimi
<al...@gmail.com>wrote:

> Hi,
>
> I have a basic question, let's say we're going to have a very very huge set
> of data.
> In a way that for sure we will need many servers (tens or hundreds of
> servers).
> We will also need failover.
> Now the question is, if we should use Hadoop or using Solr Distributed
> Search
> with shards would be enough?
>
> I've read lots of articles like:
> http://www.lucidimagination.com/content/scaling-lucene-and-solr
> http://wiki.apache.org/solr/DistributedSearch
>
> But I'm still confused, Solr's distributed search seems to be able to
> handle
> splitting the queries and merging the result. So what's the point of using
> Hadoop?
>
> I'm pretty sure I'm missing something here. Can anyone suggest
> some links regarding this issue?
>
> Regards
>
> --
> Alireza Salimi
> Java EE Developer
>