You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jens Mueller <su...@googlemail.com> on 2011/04/05 03:25:02 UTC

Very very large scale Solr Deployment = how to do (Expert Question)?

Hello Experts,



I am a Solr newbie but read quite a lot of docs. I still do not understand
what would be the best way to setup very large scale deployments:



Goal (threoretical):

 A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)

 B) Queries: 100000 Queries/ per Second

 C) Updates: 100000 Updates / per Second




Solr offers:

1.)    Replication => Scales Well for B)  BUT  A) and C) are not satisfied


2.)    Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As
I understand the Sharding approach all goes through a central server, that
dispatches the updates and assembles the quries retrieved from the different
shards. But this central server has also some capacity limits...)




What is the right approach to handle such large deployments? I would be
thankfull for just a rough sketch of the concepts so I can experiment/search
further…


Maybe I am missing something very trivial as I think some of the “Solr
Users/Use Cases” on the homepage are that kind of large deployments. How are
they implemented?



Thanky very much!!!

Jens

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Andy <an...@yahoo.com>.

Perfect. Thank you very much.

Andy

--- On Fri, 4/8/11, Pascal Coupet <pc...@gmail.com> wrote:

> From: Pascal Coupet <pc...@gmail.com>
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert Question)?
> To: solr-user@lucene.apache.org
> Date: Friday, April 8, 2011, 10:20 AM
> I dit put a pdf version here:
> https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B02DHBZQYYT_MmRkZTY0YjQtODJmZS00Mzg0LWJiNTEtOWJjNzViNmNjZjdh&hl=en&authkey=CL2Fq_QG
> 
> Zoom it to get a better view.
> 
> Pascal
> 
> 2011/4/8 Andy <an...@yahoo.com>
> 
> > Could anyone please post a version of the document in
> pdf or openoffice
> > format? I'm on Linux so there's no way for me to use
> MS Word.
> >
> > Thanks.
> >
> >
> > --- On Fri, 4/8/11, Albert Vila <av...@imente.com>
> wrote:
> >
> > > From: Albert Vila <av...@imente.com>
> > > Subject: Re: Very very large scale Solr
> Deployment = how to do (Expert
> > Question)?
> > > To: solr-user@lucene.apache.org
> > > Date: Friday, April 8, 2011, 9:25 AM
> > > Yes, It won't work if you are using
> > > OpenOffice. However it works fine
> > > with Microsoft Word.
> > >
> > > Hope it helps.
> > >
> > > Albert
> > >
> > > On 8 April 2011 14:55, Andy <an...@yahoo.com>
> > > wrote:
> > > > I can't view the document either -- it
> showed up
> > > empty.
> > > >
> > > > Has anyone succeeded in viewing it?
> > > >
> > > > Andy
> > > >
> > > > --- On Fri, 4/8/11, Albert Vila <av...@imente.com>
> > > wrote:
> > > >
> > > >> From: Albert Vila <av...@imente.com>
> > > >> Subject: Re: Very very large scale Solr
> Deployment
> > > = how to do (Expert Question)?
> > > >> To: solr-user@lucene.apache.org
> > > >> Date: Friday, April 8, 2011, 3:43 AM
> > > >> Ephraim, I still can't view the
> > > >> document.
> > > >>
> > > >> Don't know if I'm doing something wrong,
> but I
> > > downloaded
> > > >> it and It
> > > >> appears to be empty.
> > > >>
> > > >> Albert
> > > >>
> > > >> On 7 April 2011 09:32, Ephraim Ofir
> <Ep...@icq.com>
> > > >> wrote:
> > > >> > You can't view it online, but you
> should be
> > > able to
> > > >> download it from:
> > > >> >
> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> > > >> >
> > >
> 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
> > > >> >
> > > >> > Enjoy,
> > > >> > Ephraim Ofir
> > > >> >
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Jens Mueller [mailto:supidupi007@googlemail.com]
> > > >> > Sent: Thursday, April 07, 2011 8:30
> AM
> > > >> > To: solr-user@lucene.apache.org
> > > >> > Subject: Re: Very very large scale
> Solr
> > > Deployment =
> > > >> how to do (Expert
> > > >> > Question)?
> > > >> >
> > > >> > Hello Ephraim, hello Lance, hello
> Walter,
> > > >> >
> > > >> > thanks for your replies:
> > > >> >
> > > >> > Ephraim, thanks very much for the
> further
> > > detailed
> > > >> explanation. I will
> > > >> > try
> > > >> > to setup a demo system in the next
> few days
> > > and use
> > > >> your advice.
> > > >> > LoadBalancers are an important
> aspect of your
> > > design.
> > > >> Can you recommend
> > > >> > one
> > > >> > LB specificallly? (I would be
> using
> > > haproxy.1wt.eu) .
> > > >> I think the Idea
> > > >> > with
> > > >> > uploading your document is very
> good.
> > > However
> > > >> Google-Docs seemed not be
> > > >> > be
> > > >> > working (at least for me with the
> docx
> > > format?), but
> > > >> maybe you can
> > > >> > simply
> > > >> > output the document as PDF and then
> I think
> > > Google
> > > >> Docs is working, so
> > > >> > all
> > > >> > the others can also have a look at
> your
> > > concept. The
> > > >> best approach would
> > > >> > be
> > > >> > if you could upload your advice
> directly
> > > somewhere to
> > > >> the solr wiki as
> > > >> > it is
> > > >> > really helpful.I found some other
> documents
> > > meanwhile,
> > > >> but yours is much
> > > >> > clearer and more complete, with the
> LBs and
> > > the
> > > >> Aggregators (
> > > >> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> > > >> >
> > > >> > Lance, thanks I will have a look at
> what
> > > linkedin is
> > > >> doing.
> > > >> >
> > > >> > Walter, thanks for the advice: Well
> you are
> > > right,
> > > >> mentioning google. My
> > > >> > question was also to understand how
> such
> > > large systems
> > > >> like
> > > >> > google/facebook
> > > >> > are actually working. So my numbers
> are just
> > > >> theoretical and made up. My
> > > >> > system will be smaller,  but I
> would be very
> > > happy to
> > > >> understand how
> > > >> > such
> > > >> > large systems are build and I think
> the
> > > approach
> > > >> Ephraim showd should be
> > > >> > working quite well at large scale.
> If you
> > > know a good
> > > >> documents (besides
> > > >> > the
> > > >> > bigtable research paper that I
> already know)
> > > that
> > > >> technically describes
> > > >> > how
> > > >> > google is working in detail that
> would be of
> > > great
> > > >> interest. You seem to
> > > >> > be
> > > >> > working for a company that handles
> large
> > > datasets.
> > > >> Does google use this
> > > >> > approach, sharing the index into N
> writers,
> > > and the
> > > >> procuded index is
> > > >> > then
> > > >> > replicated to N "read only
> searchers"?
> > > >> >
> > > >> > thank you all.
> > > >> > best regards
> > > >> > jens
> > > >> >
> > > >> >
> > > >> >
> > > >> > 2011/4/7 Walter Underwood <wu...@wunderwood.org>
> > > >> >
> > > >> >> The bigger answer is that you
> cannot get
> > > to this
> > > >> size by just
> > > >> > configuring
> > > >> >> Solr. You may have to invent a
> lot of
> > > stuff. Like
> > > >> all of Google.
> > > >> >>
> > > >> >> Where did you get these
> numbers? The
> > > proposed
> > > >> query rate is twice as
> > > >> > big as
> > > >> >> Google (Feb 2010 estimate, 34K
> qps).
> > > >> >>
> > > >> >> I work at MarkLogic, and we
> scale to
> > > 100's of
> > > >> terabytes, with fast
> > > >> > update
> > > >> >> and query rates. If you want a
> real
> > > system that
> > > >> handles that, you
> > > >> > might want
> > > >> >> to look at our product.
> > > >> >>
> > > >> >> wunder
> > > >> >>
> > > >> >> On Apr 6, 2011, at 8:06 PM,
> Lance Norskog
> > > wrote:
> > > >> >>
> > > >> >> > I would not use
> replication.
> > > LinkedIn
> > > >> consumer search is a flat
> > > >> > system
> > > >> >> > where one process indexes
> new
> > > entries and
> > > >> does queries
> > > >> > simultaneously.
> > > >> >> > It's a custom Lucene app
> called
> > > Zoie. Their
> > > >> stuff is on Github..
> > > >> >> >
> > > >> >> > I would get documents to
> indexers
> > > via a
> > > >> multicast IP-based queueing
> > > >> >> > system. This scales very
> well and
> > > there's a
> > > >> lot of hardware support.
> > > >> >> >
> > > >> >> > The problem with
> distributed search
> > > is that
> > > >> it is a) inherently
> > > >> > slower
> > > >> >> > and b) has inherently more
> and
> > > longer jitter.
> > > >> The "airplane wing"
> > > >> >> > distribution of query
> times becomes
> > > longer
> > > >> and flatter.
> > > >> >> >
> > > >> >> > This is going to have to
> be a
> > > "federated"
> > > >> system, where the
> > > >> > front-end
> > > >> >> > app aggregates results
> rather than
> > > Solr.
> > > >> >> >
> > > >> >> > On Mon, Apr 4, 2011 at
> 6:25 PM, Jens
> > > Mueller
> > > >> > <su...@googlemail.com>
> > > >> >> wrote:
> > > >> >> >> Hello Experts,
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> I am a Solr newbie but
> read
> > > quite a lot
> > > >> of docs. I still do not
> > > >> >> understand
> > > >> >> >> what would be the best
> way to
> > > setup very
> > > >> large scale deployments:
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> Goal (threoretical):
> > > >> >> >>
> > > >> >> >>  A.) Index-Size:
> 1 Petabyte (1
> > > Document
> > > >> is about 5 KB in Size)
> > > >> >> >>
> > > >> >> >>  B) Queries:
> 100000 Queries/
> > > per Second
> > > >> >> >>
> > > >> >> >>  C) Updates:
> 100000 Updates /
> > > per
> > > >> Second
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> Solr offers:
> > > >> >> >>
> > > >> >> >> 1.)   
> Replication =>
> > > Scales Well
> > > >> for B)  BUT  A) and C) are
> not
> > > >> >> satisfied
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> 2.)   
> Sharding => Scales
> > > well for
> > > >> A) BUT B) and C) are not
> > > >> > satisfied
> > > >> >> (=> As
> > > >> >> >> I understand the
> Sharding
> > > approach all
> > > >> goes through a central
> > > >> > server,
> > > >> >> that
> > > >> >> >> dispatches the updates
> and
> > > assembles the
> > > >> quries retrieved from the
> > > >> >> different
> > > >> >> >> shards. But this
> central server
> > > has also
> > > >> some capacity limits...)
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> What is the right
> approach to
> > > handle such
> > > >> large deployments? I
> > > >> > would be
> > > >> >> >> thankfull for just a
> rough
> > > sketch of the
> > > >> concepts so I can
> > > >> >> experiment/search
> > > >> >> >> further...
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> Maybe I am missing
> something
> > > very trivial
> > > >> as I think some of the
> > > >> > "Solr
> > > >> >> >> Users/Use Cases" on
> the homepage
> > > are that
> > > >> kind of large
> > > >> > deployments. How
> > > >> >> are
> > > >> >> >> they implemented?
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> Thanky very much!!!
> > > >> >> >>
> > > >> >> >> Jens
> > > >> >> >>
> > > >> >> >
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Albert Vila Puig
> > > >> <av...@imente.com>
> > > >> iMente.com <http://www.imente.com>
> > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Albert Vila Puig
> > > <av...@imente.com>
> > > iMente.com <http://www.imente.com>
> > >
> >
>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Pascal Coupet <pc...@gmail.com>.

I dit put a pdf version here:
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B02DHBZQYYT_MmRkZTY0YjQtODJmZS00Mzg0LWJiNTEtOWJjNzViNmNjZjdh&hl=en&authkey=CL2Fq_QG

Zoom it to get a better view.

Pascal

2011/4/8 Andy <an...@yahoo.com>

> Could anyone please post a version of the document in pdf or openoffice
> format? I'm on Linux so there's no way for me to use MS Word.
>
> Thanks.
>
>
> --- On Fri, 4/8/11, Albert Vila <av...@imente.com> wrote:
>
> > From: Albert Vila <av...@imente.com>
> > Subject: Re: Very very large scale Solr Deployment = how to do (Expert
> Question)?
> > To: solr-user@lucene.apache.org
> > Date: Friday, April 8, 2011, 9:25 AM
> > Yes, It won't work if you are using
> > OpenOffice. However it works fine
> > with Microsoft Word.
> >
> > Hope it helps.
> >
> > Albert
> >
> > On 8 April 2011 14:55, Andy <an...@yahoo.com>
> > wrote:
> > > I can't view the document either -- it showed up
> > empty.
> > >
> > > Has anyone succeeded in viewing it?
> > >
> > > Andy
> > >
> > > --- On Fri, 4/8/11, Albert Vila <av...@imente.com>
> > wrote:
> > >
> > >> From: Albert Vila <av...@imente.com>
> > >> Subject: Re: Very very large scale Solr Deployment
> > = how to do (Expert Question)?
> > >> To: solr-user@lucene.apache.org
> > >> Date: Friday, April 8, 2011, 3:43 AM
> > >> Ephraim, I still can't view the
> > >> document.
> > >>
> > >> Don't know if I'm doing something wrong, but I
> > downloaded
> > >> it and It
> > >> appears to be empty.
> > >>
> > >> Albert
> > >>
> > >> On 7 April 2011 09:32, Ephraim Ofir <Ep...@icq.com>
> > >> wrote:
> > >> > You can't view it online, but you should be
> > able to
> > >> download it from:
> > >> >
> https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> > >> >
> > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
> > >> >
> > >> > Enjoy,
> > >> > Ephraim Ofir
> > >> >
> > >> >
> > >> > -----Original Message-----
> > >> > From: Jens Mueller [mailto:supidupi007@googlemail.com]
> > >> > Sent: Thursday, April 07, 2011 8:30 AM
> > >> > To: solr-user@lucene.apache.org
> > >> > Subject: Re: Very very large scale Solr
> > Deployment =
> > >> how to do (Expert
> > >> > Question)?
> > >> >
> > >> > Hello Ephraim, hello Lance, hello Walter,
> > >> >
> > >> > thanks for your replies:
> > >> >
> > >> > Ephraim, thanks very much for the further
> > detailed
> > >> explanation. I will
> > >> > try
> > >> > to setup a demo system in the next few days
> > and use
> > >> your advice.
> > >> > LoadBalancers are an important aspect of your
> > design.
> > >> Can you recommend
> > >> > one
> > >> > LB specificallly? (I would be using
> > haproxy.1wt.eu) .
> > >> I think the Idea
> > >> > with
> > >> > uploading your document is very good.
> > However
> > >> Google-Docs seemed not be
> > >> > be
> > >> > working (at least for me with the docx
> > format?), but
> > >> maybe you can
> > >> > simply
> > >> > output the document as PDF and then I think
> > Google
> > >> Docs is working, so
> > >> > all
> > >> > the others can also have a look at your
> > concept. The
> > >> best approach would
> > >> > be
> > >> > if you could upload your advice directly
> > somewhere to
> > >> the solr wiki as
> > >> > it is
> > >> > really helpful.I found some other documents
> > meanwhile,
> > >> but yours is much
> > >> > clearer and more complete, with the LBs and
> > the
> > >> Aggregators (
> > >> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> > >> >
> > >> > Lance, thanks I will have a look at what
> > linkedin is
> > >> doing.
> > >> >
> > >> > Walter, thanks for the advice: Well you are
> > right,
> > >> mentioning google. My
> > >> > question was also to understand how such
> > large systems
> > >> like
> > >> > google/facebook
> > >> > are actually working. So my numbers are just
> > >> theoretical and made up. My
> > >> > system will be smaller,  but I would be very
> > happy to
> > >> understand how
> > >> > such
> > >> > large systems are build and I think the
> > approach
> > >> Ephraim showd should be
> > >> > working quite well at large scale. If you
> > know a good
> > >> documents (besides
> > >> > the
> > >> > bigtable research paper that I already know)
> > that
> > >> technically describes
> > >> > how
> > >> > google is working in detail that would be of
> > great
> > >> interest. You seem to
> > >> > be
> > >> > working for a company that handles large
> > datasets.
> > >> Does google use this
> > >> > approach, sharing the index into N writers,
> > and the
> > >> procuded index is
> > >> > then
> > >> > replicated to N "read only searchers"?
> > >> >
> > >> > thank you all.
> > >> > best regards
> > >> > jens
> > >> >
> > >> >
> > >> >
> > >> > 2011/4/7 Walter Underwood <wu...@wunderwood.org>
> > >> >
> > >> >> The bigger answer is that you cannot get
> > to this
> > >> size by just
> > >> > configuring
> > >> >> Solr. You may have to invent a lot of
> > stuff. Like
> > >> all of Google.
> > >> >>
> > >> >> Where did you get these numbers? The
> > proposed
> > >> query rate is twice as
> > >> > big as
> > >> >> Google (Feb 2010 estimate, 34K qps).
> > >> >>
> > >> >> I work at MarkLogic, and we scale to
> > 100's of
> > >> terabytes, with fast
> > >> > update
> > >> >> and query rates. If you want a real
> > system that
> > >> handles that, you
> > >> > might want
> > >> >> to look at our product.
> > >> >>
> > >> >> wunder
> > >> >>
> > >> >> On Apr 6, 2011, at 8:06 PM, Lance Norskog
> > wrote:
> > >> >>
> > >> >> > I would not use replication.
> > LinkedIn
> > >> consumer search is a flat
> > >> > system
> > >> >> > where one process indexes new
> > entries and
> > >> does queries
> > >> > simultaneously.
> > >> >> > It's a custom Lucene app called
> > Zoie. Their
> > >> stuff is on Github..
> > >> >> >
> > >> >> > I would get documents to indexers
> > via a
> > >> multicast IP-based queueing
> > >> >> > system. This scales very well and
> > there's a
> > >> lot of hardware support.
> > >> >> >
> > >> >> > The problem with distributed search
> > is that
> > >> it is a) inherently
> > >> > slower
> > >> >> > and b) has inherently more and
> > longer jitter.
> > >> The "airplane wing"
> > >> >> > distribution of query times becomes
> > longer
> > >> and flatter.
> > >> >> >
> > >> >> > This is going to have to be a
> > "federated"
> > >> system, where the
> > >> > front-end
> > >> >> > app aggregates results rather than
> > Solr.
> > >> >> >
> > >> >> > On Mon, Apr 4, 2011 at 6:25 PM, Jens
> > Mueller
> > >> > <su...@googlemail.com>
> > >> >> wrote:
> > >> >> >> Hello Experts,
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> I am a Solr newbie but read
> > quite a lot
> > >> of docs. I still do not
> > >> >> understand
> > >> >> >> what would be the best way to
> > setup very
> > >> large scale deployments:
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> Goal (threoretical):
> > >> >> >>
> > >> >> >>  A.) Index-Size: 1 Petabyte (1
> > Document
> > >> is about 5 KB in Size)
> > >> >> >>
> > >> >> >>  B) Queries: 100000 Queries/
> > per Second
> > >> >> >>
> > >> >> >>  C) Updates: 100000 Updates /
> > per
> > >> Second
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> Solr offers:
> > >> >> >>
> > >> >> >> 1.)    Replication =>
> > Scales Well
> > >> for B)  BUT  A) and C) are not
> > >> >> satisfied
> > >> >> >>
> > >> >> >>
> > >> >> >> 2.)    Sharding => Scales
> > well for
> > >> A) BUT B) and C) are not
> > >> > satisfied
> > >> >> (=> As
> > >> >> >> I understand the Sharding
> > approach all
> > >> goes through a central
> > >> > server,
> > >> >> that
> > >> >> >> dispatches the updates and
> > assembles the
> > >> quries retrieved from the
> > >> >> different
> > >> >> >> shards. But this central server
> > has also
> > >> some capacity limits...)
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> What is the right approach to
> > handle such
> > >> large deployments? I
> > >> > would be
> > >> >> >> thankfull for just a rough
> > sketch of the
> > >> concepts so I can
> > >> >> experiment/search
> > >> >> >> further...
> > >> >> >>
> > >> >> >>
> > >> >> >> Maybe I am missing something
> > very trivial
> > >> as I think some of the
> > >> > "Solr
> > >> >> >> Users/Use Cases" on the homepage
> > are that
> > >> kind of large
> > >> > deployments. How
> > >> >> are
> > >> >> >> they implemented?
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> Thanky very much!!!
> > >> >> >>
> > >> >> >> Jens
> > >> >> >>
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Albert Vila Puig
> > >> <av...@imente.com>
> > >> iMente.com <http://www.imente.com>
> > >>
> > >
> >
> >
> >
> > --
> > Albert Vila Puig
> > <av...@imente.com>
> > iMente.com <http://www.imente.com>
> >
>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Andy <an...@yahoo.com>.

Could anyone please post a version of the document in pdf or openoffice format? I'm on Linux so there's no way for me to use MS Word.

Thanks.


--- On Fri, 4/8/11, Albert Vila <av...@imente.com> wrote:

> From: Albert Vila <av...@imente.com>
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert Question)?
> To: solr-user@lucene.apache.org
> Date: Friday, April 8, 2011, 9:25 AM
> Yes, It won't work if you are using
> OpenOffice. However it works fine
> with Microsoft Word.
> 
> Hope it helps.
> 
> Albert
> 
> On 8 April 2011 14:55, Andy <an...@yahoo.com>
> wrote:
> > I can't view the document either -- it showed up
> empty.
> >
> > Has anyone succeeded in viewing it?
> >
> > Andy
> >
> > --- On Fri, 4/8/11, Albert Vila <av...@imente.com>
> wrote:
> >
> >> From: Albert Vila <av...@imente.com>
> >> Subject: Re: Very very large scale Solr Deployment
> = how to do (Expert Question)?
> >> To: solr-user@lucene.apache.org
> >> Date: Friday, April 8, 2011, 3:43 AM
> >> Ephraim, I still can't view the
> >> document.
> >>
> >> Don't know if I'm doing something wrong, but I
> downloaded
> >> it and It
> >> appears to be empty.
> >>
> >> Albert
> >>
> >> On 7 April 2011 09:32, Ephraim Ofir <Ep...@icq.com>
> >> wrote:
> >> > You can't view it online, but you should be
> able to
> >> download it from:
> >> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> >> >
> 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
> >> >
> >> > Enjoy,
> >> > Ephraim Ofir
> >> >
> >> >
> >> > -----Original Message-----
> >> > From: Jens Mueller [mailto:supidupi007@googlemail.com]
> >> > Sent: Thursday, April 07, 2011 8:30 AM
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Re: Very very large scale Solr
> Deployment =
> >> how to do (Expert
> >> > Question)?
> >> >
> >> > Hello Ephraim, hello Lance, hello Walter,
> >> >
> >> > thanks for your replies:
> >> >
> >> > Ephraim, thanks very much for the further
> detailed
> >> explanation. I will
> >> > try
> >> > to setup a demo system in the next few days
> and use
> >> your advice.
> >> > LoadBalancers are an important aspect of your
> design.
> >> Can you recommend
> >> > one
> >> > LB specificallly? (I would be using
> haproxy.1wt.eu) .
> >> I think the Idea
> >> > with
> >> > uploading your document is very good.
> However
> >> Google-Docs seemed not be
> >> > be
> >> > working (at least for me with the docx
> format?), but
> >> maybe you can
> >> > simply
> >> > output the document as PDF and then I think
> Google
> >> Docs is working, so
> >> > all
> >> > the others can also have a look at your
> concept. The
> >> best approach would
> >> > be
> >> > if you could upload your advice directly
> somewhere to
> >> the solr wiki as
> >> > it is
> >> > really helpful.I found some other documents
> meanwhile,
> >> but yours is much
> >> > clearer and more complete, with the LBs and
> the
> >> Aggregators (
> >> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> >> >
> >> > Lance, thanks I will have a look at what
> linkedin is
> >> doing.
> >> >
> >> > Walter, thanks for the advice: Well you are
> right,
> >> mentioning google. My
> >> > question was also to understand how such
> large systems
> >> like
> >> > google/facebook
> >> > are actually working. So my numbers are just
> >> theoretical and made up. My
> >> > system will be smaller,  but I would be very
> happy to
> >> understand how
> >> > such
> >> > large systems are build and I think the
> approach
> >> Ephraim showd should be
> >> > working quite well at large scale. If you
> know a good
> >> documents (besides
> >> > the
> >> > bigtable research paper that I already know)
> that
> >> technically describes
> >> > how
> >> > google is working in detail that would be of
> great
> >> interest. You seem to
> >> > be
> >> > working for a company that handles large
> datasets.
> >> Does google use this
> >> > approach, sharing the index into N writers,
> and the
> >> procuded index is
> >> > then
> >> > replicated to N "read only searchers"?
> >> >
> >> > thank you all.
> >> > best regards
> >> > jens
> >> >
> >> >
> >> >
> >> > 2011/4/7 Walter Underwood <wu...@wunderwood.org>
> >> >
> >> >> The bigger answer is that you cannot get
> to this
> >> size by just
> >> > configuring
> >> >> Solr. You may have to invent a lot of
> stuff. Like
> >> all of Google.
> >> >>
> >> >> Where did you get these numbers? The
> proposed
> >> query rate is twice as
> >> > big as
> >> >> Google (Feb 2010 estimate, 34K qps).
> >> >>
> >> >> I work at MarkLogic, and we scale to
> 100's of
> >> terabytes, with fast
> >> > update
> >> >> and query rates. If you want a real
> system that
> >> handles that, you
> >> > might want
> >> >> to look at our product.
> >> >>
> >> >> wunder
> >> >>
> >> >> On Apr 6, 2011, at 8:06 PM, Lance Norskog
> wrote:
> >> >>
> >> >> > I would not use replication.
> LinkedIn
> >> consumer search is a flat
> >> > system
> >> >> > where one process indexes new
> entries and
> >> does queries
> >> > simultaneously.
> >> >> > It's a custom Lucene app called
> Zoie. Their
> >> stuff is on Github..
> >> >> >
> >> >> > I would get documents to indexers
> via a
> >> multicast IP-based queueing
> >> >> > system. This scales very well and
> there's a
> >> lot of hardware support.
> >> >> >
> >> >> > The problem with distributed search
> is that
> >> it is a) inherently
> >> > slower
> >> >> > and b) has inherently more and
> longer jitter.
> >> The "airplane wing"
> >> >> > distribution of query times becomes
> longer
> >> and flatter.
> >> >> >
> >> >> > This is going to have to be a
> "federated"
> >> system, where the
> >> > front-end
> >> >> > app aggregates results rather than
> Solr.
> >> >> >
> >> >> > On Mon, Apr 4, 2011 at 6:25 PM, Jens
> Mueller
> >> > <su...@googlemail.com>
> >> >> wrote:
> >> >> >> Hello Experts,
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> I am a Solr newbie but read
> quite a lot
> >> of docs. I still do not
> >> >> understand
> >> >> >> what would be the best way to
> setup very
> >> large scale deployments:
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Goal (threoretical):
> >> >> >>
> >> >> >>  A.) Index-Size: 1 Petabyte (1
> Document
> >> is about 5 KB in Size)
> >> >> >>
> >> >> >>  B) Queries: 100000 Queries/
> per Second
> >> >> >>
> >> >> >>  C) Updates: 100000 Updates /
> per
> >> Second
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Solr offers:
> >> >> >>
> >> >> >> 1.)    Replication =>
> Scales Well
> >> for B)  BUT  A) and C) are not
> >> >> satisfied
> >> >> >>
> >> >> >>
> >> >> >> 2.)    Sharding => Scales
> well for
> >> A) BUT B) and C) are not
> >> > satisfied
> >> >> (=> As
> >> >> >> I understand the Sharding
> approach all
> >> goes through a central
> >> > server,
> >> >> that
> >> >> >> dispatches the updates and
> assembles the
> >> quries retrieved from the
> >> >> different
> >> >> >> shards. But this central server
> has also
> >> some capacity limits...)
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> What is the right approach to
> handle such
> >> large deployments? I
> >> > would be
> >> >> >> thankfull for just a rough
> sketch of the
> >> concepts so I can
> >> >> experiment/search
> >> >> >> further...
> >> >> >>
> >> >> >>
> >> >> >> Maybe I am missing something
> very trivial
> >> as I think some of the
> >> > "Solr
> >> >> >> Users/Use Cases" on the homepage
> are that
> >> kind of large
> >> > deployments. How
> >> >> are
> >> >> >> they implemented?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Thanky very much!!!
> >> >> >>
> >> >> >> Jens
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Albert Vila Puig
> >> <av...@imente.com>
> >> iMente.com <http://www.imente.com>
> >>
> >
> 
> 
> 
> -- 
> Albert Vila Puig
> <av...@imente.com>
> iMente.com <http://www.imente.com>
>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Albert Vila <av...@imente.com>.

Yes, It won't work if you are using OpenOffice. However it works fine
with Microsoft Word.

Hope it helps.

Albert

On 8 April 2011 14:55, Andy <an...@yahoo.com> wrote:
> I can't view the document either -- it showed up empty.
>
> Has anyone succeeded in viewing it?
>
> Andy
>
> --- On Fri, 4/8/11, Albert Vila <av...@imente.com> wrote:
>
>> From: Albert Vila <av...@imente.com>
>> Subject: Re: Very very large scale Solr Deployment = how to do (Expert Question)?
>> To: solr-user@lucene.apache.org
>> Date: Friday, April 8, 2011, 3:43 AM
>> Ephraim, I still can't view the
>> document.
>>
>> Don't know if I'm doing something wrong, but I downloaded
>> it and It
>> appears to be empty.
>>
>> Albert
>>
>> On 7 April 2011 09:32, Ephraim Ofir <Ep...@icq.com>
>> wrote:
>> > You can't view it online, but you should be able to
>> download it from:
>> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
>> > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
>> >
>> > Enjoy,
>> > Ephraim Ofir
>> >
>> >
>> > -----Original Message-----
>> > From: Jens Mueller [mailto:supidupi007@googlemail.com]
>> > Sent: Thursday, April 07, 2011 8:30 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Very very large scale Solr Deployment =
>> how to do (Expert
>> > Question)?
>> >
>> > Hello Ephraim, hello Lance, hello Walter,
>> >
>> > thanks for your replies:
>> >
>> > Ephraim, thanks very much for the further detailed
>> explanation. I will
>> > try
>> > to setup a demo system in the next few days and use
>> your advice.
>> > LoadBalancers are an important aspect of your design.
>> Can you recommend
>> > one
>> > LB specificallly? (I would be using haproxy.1wt.eu) .
>> I think the Idea
>> > with
>> > uploading your document is very good. However
>> Google-Docs seemed not be
>> > be
>> > working (at least for me with the docx format?), but
>> maybe you can
>> > simply
>> > output the document as PDF and then I think Google
>> Docs is working, so
>> > all
>> > the others can also have a look at your concept. The
>> best approach would
>> > be
>> > if you could upload your advice directly somewhere to
>> the solr wiki as
>> > it is
>> > really helpful.I found some other documents meanwhile,
>> but yours is much
>> > clearer and more complete, with the LBs and the
>> Aggregators (
>> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
>> >
>> > Lance, thanks I will have a look at what linkedin is
>> doing.
>> >
>> > Walter, thanks for the advice: Well you are right,
>> mentioning google. My
>> > question was also to understand how such large systems
>> like
>> > google/facebook
>> > are actually working. So my numbers are just
>> theoretical and made up. My
>> > system will be smaller,  but I would be very happy to
>> understand how
>> > such
>> > large systems are build and I think the approach
>> Ephraim showd should be
>> > working quite well at large scale. If you know a good
>> documents (besides
>> > the
>> > bigtable research paper that I already know) that
>> technically describes
>> > how
>> > google is working in detail that would be of great
>> interest. You seem to
>> > be
>> > working for a company that handles large datasets.
>> Does google use this
>> > approach, sharing the index into N writers, and the
>> procuded index is
>> > then
>> > replicated to N "read only searchers"?
>> >
>> > thank you all.
>> > best regards
>> > jens
>> >
>> >
>> >
>> > 2011/4/7 Walter Underwood <wu...@wunderwood.org>
>> >
>> >> The bigger answer is that you cannot get to this
>> size by just
>> > configuring
>> >> Solr. You may have to invent a lot of stuff. Like
>> all of Google.
>> >>
>> >> Where did you get these numbers? The proposed
>> query rate is twice as
>> > big as
>> >> Google (Feb 2010 estimate, 34K qps).
>> >>
>> >> I work at MarkLogic, and we scale to 100's of
>> terabytes, with fast
>> > update
>> >> and query rates. If you want a real system that
>> handles that, you
>> > might want
>> >> to look at our product.
>> >>
>> >> wunder
>> >>
>> >> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:
>> >>
>> >> > I would not use replication. LinkedIn
>> consumer search is a flat
>> > system
>> >> > where one process indexes new entries and
>> does queries
>> > simultaneously.
>> >> > It's a custom Lucene app called Zoie. Their
>> stuff is on Github..
>> >> >
>> >> > I would get documents to indexers via a
>> multicast IP-based queueing
>> >> > system. This scales very well and there's a
>> lot of hardware support.
>> >> >
>> >> > The problem with distributed search is that
>> it is a) inherently
>> > slower
>> >> > and b) has inherently more and longer jitter.
>> The "airplane wing"
>> >> > distribution of query times becomes longer
>> and flatter.
>> >> >
>> >> > This is going to have to be a "federated"
>> system, where the
>> > front-end
>> >> > app aggregates results rather than Solr.
>> >> >
>> >> > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller
>> > <su...@googlemail.com>
>> >> wrote:
>> >> >> Hello Experts,
>> >> >>
>> >> >>
>> >> >>
>> >> >> I am a Solr newbie but read quite a lot
>> of docs. I still do not
>> >> understand
>> >> >> what would be the best way to setup very
>> large scale deployments:
>> >> >>
>> >> >>
>> >> >>
>> >> >> Goal (threoretical):
>> >> >>
>> >> >>  A.) Index-Size: 1 Petabyte (1 Document
>> is about 5 KB in Size)
>> >> >>
>> >> >>  B) Queries: 100000 Queries/ per Second
>> >> >>
>> >> >>  C) Updates: 100000 Updates / per
>> Second
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> Solr offers:
>> >> >>
>> >> >> 1.)    Replication => Scales Well
>> for B)  BUT  A) and C) are not
>> >> satisfied
>> >> >>
>> >> >>
>> >> >> 2.)    Sharding => Scales well for
>> A) BUT B) and C) are not
>> > satisfied
>> >> (=> As
>> >> >> I understand the Sharding approach all
>> goes through a central
>> > server,
>> >> that
>> >> >> dispatches the updates and assembles the
>> quries retrieved from the
>> >> different
>> >> >> shards. But this central server has also
>> some capacity limits...)
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> What is the right approach to handle such
>> large deployments? I
>> > would be
>> >> >> thankfull for just a rough sketch of the
>> concepts so I can
>> >> experiment/search
>> >> >> further...
>> >> >>
>> >> >>
>> >> >> Maybe I am missing something very trivial
>> as I think some of the
>> > "Solr
>> >> >> Users/Use Cases" on the homepage are that
>> kind of large
>> > deployments. How
>> >> are
>> >> >> they implemented?
>> >> >>
>> >> >>
>> >> >>
>> >> >> Thanky very much!!!
>> >> >>
>> >> >> Jens
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Albert Vila Puig
>> <av...@imente.com>
>> iMente.com <http://www.imente.com>
>>
>



-- 
Albert Vila Puig
<av...@imente.com>
iMente.com <http://www.imente.com>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Andy <an...@yahoo.com>.

I can't view the document either -- it showed up empty.

Has anyone succeeded in viewing it?

Andy

--- On Fri, 4/8/11, Albert Vila <av...@imente.com> wrote:

> From: Albert Vila <av...@imente.com>
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert Question)?
> To: solr-user@lucene.apache.org
> Date: Friday, April 8, 2011, 3:43 AM
> Ephraim, I still can't view the
> document.
> 
> Don't know if I'm doing something wrong, but I downloaded
> it and It
> appears to be empty.
> 
> Albert
> 
> On 7 April 2011 09:32, Ephraim Ofir <Ep...@icq.com>
> wrote:
> > You can't view it online, but you should be able to
> download it from:
> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
> >
> > Enjoy,
> > Ephraim Ofir
> >
> >
> > -----Original Message-----
> > From: Jens Mueller [mailto:supidupi007@googlemail.com]
> > Sent: Thursday, April 07, 2011 8:30 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Very very large scale Solr Deployment =
> how to do (Expert
> > Question)?
> >
> > Hello Ephraim, hello Lance, hello Walter,
> >
> > thanks for your replies:
> >
> > Ephraim, thanks very much for the further detailed
> explanation. I will
> > try
> > to setup a demo system in the next few days and use
> your advice.
> > LoadBalancers are an important aspect of your design.
> Can you recommend
> > one
> > LB specificallly? (I would be using haproxy.1wt.eu) .
> I think the Idea
> > with
> > uploading your document is very good. However
> Google-Docs seemed not be
> > be
> > working (at least for me with the docx format?), but
> maybe you can
> > simply
> > output the document as PDF and then I think Google
> Docs is working, so
> > all
> > the others can also have a look at your concept. The
> best approach would
> > be
> > if you could upload your advice directly somewhere to
> the solr wiki as
> > it is
> > really helpful.I found some other documents meanwhile,
> but yours is much
> > clearer and more complete, with the LBs and the
> Aggregators (
> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> >
> > Lance, thanks I will have a look at what linkedin is
> doing.
> >
> > Walter, thanks for the advice: Well you are right,
> mentioning google. My
> > question was also to understand how such large systems
> like
> > google/facebook
> > are actually working. So my numbers are just
> theoretical and made up. My
> > system will be smaller,  but I would be very happy to
> understand how
> > such
> > large systems are build and I think the approach
> Ephraim showd should be
> > working quite well at large scale. If you know a good
> documents (besides
> > the
> > bigtable research paper that I already know) that
> technically describes
> > how
> > google is working in detail that would be of great
> interest. You seem to
> > be
> > working for a company that handles large datasets.
> Does google use this
> > approach, sharing the index into N writers, and the
> procuded index is
> > then
> > replicated to N "read only searchers"?
> >
> > thank you all.
> > best regards
> > jens
> >
> >
> >
> > 2011/4/7 Walter Underwood <wu...@wunderwood.org>
> >
> >> The bigger answer is that you cannot get to this
> size by just
> > configuring
> >> Solr. You may have to invent a lot of stuff. Like
> all of Google.
> >>
> >> Where did you get these numbers? The proposed
> query rate is twice as
> > big as
> >> Google (Feb 2010 estimate, 34K qps).
> >>
> >> I work at MarkLogic, and we scale to 100's of
> terabytes, with fast
> > update
> >> and query rates. If you want a real system that
> handles that, you
> > might want
> >> to look at our product.
> >>
> >> wunder
> >>
> >> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:
> >>
> >> > I would not use replication. LinkedIn
> consumer search is a flat
> > system
> >> > where one process indexes new entries and
> does queries
> > simultaneously.
> >> > It's a custom Lucene app called Zoie. Their
> stuff is on Github..
> >> >
> >> > I would get documents to indexers via a
> multicast IP-based queueing
> >> > system. This scales very well and there's a
> lot of hardware support.
> >> >
> >> > The problem with distributed search is that
> it is a) inherently
> > slower
> >> > and b) has inherently more and longer jitter.
> The "airplane wing"
> >> > distribution of query times becomes longer
> and flatter.
> >> >
> >> > This is going to have to be a "federated"
> system, where the
> > front-end
> >> > app aggregates results rather than Solr.
> >> >
> >> > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller
> > <su...@googlemail.com>
> >> wrote:
> >> >> Hello Experts,
> >> >>
> >> >>
> >> >>
> >> >> I am a Solr newbie but read quite a lot
> of docs. I still do not
> >> understand
> >> >> what would be the best way to setup very
> large scale deployments:
> >> >>
> >> >>
> >> >>
> >> >> Goal (threoretical):
> >> >>
> >> >>  A.) Index-Size: 1 Petabyte (1 Document
> is about 5 KB in Size)
> >> >>
> >> >>  B) Queries: 100000 Queries/ per Second
> >> >>
> >> >>  C) Updates: 100000 Updates / per
> Second
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> Solr offers:
> >> >>
> >> >> 1.)    Replication => Scales Well
> for B)  BUT  A) and C) are not
> >> satisfied
> >> >>
> >> >>
> >> >> 2.)    Sharding => Scales well for
> A) BUT B) and C) are not
> > satisfied
> >> (=> As
> >> >> I understand the Sharding approach all
> goes through a central
> > server,
> >> that
> >> >> dispatches the updates and assembles the
> quries retrieved from the
> >> different
> >> >> shards. But this central server has also
> some capacity limits...)
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> What is the right approach to handle such
> large deployments? I
> > would be
> >> >> thankfull for just a rough sketch of the
> concepts so I can
> >> experiment/search
> >> >> further...
> >> >>
> >> >>
> >> >> Maybe I am missing something very trivial
> as I think some of the
> > "Solr
> >> >> Users/Use Cases" on the homepage are that
> kind of large
> > deployments. How
> >> are
> >> >> they implemented?
> >> >>
> >> >>
> >> >>
> >> >> Thanky very much!!!
> >> >>
> >> >> Jens
> >> >>
> >> >
> >>
> >>
> >>
> >>
> >>
> >
> 
> 
> 
> -- 
> Albert Vila Puig
> <av...@imente.com>
> iMente.com <http://www.imente.com>
>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Albert Vila <av...@imente.com>.

Ephraim, I still can't view the document.

Don't know if I'm doing something wrong, but I downloaded it and It
appears to be empty.

Albert

On 7 April 2011 09:32, Ephraim Ofir <Ep...@icq.com> wrote:
> You can't view it online, but you should be able to download it from:
> https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
>
> Enjoy,
> Ephraim Ofir
>
>
> -----Original Message-----
> From: Jens Mueller [mailto:supidupi007@googlemail.com]
> Sent: Thursday, April 07, 2011 8:30 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert
> Question)?
>
> Hello Ephraim, hello Lance, hello Walter,
>
> thanks for your replies:
>
> Ephraim, thanks very much for the further detailed explanation. I will
> try
> to setup a demo system in the next few days and use your advice.
> LoadBalancers are an important aspect of your design. Can you recommend
> one
> LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea
> with
> uploading your document is very good. However Google-Docs seemed not be
> be
> working (at least for me with the docx format?), but maybe you can
> simply
> output the document as PDF and then I think Google Docs is working, so
> all
> the others can also have a look at your concept. The best approach would
> be
> if you could upload your advice directly somewhere to the solr wiki as
> it is
> really helpful.I found some other documents meanwhile, but yours is much
> clearer and more complete, with the LBs and the Aggregators (
> http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
>
> Lance, thanks I will have a look at what linkedin is doing.
>
> Walter, thanks for the advice: Well you are right, mentioning google. My
> question was also to understand how such large systems like
> google/facebook
> are actually working. So my numbers are just theoretical and made up. My
> system will be smaller,  but I would be very happy to understand how
> such
> large systems are build and I think the approach Ephraim showd should be
> working quite well at large scale. If you know a good documents (besides
> the
> bigtable research paper that I already know) that technically describes
> how
> google is working in detail that would be of great interest. You seem to
> be
> working for a company that handles large datasets. Does google use this
> approach, sharing the index into N writers, and the procuded index is
> then
> replicated to N "read only searchers"?
>
> thank you all.
> best regards
> jens
>
>
>
> 2011/4/7 Walter Underwood <wu...@wunderwood.org>
>
>> The bigger answer is that you cannot get to this size by just
> configuring
>> Solr. You may have to invent a lot of stuff. Like all of Google.
>>
>> Where did you get these numbers? The proposed query rate is twice as
> big as
>> Google (Feb 2010 estimate, 34K qps).
>>
>> I work at MarkLogic, and we scale to 100's of terabytes, with fast
> update
>> and query rates. If you want a real system that handles that, you
> might want
>> to look at our product.
>>
>> wunder
>>
>> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:
>>
>> > I would not use replication. LinkedIn consumer search is a flat
> system
>> > where one process indexes new entries and does queries
> simultaneously.
>> > It's a custom Lucene app called Zoie. Their stuff is on Github..
>> >
>> > I would get documents to indexers via a multicast IP-based queueing
>> > system. This scales very well and there's a lot of hardware support.
>> >
>> > The problem with distributed search is that it is a) inherently
> slower
>> > and b) has inherently more and longer jitter. The "airplane wing"
>> > distribution of query times becomes longer and flatter.
>> >
>> > This is going to have to be a "federated" system, where the
> front-end
>> > app aggregates results rather than Solr.
>> >
>> > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller
> <su...@googlemail.com>
>> wrote:
>> >> Hello Experts,
>> >>
>> >>
>> >>
>> >> I am a Solr newbie but read quite a lot of docs. I still do not
>> understand
>> >> what would be the best way to setup very large scale deployments:
>> >>
>> >>
>> >>
>> >> Goal (threoretical):
>> >>
>> >>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
>> >>
>> >>  B) Queries: 100000 Queries/ per Second
>> >>
>> >>  C) Updates: 100000 Updates / per Second
>> >>
>> >>
>> >>
>> >>
>> >> Solr offers:
>> >>
>> >> 1.)    Replication => Scales Well for B)  BUT  A) and C) are not
>> satisfied
>> >>
>> >>
>> >> 2.)    Sharding => Scales well for A) BUT B) and C) are not
> satisfied
>> (=> As
>> >> I understand the Sharding approach all goes through a central
> server,
>> that
>> >> dispatches the updates and assembles the quries retrieved from the
>> different
>> >> shards. But this central server has also some capacity limits...)
>> >>
>> >>
>> >>
>> >>
>> >> What is the right approach to handle such large deployments? I
> would be
>> >> thankfull for just a rough sketch of the concepts so I can
>> experiment/search
>> >> further...
>> >>
>> >>
>> >> Maybe I am missing something very trivial as I think some of the
> "Solr
>> >> Users/Use Cases" on the homepage are that kind of large
> deployments. How
>> are
>> >> they implemented?
>> >>
>> >>
>> >>
>> >> Thanky very much!!!
>> >>
>> >> Jens
>> >>
>> >
>>
>>
>>
>>
>>
>



-- 
Albert Vila Puig
<av...@imente.com>
iMente.com <http://www.imente.com>

RE: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Ephraim Ofir <Ep...@icq.com>.

You can't view it online, but you should be able to download it from:
https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP

Enjoy,
Ephraim Ofir

-----Original Message-----
From: Jens Mueller [mailto:supidupi007@googlemail.com] 
Sent: Thursday, April 07, 2011 8:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Very very large scale Solr Deployment = how to do (Expert
Question)?

Hello Ephraim, hello Lance, hello Walter,

thanks for your replies:

Ephraim, thanks very much for the further detailed explanation. I will
try
to setup a demo system in the next few days and use your advice.
LoadBalancers are an important aspect of your design. Can you recommend
one
LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea
with
uploading your document is very good. However Google-Docs seemed not be
be
working (at least for me with the docx format?), but maybe you can
simply
output the document as PDF and then I think Google Docs is working, so
all
the others can also have a look at your concept. The best approach would
be
if you could upload your advice directly somewhere to the solr wiki as
it is
really helpful.I found some other documents meanwhile, but yours is much
clearer and more complete, with the LBs and the Aggregators (
http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)

Lance, thanks I will have a look at what linkedin is doing.

Walter, thanks for the advice: Well you are right, mentioning google. My
question was also to understand how such large systems like
google/facebook
are actually working. So my numbers are just theoretical and made up. My
system will be smaller,  but I would be very happy to understand how
such
large systems are build and I think the approach Ephraim showd should be
working quite well at large scale. If you know a good documents (besides
the
bigtable research paper that I already know) that technically describes
how
google is working in detail that would be of great interest. You seem to
be
working for a company that handles large datasets. Does google use this
approach, sharing the index into N writers, and the procuded index is
then
replicated to N "read only searchers"?

thank you all.
best regards
jens

2011/4/7 Walter Underwood <wu...@wunderwood.org>

> The bigger answer is that you cannot get to this size by just
configuring
> Solr. You may have to invent a lot of stuff. Like all of Google.
>
> Where did you get these numbers? The proposed query rate is twice as
big as
> Google (Feb 2010 estimate, 34K qps).
>
> I work at MarkLogic, and we scale to 100's of terabytes, with fast
update
> and query rates. If you want a real system that handles that, you
might want
> to look at our product.
>
> wunder
>
> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:
>
> > I would not use replication. LinkedIn consumer search is a flat
system
> > where one process indexes new entries and does queries
simultaneously.
> > It's a custom Lucene app called Zoie. Their stuff is on Github..
> >
> > I would get documents to indexers via a multicast IP-based queueing
> > system. This scales very well and there's a lot of hardware support.
> >
> > The problem with distributed search is that it is a) inherently
slower
> > and b) has inherently more and longer jitter. The "airplane wing"
> > distribution of query times becomes longer and flatter.
> >
> > This is going to have to be a "federated" system, where the
front-end
> > app aggregates results rather than Solr.
> >
> > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller
<su...@googlemail.com>
> wrote:
> >> Hello Experts,
> >>
> >>
> >>
> >> I am a Solr newbie but read quite a lot of docs. I still do not
> understand
> >> what would be the best way to setup very large scale deployments:
> >>
> >>
> >>
> >> Goal (threoretical):
> >>
> >>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
> >>
> >>  B) Queries: 100000 Queries/ per Second
> >>
> >>  C) Updates: 100000 Updates / per Second
> >>
> >>
> >>
> >>
> >> Solr offers:
> >>
> >> 1.)    Replication => Scales Well for B)  BUT  A) and C) are not
> satisfied
> >>
> >>
> >> 2.)    Sharding => Scales well for A) BUT B) and C) are not
satisfied
> (=> As
> >> I understand the Sharding approach all goes through a central
server,
> that
> >> dispatches the updates and assembles the quries retrieved from the
> different
> >> shards. But this central server has also some capacity limits...)
> >>
> >>
> >>
> >>
> >> What is the right approach to handle such large deployments? I
would be
> >> thankfull for just a rough sketch of the concepts so I can
> experiment/search
> >> further...
> >>
> >>
> >> Maybe I am missing something very trivial as I think some of the
"Solr
> >> Users/Use Cases" on the homepage are that kind of large
deployments. How
> are
> >> they implemented?
> >>
> >>
> >>
> >> Thanky very much!!!
> >>
> >> Jens
> >>
> >
>
>
>
>
>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Just a quick comment re LinkedIn's stuff.  You can look at Zoie (also covered in 
Lucene in Action 2), but you may be more interested in Sensei.

And yes, big systems like that need sharding and replication, multiple master 
and lots of slaves.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Jens Mueller <su...@googlemail.com>
> To: solr-user@lucene.apache.org
> Sent: Thu, April 7, 2011 1:29:40 AM
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert 
>Question)?
> 
> Hello Ephraim, hello Lance, hello Walter,
> 
> thanks for your  replies:
> 
> Ephraim, thanks very much for the further detailed explanation.  I will try
> to setup a demo system in the next few days and use your  advice.
> LoadBalancers are an important aspect of your design. Can you  recommend one
> LB specificallly? (I would be using haproxy.1wt.eu) . I think  the Idea with
> uploading your document is very good. However Google-Docs  seemed not be be
> working (at least for me with the docx format?), but maybe  you can simply
> output the document as PDF and then I think Google Docs is  working, so all
> the others can also have a look at your concept. The best  approach would be
> if you could upload your advice directly somewhere to the  solr wiki as it is
> really helpful.I found some other documents meanwhile, but  yours is much
> clearer and more complete, with the LBs and the Aggregators  (
> http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> 
> Lance,  thanks I will have a look at what linkedin is doing.
> 
> Walter, thanks for  the advice: Well you are right, mentioning google. My
> question was also to  understand how such large systems like google/facebook
> are actually working.  So my numbers are just theoretical and made up. My
> system will be  smaller,  but I would be very happy to understand how such
> large systems  are build and I think the approach Ephraim showd should be
> working quite well  at large scale. If you know a good documents (besides the
> bigtable research  paper that I already know) that technically describes how
> google is working  in detail that would be of great interest. You seem to be
> working for a  company that handles large datasets. Does google use this
> approach, sharing  the index into N writers, and the procuded index is then
> replicated to N  "read only searchers"?
> 
> thank you all.
> best  regards
> jens
> 
> 
> 
> 2011/4/7 Walter Underwood <wu...@wunderwood.org>
> 
> >  The bigger answer is that you cannot get to this size by just  configuring
> > Solr. You may have to invent a lot of stuff. Like all of  Google.
> >
> > Where did you get these numbers? The proposed query rate  is twice as big as
> > Google (Feb 2010 estimate, 34K qps).
> >
> >  I work at MarkLogic, and we scale to 100's of terabytes, with fast  update
> > and query rates. If you want a real system that handles that, you  might 
want
> > to look at our product.
> >
> >  wunder
> >
> > On Apr 6, 2011, at 8:06 PM, Lance Norskog  wrote:
> >
> > > I would not use replication. LinkedIn consumer  search is a flat system
> > > where one process indexes new entries and  does queries simultaneously.
> > > It's a custom Lucene app called Zoie.  Their stuff is on Github..
> > >
> > > I would get documents to  indexers via a multicast IP-based queueing
> > > system. This scales very  well and there's a lot of hardware support.
> > >
> > > The  problem with distributed search is that it is a) inherently slower
> > >  and b) has inherently more and longer jitter. The "airplane wing"
> > >  distribution of query times becomes longer and flatter.
> > >
> >  > This is going to have to be a "federated" system, where the  front-end
> > > app aggregates results rather than Solr.
> >  >
> > > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller <su...@googlemail.com>
> >  wrote:
> > >> Hello Experts,
> > >>
> >  >>
> > >>
> > >> I am a Solr newbie but read quite a  lot of docs. I still do not
> > understand
> > >> what would be  the best way to setup very large scale deployments:
> > >>
> >  >>
> > >>
> > >> Goal (threoretical):
> >  >>
> > >>  A.) Index-Size: 1 Petabyte (1 Document is about  5 KB in Size)
> > >>
> > >>  B) Queries: 100000  Queries/ per Second
> > >>
> > >>  C) Updates: 100000  Updates / per Second
> > >>
> > >>
> > >>
> >  >>
> > >> Solr offers:
> > >>
> > >>  1.)    Replication => Scales Well for B)  BUT  A) and C)  are not
> > satisfied
> > >>
> > >>
> > >>  2.)    Sharding => Scales well for A) BUT B) and C) are not  satisfied
> > (=> As
> > >> I understand the Sharding approach  all goes through a central server,
> > that
> > >> dispatches the  updates and assembles the quries retrieved from the
> > different
> >  >> shards. But this central server has also some capacity  limits...)
> > >>
> > >>
> > >>
> >  >>
> > >> What is the right approach to handle such large  deployments? I would be
> > >> thankfull for just a rough sketch of  the concepts so I can
> > experiment/search
> > >>  further…
> > >>
> > >>
> > >> Maybe I am missing  something very trivial as I think some of the “Solr
> > >> Users/Use  Cases” on the homepage are that kind of large deployments. How
> >  are
> > >> they implemented?
> > >>
> > >>
> >  >>
> > >> Thanky very much!!!
> > >>
> > >>  Jens
> > >>
> > >
> >
> >
> >
> >
> >
>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by François Schiettecatte <fs...@gmail.com>.

You might also want to look at the heritrix crawler too:

	http://crawler.archive.org/

I have written three crawlers in the past, all for RSS feeds, it is not easy. Happy to provide tips and help if you want to go down that route.

François

On Apr 8, 2011, at 1:53 AM, Andrea Campi wrote:

> On Fri, Apr 8, 2011 at 6:23 AM, Jens Mueller <su...@googlemail.com>wrote:
> 
>> Hello all,
>> 
>> thanks for your generous help.
>> 
>> I think I now know everything:  (What I want to do is to build a web
>> crawler
>> and index the documents found). I will start with the setup as suggested by
>> 
>> 
> Write a web crawler from scratch is... ambitious.
> Have you looked at Nutch (http://nutch.apache.org/)?  It uses Solr for
> indexing, it may help you get a head start.
> If you've never used Hadoop before it may take some getting used to, but I
> have helped a customer implement it and helped a couple of their devs
> (medium-seniority) get up to speed, and it didn't take them too long to get
> used to it.
> 
> Andrea

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Andrea Campi <an...@zephirworks.com>.

On Fri, Apr 8, 2011 at 6:23 AM, Jens Mueller <su...@googlemail.com>wrote:

> Hello all,
>
> thanks for your generous help.
>
> I think I now know everything:  (What I want to do is to build a web
> crawler
> and index the documents found). I will start with the setup as suggested by
>
>
Write a web crawler from scratch is... ambitious.
Have you looked at Nutch (http://nutch.apache.org/)?  It uses Solr for
indexing, it may help you get a head start.
If you've never used Hadoop before it may take some getting used to, but I
have helped a customer implement it and helped a couple of their devs
(medium-seniority) get up to speed, and it didn't take them too long to get
used to it.

Andrea

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Jens Mueller <su...@googlemail.com>.

Hello all,

thanks for your generous help.

I think I now know everything:  (What I want to do is to build a web crawler
and index the documents found). I will start with the setup as suggested by
Ephraim (Several sharded masters, each with at least one slave for reads and
some aggregators for querying). This is only a prototype to learn more...

And the Google PDF from Walter is very interesting, that is something that I
can then try if I hit the limits with the setup above.  But before that, I
have to learn much more about all this indexing / index building and
solr/lucene stuff.

Thanks again for your help!!
best regards
jens

2011/4/7 Walter Underwood <wu...@wunderwood.org>

> On Apr 6, 2011, at 10:29 PM, Jens Mueller wrote:
>
> > Walter, thanks for the advice: Well you are right, mentioning google. My
> > question was also to understand how such large systems like
> google/facebook
> > are actually working. So my numbers are just theoretical and made up. My
> > system will be smaller,  but I would be very happy to understand how such
> > large systems are build and I think the approach Ephraim showd should be
> > working quite well at large scale.
>
> Understanding what Google does will NOT help you build your engine. Just
> like understanding a F1 race car does not help you build a Toyota Camry. One
> is built for performance only, and requires LOTS of support, the other for
> supportability and stability. Very different engineering goals and designs.
>
> Here is one view of Google's search setup:
> http://www.linesave.co.uk/google_search_engine.html
>
> This talk gives a lot more detail. Summary in the blog post, slides in the
> PDF. Google's search is entirely in-memory. They load off disk and run.
>
> http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-wsdm-2009.html
> http://research.google.com/people/jeff/WSDM09-keynote.pdf
>
> How big will your system be? Does it require real-time updates?
>
> wunder
> --
> Walter Underwood
> Lead Engineer, MarkLogic
>
>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Walter Underwood <wu...@wunderwood.org>.

On Apr 6, 2011, at 10:29 PM, Jens Mueller wrote:

> Walter, thanks for the advice: Well you are right, mentioning google. My
> question was also to understand how such large systems like google/facebook
> are actually working. So my numbers are just theoretical and made up. My
> system will be smaller,  but I would be very happy to understand how such
> large systems are build and I think the approach Ephraim showd should be
> working quite well at large scale. 

Understanding what Google does will NOT help you build your engine. Just like understanding a F1 race car does not help you build a Toyota Camry. One is built for performance only, and requires LOTS of support, the other for supportability and stability. Very different engineering goals and designs.

Here is one view of Google's search setup: http://www.linesave.co.uk/google_search_engine.html

This talk gives a lot more detail. Summary in the blog post, slides in the PDF. Google's search is entirely in-memory. They load off disk and run.

http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-wsdm-2009.html
http://research.google.com/people/jeff/WSDM09-keynote.pdf

How big will your system be? Does it require real-time updates?

wunder
--
Walter Underwood
Lead Engineer, MarkLogic

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Jens Mueller <su...@googlemail.com>.

Hello Ephraim, hello Lance, hello Walter,

thanks for your replies:

Ephraim, thanks very much for the further detailed explanation. I will try
to setup a demo system in the next few days and use your advice.
LoadBalancers are an important aspect of your design. Can you recommend one
LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea with
uploading your document is very good. However Google-Docs seemed not be be
working (at least for me with the docx format?), but maybe you can simply
output the document as PDF and then I think Google Docs is working, so all
the others can also have a look at your concept. The best approach would be
if you could upload your advice directly somewhere to the solr wiki as it is
really helpful.I found some other documents meanwhile, but yours is much
clearer and more complete, with the LBs and the Aggregators (
http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)

Lance, thanks I will have a look at what linkedin is doing.

Walter, thanks for the advice: Well you are right, mentioning google. My
question was also to understand how such large systems like google/facebook
are actually working. So my numbers are just theoretical and made up. My
system will be smaller,  but I would be very happy to understand how such
large systems are build and I think the approach Ephraim showd should be
working quite well at large scale. If you know a good documents (besides the
bigtable research paper that I already know) that technically describes how
google is working in detail that would be of great interest. You seem to be
working for a company that handles large datasets. Does google use this
approach, sharing the index into N writers, and the procuded index is then
replicated to N "read only searchers"?

thank you all.
best regards
jens

2011/4/7 Walter Underwood <wu...@wunderwood.org>

> The bigger answer is that you cannot get to this size by just configuring
> Solr. You may have to invent a lot of stuff. Like all of Google.
>
> Where did you get these numbers? The proposed query rate is twice as big as
> Google (Feb 2010 estimate, 34K qps).
>
> I work at MarkLogic, and we scale to 100's of terabytes, with fast update
> and query rates. If you want a real system that handles that, you might want
> to look at our product.
>
> wunder
>
> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:
>
> > I would not use replication. LinkedIn consumer search is a flat system
> > where one process indexes new entries and does queries simultaneously.
> > It's a custom Lucene app called Zoie. Their stuff is on Github..
> >
> > I would get documents to indexers via a multicast IP-based queueing
> > system. This scales very well and there's a lot of hardware support.
> >
> > The problem with distributed search is that it is a) inherently slower
> > and b) has inherently more and longer jitter. The "airplane wing"
> > distribution of query times becomes longer and flatter.
> >
> > This is going to have to be a "federated" system, where the front-end
> > app aggregates results rather than Solr.
> >
> > On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller <su...@googlemail.com>
> wrote:
> >> Hello Experts,
> >>
> >>
> >>
> >> I am a Solr newbie but read quite a lot of docs. I still do not
> understand
> >> what would be the best way to setup very large scale deployments:
> >>
> >>
> >>
> >> Goal (threoretical):
> >>
> >>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
> >>
> >>  B) Queries: 100000 Queries/ per Second
> >>
> >>  C) Updates: 100000 Updates / per Second
> >>
> >>
> >>
> >>
> >> Solr offers:
> >>
> >> 1.)    Replication => Scales Well for B)  BUT  A) and C) are not
> satisfied
> >>
> >>
> >> 2.)    Sharding => Scales well for A) BUT B) and C) are not satisfied
> (=> As
> >> I understand the Sharding approach all goes through a central server,
> that
> >> dispatches the updates and assembles the quries retrieved from the
> different
> >> shards. But this central server has also some capacity limits...)
> >>
> >>
> >>
> >>
> >> What is the right approach to handle such large deployments? I would be
> >> thankfull for just a rough sketch of the concepts so I can
> experiment/search
> >> further…
> >>
> >>
> >> Maybe I am missing something very trivial as I think some of the “Solr
> >> Users/Use Cases” on the homepage are that kind of large deployments. How
> are
> >> they implemented?
> >>
> >>
> >>
> >> Thanky very much!!!
> >>
> >> Jens
> >>
> >
>
>
>
>
>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Walter Underwood <wu...@wunderwood.org>.

The bigger answer is that you cannot get to this size by just configuring Solr. You may have to invent a lot of stuff. Like all of Google.

Where did you get these numbers? The proposed query rate is twice as big as Google (Feb 2010 estimate, 34K qps).

I work at MarkLogic, and we scale to 100's of terabytes, with fast update and query rates. If you want a real system that handles that, you might want to look at our product.

wunder

On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:

> I would not use replication. LinkedIn consumer search is a flat system
> where one process indexes new entries and does queries simultaneously.
> It's a custom Lucene app called Zoie. Their stuff is on Github..
> 
> I would get documents to indexers via a multicast IP-based queueing
> system. This scales very well and there's a lot of hardware support.
> 
> The problem with distributed search is that it is a) inherently slower
> and b) has inherently more and longer jitter. The "airplane wing"
> distribution of query times becomes longer and flatter.
> 
> This is going to have to be a "federated" system, where the front-end
> app aggregates results rather than Solr.
> 
> On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller <su...@googlemail.com> wrote:
>> Hello Experts,
>> 
>> 
>> 
>> I am a Solr newbie but read quite a lot of docs. I still do not understand
>> what would be the best way to setup very large scale deployments:
>> 
>> 
>> 
>> Goal (threoretical):
>> 
>>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
>> 
>>  B) Queries: 100000 Queries/ per Second
>> 
>>  C) Updates: 100000 Updates / per Second
>> 
>> 
>> 
>> 
>> Solr offers:
>> 
>> 1.)    Replication => Scales Well for B)  BUT  A) and C) are not satisfied
>> 
>> 
>> 2.)    Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As
>> I understand the Sharding approach all goes through a central server, that
>> dispatches the updates and assembles the quries retrieved from the different
>> shards. But this central server has also some capacity limits...)
>> 
>> 
>> 
>> 
>> What is the right approach to handle such large deployments? I would be
>> thankfull for just a rough sketch of the concepts so I can experiment/search
>> further…
>> 
>> 
>> Maybe I am missing something very trivial as I think some of the “Solr
>> Users/Use Cases” on the homepage are that kind of large deployments. How are
>> they implemented?
>> 
>> 
>> 
>> Thanky very much!!!
>> 
>> Jens
>> 
>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Lance Norskog <go...@gmail.com>.

I would not use replication. LinkedIn consumer search is a flat system
where one process indexes new entries and does queries simultaneously.
It's a custom Lucene app called Zoie. Their stuff is on Github..

I would get documents to indexers via a multicast IP-based queueing
system. This scales very well and there's a lot of hardware support.

The problem with distributed search is that it is a) inherently slower
and b) has inherently more and longer jitter. The "airplane wing"
distribution of query times becomes longer and flatter.

This is going to have to be a "federated" system, where the front-end
app aggregates results rather than Solr.

On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller <su...@googlemail.com> wrote:
> Hello Experts,
>
>
>
> I am a Solr newbie but read quite a lot of docs. I still do not understand
> what would be the best way to setup very large scale deployments:
>
>
>
> Goal (threoretical):
>
>  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
>
>  B) Queries: 100000 Queries/ per Second
>
>  C) Updates: 100000 Updates / per Second
>
>
>
>
> Solr offers:
>
> 1.)    Replication => Scales Well for B)  BUT  A) and C) are not satisfied
>
>
> 2.)    Sharding => Scales well for A) BUT B) and C) are not satisfied (=> As
> I understand the Sharding approach all goes through a central server, that
> dispatches the updates and assembles the quries retrieved from the different
> shards. But this central server has also some capacity limits...)
>
>
>
>
> What is the right approach to handle such large deployments? I would be
> thankfull for just a rough sketch of the concepts so I can experiment/search
> further…
>
>
> Maybe I am missing something very trivial as I think some of the “Solr
> Users/Use Cases” on the homepage are that kind of large deployments. How are
> they implemented?
>
>
>
> Thanky very much!!!
>
> Jens
>

-- 
Lance Norskog
goksron@gmail.com

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by François Schiettecatte <fs...@gmail.com>.

And if you have control over machine placement, split them across racks so that a power outage on one rack does not take out your search cluster.

François

On Apr 5, 2011, at 3:19 AM, Ephraim Ofir wrote:

> I'm not sure about the scale you're aiming for, but you probably want to
> do both sharding and replication.  There's no central server which would
> be the bottleneck. The guidelines should probably be something like:
> 1. Split your index to enough shards so it can keep up with the update
> rate.
> 2. Have enough replicates of each shard master to keep up with the rate
> of queries.
> 3. Have enough aggregators in front of the shard replicates so the
> aggregation doesn't become a bottleneck.
> 4. Make sure you have good load balancing across your system.
> 
> Attached is a diagram of the setup we have.  You might want to look into
> SolrCloud as well.
> 
> Ephraim Ofir
> 
> 
> -----Original Message-----
> From: Jens Mueller [mailto:supidupi007@googlemail.com] 
> Sent: Tuesday, April 05, 2011 4:25 AM
> To: solr-user@lucene.apache.org
> Subject: Very very large scale Solr Deployment = how to do (Expert
> Question)?
> 
> Hello Experts,
> 
> 
> 
> I am a Solr newbie but read quite a lot of docs. I still do not
> understand
> what would be the best way to setup very large scale deployments:
> 
> 
> 
> Goal (threoretical):
> 
> A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
> 
> B) Queries: 100000 Queries/ per Second
> 
> C) Updates: 100000 Updates / per Second
> 
> 
> 
> 
> Solr offers:
> 
> 1.)    Replication => Scales Well for B)  BUT  A) and C) are not
> satisfied
> 
> 
> 2.)    Sharding => Scales well for A) BUT B) and C) are not satisfied
> (=> As
> I understand the Sharding approach all goes through a central server,
> that
> dispatches the updates and assembles the quries retrieved from the
> different
> shards. But this central server has also some capacity limits...)
> 
> 
> 
> 
> What is the right approach to handle such large deployments? I would be
> thankfull for just a rough sketch of the concepts so I can
> experiment/search
> further...
> 
> 
> Maybe I am missing something very trivial as I think some of the "Solr
> Users/Use Cases" on the homepage are that kind of large deployments. How
> are
> they implemented?
> 
> 
> 
> Thanky very much!!!
> 
> Jens

RE: Very very large scale Solr Deployment = how to do (Expert Question)?

Posted by Ephraim Ofir <Ep...@icq.com>.

I'm not sure about the scale you're aiming for, but you probably want to
do both sharding and replication.  There's no central server which would
be the bottleneck. The guidelines should probably be something like:
1. Split your index to enough shards so it can keep up with the update
rate.
2. Have enough replicates of each shard master to keep up with the rate
of queries.
3. Have enough aggregators in front of the shard replicates so the
aggregation doesn't become a bottleneck.
4. Make sure you have good load balancing across your system.

Attached is a diagram of the setup we have.  You might want to look into
SolrCloud as well.

Ephraim Ofir


-----Original Message-----
From: Jens Mueller [mailto:supidupi007@googlemail.com] 
Sent: Tuesday, April 05, 2011 4:25 AM
To: solr-user@lucene.apache.org
Subject: Very very large scale Solr Deployment = how to do (Expert
Question)?

Hello Experts,



I am a Solr newbie but read quite a lot of docs. I still do not
understand
what would be the best way to setup very large scale deployments:



Goal (threoretical):

 A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)

 B) Queries: 100000 Queries/ per Second

 C) Updates: 100000 Updates / per Second




Solr offers:

1.)    Replication => Scales Well for B)  BUT  A) and C) are not
satisfied


2.)    Sharding => Scales well for A) BUT B) and C) are not satisfied
(=> As
I understand the Sharding approach all goes through a central server,
that
dispatches the updates and assembles the quries retrieved from the
different
shards. But this central server has also some capacity limits...)




What is the right approach to handle such large deployments? I would be
thankfull for just a rough sketch of the concepts so I can
experiment/search
further...


Maybe I am missing something very trivial as I think some of the "Solr
Users/Use Cases" on the homepage are that kind of large deployments. How
are
they implemented?



Thanky very much!!!

Jens