You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Александр Вандышев <a-...@rambler.ru> on 2014/04/03 14:39:15 UTC

Solr interface

Is it possible to index files not via HTTP interface?

Re: Solr interface

Posted by Jason Hellman <jh...@innoventsolutions.com>.

This.  And so much this.  As much this as you can muster.

On Apr 7, 2014, at 1:49 PM, Michael Della Bitta <mi...@appinions.com> wrote:

> The speed of ingest via HTTP improves greatly once you do two things:
> 
> 1. Batch multiple documents into a single request.
> 2. Index with multiple threads at once.
> 
> Michael Della Bitta
> 
> Applications Developer
> 
> o: +1 646 532 3062
> 
> appinions inc.
> 
> "The Science of Influence Marketing"
> 
> 18 East 41st Street
> 
> New York, NY 10017
> 
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>
> 
> 
> On Mon, Apr 7, 2014 at 12:40 PM, Daniel Collins <da...@gmail.com>wrote:
> 
>> I have to agree with Shawn.  We have a SolrCloud setup with 256 shards,
>> ~400M documents in total, with 4-way replication (so its quite a big
>> setup!)  I had thought that HTTP would slow things down, so we recently
>> trialed a JNI approach (clients are C++) so we could call SolrJ and get the
>> benefits of JavaBin encoding for our indexing....
>> 
>> Once we had done benchmarks with both solutions, I think we saved about 1ms
>> per document (on average) with JNI, so it wasn't as big a gain as we were
>> expecting.  There are other benefits of SolrJ (zookeeper integration,
>> better routing, etc) and we were doing local HTTP (so it was literally just
>> a TCP port to localhost, no actual net traffic) but that just goes to prove
>> what other posters have said here.  Check whether HTTP really *is* the
>> bottleneck before you try to replace it!
>> 
>> 
>> On 7 April 2014 17:05, Shawn Heisey <so...@elyograg.org> wrote:
>> 
>>> On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
>>> 
>>>> Do you mean to tell me that the people on this list that are indexing
>>>> 100s of millions of documents are doing this over http?  I have been
>> using
>>>> custom Lucene code to index files, as I thought this would be faster for
>>>> many documents and I wanted some non-standard OCR and index fields.  Is
>>>> there a better way?
>>>> 
>>>> To the OP: You can also use Lucene to locally index files for Solr.
>>>> 
>>> 
>>> My sharded index has 94 million docs in it.  All normal indexing and
>>> maintenance is done with SolrJ, over http.Currently full rebuilds are
>> done
>>> with the dataimport handler loading from MySQL, but that is legacy.  This
>>> is NOT a SolrCloud installation.  It is also not a replicated setup -- my
>>> indexing program keeps both copies up to date independently, similar to
>>> what happens behind the scenes with SolrCloud.
>>> 
>>> The single-thread DIH is very well optimized, and is faster than what I
>>> have written myself -- also single-threaded.
>>> 
>>> The real reason that we still use DIH for rebuilds is that I can run the
>>> DIH simultaenously on all shards.  A full rebuild that way takes about 5
>>> hours.  A SolrJ process feeding all shards with a single thread would
>> take
>>> a lot longer.  Once I have time to work on it, I can make the SolrJ
>> rebuild
>>> multi-threaded, and I expect it will be similar to DIH in rebuild speed.
>>> Hopefully I can make it faster.
>>> 
>>> There is always overhead with HTTP.  On a gigabit LAN, I don't think it's
>>> high enough to matter.
>>> 
>>> Using Lucene to index files for Solr is an option -- but that requires
>>> writing a custom Lucene application, and knowledge about how to turn the
>>> Solr schema into Lucene code.  A lot of users on this list (me included)
>> do
>>> not have the skills required.  I know SolrJ reasonably well, but Lucene
>> is
>>> a nut that I haven't cracked.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 
>>

Re: Solr interface

Posted by Michael Della Bitta <mi...@appinions.com>.

The speed of ingest via HTTP improves greatly once you do two things:

1. Batch multiple documents into a single request.
2. Index with multiple threads at once.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Mon, Apr 7, 2014 at 12:40 PM, Daniel Collins <da...@gmail.com>wrote:

> I have to agree with Shawn.  We have a SolrCloud setup with 256 shards,
> ~400M documents in total, with 4-way replication (so its quite a big
> setup!)  I had thought that HTTP would slow things down, so we recently
> trialed a JNI approach (clients are C++) so we could call SolrJ and get the
> benefits of JavaBin encoding for our indexing....
>
> Once we had done benchmarks with both solutions, I think we saved about 1ms
> per document (on average) with JNI, so it wasn't as big a gain as we were
> expecting.  There are other benefits of SolrJ (zookeeper integration,
> better routing, etc) and we were doing local HTTP (so it was literally just
> a TCP port to localhost, no actual net traffic) but that just goes to prove
> what other posters have said here.  Check whether HTTP really *is* the
> bottleneck before you try to replace it!
>
>
> On 7 April 2014 17:05, Shawn Heisey <so...@elyograg.org> wrote:
>
> > On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
> >
> >> Do you mean to tell me that the people on this list that are indexing
> >> 100s of millions of documents are doing this over http?  I have been
> using
> >> custom Lucene code to index files, as I thought this would be faster for
> >> many documents and I wanted some non-standard OCR and index fields.  Is
> >> there a better way?
> >>
> >> To the OP: You can also use Lucene to locally index files for Solr.
> >>
> >
> > My sharded index has 94 million docs in it.  All normal indexing and
> > maintenance is done with SolrJ, over http.Currently full rebuilds are
> done
> > with the dataimport handler loading from MySQL, but that is legacy.  This
> > is NOT a SolrCloud installation.  It is also not a replicated setup -- my
> > indexing program keeps both copies up to date independently, similar to
> > what happens behind the scenes with SolrCloud.
> >
> > The single-thread DIH is very well optimized, and is faster than what I
> > have written myself -- also single-threaded.
> >
> > The real reason that we still use DIH for rebuilds is that I can run the
> > DIH simultaenously on all shards.  A full rebuild that way takes about 5
> > hours.  A SolrJ process feeding all shards with a single thread would
> take
> > a lot longer.  Once I have time to work on it, I can make the SolrJ
> rebuild
> > multi-threaded, and I expect it will be similar to DIH in rebuild speed.
> >  Hopefully I can make it faster.
> >
> > There is always overhead with HTTP.  On a gigabit LAN, I don't think it's
> > high enough to matter.
> >
> > Using Lucene to index files for Solr is an option -- but that requires
> > writing a custom Lucene application, and knowledge about how to turn the
> > Solr schema into Lucene code.  A lot of users on this list (me included)
> do
> > not have the skills required.  I know SolrJ reasonably well, but Lucene
> is
> > a nut that I haven't cracked.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: Solr interface

Posted by Daniel Collins <da...@gmail.com>.

I have to agree with Shawn.  We have a SolrCloud setup with 256 shards,
~400M documents in total, with 4-way replication (so its quite a big
setup!)  I had thought that HTTP would slow things down, so we recently
trialed a JNI approach (clients are C++) so we could call SolrJ and get the
benefits of JavaBin encoding for our indexing....

Once we had done benchmarks with both solutions, I think we saved about 1ms
per document (on average) with JNI, so it wasn't as big a gain as we were
expecting.  There are other benefits of SolrJ (zookeeper integration,
better routing, etc) and we were doing local HTTP (so it was literally just
a TCP port to localhost, no actual net traffic) but that just goes to prove
what other posters have said here.  Check whether HTTP really *is* the
bottleneck before you try to replace it!


On 7 April 2014 17:05, Shawn Heisey <so...@elyograg.org> wrote:

> On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
>
>> Do you mean to tell me that the people on this list that are indexing
>> 100s of millions of documents are doing this over http?  I have been using
>> custom Lucene code to index files, as I thought this would be faster for
>> many documents and I wanted some non-standard OCR and index fields.  Is
>> there a better way?
>>
>> To the OP: You can also use Lucene to locally index files for Solr.
>>
>
> My sharded index has 94 million docs in it.  All normal indexing and
> maintenance is done with SolrJ, over http.Currently full rebuilds are done
> with the dataimport handler loading from MySQL, but that is legacy.  This
> is NOT a SolrCloud installation.  It is also not a replicated setup -- my
> indexing program keeps both copies up to date independently, similar to
> what happens behind the scenes with SolrCloud.
>
> The single-thread DIH is very well optimized, and is faster than what I
> have written myself -- also single-threaded.
>
> The real reason that we still use DIH for rebuilds is that I can run the
> DIH simultaenously on all shards.  A full rebuild that way takes about 5
> hours.  A SolrJ process feeding all shards with a single thread would take
> a lot longer.  Once I have time to work on it, I can make the SolrJ rebuild
> multi-threaded, and I expect it will be similar to DIH in rebuild speed.
>  Hopefully I can make it faster.
>
> There is always overhead with HTTP.  On a gigabit LAN, I don't think it's
> high enough to matter.
>
> Using Lucene to index files for Solr is an option -- but that requires
> writing a custom Lucene application, and knowledge about how to turn the
> Solr schema into Lucene code.  A lot of users on this list (me included) do
> not have the skills required.  I know SolrJ reasonably well, but Lucene is
> a nut that I haven't cracked.
>
> Thanks,
> Shawn
>
>

Re: Solr interface

Posted by Shawn Heisey <so...@elyograg.org>.

On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
> Do you mean to tell me that the people on this list that are indexing 100s of millions of documents are doing this over http?  I have been using custom Lucene code to index files, as I thought this would be faster for many documents and I wanted some non-standard OCR and index fields.  Is there a better way?
>
> To the OP: You can also use Lucene to locally index files for Solr.

My sharded index has 94 million docs in it.  All normal indexing and 
maintenance is done with SolrJ, over http.Currently full rebuilds are 
done with the dataimport handler loading from MySQL, but that is 
legacy.  This is NOT a SolrCloud installation.  It is also not a 
replicated setup -- my indexing program keeps both copies up to date 
independently, similar to what happens behind the scenes with SolrCloud.

The single-thread DIH is very well optimized, and is faster than what I 
have written myself -- also single-threaded.

The real reason that we still use DIH for rebuilds is that I can run the 
DIH simultaenously on all shards.  A full rebuild that way takes about 5 
hours.  A SolrJ process feeding all shards with a single thread would 
take a lot longer.  Once I have time to work on it, I can make the SolrJ 
rebuild multi-threaded, and I expect it will be similar to DIH in 
rebuild speed.  Hopefully I can make it faster.

There is always overhead with HTTP.  On a gigabit LAN, I don't think 
it's high enough to matter.

Using Lucene to index files for Solr is an option -- but that requires 
writing a custom Lucene application, and knowledge about how to turn the 
Solr schema into Lucene code.  A lot of users on this list (me included) 
do not have the skills required.  I know SolrJ reasonably well, but 
Lucene is a nut that I haven't cracked.

Thanks,
Shawn

Re: Solr interface

Posted by Andre Bois-Crettez <an...@kelkoo.com>.

You can use Solrj : https://wiki.apache.org/solr/Solrj
Anyway, even using http the performance is good.

André

On 2014-04-07 13:52, Jonathan Varsanik wrote:
> Do you mean to tell me that the people on this list that are indexing 100s of millions of documents are doing this over http?  I have been using custom Lucene code to index files, as I thought this would be faster for many documents and I wanted some non-standard OCR and index fields.  Is there a better way?
>
> To the OP: You can also use Lucene to locally index files for Solr.
>
>
>
> -----Original Message-----
> From: Erik Hatcher [mailto:erik.hatcher@gmail.com]
> Sent: Thursday, April 03, 2014 8:47 AM
> To:solr-user@lucene.apache.org
> Cc: Solr User
> Subject: Re: Solr interface
>
> Yes. But why?
>
> DataImportHandler kinda does this (still use http to kick off an indexing job).  And there's EmbeddedSolrServer too.
>
>      Erik
>
>> On Apr 3, 2014, at 8:39, Александр Вандышев<a-...@rambler.ru>  wrote:
>>
>> Is it possible to index files not via HTTP interface?
>
> --
> André Bois-Crettez
>
> Software Architect
> Big Data Developer
> http://www.kelkoo.com/

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

RE: Solr interface

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Mon, 2014-04-07 at 13:52 +0200, Jonathan Varsanik wrote:
> Do you mean to tell me that the people on this list that are indexing
> 100s of millions of documents are doing this over http?

Some of us do. Our net archive indexer runs a lot of Tika processes that
sends their analysed documents through http. We're building 1TB indexes
of about 3-400M documents each. The Tika-analysis is by far the heavy
part of the setup: 1 Solr instance easily keeps up with 30 Tikas on a 24
core machine (or 48, depending on how you count). This setup makes it
easy to scale up & out, basically by starting new Tika processes on
whatever machines we have available.

In other setups, where the pre-index analysis is lighter, the choice of
transport layer might matter more. As always, optimize where it it
needed.

- Toke Eskildsen, State and University Library, Denmark

RE: Solr interface

Posted by Jonathan Varsanik <jv...@exponent.com>.

Do you mean to tell me that the people on this list that are indexing 100s of millions of documents are doing this over http?  I have been using custom Lucene code to index files, as I thought this would be faster for many documents and I wanted some non-standard OCR and index fields.  Is there a better way?

To the OP: You can also use Lucene to locally index files for Solr.

-----Original Message-----
From: Erik Hatcher [mailto:erik.hatcher@gmail.com] 
Sent: Thursday, April 03, 2014 8:47 AM
To: solr-user@lucene.apache.org
Cc: Solr User
Subject: Re: Solr interface

Yes. But why?

DataImportHandler kinda does this (still use http to kick off an indexing job).  And there's EmbeddedSolrServer too. 

    Erik

> On Apr 3, 2014, at 8:39, Александр Вандышев <a-...@rambler.ru> wrote:
> 
> Is it possible to index files not via HTTP interface?

Re: Solr interface

Posted by Erik Hatcher <er...@gmail.com>.

Yes. But why?

DataImportHandler kinda does this (still use http to kick off an indexing job).  And there's EmbeddedSolrServer too. 

    Erik

> On Apr 3, 2014, at 8:39, Александр Вандышев <a-...@rambler.ru> wrote:
> 
> Is it possible to index files not via HTTP interface?