You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael Levy <Lu...@gmail.com> on 2006/05/12 19:02:02 UTC

One big XML file vs. many HTTP requests

Greetings,

I'm evaluating using Solr under Tomcat to replace a number of text 
searching projects that currently use UMASS's INQUERY, an older search 
engine.

One nice feature of INQUERY is that you can create one large SGML file, 
containing lots of records, each bracketed with <DOC> and </DOC> tags.  
Submitting that big SGML document for indexing goes very fast. 

I believe that Solr indexes one document at a time; each document 
requires a separate HTTP POST.

How efficient is making a separate HTTP request per-document, when there 
are millions of documents?  Do people ever use Solr's or Lucene's API 
directly for indexing large numbers of documents, and if so, what are 
the considerations pro and con?

Thanks to Yonik and Chris everyone for all your work; Solr looks really 
great.


Re: One big XML file vs. many HTTP requests

Posted by Yonik Seeley <ys...@gmail.com>.
On 5/12/06, Michael Levy <Lu...@gmail.com> wrote:
> How efficient is making a separate HTTP request per-document, when there
> are millions of documents?

If you use persistent connections and add make multiple requests in
parallel, there won't be much difference than multiple docs per
request.

-Yonik

Re: One big XML file vs. many HTTP requests

Posted by Chris Hostetter <ho...@fucit.org>.
: ...but you can't simply
: <delete><query>FIELDNAME:*</query></delete>
: or
:  <delete><query>*</query></delete>

That's because the Lucene query parser doesn't support 100% wildcard
queries.

: What is the best way to delete all records, for example if you want to clear
: out the entire index and reindex everything?

if you really want to make sure *EVERYTHING* is gone -- delete the index
directory and bounce the port, solr will make a new one.

if you have a uniqueKey field, or some other field you are sure every
document contains an indexed value for, then just do an unbounded range
query on that field...

	 <delete><query>FIELDNAME:[* TO *]</query></delete>


-Hoss


Re: One big XML file vs. many HTTP requests

Posted by Michael Levy <lu...@gmail.com>.
It seems you can do something like
<delete><query>FIELDNAME:a*</query></delete>
and
<delete><query>FIELDNAME:b*</query></delete>
...but you can't simply
<delete><query>FIELDNAME:*</query></delete>
or
 <delete><query>*</query></delete>

The demo post.sh returns <result status="400">Error parsing Lucene
query</result>
and the demo Solr Admin page shows
XML Parsing Error: syntax error
Location:
http://wiki.ushmm.org:8080/solr/select/?stylesheet=&q=*&version=2.1&start=0&rows=10&indent=on
Line Number 1, Column 1:org.apache.solr.core.SolrException: Error parsing
Lucene query
^

What is the best way to delete all records, for example if you want to clear
out the entire index and reindex everything?


On 5/21/06, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> : But deleting multiple documents with just one POST is not possible,
> : right? Is there a special reason for that or is it because nobody asked
>
> delete by query will remoe multiple documents with a sigle command .. but
> if you mean dleete by id .. you may be right about it not having the same
> "loop" kludge that <add> has.
>
> As Yonik has mentioned before .. if you use persistent connections in your
> HTTP Client layer, there isn't really any advantage to sending multiple
> commands in one request, vs sending multiple requests.
>
> -Hoss
>
>

Re: One big XML file vs. many HTTP requests

Posted by Chris Hostetter <ho...@fucit.org>.
: But deleting multiple documents with just one POST is not possible,
: right? Is there a special reason for that or is it because nobody asked

delete by query will remoe multiple documents with a sigle command .. but
if you mean dleete by id .. you may be right about it not having the same
"loop" kludge that <add> has.

As Yonik has mentioned before .. if you use persistent connections in your
HTTP Client layer, there isn't really any advantage to sending multiple
commands in one request, vs sending multiple requests.

-Hoss


Re: One big XML file vs. many HTTP requests

Posted by Marcus Stratmann <st...@gmx.de>.
Erik Hatcher wrote:
>> I believe that Solr indexes one document at a time; each document  
>> requires a separate HTTP POST.
> Actually adding multiple documents per POST is possible
But deleting multiple documents with just one POST is not possible, 
right? Is there a special reason for that or is it because nobody asked 
for that yet? If so: I'd like to have it! ;-)

Thanks to Erik for the hint!

Marcus

Re: One big XML file vs. many HTTP requests

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On May 12, 2006, at 1:02 PM, Michael Levy wrote:
> One nice feature of INQUERY is that you can create one large SGML  
> file, containing lots of records, each bracketed with <DOC> and </ 
> DOC> tags.  Submitting that big SGML document for indexing goes  
> very fast.
> I believe that Solr indexes one document at a time; each document  
> requires a separate HTTP POST.

Actually adding multiple documents per POST is possible

> How efficient is making a separate HTTP request per-document, when  
> there are millions of documents?  Do people ever use Solr's or  
> Lucene's API directly for indexing large numbers of documents, and  
> if so, what are the considerations pro and con?

Maybe Solr could evolve a facility for doing these types of bulk  
operations without HTTP, but still using Solr's engine somehow via  
API directly.  I guess this gets tricky when you have a live Solr  
system up and juggling write locks though.

But currently going through HTTP is the only way, and likely to not  
be that much of a bottleneck especially given you can post multiple  
documents at a time (the wiki has an example, but I can't get to the  
web at the moment to post the link).

	Erik