You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tim TerlegÄrd <ti...@gmail.com> on 2010/03/19 11:00:16 UTC

StreamingUpdateSolrServer being inefficient when adding is not as fast as empying its queue

StreamingUpdateSolrServer logs "starting runner: ...", sends a POST
with <stream>...</stream> and I guess also opens a new HTTP connection
every time it has managed to empty its queue. In
StreamingUpdateSolrServer.java it says this:

    // info is ok since this should only happen once for each thread
    log.info( "starting runner: {}" , this );

But the comment is not correct. It will log everytime its queue has
been emptied. I get "starting runner: ..." lots and lots in the log,
but I have only 4 threads.

Let's say I have this code:

SolrServer server = new
StreamingUpdateSolrServer("http://localhost:8983/solr", 20, 4)

foreach (String id : ids) {
  SolrInputDocument doc = new SolrInputDocument()
  doc.addField("id", id)
  doc.addField("text", "something")
  server.add(doc)
  // Simulating that lots of stuff happens, like getting stuff from
the database and what not
  Thread.sleep(300)
}
server.commit()

Because there is a little delay after server.add(doc) the
StreamingUpdateSolrServer's runners will quickly empty the internal
queue of documents. Because of this the next time server.add(doc) is
called it will open a new HTTP connection and make a new POST and send
a <stream> with only one document. This is very inefficient.

Would it be possible to hold a HTTP connection open and hold a
<stream> open until commit is called?

I realize that the way to make this more efficient is to put all
documents in a list and call server.add(allDocs). I browsed the web to
find StreamingUpdateSolrServer examples and found one where it said
that nowadays it's just fine to call server.add(doc), that you don't
have to put everything in a list first. But that's obviosly not
entirely correct.

Before hitting the send button I realized that the runner checks for
new documents in the queue for 250 milliseconds before it gives up. So
in my application this timeout isn't enough. Maybe we could modify the
class so the timeout could be changed with setTimeout() instead of
having it hardcoded to 250?

What is a good number of documents for sever.add(docs)? Is there an
upper limit or is it ok to have a million documents?

/Tim