You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Ryan McKinley (JIRA)" <ji...@apache.org> on 2008/12/10 19:08:44 UTC

[jira] Created: (SOLR-906) Buffered / Streaming SolrServer implementaion

Buffered / Streaming SolrServer implementaion
---------------------------------------------

                 Key: SOLR-906
                 URL: https://issues.apache.org/jira/browse/SOLR-906
             Project: Solr
          Issue Type: New Feature
          Components: clients - java
            Reporter: Ryan McKinley
             Fix For: 1.4


While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.

With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.

For related discussion see:
http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

No, I haven't looked at the new patch yet. I'll try to look into it
tonight/tomorrow. Sorry about that.

On Tue, Dec 30, 2008 at 12:24 PM, Ryan McKinley (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659750#action_12659750]
>
> Ryan McKinley commented on SOLR-906:
> ------------------------------------
>
> Shalin, did you get a change to look at this version?
>
>
> > Buffered / Streaming SolrServer implementaion
> > ---------------------------------------------
> >
> >                 Key: SOLR-906
> >                 URL: https://issues.apache.org/jira/browse/SOLR-906
> >             Project: Solr
> >          Issue Type: New Feature
> >          Components: clients - java
> >            Reporter: Ryan McKinley
> >            Assignee: Shalin Shekhar Mangar
> >             Fix For: 1.4
> >
> >         Attachments: SOLR-906-StreamingHttpSolrServer.patch,
> SOLR-906-StreamingHttpSolrServer.patch,
> SOLR-906-StreamingHttpSolrServer.patch,
> SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
> >
> >
> > While indexing lots of documents, the CommonsHttpSolrServer add(
> SolrInputDocument ) is less then optimal.  This makes a new request for each
> document.
> > With a "StreamingHttpSolrServer", documents are buffered and then written
> to a single open Http connection.
> > For related discussion see:
> > http://www.nabble.com/solr-performance-tt9055437.html#a20833680
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Regards,
Shalin Shekhar Mangar.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657882#action_12657882 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

I suppose...  though I don't see it as a strict requirement -- if you need full error handling, use a different SolrServer implementation.

I think a more reasonable error API would be a callback function rather then polling -- the error could occur outside you loop (assuming you break at some point).  That callback could easily be converted to a polling api if desired.

The big thing to note with this API is that calling:
 solr.add( doc )
just adds it to the queue processes it in the background.  It is a BlockingQueue, so after it hits the max size the client will block before it can add -- but that should be transparent to the client.

the error caused by adding that doc may happen much later in time.

I'll go ahead and add that callback...

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-906:
-------------------------------

    Attachment: SOLR-906-StreamingHttpSolrServer.patch

removes @Override from interfaces

I guess:
  <property name="java.compat.version" value="1.5" />

does not take check everything!

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660457#action_12660457 ] 

Noble Paul commented on SOLR-906:
---------------------------------

One problem with the current implementation is that it writes everything to a local buffer and then uploads the whole content in one go. So essentially we are wasting time till your 40K docs are written into this huge XML. Another issue is that this XML has to fit in memory

We need to fix the comonsHttpSolrServer first. It must stream the docs .

Another enhancement is using a different format (SOLR-865). It uses javabin format and it can be extremely fast compared to XML and the payload can be reduced substantially.

Probably we can overcome the perf problems to a certain extent with these two fixes. 





> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-906:
-------------------------------

    Attachment: StreamingHttpSolrServer.java

This implementation buffers documents in a BlockingQueue<SolrInputDocument>
when a document is added, it makes sure a Thread is working on sending the queue to solr.

While this implementation only starts one thread, it would be easy to extent this so that multiple threads are writing to solr simultaneously and draining the same Queue.

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657115#action_12657115 ] 

Shalin Shekhar Mangar commented on SOLR-906:
--------------------------------------------

bq. why not? All the tests pass for me... 
There are multiple places where SolrExampleTest calls commit without waitSearcher=true and proceeds to query and assert on results. The failure happens intermittently. Try varying the number of threads and you may be able to reproduce the failure.

bq. The add calls do not come to the process method.
I meant the request method. Sorry about that. The SolrServer.add() calls the request method but this implementation does not. If there are multiple threads using this class, new documents may get added to the queue before we acquire the lock inside blockUntilFinished due to the call to commit.

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657911#action_12657911 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

| the actual error response (http error code, and the body?)
That is / can be encoded in the Throwable implementation no?  As is, it adds the same Exception that you get when running the standard one.

| the InputDocument that caused the failure, if there was one

I don't know if there is a good way to do this.  The scope that catches exceptions is way outside of the context where we knew what was written.  This is just like hitting an error somewhere in the add( List<Doc> ) -- the response has no way to know where it broke.  Ideally the error that Solr returns is enough information, but the current exception behavior is to barf or not.

| what else?

My thought was we could add any of the specialized parameters in finer grained Throwable implementations.  When we have a real 'error' response, this could be parsed and passed as an exception.

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar updated SOLR-906:
---------------------------------------

    Assignee:     (was: Shalin Shekhar Mangar)

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661285#action_12661285 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

| But we need to change the global lock to be final. 

why?  The global lock is used to block all the threads -- each worker checks if it is null to see if it should block or not.

The only place that sets the lock is within the same synchronized block:

{code:java}

  public synchronized void blockUntilFinished()
  {
    if( lock == null ) {
      lock = new ReentrantLock();
    }
    lock.lock();

    ...

    lock.unlock();
    lock = null;
  }
{code}

since nothing else changes lock, i think it is ok.


| What does the "<stream>" tag do?

That is just there so that multiple <add> commands can be in the same XML document.  it is just an arbitrary parent tag.  The parser on the other end only validates once it hits a known cmd tag.

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661434#action_12661434 ] 

noble.paul edited comment on SOLR-906 at 1/6/09 8:15 PM:
---------------------------------------------------------

bq.That is just there so that multiple <add> commands can be in the same XML document.

It may not be necessary to have the <strream> tag , if we modify the XMLLoader to accept multiple <add> we can do away with the extra <stream> tag.

The XML format is a public interface and we should be conservative in adding stuff into that .

It would be nice to document the public changes in the JIRA itself so that other committers can see the changes which are going to come in w/o going through the patch itself

      was (Author: noble.paul):
    bq.That is just there so that multiple <add> commands can be in the same XML document.

It may not be necessary to have the <strream> tag , if we modify the XMLLoader to accept multiple <add> we can do away with the extra <stream> tag.

The XML format is a public interface and we should be conservative in adding stuff into that .
  
> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley resolved SOLR-906.
--------------------------------

    Resolution: Fixed

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657871#action_12657871 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

API is identical to SolrServer

rather then instantiating with CommonsHttpSolrServer, you just use StreamingSolrServer.

The constructor args are:
  StreamingHttpSolrServer(String solrServerUrl, int queueSize, int threadCount ) 

queueSize is how big you want the buffering queue to be
threadCount is the maximum number of threads that will get used to empty the queue

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657107#action_12657107 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

Thanks for looking at this!

| The if (req.getCommitWithin() < 0) should be > 0, right?

no -- if a commit within time is specified, we can not use the open request.  It needs to start a new request so that a new <add ...> command could be sent.  I experimeted with sending everything over the open connection, but we would need to add a new parent tag to the xml format.   That might not be a bad idea.  Then we could send:
<stream>
  <add>
     <doc...>
  </add>
  <add commitWithin="">
  ...
 <commit />
 <add ...>

and finally
</stream>


> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660610#action_12660610 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

I would like to go ahead and commit this patch soon.  Shalin - did the changes in the latest patch resolve the issues you referred to?

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655302#action_12655302 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

The error handling behavior is less then ideal....  (as it is for <add> with multiple documents already)

As is, it will break the connection when an error occures, but will not know which one.  However, it will continue indexing the Queue when an error occures.

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657278#action_12657278 ] 

Noble Paul commented on SOLR-906:
---------------------------------

another observation :
why do we need a ScheduledExecutorService we only need a ThreadPoolExecutorService

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660457#action_12660457 ] 

noble.paul edited comment on SOLR-906 at 1/3/09 2:10 AM:
---------------------------------------------------------

 * One problem with the current implementation is that it writes everything to a local buffer and then uploads the whole content in one go. So essentially we are wasting time till your 40K docs are written into this huge XML. Another issue is that this XML has to fit in memory. We need to fix the comonsHttpSolrServer first. It must stream the docs .
 * We can enhance the SolrServer API by adding a method SolrServer#add(Iterator<SolrInputDocs> docs) . So CommonsHttpSolrServer can start writing the documents as and when you are producing your documents . We also have the advantage of not storing the huge list of docs in memory. 

 * Another enhancement is using a different format (SOLR-865). It uses javabin format and it can be extremely fast compared to XML and the payload can be reduced substantially.

Probably we can overcome the perf problems to a certain extent with these two fixes. 





      was (Author: noble.paul):
    One problem with the current implementation is that it writes everything to a local buffer and then uploads the whole content in one go. So essentially we are wasting time till your 40K docs are written into this huge XML. Another issue is that this XML has to fit in memory

We need to fix the comonsHttpSolrServer first. It must stream the docs .

Another enhancement is using a different format (SOLR-865). It uses javabin format and it can be extremely fast compared to XML and the payload can be reduced substantially.

Probably we can overcome the perf problems to a certain extent with these two fixes. 




  
> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar reassigned SOLR-906:
------------------------------------------

    Assignee: Shalin Shekhar Mangar

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660496#action_12660496 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

| One problem with the current implementation is that it writes everything to a local buffer and then uploads the whole content in one go. So essentially we are wasting time till your 40K docs are written into this huge XML. Another issue is that this XML has to fit in memory. We need to fix the comonsHttpSolrServer first. It must stream the docs .

Really?!

Are you saying that the [RequestEntity.html#getContentLength()|http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/methods/RequestEntity.html#getContentLength()] does not behave as advertised?

This implementation returns -1 for the content length, and that tells the connection use chunk encoding to transmit the request entity.  

Where do you get the 40K number?  Is it from the log?  If so, that is the expected behavior -- the server continually processes documents until it reaches the end of the stream.  That may be 1 document that may be 1M docs...

If you are filling up a Collection<SolrInputDocument> with 40K docs, then sending it of course it is going to hold on to 40K docs at once.


| We can enhance the SolrServer API by adding a method SolrServer#add(Iterator<SolrInputDocs> docs) . So CommonsHttpSolrServer can start writing the documents as and when you are producing your documents . We also have the advantage of not storing the huge list of docs in memory.

I'm not following...  with the StreamingHttpSolrServer, you can send documents one at a time and each documents starts sending as soon as it can.  There is a BlockingQueue<UpdateRequest> that holds all UpdateRequests that come through the 'request' method.  BlockingQueue's only hold a fixed number of items and will block before adding something beyond the limit.

| Another enhancement is using a different format (SOLR-865). It uses javabin format and it can be extremely fast compared to XML and the payload can be reduced substantially.

That is a different issue altogether.  That relates to having something different running on the server.  Once that is in, then this should be able to leverage that as well...



> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-906:
-------------------------------

    Attachment: SOLR-906-StreamingHttpSolrServer.patch

Here is an updated version that lets you specify how many threads should work on emptying the Queue.  It also adds tests to make sure it passes all the same tests that CommonsHttpSolrServer and EmbeddedSolrServer already pass.   That is, it is a drop in replacement and passes all existing tests.

One big change is that calling commit or optimize with waitSearcher=true:
1. blocks adding new docs to the Queue
2. waits for the Queue to empty (send all docs)
3. waits for <commit waitSearcher=true /> to return
4. unblocks everything
5. finally continues execution.

My threading chops are not great, so I may be doing something really strange.  It would be good to get some more eyes on this!

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657889#action_12657889 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

This can be use like so:
{code:java}
SolrServer solr = new StreamingHttpSolrServer( url, 2, 5 ) {
  @Override
  public void handleError(Throwable ex) {
     // do somethign...
  }
};
{code}

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley reassigned SOLR-906:
----------------------------------

    Assignee: Ryan McKinley

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657108#action_12657108 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

| The SolrExampleTest cannot be used directly

why not?  All the tests pass for me... 

sending a commit() (with waitSearcher=true) should wait for all the docs to get added, then issue the commit, then return.



> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657110#action_12657110 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

| The add calls do not come to the process method. Due to this some add calls may still get in before the commit acquires the lock (assuming multiple producers). Is this class strictly meant for a single document producer use-case?

I don't totally follow... but if possible, it would be good if multiple threads could fill the same queue.  This would let the StreamingHttpSolrServer  manage all solr communication

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655438#action_12655438 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

> how much of the 3.5minutes -> 30seconds is due to the logging?

~1 min.  When I turn off logging completely, the time is ~2.5 mins  (also, note that with 3 threads, it is down to 20sec)

RE: calling add( doc ) vs add( List<doc> )...  
yes, things are much better if you call add( List<doc> ) however, it is not the most convenient api if you are running though tons of things.

I would expect (but have not tried) adding 40K docs in one call to add( List<doc> ) would have the same time as this StreamingHttpSolrServer.  It is probably also similar if you buffer 100? 1,000? at a time, but I have not tried.

The StreamingHttpSolrServer essentially handles the buffering for you.  It keeps an http connection open as long as the Queue has docs to send.  It can start multiple threads and drain the same Queue simultaneously.

Essentially, this just offers an easier interface to get the best possible performance.  The trade off (for now) is that there is no good error reporting.



> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660572#action_12660572 ] 

Noble Paul commented on SOLR-906:
---------------------------------

Please ignore the number 40K docs. I just took it from your perf test numbers. I thought you were writing docs as a list

I am referring to the client code .The method in UpdateRequest
{code}
public Collection<ContentStream> getContentStreams() throws IOException {
    return ClientUtils.toContentStreams( getXML(), ClientUtils.TEXT_XML );
}
{code}

This means that the getXML() method actually constructs a huge String which is the entire xml. It is not very good if we are writing out very large no:of docs

I am suggesting that ComonsHttpSolrServer has scope for improvement. Instead of building that String in memory  we can just start streaming it to the server. So the OutputStream can be passed on to UpdateRequest so that it can write the xml right into the stream. So there is streaming effectively on both ends

This is valid where users do bulk updates. Not when they write one doc at a time. 

The new method SolrServer#add(Iterator<SolrInputDocs> docs) can start writing the docs immedietly and the docs can be uploaded as and when they are being produced. It is not related to these issue exactly, But the intend of this issue is to make upload faster.


SOLR-865 is not very related to this issue. StreamingHttpSolrServer can use javabin format as well.

bq.with the StreamingHttpSolrServer, you can send documents one at a time and each documents starts sending as soon as it can
One drawback of a StreamingHttpSolrServer is that it ends up opening  multiple connections for uploading the documents

Another enhancement . We can add one (or more ) extra thread in the server to do the call updaterequestprocessor.processAdd() . 

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659750#action_12659750 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

Shalin, did you get a change to look at this version?


> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657102#action_12657102 ] 

Shalin Shekhar Mangar commented on SOLR-906:
--------------------------------------------

I've started taking a look at this. A couple of points:

* Instantiating the lock in blockUntilFinished and nulling it can cause a race condition. A thread in the 'add' method can find that the lock is not null, another thread can null it and the first thread proceeds to lock on it leading to NPE. In the same way, creation of multiple locks is possible in the blockUntilFinished method.
* The run method calling itself recursively looks suspicious. We may be in danger of overflowing the stack.
* The SolrExampleTest cannot be used directly because it depends on the order of the commands being executed. We must clearly document that clients should not depend on the order of commands being executed in the same order as they are given.
* The if (req.getCommitWithin() < 0) should be > 0, right?
* The add calls do not come to the process method. Due to this some add calls may still get in before the commit acquires the lock (assuming multiple producers). Is this class strictly meant for a single document producer use-case?
* The wait loop in blockUntilFinished is very CPU intensive. It can probably be optimized.

I'm experimenting with a slightly different implementation. Still trying to tie the loose ends. I hope to have a patch soon.

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661440#action_12661440 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

| if we modify the XMLLoader to accept multiple <add>

The XMLLoader already accepts multiple <add> commands without any changes.

The <stream> tag (or <whatever> tag) is added so that the xml is valid -- you can not have multiple roots in an xml document.

Again -- *nothing* has changed with the parser. 

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655363#action_12655363 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

also, using the recent patch with a Queue size = 20 and thread count=3 (on a dual core machine), the indexing time dropped from 30 secs -> 20 secs.  In sum:  with the data I am working with, switch from CommonsHttpSolrServer => StreamingHttpSolrServer changes the index time from 3.5 min => 20 sec, or ~10x faster  

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ian Holsman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655431#action_12655431 ] 

Ian Holsman commented on SOLR-906:
----------------------------------


also..

The way I've always thought commonsHttpServer was designed to work was to batch up the documents in groups and use the add(List<SolrInputDocument>) function instead of the individual add(SolrInputDocument) function. 

is adding documents in batches (say 100 or 1,000 at a time) also as slow as 3.5m ? or wont this work as you won't get the feedback on which document failed to add if there was an error.

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657278#action_12657278 ] 

noble.paul edited comment on SOLR-906 at 12/16/08 8:16 PM:
-----------------------------------------------------------

another observation :
why do we need a ScheduledExecutorService we only need a ThreadPoolExecutorService
The name of the class is somewhat misleading. We must document that this may be exclusively used for updates

How about renaming this to StreamingUpdateSolrServer

      was (Author: noble.paul):
    another observation :
why do we need a ScheduledExecutorService we only need a ThreadPoolExecutorService
  
> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657868#action_12657868 ] 

Yonik Seeley commented on SOLR-906:
-----------------------------------

I haven't looked at the code, but can someone outline what the API looks like (i.e. what would typical pseudo code that used the API look like).

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657893#action_12657893 ] 

Yonik Seeley commented on SOLR-906:
-----------------------------------

IMO callbacks can be tougher to deal with and require client code to be multi-threaded (like signal handling in C that caused tons of trouble for people because they weren't thinking about the fact that it could be called at any time concurrently with other code, esp back in the days when C library calls were not re-entrant).

But I guess as you say, polling could be added on top of a callback API if needed later.


> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660607#action_12660607 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

Are you looking at the patch or just brainstorming how this could be implemented?

{panel}
I am referring to the client code .The method in UpdateRequest

public Collection<ContentStream> getContentStreams() throws IOException {
    return ClientUtils.toContentStreams( getXML(), ClientUtils.TEXT_XML );
}

This means that the getXML() method actually constructs a huge String which is the entire xml. It is not very good if we are writing out very large no:of docs
{panel}

This is not how the patch works...  for starters, it never calls getContentStreams() for UpdateRequest.  It opens a single connection and continually dumps the xml for each request.  Rather then call getXML() the patch adds a function writeXml( Writer ) that writes directly to the open buffer.

{panel}
I am suggesting that ComonsHttpSolrServer has scope for improvement. Instead of building that String in memory we can just start streaming it to the server. So the OutputStream can be passed on to UpdateRequest so that it can write the xml right into the stream. So there is streaming effectively on both ends
{panel}
The ComonsHttpSolrServer is fine, but you are right that each UpdateRequest *may* want to write the content directly to the open stream.  The ContentStream interface gives us all that control.  One thing to note is that if you do not specify the length, the HttpCommons server will use chunked encoding.

But I think adding the StreammingUpdateSolrServer resolves that for everyone.  Uses have either option.

{panel}
 One drawback of a StreamingHttpSolrServer is that it ends up opening multiple connections for uploading the documents
{panel}
Nonsense -- that is exactly what this avoids.  It opens a single connection and writes everything to it.  You can configure how many threads you want emptying the queue; each one will open a connection.

{panel}
Another enhancement . We can add one (or more ) extra thread in the server to do the call updaterequestprocessor.processAdd() . 
{panel}
That opens a whole can of worms...  perhaps better discussed on java-dev.  For now I think sticking to the 1 thread/prequest is a good model.  If you want multiple threads running on the server use multiple connections (it is even an argument in the StreammingHttpSolrServer)

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-906:
-------------------------------

    Attachment: SOLR-906-StreamingHttpSolrServer.patch

Here is an updated patch that bufferes UpdateRequests rather then SolrInputDocuments -- this is good because then everything is handled in request( final SolrRequest request ) so blocking can be easier.

Also this lets up submit submit <add commands with the commitWithin syntax.

All /update commands are streamed to server unless waitSearcher==true

Shalin -- can you check this over for threading issues or general improvements?

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657877#action_12657877 ] 

Yonik Seeley commented on SOLR-906:
-----------------------------------

Isn't there a need for a polling API to check for errors?
I think something like the following client code would be easiest for people to deal with:
{code}
while not done:
   myDoc = [...] // make the doc
   solr.addDoc(myDoc)
   List<Error> errors = solr.getErrors();
   if (errors != null) handleErrors(errors)
{code}

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661434#action_12661434 ] 

Noble Paul commented on SOLR-906:
---------------------------------

bq.That is just there so that multiple <add> commands can be in the same XML document.

It may not be necessary to have the <strream> tag , if we modify the XMLLoader to accept multiple <add> we can do away with the extra <stream> tag.

The XML format is a public interface and we should be conservative in adding stuff into that .

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655442#action_12655442 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

The other aspect no note is that StreamingHttpSolrServer sends the request in the background so after you call: add( doc ) or add( list ) that thread is free to keep working.  With the off the shelf CommonsHttpSolrServer the client needs to wait for the server to parse the request and index the data before it can continue.

This switches where the client gets blocked.
 * with CommonsHttpSolrServer it blocks while waiting for solr to *write the response*
 * with StreamingHttpSolrServer it blocks while waiting for solr to *read the request*


> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ian Holsman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655427#action_12655427 ] 

Ian Holsman commented on SOLR-906:
----------------------------------

how much of the 3.5minutes -> 30seconds is due to the logging?
would it be simpler to change the log message to 'DEBUG' instead of 'INFO' and see how the performance of the regular commons server behaves then?


> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660784#action_12660784 ] 

Shalin Shekhar Mangar commented on SOLR-906:
--------------------------------------------

Sorry Ryan for stalling this. The tests run fine. But we need to change the global lock to be final.

What does the "<stream>" tag do?

Please go ahead and take charge. It's all yours.

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660662#action_12660662 ] 

Noble Paul commented on SOLR-906:
---------------------------------

Hi Ryan,
You got me wrong. I was trying to say how to make CommonsHttpSolrServer efficient by streaming docs as StreamingHttpSolrServer does when I add docs in bulk using 
{code}
SolrServer.add(List<SolrInputDocument> docs)
{code}

Yes , StreamingHttpSolrServer uses only one connection per thread and it closes the connection after waiting for 250ms for a new document.

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-906:
-------------------------------

    Attachment: SOLR-906-StreamingHttpSolrServer.patch

updated version that includes the callback function:
{code:java}
  public void handleError( Throwable ex )
  {
    log.error( "error", ex );
  }
{code}

this gets called whenever an error occurs.  We could have that keep a list of any errors or whatever.  I'm not sure what the default error behavior should be.  Collecting everything in a list?  just logging?

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655917#action_12655917 ] 

Shalin Shekhar Mangar commented on SOLR-906:
--------------------------------------------

Ryan, I'm seeing compile errors related to @Override with interface methods (that's a Java 6 feature). Also, new IOException( e ) is not defined (also Java 6, I guess).

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655298#action_12655298 ] 

Ryan McKinley commented on SOLR-906:
------------------------------------

One basic problem with calling add( SolrInputDocument) with the CommonsHttpSolrServer is that it logs a request for each document.  This can be a substantial impact.  For example while indexing 40K docs on my machine, it takes ~3 1/2 mins.  If I turn logging off the time drops to ! 2 1/2 mins.  With the streaming approach, the time drops to 20sec!   Some of that is obviously because it limits the logging:
{code}
INFO: {add=[id1,id2,id3,id4, ...(38293 more)]} 0 20714
{code}

> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>             Fix For: 1.4
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657896#action_12657896 ] 

Yonik Seeley commented on SOLR-906:
-----------------------------------

bq. public void handleError(Throwable ex)

Ideally, we could give the user back
 - the actual error response (http error code, and the body?)
 - the InputDocument that caused the failure, if there was one
 - what else?

So it seems like we should add these parameters to handleError, and just document that they currently return null if we can't get that info yet.  Or perhaps more extensible, a SolrError class that has these fields?


> Buffered / Streaming SolrServer implementaion
> ---------------------------------------------
>
>                 Key: SOLR-906
>                 URL: https://issues.apache.org/jira/browse/SOLR-906
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java
>            Reporter: Ryan McKinley
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java
>
>
> While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal.  This makes a new request for each document.
> With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.
> For related discussion see:
> http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.