You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tod <li...@gmail.com> on 2010/08/04 21:01:45 UTC

Solrj ContentStreamUpdateRequest Slow

I'm running a slight variation of the example code referenced below and 
it takes a real long time to finally execute.  In fact it hangs for a 
long time at solr.request(up) before finally executing.  Is there 
anything I can look at or tweak to improve performance?

I am also indexing a local pdf file, there are no firewall issues, solr 
is running on the same machine, and I tried the actual host name in 
addition to localhost but nothing helps.


Thanks - Tod

http://wiki.apache.org/solr/ContentStreamUpdateRequestExample

Re: Solrj ContentStreamUpdateRequest Slow

Posted by Chris Hostetter <ho...@fucit.org>.
:     ContentStreamUpdateRequest req = new
: ContentStreamUpdateRequest("/update/extract");
: 
: System.out.println("setting params...");
:     req.setParam("stream.url", fileName);
:     req.setParam("literal.content_id", solrId);

ContentStreamUpdateRequest exists so that you can stream content directly 
from the client to the server -- you aren't doing that, you are asking the 
server t ogo fetch the stream.url itself.

The NullPointerException happens because you've never called 
ContentStreamUpdateRequest.addFile or 
ContentStreamUpdateRequest.addContentStream so it gets into a state where 
it doesn't know what it's doing (admitedely the error message is less then 
ideal)

If you just use a plain old regular "UpdateRequest" (or even a 
"QueryRequest") instead, your code works as written.

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss      ...  Stump The Chump!


Re: Solrj ContentStreamUpdateRequest Slow

Posted by Lance Norskog <go...@gmail.com>.
There are no unit tests for stream.file or stream.url. Tests in
org.apache.solr.handler.TestCSVLoader.filename:loadLocal() intercept
them and do its own thing, feeding a local file instead of the
stream.file parameter. I see no proof that stream.file/stream.url
should work in SolrJ or in EmbeddedSolr (which uses the SolrJ API).
This comment is in all known source trees:

    // TODO: stop using locally defined streams once stream.file and
    // stream.body work everywhere

Sorry, you're stuck with the command line. Apologies for giving you
bad advice. I have filed a JIRA about the lack of unit tests:

https://issues.apache.org/jira/browse/SOLR-2060

Please add your source code if you are confident it should work, but does not.

Lance

On Thu, Aug 19, 2010 at 7:45 AM, Tod <li...@gmail.com> wrote:
> On 8/19/2010 1:45 AM, Lance Norskog wrote:
>>
>> 'stream.url' is just a simple parameter. You should be able to just
>> add it directly.
>
>
> I agree (code excluding imports):
>
> public class CommonTest {
>
>  public static void main(String[] args) {
> System.out.println("main...");
>    try {
>      String fileName = String fileName =
> "http://remoteserver/test/test.pdf";
>      String solrId = "1234";
>      indexFilesSolrCell(fileName, solrId);
>
>    } catch (Exception ex) {
>      ex.printStackTrace();
>    }
>  }
>
>  /**
>   * Method to index all types of files into Solr.
>   * @param fileName
>   * @param solrId
>   * @throws IOException
>   * @throws SolrServerException
>   */
>  public static void indexFilesSolrCell(String fileName, String solrId)
>    throws IOException, SolrServerException {
>
> System.out.println("indexFilesSolrCell...");
>
>    String urlString = "http://localhost:9080/solr";
>
> System.out.println("getting connection...");
>    SolrServer solr = new CommonsHttpSolrServer(urlString);
>
> System.out.println("getting updaterequest handle...");
>    ContentStreamUpdateRequest req = new
> ContentStreamUpdateRequest("/update/extract");
>
> System.out.println("setting params...");
>    req.setParam("stream.url", fileName);
>    req.setParam("literal.content_id", solrId);
>
> System.out.println("making request...");
>    solr.request(req);
>
> System.out.println("committing...");
>    solr.commit();
>
> System.out.println("done...");
>  }
> }
>
>
> At "making request" I get:
>
> java.lang.NullPointerException
>        at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:381)
>        at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
>        at CommonTest.indexFilesSolrCell(CommonTest.java:59)
>        at CommonTest.main(CommonTest.java:26)
>
> ... which is pointing to the solr.request(req) line.
>
>
>
> Thanks - Tod
>



-- 
Lance Norskog
goksron@gmail.com

Re: Solrj ContentStreamUpdateRequest Slow

Posted by Tod <li...@gmail.com>.
On 8/19/2010 1:45 AM, Lance Norskog wrote:
> 'stream.url' is just a simple parameter. You should be able to just
> add it directly.


I agree (code excluding imports):

public class CommonTest {

   public static void main(String[] args) {
System.out.println("main...");
     try {
       String fileName = String fileName = 
"http://remoteserver/test/test.pdf";
       String solrId = "1234";
       indexFilesSolrCell(fileName, solrId);

     } catch (Exception ex) {
       ex.printStackTrace();
     }
   }

   /**
    * Method to index all types of files into Solr.
    * @param fileName
    * @param solrId
    * @throws IOException
    * @throws SolrServerException
    */
   public static void indexFilesSolrCell(String fileName, String solrId)
     throws IOException, SolrServerException {

System.out.println("indexFilesSolrCell...");

     String urlString = "http://localhost:9080/solr";

System.out.println("getting connection...");
     SolrServer solr = new CommonsHttpSolrServer(urlString);

System.out.println("getting updaterequest handle...");
     ContentStreamUpdateRequest req = new 
ContentStreamUpdateRequest("/update/extract");

System.out.println("setting params...");
     req.setParam("stream.url", fileName);
     req.setParam("literal.content_id", solrId);

System.out.println("making request...");
     solr.request(req);

System.out.println("committing...");
     solr.commit();

System.out.println("done...");
   }
}


At "making request" I get:

java.lang.NullPointerException
         at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:381)
         at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
         at CommonTest.indexFilesSolrCell(CommonTest.java:59)
         at CommonTest.main(CommonTest.java:26)

... which is pointing to the solr.request(req) line.



Thanks - Tod

Re: Solrj ContentStreamUpdateRequest Slow

Posted by Lance Norskog <go...@gmail.com>.
'stream.url' is just a simple parameter. You should be able to just
add it directly.

On Wed, Aug 18, 2010 at 5:35 AM, Tod <li...@gmail.com> wrote:
> On 8/16/2010 6:12 PM, Chris Hostetter wrote:
>>
>> : > I think your problem may be that StreamingUpdateSolrServer buffers up
>> : > commands and sends them in batches in a background thread.  if you
>> want to
>> : > send individual updates in real time (and time them) you should just
>> use
>> : > CommonsHttpSolrServer
>> : : My goal is to batch updates.  My content lives somewhere else so I was
>> trying
>> : to find a way to tell Solr where the document lived so it could go out
>> and
>> : stream it into the index for me.  That's where I thought
>> : StreamingUpdateSolrServer would help.
>>
>> If your content lives on a machine which is not your "client" nor your
>> "server" and you want your client to tell your server to go fetch it
>> directly then the "stream.url" param is what you need -- that is unrelated
>> to wether you use StreamingUpdateSolrServer or not.
>
>
> Do you happen to have a code fragment laying around that demonstrates using
> CommonsHttpSolrServer and "stream.url"?  I've tried it in conjunction with
> ContentStreamUpdateRequest and I keep getting an annoying null pointer
> exception.  In the meantime I will check the examples...
>
>
>
>> Thinking about it some more, i suspect the reason you might be seeing a
>> delay when using StreamingUpdateSolrServer is because of this bug...
>>
>>   https://issues.apache.org/jira/browse/SOLR-1990
>>
>> ...if there are no actual documents in your UpdateRequest (because you are
>> using the stream.url param) then the StreamingUpdateSolrServer blocks until
>> all other requests are done, then delegates to the super class (so it never
>> actaully puts your indexing requests in a buffered queue, it just delays and
>> then does them immediately)
>>
>> Not sure of a good way arround this off the top of my head, but i'll note
>> it in SOLR-1990 as another problematic use case that needs dealt with.
>
> Perhaps I can execute an initial update request using a benign file before
> making the "stream.url" call?
>
> Also, to beat a dead horse, this:
> 'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true'
>
> ... works fine - I just want to do it a LOT and as efficiently as possible.
>  If I have to I can wrap it in a perl script and run a cURL or LWP loop but
> I'd prefer to use SolrJ if I can.
>
> Thanks for all your help.
>
>
> - Tod
>



-- 
Lance Norskog
goksron@gmail.com

Re: Solrj ContentStreamUpdateRequest Slow

Posted by Tod <li...@gmail.com>.
On 8/16/2010 6:12 PM, Chris Hostetter wrote:
> : > I think your problem may be that StreamingUpdateSolrServer buffers up
> : > commands and sends them in batches in a background thread.  if you want to
> : > send individual updates in real time (and time them) you should just use
> : > CommonsHttpSolrServer
> : 
> : My goal is to batch updates.  My content lives somewhere else so I was trying
> : to find a way to tell Solr where the document lived so it could go out and
> : stream it into the index for me.  That's where I thought
> : StreamingUpdateSolrServer would help.
> 
> If your content lives on a machine which is not your "client" nor your 
> "server" and you want your client to tell your server to go fetch it 
> directly then the "stream.url" param is what you need -- that is unrelated 
> to wether you use StreamingUpdateSolrServer or not.


Do you happen to have a code fragment laying around that demonstrates 
using CommonsHttpSolrServer and "stream.url"?  I've tried it in 
conjunction with ContentStreamUpdateRequest and I keep getting an 
annoying null pointer exception.  In the meantime I will check the 
examples...



> Thinking about it some more, i suspect the reason you might be seeing a 
> delay when using StreamingUpdateSolrServer is because of this bug...
> 
>    https://issues.apache.org/jira/browse/SOLR-1990
> 
> ...if there are no actual documents in your UpdateRequest (because you are 
> using the stream.url param) then the StreamingUpdateSolrServer blocks 
> until all other requests are done, then delegates to the super class (so 
> it never actaully puts your indexing requests in a buffered queue, it just 
> delays and then does them immediately)
> 
> Not sure of a good way arround this off the top of my head, but i'll note 
> it in SOLR-1990 as another problematic use case that needs dealt with.

Perhaps I can execute an initial update request using a benign file 
before making the "stream.url" call?

Also, to beat a dead horse, this:
'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true'

... works fine - I just want to do it a LOT and as efficiently as 
possible.  If I have to I can wrap it in a perl script and run a cURL or 
LWP loop but I'd prefer to use SolrJ if I can.

Thanks for all your help.


- Tod

Re: Solrj ContentStreamUpdateRequest Slow

Posted by Chris Hostetter <ho...@fucit.org>.
: > I think your problem may be that StreamingUpdateSolrServer buffers up
: > commands and sends them in batches in a background thread.  if you want to
: > send individual updates in real time (and time them) you should just use
: > CommonsHttpSolrServer
: 
: My goal is to batch updates.  My content lives somewhere else so I was trying
: to find a way to tell Solr where the document lived so it could go out and
: stream it into the index for me.  That's where I thought
: StreamingUpdateSolrServer would help.

If your content lives on a machine which is not your "client" nor your 
"server" and you want your client to tell your server to go fetch it 
directly then the "stream.url" param is what you need -- that is unrelated 
to wether you use StreamingUpdateSolrServer or not.

Thinking about it some more, i suspect the reason you might be seeing a 
delay when using StreamingUpdateSolrServer is because of this bug...

   https://issues.apache.org/jira/browse/SOLR-1990

...if there are no actual documents in your UpdateRequest (because you are 
using the stream.url param) then the StreamingUpdateSolrServer blocks 
until all other requests are done, then delegates to the super class (so 
it never actaully puts your indexing requests in a buffered queue, it just 
delays and then does them immediately)

Not sure of a good way arround this off the top of my head, but i'll note 
it in SOLR-1990 as another problematic use case that needs dealt with.


-Hoss


Re: Solrj ContentStreamUpdateRequest Slow

Posted by Tod <li...@gmail.com>.
On 8/12/2010 8:02 PM, Chris Hostetter wrote:
> : It returns in around a second.  When I execute the attached code it takes just
> : over three minutes.  The optimal for me would be able get closer to the
> : performance I'm seeing with curl using Solrj.
> 
> I think your problem may be that StreamingUpdateSolrServer buffers up 
> commands and sends them in batches in a background thread.  if you want to 
> send individual updates in real time (and time them) you should just use 
> CommonsHttpSolrServer
> 
> 
> -Hoss


My goal is to batch updates.  My content lives somewhere else so I was 
trying to find a way to tell Solr where the document lived so it could 
go out and stream it into the index for me.  That's where I thought 
StreamingUpdateSolrServer would help.

- Tod

Re: Solrj ContentStreamUpdateRequest Slow

Posted by Chris Hostetter <ho...@fucit.org>.
: It returns in around a second.  When I execute the attached code it takes just
: over three minutes.  The optimal for me would be able get closer to the
: performance I'm seeing with curl using Solrj.

I think your problem may be that StreamingUpdateSolrServer buffers up 
commands and sends them in batches in a background thread.  if you want to 
send individual updates in real time (and time them) you should just use 
CommonsHttpSolrServer


-Hoss


Re: Solrj ContentStreamUpdateRequest Slow

Posted by Tod <li...@gmail.com>.
On 8/4/2010 11:11 PM, jayendra patil wrote:
> ContentStreamUpdateRequest seems to read the file contents and transfer it
> over http, which slows down the indexing.
> 
> Try Using StreamingUpdateSolrServer with stream.file param @
> http://wiki.apache.org/solr/SolrPerformanceFactors#Embedded_vs_HTTP_Post
> 
> e.g.
> 
> SolrServer server = new StreamingUpdateSolrServer("Solr Server URL",20,8);
> UpdateRequest req = new UpdateRequest("/update/extract");
> ModifiableSolrParams params = null ;
> params = new ModifiableSolrParams();
> params.add("stream.file", new String[]{"local file path"});
> params.set("literal.id", value);
> req.setParams(params);
> server.request(req);
> server.commit();

Thanks for your suggestions.  Unfortunately, I'm still seeing poor 
performance.

To be clear, I am trying to have SOLR index multiple documents that 
exist on a remote server.  I'd prefer that SOLR stream the documents 
after I pass a pointer to them rather than me retrieving and pushing 
them so I can avoid network overhead.

When I do this:

curl 
'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true'

It returns in around a second.  When I execute the attached code it 
takes just over three minutes.  The optimal for me would be able get 
closer to the performance I'm seeing with curl using Solrj.

To be fair the SOLR server I am using is really a workstation class 
machine, plus I am still learning.  I have a feeling I'm doing something 
dumb but just can't seem to pinpoint the exact problem.


Thanks - Tod


--------code-----------


import java.io.File;
import java.io.IOException;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;

import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.request.UpdateRequest;
import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer;
import org.apache.solr.common.params.ModifiableSolrParams;


/**
  * @author EDaniel
  */
public class SolrExampleTests {

   public static void main(String[] args) {
System.out.println("main...");
     try {
//      String fileName = "/test/test.pdf";
       String fileName = "http://remoteserver/test/test.pdf";
       String solrId = "1234";
       indexFilesSolrCell(fileName, solrId);

     } catch (Exception ex) {
       System.out.println(ex.toString());
     }
   }

   /**
    * Method to index all types of files into Solr.
    * @param fileName
    * @param solrId
    * @throws IOException
    * @throws SolrServerException
    */
   public static void indexFilesSolrCell(String fileName, String solrId)
     throws IOException, SolrServerException {

System.out.println("indexFilesSolrCell...");

     String urlString = "http://localhost:8080/solr";

System.out.println("getting connection...");
//    SolrServer solr = new CommonsHttpSolrServer(urlString);
     SolrServer solr = new StreamingUpdateSolrServer(urlString,100,5);

System.out.println("getting updaterequest handle...");
//    ContentStreamUpdateRequest up = new 
ContentStreamUpdateRequest("/update/extract");
     UpdateRequest up = new UpdateRequest("/update/extract");

     ModifiableSolrParams params = null ;
     params = new ModifiableSolrParams();
//    params.add("stream.file", fileName);
     params.add("stream.url", fileName);
     params.set("literal.content_id", solrId);
     up.setParams(params);

System.out.println("making request...");
     solr.request(up);

System.out.println("committing...");
     solr.commit();

System.out.println("done...");
   }
}

Re: Solrj ContentStreamUpdateRequest Slow

Posted by jayendra patil <ja...@gmail.com>.
ContentStreamUpdateRequest seems to read the file contents and transfer it
over http, which slows down the indexing.

Try Using StreamingUpdateSolrServer with stream.file param @
http://wiki.apache.org/solr/SolrPerformanceFactors#Embedded_vs_HTTP_Post

e.g.

SolrServer server = new StreamingUpdateSolrServer("Solr Server URL",20,8);
UpdateRequest req = new UpdateRequest("/update/extract");
ModifiableSolrParams params = null ;
params = new ModifiableSolrParams();
params.add("stream.file", new String[]{"local file path"});
params.set("literal.id", value);
req.setParams(params);
server.request(req);
server.commit();

Regards,
Jayendra

On Wed, Aug 4, 2010 at 3:01 PM, Tod <li...@gmail.com> wrote:

> I'm running a slight variation of the example code referenced below and it
> takes a real long time to finally execute.  In fact it hangs for a long time
> at solr.request(up) before finally executing.  Is there anything I can look
> at or tweak to improve performance?
>
> I am also indexing a local pdf file, there are no firewall issues, solr is
> running on the same machine, and I tried the actual host name in addition to
> localhost but nothing helps.
>
>
> Thanks - Tod
>
> http://wiki.apache.org/solr/ContentStreamUpdateRequestExample
>