You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tod <li...@gmail.com> on 2010/08/04 21:01:45 UTC
Solrj ContentStreamUpdateRequest Slow
I'm running a slight variation of the example code referenced below and
it takes a real long time to finally execute. In fact it hangs for a
long time at solr.request(up) before finally executing. Is there
anything I can look at or tweak to improve performance?
I am also indexing a local pdf file, there are no firewall issues, solr
is running on the same machine, and I tried the actual host name in
addition to localhost but nothing helps.
Thanks - Tod
http://wiki.apache.org/solr/ContentStreamUpdateRequestExample
Re: Solrj ContentStreamUpdateRequest Slow
Posted by Chris Hostetter <ho...@fucit.org>.
: ContentStreamUpdateRequest req = new
: ContentStreamUpdateRequest("/update/extract");
:
: System.out.println("setting params...");
: req.setParam("stream.url", fileName);
: req.setParam("literal.content_id", solrId);
ContentStreamUpdateRequest exists so that you can stream content directly
from the client to the server -- you aren't doing that, you are asking the
server t ogo fetch the stream.url itself.
The NullPointerException happens because you've never called
ContentStreamUpdateRequest.addFile or
ContentStreamUpdateRequest.addContentStream so it gets into a state where
it doesn't know what it's doing (admitedely the error message is less then
ideal)
If you just use a plain old regular "UpdateRequest" (or even a
"QueryRequest") instead, your code works as written.
-Hoss
--
http://lucenerevolution.org/ ... October 7-8, Boston
http://bit.ly/stump-hoss ... Stump The Chump!
Re: Solrj ContentStreamUpdateRequest Slow
Posted by Lance Norskog <go...@gmail.com>.
There are no unit tests for stream.file or stream.url. Tests in
org.apache.solr.handler.TestCSVLoader.filename:loadLocal() intercept
them and do its own thing, feeding a local file instead of the
stream.file parameter. I see no proof that stream.file/stream.url
should work in SolrJ or in EmbeddedSolr (which uses the SolrJ API).
This comment is in all known source trees:
// TODO: stop using locally defined streams once stream.file and
// stream.body work everywhere
Sorry, you're stuck with the command line. Apologies for giving you
bad advice. I have filed a JIRA about the lack of unit tests:
https://issues.apache.org/jira/browse/SOLR-2060
Please add your source code if you are confident it should work, but does not.
Lance
On Thu, Aug 19, 2010 at 7:45 AM, Tod <li...@gmail.com> wrote:
> On 8/19/2010 1:45 AM, Lance Norskog wrote:
>>
>> 'stream.url' is just a simple parameter. You should be able to just
>> add it directly.
>
>
> I agree (code excluding imports):
>
> public class CommonTest {
>
> public static void main(String[] args) {
> System.out.println("main...");
> try {
> String fileName = String fileName =
> "http://remoteserver/test/test.pdf";
> String solrId = "1234";
> indexFilesSolrCell(fileName, solrId);
>
> } catch (Exception ex) {
> ex.printStackTrace();
> }
> }
>
> /**
> * Method to index all types of files into Solr.
> * @param fileName
> * @param solrId
> * @throws IOException
> * @throws SolrServerException
> */
> public static void indexFilesSolrCell(String fileName, String solrId)
> throws IOException, SolrServerException {
>
> System.out.println("indexFilesSolrCell...");
>
> String urlString = "http://localhost:9080/solr";
>
> System.out.println("getting connection...");
> SolrServer solr = new CommonsHttpSolrServer(urlString);
>
> System.out.println("getting updaterequest handle...");
> ContentStreamUpdateRequest req = new
> ContentStreamUpdateRequest("/update/extract");
>
> System.out.println("setting params...");
> req.setParam("stream.url", fileName);
> req.setParam("literal.content_id", solrId);
>
> System.out.println("making request...");
> solr.request(req);
>
> System.out.println("committing...");
> solr.commit();
>
> System.out.println("done...");
> }
> }
>
>
> At "making request" I get:
>
> java.lang.NullPointerException
> at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:381)
> at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
> at CommonTest.indexFilesSolrCell(CommonTest.java:59)
> at CommonTest.main(CommonTest.java:26)
>
> ... which is pointing to the solr.request(req) line.
>
>
>
> Thanks - Tod
>
--
Lance Norskog
goksron@gmail.com
Re: Solrj ContentStreamUpdateRequest Slow
Posted by Tod <li...@gmail.com>.
On 8/19/2010 1:45 AM, Lance Norskog wrote:
> 'stream.url' is just a simple parameter. You should be able to just
> add it directly.
I agree (code excluding imports):
public class CommonTest {
public static void main(String[] args) {
System.out.println("main...");
try {
String fileName = String fileName =
"http://remoteserver/test/test.pdf";
String solrId = "1234";
indexFilesSolrCell(fileName, solrId);
} catch (Exception ex) {
ex.printStackTrace();
}
}
/**
* Method to index all types of files into Solr.
* @param fileName
* @param solrId
* @throws IOException
* @throws SolrServerException
*/
public static void indexFilesSolrCell(String fileName, String solrId)
throws IOException, SolrServerException {
System.out.println("indexFilesSolrCell...");
String urlString = "http://localhost:9080/solr";
System.out.println("getting connection...");
SolrServer solr = new CommonsHttpSolrServer(urlString);
System.out.println("getting updaterequest handle...");
ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
System.out.println("setting params...");
req.setParam("stream.url", fileName);
req.setParam("literal.content_id", solrId);
System.out.println("making request...");
solr.request(req);
System.out.println("committing...");
solr.commit();
System.out.println("done...");
}
}
At "making request" I get:
java.lang.NullPointerException
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:381)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
at CommonTest.indexFilesSolrCell(CommonTest.java:59)
at CommonTest.main(CommonTest.java:26)
... which is pointing to the solr.request(req) line.
Thanks - Tod
Re: Solrj ContentStreamUpdateRequest Slow
Posted by Lance Norskog <go...@gmail.com>.
'stream.url' is just a simple parameter. You should be able to just
add it directly.
On Wed, Aug 18, 2010 at 5:35 AM, Tod <li...@gmail.com> wrote:
> On 8/16/2010 6:12 PM, Chris Hostetter wrote:
>>
>> : > I think your problem may be that StreamingUpdateSolrServer buffers up
>> : > commands and sends them in batches in a background thread. if you
>> want to
>> : > send individual updates in real time (and time them) you should just
>> use
>> : > CommonsHttpSolrServer
>> : : My goal is to batch updates. My content lives somewhere else so I was
>> trying
>> : to find a way to tell Solr where the document lived so it could go out
>> and
>> : stream it into the index for me. That's where I thought
>> : StreamingUpdateSolrServer would help.
>>
>> If your content lives on a machine which is not your "client" nor your
>> "server" and you want your client to tell your server to go fetch it
>> directly then the "stream.url" param is what you need -- that is unrelated
>> to wether you use StreamingUpdateSolrServer or not.
>
>
> Do you happen to have a code fragment laying around that demonstrates using
> CommonsHttpSolrServer and "stream.url"? I've tried it in conjunction with
> ContentStreamUpdateRequest and I keep getting an annoying null pointer
> exception. In the meantime I will check the examples...
>
>
>
>> Thinking about it some more, i suspect the reason you might be seeing a
>> delay when using StreamingUpdateSolrServer is because of this bug...
>>
>> https://issues.apache.org/jira/browse/SOLR-1990
>>
>> ...if there are no actual documents in your UpdateRequest (because you are
>> using the stream.url param) then the StreamingUpdateSolrServer blocks until
>> all other requests are done, then delegates to the super class (so it never
>> actaully puts your indexing requests in a buffered queue, it just delays and
>> then does them immediately)
>>
>> Not sure of a good way arround this off the top of my head, but i'll note
>> it in SOLR-1990 as another problematic use case that needs dealt with.
>
> Perhaps I can execute an initial update request using a benign file before
> making the "stream.url" call?
>
> Also, to beat a dead horse, this:
> 'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true'
>
> ... works fine - I just want to do it a LOT and as efficiently as possible.
> If I have to I can wrap it in a perl script and run a cURL or LWP loop but
> I'd prefer to use SolrJ if I can.
>
> Thanks for all your help.
>
>
> - Tod
>
--
Lance Norskog
goksron@gmail.com
Re: Solrj ContentStreamUpdateRequest Slow
Posted by Tod <li...@gmail.com>.
On 8/16/2010 6:12 PM, Chris Hostetter wrote:
> : > I think your problem may be that StreamingUpdateSolrServer buffers up
> : > commands and sends them in batches in a background thread. if you want to
> : > send individual updates in real time (and time them) you should just use
> : > CommonsHttpSolrServer
> :
> : My goal is to batch updates. My content lives somewhere else so I was trying
> : to find a way to tell Solr where the document lived so it could go out and
> : stream it into the index for me. That's where I thought
> : StreamingUpdateSolrServer would help.
>
> If your content lives on a machine which is not your "client" nor your
> "server" and you want your client to tell your server to go fetch it
> directly then the "stream.url" param is what you need -- that is unrelated
> to wether you use StreamingUpdateSolrServer or not.
Do you happen to have a code fragment laying around that demonstrates
using CommonsHttpSolrServer and "stream.url"? I've tried it in
conjunction with ContentStreamUpdateRequest and I keep getting an
annoying null pointer exception. In the meantime I will check the
examples...
> Thinking about it some more, i suspect the reason you might be seeing a
> delay when using StreamingUpdateSolrServer is because of this bug...
>
> https://issues.apache.org/jira/browse/SOLR-1990
>
> ...if there are no actual documents in your UpdateRequest (because you are
> using the stream.url param) then the StreamingUpdateSolrServer blocks
> until all other requests are done, then delegates to the super class (so
> it never actaully puts your indexing requests in a buffered queue, it just
> delays and then does them immediately)
>
> Not sure of a good way arround this off the top of my head, but i'll note
> it in SOLR-1990 as another problematic use case that needs dealt with.
Perhaps I can execute an initial update request using a benign file
before making the "stream.url" call?
Also, to beat a dead horse, this:
'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true'
... works fine - I just want to do it a LOT and as efficiently as
possible. If I have to I can wrap it in a perl script and run a cURL or
LWP loop but I'd prefer to use SolrJ if I can.
Thanks for all your help.
- Tod
Re: Solrj ContentStreamUpdateRequest Slow
Posted by Chris Hostetter <ho...@fucit.org>.
: > I think your problem may be that StreamingUpdateSolrServer buffers up
: > commands and sends them in batches in a background thread. if you want to
: > send individual updates in real time (and time them) you should just use
: > CommonsHttpSolrServer
:
: My goal is to batch updates. My content lives somewhere else so I was trying
: to find a way to tell Solr where the document lived so it could go out and
: stream it into the index for me. That's where I thought
: StreamingUpdateSolrServer would help.
If your content lives on a machine which is not your "client" nor your
"server" and you want your client to tell your server to go fetch it
directly then the "stream.url" param is what you need -- that is unrelated
to wether you use StreamingUpdateSolrServer or not.
Thinking about it some more, i suspect the reason you might be seeing a
delay when using StreamingUpdateSolrServer is because of this bug...
https://issues.apache.org/jira/browse/SOLR-1990
...if there are no actual documents in your UpdateRequest (because you are
using the stream.url param) then the StreamingUpdateSolrServer blocks
until all other requests are done, then delegates to the super class (so
it never actaully puts your indexing requests in a buffered queue, it just
delays and then does them immediately)
Not sure of a good way arround this off the top of my head, but i'll note
it in SOLR-1990 as another problematic use case that needs dealt with.
-Hoss
Re: Solrj ContentStreamUpdateRequest Slow
Posted by Tod <li...@gmail.com>.
On 8/12/2010 8:02 PM, Chris Hostetter wrote:
> : It returns in around a second. When I execute the attached code it takes just
> : over three minutes. The optimal for me would be able get closer to the
> : performance I'm seeing with curl using Solrj.
>
> I think your problem may be that StreamingUpdateSolrServer buffers up
> commands and sends them in batches in a background thread. if you want to
> send individual updates in real time (and time them) you should just use
> CommonsHttpSolrServer
>
>
> -Hoss
My goal is to batch updates. My content lives somewhere else so I was
trying to find a way to tell Solr where the document lived so it could
go out and stream it into the index for me. That's where I thought
StreamingUpdateSolrServer would help.
- Tod
Re: Solrj ContentStreamUpdateRequest Slow
Posted by Chris Hostetter <ho...@fucit.org>.
: It returns in around a second. When I execute the attached code it takes just
: over three minutes. The optimal for me would be able get closer to the
: performance I'm seeing with curl using Solrj.
I think your problem may be that StreamingUpdateSolrServer buffers up
commands and sends them in batches in a background thread. if you want to
send individual updates in real time (and time them) you should just use
CommonsHttpSolrServer
-Hoss
Re: Solrj ContentStreamUpdateRequest Slow
Posted by Tod <li...@gmail.com>.
On 8/4/2010 11:11 PM, jayendra patil wrote:
> ContentStreamUpdateRequest seems to read the file contents and transfer it
> over http, which slows down the indexing.
>
> Try Using StreamingUpdateSolrServer with stream.file param @
> http://wiki.apache.org/solr/SolrPerformanceFactors#Embedded_vs_HTTP_Post
>
> e.g.
>
> SolrServer server = new StreamingUpdateSolrServer("Solr Server URL",20,8);
> UpdateRequest req = new UpdateRequest("/update/extract");
> ModifiableSolrParams params = null ;
> params = new ModifiableSolrParams();
> params.add("stream.file", new String[]{"local file path"});
> params.set("literal.id", value);
> req.setParams(params);
> server.request(req);
> server.commit();
Thanks for your suggestions. Unfortunately, I'm still seeing poor
performance.
To be clear, I am trying to have SOLR index multiple documents that
exist on a remote server. I'd prefer that SOLR stream the documents
after I pass a pointer to them rather than me retrieving and pushing
them so I can avoid network overhead.
When I do this:
curl
'http://localhost:8080/solr/update/extract?stream.url=http://remote_server.mydomain.com/test.pdf&stream.contentType=application/pdf&literal.content_id=12342&commit=true'
It returns in around a second. When I execute the attached code it
takes just over three minutes. The optimal for me would be able get
closer to the performance I'm seeing with curl using Solrj.
To be fair the SOLR server I am using is really a workstation class
machine, plus I am still learning. I have a feeling I'm doing something
dumb but just can't seem to pinpoint the exact problem.
Thanks - Tod
--------code-----------
import java.io.File;
import java.io.IOException;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.request.UpdateRequest;
import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer;
import org.apache.solr.common.params.ModifiableSolrParams;
/**
* @author EDaniel
*/
public class SolrExampleTests {
public static void main(String[] args) {
System.out.println("main...");
try {
// String fileName = "/test/test.pdf";
String fileName = "http://remoteserver/test/test.pdf";
String solrId = "1234";
indexFilesSolrCell(fileName, solrId);
} catch (Exception ex) {
System.out.println(ex.toString());
}
}
/**
* Method to index all types of files into Solr.
* @param fileName
* @param solrId
* @throws IOException
* @throws SolrServerException
*/
public static void indexFilesSolrCell(String fileName, String solrId)
throws IOException, SolrServerException {
System.out.println("indexFilesSolrCell...");
String urlString = "http://localhost:8080/solr";
System.out.println("getting connection...");
// SolrServer solr = new CommonsHttpSolrServer(urlString);
SolrServer solr = new StreamingUpdateSolrServer(urlString,100,5);
System.out.println("getting updaterequest handle...");
// ContentStreamUpdateRequest up = new
ContentStreamUpdateRequest("/update/extract");
UpdateRequest up = new UpdateRequest("/update/extract");
ModifiableSolrParams params = null ;
params = new ModifiableSolrParams();
// params.add("stream.file", fileName);
params.add("stream.url", fileName);
params.set("literal.content_id", solrId);
up.setParams(params);
System.out.println("making request...");
solr.request(up);
System.out.println("committing...");
solr.commit();
System.out.println("done...");
}
}
Re: Solrj ContentStreamUpdateRequest Slow
Posted by jayendra patil <ja...@gmail.com>.
ContentStreamUpdateRequest seems to read the file contents and transfer it
over http, which slows down the indexing.
Try Using StreamingUpdateSolrServer with stream.file param @
http://wiki.apache.org/solr/SolrPerformanceFactors#Embedded_vs_HTTP_Post
e.g.
SolrServer server = new StreamingUpdateSolrServer("Solr Server URL",20,8);
UpdateRequest req = new UpdateRequest("/update/extract");
ModifiableSolrParams params = null ;
params = new ModifiableSolrParams();
params.add("stream.file", new String[]{"local file path"});
params.set("literal.id", value);
req.setParams(params);
server.request(req);
server.commit();
Regards,
Jayendra
On Wed, Aug 4, 2010 at 3:01 PM, Tod <li...@gmail.com> wrote:
> I'm running a slight variation of the example code referenced below and it
> takes a real long time to finally execute. In fact it hangs for a long time
> at solr.request(up) before finally executing. Is there anything I can look
> at or tweak to improve performance?
>
> I am also indexing a local pdf file, there are no firewall issues, solr is
> running on the same machine, and I tried the actual host name in addition to
> localhost but nothing helps.
>
>
> Thanks - Tod
>
> http://wiki.apache.org/solr/ContentStreamUpdateRequestExample
>