You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by vybe3142 <vy...@gmail.com> on 2012/03/16 23:55:42 UTC

Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP streaming?

Hi,
Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP
streaming. 

Use case: 
* Text Files to be indexed are on file server (A) (some potentially large -
several 100 MB)
* SOLRJ client is on server (B)
* SOLR server is on server (C) running with dynamically created SOLR cores

Looking at how ContentStreamUpdateRequest is typically used in SOLRJ, it
looks like the files would be read from A to the client on B (across the
wire) and then sent across the wire via an HTTP request (in the body) to C
to be indexed. 

Is there a more efficient way to accomplish this i.e. pass a path to the
file when making the request from B so that the SOLR server on C can read
directly from file server A ?

Thanks


--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-for-SOLR-SOLRJ-to-index-files-directly-bypassing-HTTP-streaming-tp3833419p3833419.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP streaming?

Posted by vybe3142 <vy...@gmail.com>.

BTW, .. using the client I pasted, I get the same error even with the
standard supplied executable SOLR jar.

--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-for-SOLR-SOLRJ-to-index-files-directly-bypassing-HTTP-streaming-tp3833419p3840483.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP streaming?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Mon, Mar 19, 2012 at 5:48 PM, vybe3142 <vy...@gmail.com> wrote:
> Thanks for the response
>
> No, the file is plain text.
>
> All I'm trying to do is index plain ASCII text files via a remote reference
> to their file paths.

The XML update handler expects a specific format of XML.
The json, CSV, javabin update handlers likewise expect a specific
document format.

If you have Word, PDF, HTML, or plain text files, one way to index them is
http://wiki.apache.org/solr/ExtractingRequestHandler (aka Solr Cell)

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10

Re: Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP streaming?

Posted by vybe3142 <vy...@gmail.com>.

Thanks for the response

No, the file is plain text. 

All I'm trying to do is index plain ASCII text files via a remote reference
to their file paths. 

I guess what I need to do is specify the content type as text. I don't think
a "content-type" param will help since this behavior is tied to the
BinaryRequestWriter() . There's got to be some built in functionality in
SOLR that will enable me to achieve this.


--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-for-SOLR-SOLRJ-to-index-files-directly-bypassing-HTTP-streaming-tp3833419p3840478.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP streaming?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Mon, Mar 19, 2012 at 4:38 PM, vybe3142 <vy...@gmail.com> wrote:
> Okay, I added the javabin handler snippet to the solrconfig.xml file
> (actually shared across all cores).  I got further (the request made it past
> tomcat and into SOLR) but  haven't quite succeeded yet.
>
> Server trace:
> Mar 19, 2012 3:31:35 PM org.apache.solr.core.SolrCore execute
> INFO: [testcore1] webapp=/solr path=/update/javabin
> params={waitSearcher=true&commit=true&literal.id=testid1&waitFlush=true&wt=javabin&stream.file=C:\work\SolrC
> lient\data\justin2.txt&version=2} status=500 QTime=82

Is this justin2.txt file in the "javabin" format?  That's what you're
telling Solr by hitting the /update/javabin URL.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10

Re: Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP streaming?

Posted by vybe3142 <vy...@gmail.com>.

Okay, I added the javabin handler snippet to the solrconfig.xml file
(actually shared across all cores).  I got further (the request made it past
tomcat and into SOLR) but  haven't quite succeeded yet.

Server trace:
Mar 19, 2012 3:31:35 PM org.apache.solr.core.SolrCore execute
INFO: [testcore1] webapp=/solr path=/update/javabin
params={waitSearcher=true&commit=true&literal.id=testid1&waitFlush=true&wt=javabin&stream.file=C:\work\SolrC
lient\data\justin2.txt&version=2} status=500 QTime=82
Mar 19, 2012 3:31:35 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: Invalid version (expected 2, but -17) or
the data in not in 'javabin' format
        at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
        at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:144)
        at
org.apache.solr.handler.BinaryUpdateRequestHandler.parseAndLoadDocs(BinaryUpdateRequestHandler.java:69)
        at
org.apache.solr.handler.BinaryUpdateRequestHandler.access$000(BinaryUpdateRequestHandler.java:45)
        at
org.apache.solr.handler.BinaryUpdateRequestHandler$1.load(BinaryUpdateRequestHandler.java:56)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
        at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)

=================================================
SOLRJ client log:

Starting SOLR doc indexing client 2
Exception in thread "main" org.apache.solr.common.SolrException: Internal
Server Error

Internal Server Error

request: http://localhost:8080/solr/testcore1/update/javabin
	at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:432)


--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-for-SOLR-SOLRJ-to-index-files-directly-bypassing-HTTP-streaming-tp3833419p3840290.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP streaming?

Posted by Erick Erickson <er...@gmail.com>.

My guess is that this isn't defined in the solrconfig.xml file
for your testcore1/conf..

  <requestHandler name="/update/javabin"
                  class="solr.BinaryUpdateRequestHandler" />


If you modeled your testcore1 after the solrconfig.xml files in the
example/multicore/core* directories, these are extremely simplified.
You might try copying the one from example/solr/conf and removing
stuff you don't need.....


Best
Erick

On Mon, Mar 19, 2012 at 3:22 PM, vybe3142 <vy...@gmail.com> wrote:
> Still No luck.Please help point out what I'm doing wrong. Neither the
> (commented out ) first approach (including the content with the request) nor
> the 2nd approach seem to work. Nothing seems to be acknowledged at the
> tomcat server either. I get the error:
>
>
> Starting SOLR doc indexing client 2
> Exception in thread "main" org.apache.solr.common.SolrException: Not Found
>
> Not Found
>
> request: http://localhost:8080/solr/testcore1/update/javabin
>        at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:432)
>        at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246)
>        at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>        at
> com.il.solrclient.SolrJClientIndexDocApp2.main(SolrJClientIndexDocApp2.java:41)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
>
> ========================================================
>
>
> public class SolrJClientIndexDocApp2 {
>    public static void main(String[] arg) throws Exception,
> SolrServerException {
>        System.out.println("Starting SOLR doc indexing client 2");
>        String url = "http://localhost:8080/solr/testcore1";
>        CommonsHttpSolrServer server = new CommonsHttpSolrServer(url);
> //        ContentStreamUpdateRequest req = new
> ContentStreamUpdateRequest("/update/extract");
> //        req.addFile(new File("C:\\work\\SolrClient\\data\\justin2.txt"));
> //        //req.setParam(ExtractingParams.EXTRACT_ONLY, "true");
> //
> //       req.setParam("literal.id", "testid");
> //
> //        NamedList result = server.request(req);
> //        server.commit();
> //        System.out.println("Result: " + result);
>
>
>        server.setRequestWriter(new BinaryRequestWriter());
>        UpdateRequest request = new UpdateRequest();
>        request.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>        request.setParam("literal.id", "testid1");
>        request.setParam("stream.file",
> "C:\\work\\SolrClient\\data\\justin2.txt");
>        request.process(server);
>    }
>
>
> }
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-for-SOLR-SOLRJ-to-index-files-directly-bypassing-HTTP-streaming-tp3833419p3840068.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP streaming?

Posted by vybe3142 <vy...@gmail.com>.

Still No luck.Please help point out what I'm doing wrong. Neither the
(commented out ) first approach (including the content with the request) nor
the 2nd approach seem to work. Nothing seems to be acknowledged at the
tomcat server either. I get the error: 


Starting SOLR doc indexing client 2
Exception in thread "main" org.apache.solr.common.SolrException: Not Found

Not Found

request: http://localhost:8080/solr/testcore1/update/javabin
	at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:432)
	at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246)
	at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
	at
com.il.solrclient.SolrJClientIndexDocApp2.main(SolrJClientIndexDocApp2.java:41)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)

========================================================


public class SolrJClientIndexDocApp2 {
    public static void main(String[] arg) throws Exception,
SolrServerException {
        System.out.println("Starting SOLR doc indexing client 2");
        String url = "http://localhost:8080/solr/testcore1";
        CommonsHttpSolrServer server = new CommonsHttpSolrServer(url);
//        ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
//        req.addFile(new File("C:\\work\\SolrClient\\data\\justin2.txt"));
//        //req.setParam(ExtractingParams.EXTRACT_ONLY, "true");
//
//       req.setParam("literal.id", "testid");
//
//        NamedList result = server.request(req);
//        server.commit();
//        System.out.println("Result: " + result);


        server.setRequestWriter(new BinaryRequestWriter());
        UpdateRequest request = new UpdateRequest();
        request.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
        request.setParam("literal.id", "testid1");
        request.setParam("stream.file",
"C:\\work\\SolrClient\\data\\justin2.txt");
        request.process(server);
    }


}


--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-for-SOLR-SOLRJ-to-index-files-directly-bypassing-HTTP-streaming-tp3833419p3840068.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP streaming?

Posted by vybe3142 <vy...@gmail.com>.

I'm going to try the approach described here and see what happens

http://lucene.472066.n3.nabble.com/Fastest-way-to-use-solrj-td502659.html

--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-for-SOLR-SOLRJ-to-index-files-directly-bypassing-HTTP-streaming-tp3833419p3838250.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP streaming?

Posted by vybe3142 <vy...@gmail.com>.

Thanks much. I plan to try this tomorrow.

Can someone describe how to use remote streaming programmatically with
solrj. For example, see the basic clients described here:
http://androidyou.blogspot.com/2010/05/client-integration-with-solr-by-using.html
and observe that  the data is transferred in the http message (which I want
to avoid).


--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-for-SOLR-SOLRJ-to-index-files-directly-bypassing-HTTP-streaming-tp3833419p3838238.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP streaming?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Sure it does

http://my.safaribooksonline.com/book/web-development/9781847195883/indexing-data/ch03lvl1sec03#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTk3ODE4NDcxOTU4ODMvNjg=

On Sat, Mar 17, 2012 at 2:55 AM, vybe3142 <vy...@gmail.com> wrote:

> Hi,
> Is there a way for SOLR / SOLRJ to index files directly bypassing HTTP
> streaming.
>
> Use case:
> * Text Files to be indexed are on file server (A) (some potentially large -
> several 100 MB)
> * SOLRJ client is on server (B)
> * SOLR server is on server (C) running with dynamically created SOLR cores
>
> Looking at how ContentStreamUpdateRequest is typically used in SOLRJ, it
> looks like the files would be read from A to the client on B (across the
> wire) and then sent across the wire via an HTTP request (in the body) to C
> to be indexed.
>
> Is there a more efficient way to accomplish this i.e. pass a path to the
> file when making the request from B so that the SOLR server on C can read
> directly from file server A ?
>
> Thanks
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Is-there-a-way-for-SOLR-SOLRJ-to-index-files-directly-bypassing-HTTP-streaming-tp3833419p3833419.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Thanks All

Posted by Lance Norskog <go...@gmail.com>.

If you build it, they will come!

On Tue, Mar 20, 2012 at 12:59 PM, vybe3142 <vy...@gmail.com> wrote:

> I'm still puzzled that there are no readily available alternatives to using
> the Tika based ExtractingRequestHandler in the situation where the input
> data is plain UTF-8 text files that SOLR needs to injest and index. I may
> need to look into defining a custom Request Handler  if that's the right way
> to go.
>



-- 
Lance Norskog
goksron@gmail.com

Re: Thanks All

Posted by Chris Hostetter <ho...@fucit.org>.

: To get this to work correctly, the following server side config was needed
: (I started from a barebones solr config)

: 1. Add apache-solr-cell-3.5.0.jar to the <solrhost>/lib directory (or
: wherever solr can access jars) as this contains the class
: ExtractingRequestHandler
: 2. Add the appropriate handler for /update/extract in the solrconfig.xml
: (this uses the ExtractingRequestHandler class).

what barebones solr config did you start with?

the example configs that ship with solr have included /update/extract 
since 1.4.0


-Hoss

Thanks All

Posted by vybe3142 <vy...@gmail.com>.

Here is the core of the SOLRJ client that ended up accomplishing what I
wanted

        String fileName2 = "C:\\work\\SolrClient\\data\\worldwartwo.txt";
        SolrServer server = new
StreamingUpdateSolrServer("http://localhost:8080/solr/",20,8);
        UpdateRequest req = new UpdateRequest("/update/extract");
        ModifiableSolrParams params = null ;
        params = new ModifiableSolrParams();
        params.add("stream.file", new String[]{fileName2});
        params.set("literal.id", fileName2);
        params.set("captureAttr", "false");


        req.setParams(params);
        server.request(req);
        server.commit();

To get this to work correctly, the following server side config was needed
(I started from a barebones solr config)

1. Add apache-solr-cell-3.5.0.jar to the <solrhost>/lib directory (or
wherever solr can access jars) as this contains the class
ExtractingRequestHandler
2. Add the appropriate handler for /update/extract in the solrconfig.xml
(this uses the ExtractingRequestHandler class).

I'll blog about this later on for the benefit of the community at large

I'm still puzzled that there are no readily available alternatives to using
the Tika based ExtractingRequestHandler in the situation where the input
data is plain UTF-8 text files that SOLR needs to injest and index. I may
need to look into defining a custom Request Handler  if that's the right way
to go.

Thanks again

--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-for-SOLR-SOLRJ-to-index-files-directly-bypassing-HTTP-streaming-tp3833419p3843593.html
Sent from the Solr - User mailing list archive at Nabble.com.