You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by alendo <al...@uniroma2.it> on 2010/02/09 10:23:18 UTC

Posting pdf file and posting from remote

I understand that tika is able to index pdf content: its true? I tried to
post a pdf from local and I've seen in the solr/admin schema browser another
document, but when I search only the document id is available, the documents
doesn't seem indexed. Do I need other products to index pdf content?

Moreover I want to send a file from remote: it seems I must configure tika
with a tika-config.xml file, enabling remote streaming as in the following:
<requestDispatcher handleSelect="true" >
    <requestParsers enableRemoteStreaming="{true|false}"
multipartUploadLimitInKB="20480" />

but I'm not able to find a tika-config.xml example... 
thanks a lot
Alessandra
-- 
View this message in context: http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512455.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Posting pdf file and posting from remote

Posted by alendo <al...@uniroma2.it>.
Thanks a lot: this tip was very important for me.
I tried with php curl with the purpose to send from Windows to MAC OS, after
one day I discovered that the @filename doesn't work on Windows, the error
was "26 failed creating formpost data" and the reason is that Windows php
curl (I don't know where is the bug) is not able to open the file passing
@filename. PHP Version 5.2.4.
I tried:
<?php
$ch =
curl_init('http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true');
 curl_setopt ($ch, CURLOPT_POST, 1);
 curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
 $result= curl_exec ($ch);
?>
and it works fine: I hope it'll work also from a remote Linux server.


Lance Norskog-2 wrote:
> 
> stream.file= means read a local file from the server that solr runs
> on. It has to be a complete path that works from that server. To load
> the file over HTTP you have to use @filename to have curl open it.
> This path has to work from the program you run curl on, and relative
> paths work.
> 
> Also, tika does not save the PDF binary, it only pulls words out of
> the PDF and stores those.
> 
> There's a tika example in solr/trunk/example/exampleDIH in the current
> solr trunk. (I don't remember if it's in the solr 1.4 release.) With
> this you can save the pdf binary in one field and save the extracted
> text in another field. I'm doing this now with html.
> 
> On Tue, Feb 9, 2010 at 2:08 AM, alendo <al...@uniroma2.it>
> wrote:
>>
>> Ok I'm going ahead (may be:).
>> I tried another curl command to send the file from remote:
>>
>> http://mysolr:xxxx/solr/update/extract?literal.id=8514&stream.file=files/attach-8514.pdf&stream.contentType=application/pdf
>>
>> and the behaviour has been changed: now I get an error in solr log file:
>>
>> HTTP Status 500 - files/attach-8514.pdf (No such file or directory)
>> java.io.FileNotFoundException: files/attach-8514.pdf (No such file or
>> directory) at java.io.FileInputStream.open(Native Method) at
>> java.io.FileInputStream.<init>(FileInputStream.java:106) at
>> org.apache.solr.common.util.ContentStreamBase$FileStream.getStream(ContentStreamBase.java:108)
>> at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
>> at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>> at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>> at
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
>>
>> etc etc...
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512952.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com
> 
> 

-- 
View this message in context: http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27543540.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Posting pdf file and posting from remote

Posted by Lance Norskog <go...@gmail.com>.
stream.file= means read a local file from the server that solr runs
on. It has to be a complete path that works from that server. To load
the file over HTTP you have to use @filename to have curl open it.
This path has to work from the program you run curl on, and relative
paths work.

Also, tika does not save the PDF binary, it only pulls words out of
the PDF and stores those.

There's a tika example in solr/trunk/example/exampleDIH in the current
solr trunk. (I don't remember if it's in the solr 1.4 release.) With
this you can save the pdf binary in one field and save the extracted
text in another field. I'm doing this now with html.

On Tue, Feb 9, 2010 at 2:08 AM, alendo <al...@uniroma2.it> wrote:
>
> Ok I'm going ahead (may be:).
> I tried another curl command to send the file from remote:
>
> http://mysolr:xxxx/solr/update/extract?literal.id=8514&stream.file=files/attach-8514.pdf&stream.contentType=application/pdf
>
> and the behaviour has been changed: now I get an error in solr log file:
>
> HTTP Status 500 - files/attach-8514.pdf (No such file or directory)
> java.io.FileNotFoundException: files/attach-8514.pdf (No such file or
> directory) at java.io.FileInputStream.open(Native Method) at
> java.io.FileInputStream.<init>(FileInputStream.java:106) at
> org.apache.solr.common.util.ContentStreamBase$FileStream.getStream(ContentStreamBase.java:108)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
>
> etc etc...
>
> --
> View this message in context: http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512952.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Posting pdf file and posting from remote

Posted by alendo <al...@uniroma2.it>.
Ok I'm going ahead (may be:).
I tried another curl command to send the file from remote:

http://mysolr:xxxx/solr/update/extract?literal.id=8514&stream.file=files/attach-8514.pdf&stream.contentType=application/pdf 

and the behaviour has been changed: now I get an error in solr log file:

HTTP Status 500 - files/attach-8514.pdf (No such file or directory)
java.io.FileNotFoundException: files/attach-8514.pdf (No such file or
directory) at java.io.FileInputStream.open(Native Method) at
java.io.FileInputStream.<init>(FileInputStream.java:106) at
org.apache.solr.common.util.ContentStreamBase$FileStream.getStream(ContentStreamBase.java:108)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at 

etc etc...

-- 
View this message in context: http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512952.html
Sent from the Solr - User mailing list archive at Nabble.com.