You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by dpverma <pa...@gmail.com> on 2012/09/14 01:59:58 UTC
Nutch/Solr - Pdf getting indexed but content is not showing in solr
Hi ,
I am facing the following problem:
I am able to index the pdf's file name but not able to see the content of
the file. Any help will be appreciated.
I have followed all the instructions suggested in below thread but still not
bale to see the content of the pdf files.
http://http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-td3815336.html#a4006754
http://http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-td3815336.html#a4006754
I am using tomcat6, nutch1.1 and solr1.4
following are the steps which I have done so far:
1. download the two missing libraries from:
http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/
I downloaded the additional JARS from the URL in step 1 but instead of
putting them in "src/plugin/parse-pdf/lib" folder, I put them in "plugins"
folder. I modified the plugin.xml in the same folder as per the
instructions in it. Then enabled the 'parse-pdf' plugin in 'nutch-site.xml'
file as shown below (just added 'parse-pdf' at the end. I did not think I
need to rebuild Nutch.
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|que
ry-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno
rmalizer-(pass|regex|basic)|parse-pdf</value>
</property>
Then i got the following error:
WARN parse.Parser - Error parsing:
http://viterbi.usc.edu/aviation/assets/002/74092.pdf: failed(2,202): Content
truncated at 66251 bytes. Parser can't handle incomplete pdf file.
solution :for this I changed file.content.limit and hhtp.content.limit to -1
in both nutch-default.xml and nutch-site.xml
Then I fixed follwoing error:
The parsing plugins: [org.apache.nutch.parse.pdf.PdfParser] are enabled via
the plugin.includes system property, and all claim to support the content
type application/pdf, but they are not mapped to it in the
parse-plugins.xml file
solution: in parse-plugins.xml under nutch config , I uncommented the pdf
section
After doing this , I see no errors or warning in log file.
But there is still no text in the content section.
I have given direct link to the pdf file in regex-urlfilter.xml
+^http://([a-z0-9\-A-Z]*\.)*viterbi.usc.edu/aviation/assets/002/79884.pdf([a-z0-9\-A-Z]*\/)*
Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Solr-Pdf-getting-indexed-but-content-is-not-showing-in-solr-tp4007657.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how to index the size of document ?
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,
Try index-more
http://wiki.apache.org/nutch/FAQ#How_can_I_find_out.2BAC8-display_the_size_and_mime_type_of_the_hits_that_a_search_returns.3F
hth
Lewis
On Fri, Sep 14, 2012 at 9:22 PM, Eyeris Rodriguez Rueda <er...@uci.cu> wrote:
> Hi, all.
> I am using nutch and solr since 1 year and i need to index the size(bytes or KBytes) of documents found in nutch crawl process. Can I save this property in a fields of solr index ?
> Any suggestion will be appreciated.
>
>
>
>
>
>
>
>
>
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
--
Lewis
how to index the size of document ?
Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.
Hi, all.
I am using nutch and solr since 1 year and i need to index the size(bytes or KBytes) of documents found in nutch crawl process. Can I save this property in a fields of solr index ?
Any suggestion will be appreciated.
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci
Re: Nutch/Solr - Pdf getting indexed but content is not showing in solr
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,
On Fri, Sep 14, 2012 at 12:59 AM, dpverma <pa...@gmail.com> wrote:
> I am using tomcat6, nutch1.1 and solr1.4
For starters this is probably your main mistake! I would seriously
urge you to upgrade your Nutch distribution.
I've just used to parsechecker with -dumpText and you url and I get a
whole pile of useful parse metadata from that pdf file.
I am sorry I can't be of more use... if ytou upgrade you are
eliminating a series of variables which *may* be leading to your local
copy not working properly.
hth
Lewis