You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by dpverma <pa...@gmail.com> on 2012/09/14 01:59:58 UTC

Nutch/Solr - Pdf getting indexed but content is not showing in solr

Hi ,
I am facing the following problem:

I am able to index the pdf's file name but not able to see the content of
the file. Any help will be appreciated. 

I have followed all the instructions suggested in below thread but still not
bale to see the content of the pdf files. 
http://http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-td3815336.html#a4006754
http://http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-td3815336.html#a4006754 

I am using tomcat6, nutch1.1 and solr1.4

following are the steps which I have done so far:
1. download the two missing libraries  from:
  http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/

I downloaded the additional JARS from the URL in step 1 but instead of
putting them in "src/plugin/parse-pdf/lib" folder, I put them in "plugins"
folder.  I modified the plugin.xml in the same folder as per the
instructions in it.  Then enabled the 'parse-pdf' plugin in 'nutch-site.xml'
file as shown below (just added 'parse-pdf' at the end. I did not think I
need to rebuild Nutch.  
 
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|que
ry-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno
rmalizer-(pass|regex|basic)|parse-pdf</value>
</property>

Then i got the following error:
WARN  parse.Parser - Error parsing:
http://viterbi.usc.edu/aviation/assets/002/74092.pdf: failed(2,202): Content
truncated at 66251 bytes. Parser can't handle incomplete pdf file.

solution :for this I changed file.content.limit and hhtp.content.limit to -1
in both nutch-default.xml and nutch-site.xml

Then I fixed follwoing error:
The parsing plugins: [org.apache.nutch.parse.pdf.PdfParser] are enabled via
the plugin.includes system property, and all claim to support the content
type application/pdf, but they are not mapped to it  in the
parse-plugins.xml file

solution: in parse-plugins.xml under nutch config , I uncommented the pdf
section

After doing this , I see no errors or warning in log file.
But there is still no text in the content section.

I have given direct link to the pdf file in regex-urlfilter.xml
+^http://([a-z0-9\-A-Z]*\.)*viterbi.usc.edu/aviation/assets/002/79884.pdf([a-z0-9\-A-Z]*\/)* 

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Solr-Pdf-getting-indexed-but-content-is-not-showing-in-solr-tp4007657.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how to index the size of document ?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,

Try index-more

http://wiki.apache.org/nutch/FAQ#How_can_I_find_out.2BAC8-display_the_size_and_mime_type_of_the_hits_that_a_search_returns.3F

hth

Lewis

On Fri, Sep 14, 2012 at 9:22 PM, Eyeris Rodriguez Rueda <er...@uci.cu> wrote:
> Hi, all.
> I am using nutch and solr since 1 year and i need to index the size(bytes or KBytes) of documents found in nutch crawl process. Can I save this property in a fields of solr index ?
> Any suggestion will be appreciated.
>
>
>
>
>
>
>
>
>
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci



-- 
Lewis

how to index the size of document ?

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.

Hi, all.
I am using nutch and solr since 1 year and i need to index the size(bytes or KBytes) of documents found in nutch crawl process. Can I save this property in a fields of solr index ?
Any suggestion will be appreciated.










10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Nutch/Solr - Pdf getting indexed but content is not showing in solr

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,

On Fri, Sep 14, 2012 at 12:59 AM, dpverma <pa...@gmail.com> wrote:

> I am using tomcat6, nutch1.1 and solr1.4

For starters this is probably your main mistake! I would seriously
urge you to upgrade your Nutch distribution.

I've just used to parsechecker with -dumpText and you url and I get a
whole pile of useful parse metadata from that pdf file.

I am sorry I can't be of more use... if ytou upgrade you are
eliminating a series of variables which *may* be leading to your local
copy not working properly.

hth

Lewis