You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by reddibabu <re...@gmail.com> on 2014/03/21 12:53:32 UTC

Nutch/Solr - Pdf content is not getting indexed

Nutch/Solr - The pdf is not getting indexed if the pdf size is big enough, I
am not getting any exceptions but the content in the pdf is not getting
indexed.

If I am using any small pdf link which does not have any images or urls,
then the content is getting indexed and coming into solr. But when I am
using the pdf links which contains more content the data is not getting
indexed.
I have changed the file.content.limit in nutch-default.xml to -1 and
http.content.size in nutch-site.xml to -1 but it did not helped.

I have followed the below links to get the thing worked but it did not
helped, any further help would be much appreciated:
http://grokbase.com/t/nutch/user/129ef77wa7/nutch-solr-pdf-getting-indexed-but-content-is-not-showing-in-solr
http://grokbase.com/t/nutch/user/131apskpxq/crawling-pdfs



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Solr-Pdf-content-is-not-getting-indexed-tp4125992.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch/Solr - Pdf content is not getting indexed

Posted by anupamk <an...@usc.edu>.
There is a good chance that parse is failing. 

check the stats of the segment that contains the large PDF. Also dump the
segment and see result.

do a 

and 


See if Tika able to parse the PDF or not ...







--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Solr-Pdf-content-is-not-getting-indexed-tp4125992p4126145.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch/Solr - Pdf content is not getting indexed

Posted by Sebastian Nagel <wa...@googlemail.com>.
> I have using nutch 1.7 version. Is it possible to crawl and index data into
> Solr below nutch 2.x versions.

Yes, of course!

> I have changed the file.content.limit in nutch-default.xml to -1 and
> http.content.size in nutch-site.xml to -1 but it did not helped.

1. http.content.limit needs to be set to -1
2. it's recommended to set all customized properties in nutch-site.xml

It's best to check the configuration via

% $NUTCH_HOME/bin/nutch parsechecker -dumpText http://.../abc.pdf

(Only for Nutch 1.8:) If content is truncated this is shown by parsechecker

Sebastian

On 04/01/2014 08:33 AM, reddibabu wrote:
> Hi Talat,
> 
> Thanks for reply.
> 
> I have using nutch 1.7 version. Is it possible to crawl and index data into
> Solr below nutch 2.x versions.
> If possible then let me the specific configurations for crawling pdf files.
> 
> 
> Thanks,
> Reddi Babu
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Solr-Pdf-content-is-not-getting-indexed-tp4125992p4128347.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


Re: Nutch/Solr - Pdf content is not getting indexed

Posted by reddibabu <re...@gmail.com>.
Hi Talat,

Thanks for reply.

I have using nutch 1.7 version. Is it possible to crawl and index data into
Solr below nutch 2.x versions.
If possible then let me the specific configurations for crawling pdf files.


Thanks,
Reddi Babu



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Solr-Pdf-content-is-not-getting-indexed-tp4125992p4128347.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch/Solr - Pdf content is not getting indexed

Posted by Talat Uyarer <ta...@uyarer.com>.
What is your Nutch version ? If i remember mistake, this is a bug for nutch
2.2.1. It is fixed in 2.x
21 Mar 2014 13:54 tarihinde "reddibabu" <re...@gmail.com> yazdı:

> Nutch/Solr - The pdf is not getting indexed if the pdf size is big enough,
> I
> am not getting any exceptions but the content in the pdf is not getting
> indexed.
>
> If I am using any small pdf link which does not have any images or urls,
> then the content is getting indexed and coming into solr. But when I am
> using the pdf links which contains more content the data is not getting
> indexed.
> I have changed the file.content.limit in nutch-default.xml to -1 and
> http.content.size in nutch-site.xml to -1 but it did not helped.
>
> I have followed the below links to get the thing worked but it did not
> helped, any further help would be much appreciated:
>
> http://grokbase.com/t/nutch/user/129ef77wa7/nutch-solr-pdf-getting-indexed-but-content-is-not-showing-in-solr
> http://grokbase.com/t/nutch/user/131apskpxq/crawling-pdfs
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-Solr-Pdf-content-is-not-getting-indexed-tp4125992.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>