You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by reddibabu <re...@gmail.com> on 2014/04/01 08:33:49 UTC

Re: Nutch/Solr - Pdf content is not getting indexed

Hi Talat,

Thanks for reply.

I have using nutch 1.7 version. Is it possible to crawl and index data into
Solr below nutch 2.x versions.
If possible then let me the specific configurations for crawling pdf files.


Thanks,
Reddi Babu



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Solr-Pdf-content-is-not-getting-indexed-tp4125992p4128347.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch/Solr - Pdf content is not getting indexed

Posted by Sebastian Nagel <wa...@googlemail.com>.
> I have using nutch 1.7 version. Is it possible to crawl and index data into
> Solr below nutch 2.x versions.

Yes, of course!

> I have changed the file.content.limit in nutch-default.xml to -1 and
> http.content.size in nutch-site.xml to -1 but it did not helped.

1. http.content.limit needs to be set to -1
2. it's recommended to set all customized properties in nutch-site.xml

It's best to check the configuration via

% $NUTCH_HOME/bin/nutch parsechecker -dumpText http://.../abc.pdf

(Only for Nutch 1.8:) If content is truncated this is shown by parsechecker

Sebastian

On 04/01/2014 08:33 AM, reddibabu wrote:
> Hi Talat,
> 
> Thanks for reply.
> 
> I have using nutch 1.7 version. Is it possible to crawl and index data into
> Solr below nutch 2.x versions.
> If possible then let me the specific configurations for crawling pdf files.
> 
> 
> Thanks,
> Reddi Babu
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Solr-Pdf-content-is-not-getting-indexed-tp4125992p4128347.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>