You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/27 22:23:47 UTC

[jira] [Commented] (NUTCH-1018) Solr Document Size Limit

    [ https://issues.apache.org/jira/browse/NUTCH-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055731#comment-13055731 ] 

Markus Jelsma commented on NUTCH-1018:
--------------------------------------

This might be useful but maybe not as a Solr option but as an indexing plugin. This way other future back ends such as ES would also benefit. 

However, in Solr you can copyField a source to a destination field and specify how many chars are to be copied over. This yields the same result.

> Solr Document Size Limit
> ------------------------
>
>                 Key: NUTCH-1018
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1018
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Mark Achee
>            Priority: Minor
>              Labels: solr
>
> There should be an option, perhaps named solr.content.limit, that defines the max size of documents added to Solr.  I've had issues with large documents in Solr, so I set the file.content.limit to 2MB.  However, this causes many files to not be parsed (mostly PDFs) because of only retrieving parts of the document.  With this new option, I could still correctly parse them, but only index the first 2MB (or however large it is set) in Solr.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira