You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Mark Achee (JIRA)" <ji...@apache.org> on 2011/06/27 22:01:48 UTC

[jira] [Created] (NUTCH-1018) Solr Document Size Limit

Solr Document Size Limit
------------------------

                 Key: NUTCH-1018
                 URL: https://issues.apache.org/jira/browse/NUTCH-1018
             Project: Nutch
          Issue Type: New Feature
          Components: indexer
            Reporter: Mark Achee
            Priority: Minor


There should be an option, perhaps named solr.content.limit, that defines the max size of documents added to Solr.  I've had issues with large documents in Solr, so I set the file.content.limit to 2MB.  However, this causes many files to not be parsed (mostly PDFs) because of only retrieving parts of the document.  With this new option, I could still correctly parse them, but only index the first 2MB (or however large it is set) in Solr.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1018) Solr Document Size Limit

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055736#comment-13055736 ] 

Lewis John McGibbney commented on NUTCH-1018:
---------------------------------------------

Yes, this agrees with the roadmap for future releases. There is consensus that Nutch has not to be restricted to Solr for indexing. Using the work around that Markus mentioned to achieve the same result is a more sustainable option for this particular issue. If we could get a plugin implementation for the scope description above, it would make a valuable contribution to the wiki!

> Solr Document Size Limit
> ------------------------
>
>                 Key: NUTCH-1018
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1018
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Mark Achee
>            Priority: Minor
>              Labels: solr
>
> There should be an option, perhaps named solr.content.limit, that defines the max size of documents added to Solr.  I've had issues with large documents in Solr, so I set the file.content.limit to 2MB.  However, this causes many files to not be parsed (mostly PDFs) because of only retrieving parts of the document.  With this new option, I could still correctly parse them, but only index the first 2MB (or however large it is set) in Solr.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1018) Solr Document Size Limit

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055742#comment-13055742 ] 

Markus Jelsma commented on NUTCH-1018:
--------------------------------------

Implementing such an indexer extension plugin is really straightforward, check and truncate. http://wiki.apache.org/nutch/WritingPluginExample-1.2

> Solr Document Size Limit
> ------------------------
>
>                 Key: NUTCH-1018
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1018
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Mark Achee
>            Priority: Minor
>              Labels: solr
>
> There should be an option, perhaps named solr.content.limit, that defines the max size of documents added to Solr.  I've had issues with large documents in Solr, so I set the file.content.limit to 2MB.  However, this causes many files to not be parsed (mostly PDFs) because of only retrieving parts of the document.  With this new option, I could still correctly parse them, but only index the first 2MB (or however large it is set) in Solr.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1018) Solr Document Size Limit

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055731#comment-13055731 ] 

Markus Jelsma commented on NUTCH-1018:
--------------------------------------

This might be useful but maybe not as a Solr option but as an indexing plugin. This way other future back ends such as ES would also benefit. 

However, in Solr you can copyField a source to a destination field and specify how many chars are to be copied over. This yields the same result.

> Solr Document Size Limit
> ------------------------
>
>                 Key: NUTCH-1018
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1018
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Mark Achee
>            Priority: Minor
>              Labels: solr
>
> There should be an option, perhaps named solr.content.limit, that defines the max size of documents added to Solr.  I've had issues with large documents in Solr, so I set the file.content.limit to 2MB.  However, this causes many files to not be parsed (mostly PDFs) because of only retrieving parts of the document.  With this new option, I could still correctly parse them, but only index the first 2MB (or however large it is set) in Solr.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira