You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2012/08/16 10:51:38 UTC

[jira] [Commented] (NUTCH-1458) Support for raw HTML field added to Solr

    [ https://issues.apache.org/jira/browse/NUTCH-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435850#comment-13435850 ] 

Markus Jelsma commented on NUTCH-1458:
--------------------------------------

The fieldType for this field should be binary, if not we can never succesfully index PDF and other file types. Everything will be BASE64 encoded for that field so expect a significant decrease in performance if it's stored, which it most likely will be.
Also, the parser has to store content and it has to be loaded in the indexer. This is quite costly for 1.x as all data must be shuffled to the reducer.
                
> Support for raw HTML field added to Solr
> ----------------------------------------
>
>                 Key: NUTCH-1458
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1458
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>    Affects Versions: 1.5.1
>            Reporter: Max Dzyuba
>              Labels: html, nutch, raw, solr
>
> At the moment, the “content” field holds only the parsed text from the page. It would be nice to have a separate field in Solr document that would hold raw HTML from the crawled page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira