You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2014/08/22 21:18:12 UTC

[jira] [Commented] (NUTCH-1749) Title duplicated in document body

    [ https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107326#comment-14107326 ] 

Sebastian Nagel commented on NUTCH-1749:
----------------------------------------

Hi, Greg! Indeed, it may be sometimes useful to not include title in content, e.g. if title and (short) content are displayed in search results. However,
- should be made optional by a property "indexer.content.with.title" (or similar). Otherwise users would need to adapt the search logic if word from title are not contained in body.
- should be done for parse-tika as well
- a hard-wired exclusion of <title> elements in method {{getTextHelper}} is not really transparent, esp. because it is also used by {{getTitle}} and you need the construct {{currentNode != node && "title".equalsIgnoreCase(nodeName)}}. Wouldn't it be much clearer (and more extensible) to add an extra argument with excluded tags/elements (filled/set by the calling method). Roughly:
{code}
private boolean getTextHelper(..., Set excludedElementNames) {
  ...
  if (excludedElementNames.contains(nodeName)) {
   walker.skipChildren();
  }
{code}


> Title duplicated in document body
> ---------------------------------
>
>                 Key: NUTCH-1749
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1749
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.7
>            Reporter: Greg Padiasek
>         Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since the title alone can be retrieved via DOMContentUtils.getTitle() and content is retrieved via DOMContentUtils.getText(), there is no need to duplicate title in the content. When title is included in the content it becomes difficult/impossible to extract document body without title. A need to extract document body without title is visible when user wants to index or display body and title separately.
> Attached is a patch which prevents including title in document content in the HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)