You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Nguyen Manh Tien (JIRA)" <ji...@apache.org> on 2013/11/24 09:44:35 UTC

[jira] [Updated] (NUTCH-1673) Title isn't reset in MoreIndexingFilter

     [ https://issues.apache.org/jira/browse/NUTCH-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nguyen Manh Tien updated NUTCH-1673:
------------------------------------

    Attachment: NUTCH-1673.patch

> Title isn't reset in MoreIndexingFilter
> ---------------------------------------
>
>                 Key: NUTCH-1673
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1673
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 2.2.1
>            Reporter: Nguyen Manh Tien
>         Attachments: NUTCH-1673.patch
>
>
> In resetTitle function, title is added to doc. We need remove old title before add. Currently it will resulted in error when indexing to solr when title field is not multivalue field.
> private NutchDocument resetTitle(NutchDocument doc, WebPage page, String url) {
> ...
>     for (int i = 0; i < patterns.length; i++) {
>       if (matcher.contains(contentDisposition.toString(), patterns[i])) {
> ...
>         doc.add("title", result.group(1));
>         break;
>       }
>     }
>     return doc;
>   }



--
This message was sent by Atlassian JIRA
(v6.1#6144)