You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2014/06/08 22:19:01 UTC

[jira] [Updated] (NUTCH-1708) use same id when indexing and deleting redirects

     [ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel updated NUTCH-1708:
-----------------------------------

    Attachment: NUTCH-1708-trunk-v1.patch
                NUTCH-1708-2x-v1.patch

Attached patches for both 1.x and 2.x to achieve consistent behavior:
* (1.x only) add field "id" in IndexerMapReduce which contains the "real" URL (no reprUrl). Remove the copyField statement in solrindex-mapping.xml
* field "url" had attribute "required=true" in Solr schema*.xml : shouldn't this apply to field "id" instead? In fact, the "id" field is always required to allow for proper deletions and updates. The "required" flag is now moved to "id" (1.x and 2.x).
* Elastic search indexer now uses "id" as ID-field (instead of "url" for both 1.x and 2.x).
* add field "id" in IndexingFiltersChecker

For now only the combination <trunk,Solr> has been tested.

Open question:
* index-basic in 2.x fills a field "orig" which contains the "real" URL (not the reprUrl). Is this field now still required? The "id" field should be the same?


> use same id when indexing and deleting redirects
> ------------------------------------------------
>
>                 Key: NUTCH-1708
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1708
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Sebastian Nagel
>         Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch
>
>
> Redirect targets are indexed using "representative URL"
> * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair.
> * NutchField "url" is filled by basic indexing filter with repr URL
> * id field used as unique key is filled from url per solrindex-mapping.xml
> Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted.
> Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added:
> {code}
> delete  http://wiki.apache.org/nutch
> add     http://wiki.apache.org/nutch
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)