You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2010/09/06 15:34:33 UTC
[jira] Issue Comment Edited: (NUTCH-716) Make subcollection index
filed multivalued
[ https://issues.apache.org/jira/browse/NUTCH-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906488#action_12906488 ]
Markus Jelsma edited comment on NUTCH-716 at 9/6/10 9:32 AM:
-------------------------------------------------------------
This patch concatenates multiple values in a single string instead of adding single values to a multi valued field. For a test crawl i have defined the following two subcollection definitions:
<subcollection>
<name>asdf</name>
<id>asdf-site</id>
<whitelist>http://asdf/</whitelist>
<blacklist/>
</subcollection>
<subcollection>
<name>news</name>
<id>asdf-nieuws</id>
<whitelist>http://asdf/news/</whitelist>
<blacklist/>
</subcollection>
Reindexing the segments by sending them to Solr will yield the following results for a news URL:
<doc>
<arr name="subcollection">
<str>asdf</str>
</arr>
<str name="url">http://asdf/home/</str>
</doc>
<doc>
<arr name="subcollection">
<str>asdf news</str>
</arr>
<str name="url">http://asdf/news/</str>
</doc>
Instead, i expected the following result for the second document:
<doc>
<arr name="subcollection">
<str>asdf</str>
<str>news</str>
</arr>
<str name="url">http://asdf/news/</str>
</doc>
My Solr schema.xml has the following declaration for the subcollection field:
<field name="subcollection" type="string" stored="true" indexed="true" multiValued="true" />
The latest nightly build i could find:
nutch-2010-07-07_04-49-04
was (Author: markus17):
This patch concatenates multiple values in a single string instead of adding single values to a multi valued field. For a test crawl i have defined the following two subcollection definitions:
<subcollection>
<name>asdf</name>
<id>asdf-site</id>
<whitelist>http://asdf/</whitelist>
<blacklist/>
</subcollection>
<subcollection>
<name>news</name>
<id>asdf-nieuws</id>
<whitelist>http://asdf/news/</whitelist>
<blacklist/>
</subcollection>
Reindexing the segments by sending them to Solr will yield the following results for a news URL:
<doc>
<arr name="subcollection">
<str>asdf</str>
</arr>
<str name="url">http://asdf/home/</str>
</doc>
<doc>
<arr name="subcollection">
<str>asdf news</str>
</arr>
<str name="url">http://asdf/news/</str>
</doc>
Instead, i expected the following result for the second document:
<doc>
<arr name="subcollection">
<str>asdf</str>
<str>news</str>
</arr>
<str name="url">http://asdf/news/</str>
</doc>
My Solr schema.xml has the following declaration for the subcollection field:
<field name="subcollection" type="string" stored="true" indexed="true" multiValued="true" />
> Make subcollection index filed multivalued
> ------------------------------------------
>
> Key: NUTCH-716
> URL: https://issues.apache.org/jira/browse/NUTCH-716
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Affects Versions: 1.0.0
> Reporter: Dmitry Lihachev
> Fix For: 1.2, 2.0
>
> Attachments: NUTCH-716-1_2.patch, NUTCH-716_multivalued_subcollection.patch
>
>
> Looks like a reasonable thing to do. Marking as 1.2 and will commit if no one objects
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.