You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2007/08/09 16:27:42 UTC
[jira] Created: (NUTCH-541) Index url field untokenized
Index url field untokenized
---------------------------
Key: NUTCH-541
URL: https://issues.apache.org/jira/browse/NUTCH-541
Project: Nutch
Issue Type: New Feature
Components: indexer, searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Fix For: 1.0.0
Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the untokenized version of the url field in some contexts :
1. For deleting duplicates by url (at search time). see NUTCH-455
2. For restricting the search to a certain url (may be used in the case of RSS search where each entry in the Rss is added as a distinct document with (possibly) same url )
query-url extends FieldQueryFilter so:
Query: url:http://www.apache.org/
Parsed: url:"http http-www http-www-apache www www-apache apache org"
Translated: +url:"http-http-www http-www-http-www-apache http-www-apache-www www-www-apache www-apache apache org"
3. for accessing a document(s) in the search servers in the search servers. (using query plugin)
I suggest we add url as in index-basic and implement a query-url-untoken plugin.
doc.add(new Field("url", url.toString(), Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("url_untoken", url.toString(), Field.Store.NO, Field.Index.UN_TOKENIZED));
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-541) Index url field untokenized
Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated NUTCH-541:
------------------------------------
Fix Version/s: (was: 1.1)
- pushing this out per http://bit.ly/c7tBv9
> Index url field untokenized
> ---------------------------
>
> Key: NUTCH-541
> URL: https://issues.apache.org/jira/browse/NUTCH-541
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, searcher
> Affects Versions: 1.0.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
>
> Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the untokenized version of the url field in some contexts :
> 1. For deleting duplicates by url (at search time). see NUTCH-455
> 2. For restricting the search to a certain url (may be used in the case of RSS search where each entry in the Rss is added as a distinct document with (possibly) same url )
> query-url extends FieldQueryFilter so:
> Query: url:http://www.apache.org/
> Parsed: url:"http http-www http-www-apache www www-apache apache org"
> Translated: +url:"http-http-www http-www-http-www-apache http-www-apache-www www-www-apache www-apache apache org"
> 3. for accessing a document(s) in the search servers in the search servers. (using query plugin)
> I suggest we add url as in index-basic and implement a query-url-untoken plugin.
> doc.add(new Field("url", url.toString(), Field.Store.YES, Field.Index.TOKENIZED));
> doc.add(new Field("url_untoken", url.toString(), Field.Store.NO, Field.Index.UN_TOKENIZED));
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.