You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Rajasekar Karthik (JIRA)" <ji...@apache.org> on 2007/11/07 19:59:50 UTC
[jira] Created: (NUTCH-573) Multiple Domains - Query Search
Multiple Domains - Query Search
-------------------------------
Key: NUTCH-573
URL: https://issues.apache.org/jira/browse/NUTCH-573
Project: Nutch
Issue Type: Improvement
Components: searcher
Affects Versions: 0.9.0, 0.8.1, 0.8, 0.7.2, 0.7.1, 0.7, 0.6
Environment: All
Reporter: Rajasekar Karthik
Priority: Minor
Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
Query:
+content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
works on lucene but the same concept does not work on nutch.
In Lucene, it works with
org.apache.lucene.analysis.KeywordAnalyzer
org.apache.lucene.analysis.standard.StandardAnalyzer
but NOT on
org.apache.lucene.analysis.SimpleAnalyzer
Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
Just FYI, another solution (inefficient I believe) which seems to be working on nutch
<query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-573) Multiple Domains - Query Search
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542382 ]
Andrzej Bialecki commented on NUTCH-573:
-----------------------------------------
I don't think this is the right direction. Using commas is IMHO not intuitive - if we go that far, I think it would be better to refactor the parser and Query to support nested clauses, if we really want to support more complex query syntax. This way we could also support parentheses.
Also, I'm not sure if the original reporter asked for a generic solution that would work with every field - if the issue at hand is just the site: field, then we can use "raw field" and a RawQueryFilter to parse multiple terms within the SiteQueryFilter implementation, without changing the global query syntax.
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Fix For: 1.0.0
>
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-573) Multiple Domains - Query Search
Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated NUTCH-573:
------------------------------------
Fix Version/s: (was: 1.1)
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-573) Multiple Domains - Query Search
Posted by "Srikarthik Venkataraman (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776451#action_12776451 ]
Srikarthik Venkataraman commented on NUTCH-573:
-----------------------------------------------
I am very interested in using the Multiterm Query feature for searching in multiple domains.
Can you please let me know if this patch is tested and available on any of your release builds.
Can we expect this fix to be available in version 1.1 or could you provide us intermediate release.
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Fix For: 1.1
>
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-573) Multiple Domains - Query Search
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542422 ]
Doğacan Güney commented on NUTCH-573:
-------------------------------------
Why not fix NUTCH-479 and use it generally? If I understand correctly, after NUTCH-479 we can support a query like "site:www.apache.org OR site:www.mozilla.org" , can't we? IMHO, all-uppercase OR is more intuitive...
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Fix For: 1.0.0
>
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-573) Multiple Domains - Query Search
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sami Siren updated NUTCH-573:
-----------------------------
Patch Info: [Patch Available]
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Fix For: 1.0.0
>
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-573) Multiple Domains - Query Search
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enis Soztutar updated NUTCH-573:
--------------------------------
Attachment: multiTermQuery_v1.patch
Here is a patch that enables querying multiple values for the same field.
#The query syntax is changed to enable [<field>:]term1(,term2)* type queries, where multiple terms are converted to a boolean OR query.
#Query.Clause, Query.Term, and Query.Phrase is changed significantly.
This is an initial version of the patch for review, today I will test it a little bit more.
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Priority: Minor
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-573) Multiple Domains - Query Search
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542414 ]
Andrzej Bialecki commented on NUTCH-573:
-----------------------------------------
I hope I didn't come across as arguing - your patch looks good from the technical point of view, I'm just trying to figure out the long-term impact of this patch.
I agree, the full Lucene syntax is too complex - but even the Google syntax falls into the "advanced" category, i.e. you need to learn how to construct such query. As far as I could determine, Google indeed treats an infix comma as a list operator but only for some fields, such as inurl:. Try the following queries:
{code}
site:www.apache.org server
site:www.cnn.com server
site:www.apache.org,www.cnn.com server
{code}
For other fields, such as intitle, inanchor it gives inconsistent results (maybe I discovered a Google bug :) ).
Regarding the question whether to enable it for any field: I think one important exception would be "raw fields", where a QueryFilter implementation wants to interpret the input token differently, and in such cases infix comma may be a valid token character. Perhaps we could add support for an escape character, which turns comma into a regular token character?
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Fix For: 1.0.0
>
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-573) Multiple Domains - Query Search
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enis Soztutar updated NUTCH-573:
--------------------------------
Fix Version/s: 1.0.0
Priority: Major (was: Minor)
Affects Version/s: (was: 0.8.1)
(was: 0.8)
(was: 0.7.2)
(was: 0.7.1)
(was: 0.7)
(was: 0.6)
Incrementing priority to major.
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Fix For: 1.0.0
>
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-573) Multiple Domains - Query Search
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542449 ]
Enis Soztutar commented on NUTCH-573:
-------------------------------------
@Andrzej
I recall google over comma delimited syntax, but now it doesn't work. does it? Maybe I remembered wrong.
http://www.google.com/intl/en/help/operators.html confirms that comma delimited syntax is not allowed, but a we can make allintitle: ... type queries.
I think the raw fields, which are site, date, type and lang are unlikely to contain commas, so we may not have to worry about escape characters. As far as i know, we treat comma as white space, so searching comma-containing phrases in raw fields is not enabled anyway. Of course we may fix this should it be needed.
@Dogacan
I share the same concerns about performance and complexity about NUTCH-479. However it may be good if it were implemented correct.
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Fix For: 1.0.0
>
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-573) Multiple Domains - Query Search
Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated NUTCH-573:
------------------------------------
- pushing this out per http://bit.ly/c7tBv9
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-573) Multiple Domains - Query Search
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543067 ]
Enis Soztutar commented on NUTCH-573:
-------------------------------------
So, how shall we proceed with this one?
I give +1 to commit this, and deal with NUTCH-479 in its own issue. Having both multi term queries and OR syntax wont be too bad i guess.
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Fix For: 1.0.0
>
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-573) Multiple Domains - Query Search
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542389 ]
Enis Soztutar commented on NUTCH-573:
-------------------------------------
bq. Using commas is IMHO not intuitive
With your respect I should disagree. We cannot expect search users to type queries of the form +(site:www.somesite.com site:www.foo.com). Last time I checked google used comma syntax. I think that supporting only a subset of the query syntax that lucene supports was the initial intention to implement another query parser for nutch, so that ordinary search users will not get confused, and they can use the de-facto syntax.
bq. Also, I'm not sure if the original reporter asked for a generic solution that would work with every field - if the issue at hand is just the site: field, then we can use "raw field" and a RawQueryFilter to parse multiple terms within the SiteQueryFilter implementation, without changing the global query syntax.
The original intention was to allow this in only site queries, howeve i cannot see a reason to not enable this for other fields.
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Fix For: 1.0.0
>
> Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-573) Multiple Domains - Query Search
Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enis Soztutar reassigned NUTCH-573:
-----------------------------------
Assignee: Enis Soztutar
> Multiple Domains - Query Search
> -------------------------------
>
> Key: NUTCH-573
> URL: https://issues.apache.org/jira/browse/NUTCH-573
> Project: Nutch
> Issue Type: Improvement
> Components: searcher
> Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0
> Environment: All
> Reporter: Rajasekar Karthik
> Assignee: Enis Soztutar
> Priority: Minor
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using?
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.