You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Rajasekar Karthik (JIRA)" <ji...@apache.org> on 2007/11/07 19:59:50 UTC

[jira] Created: (NUTCH-573) Multiple Domains - Query Search

Multiple Domains - Query Search
-------------------------------

                 Key: NUTCH-573
                 URL: https://issues.apache.org/jira/browse/NUTCH-573
             Project: Nutch
          Issue Type: Improvement
          Components: searcher
    Affects Versions: 0.9.0, 0.8.1, 0.8, 0.7.2, 0.7.1, 0.7, 0.6
         Environment: All
            Reporter: Rajasekar Karthik
            Priority: Minor


Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
Query:
+content:"abc" +(site"www.aaa.com" site:"www.bbb.com")

works on lucene but the same concept does not work on nutch.

In Lucene, it works with 
org.apache.lucene.analysis.KeywordAnalyzer
org.apache.lucene.analysis.standard.StandardAnalyzer 

but NOT on
org.apache.lucene.analysis.SimpleAnalyzer 

Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 

Just FYI, another solution (inefficient I believe) which seems to be working on nutch
<query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-573) Multiple Domains - Query Search

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542382 ] 

Andrzej Bialecki  commented on NUTCH-573:
-----------------------------------------

I don't think this is the right direction. Using commas is IMHO not intuitive - if we go that far, I think it would be better to refactor the parser and Query to support nested clauses, if we really want to support more complex query syntax. This way we could also support parentheses.

Also, I'm not sure if the original reporter asked for a generic solution that would work with every field - if the issue at hand is just the site: field, then we can use "raw field" and a RawQueryFilter to parse multiple terms within the SiteQueryFilter implementation, without changing the global query syntax.

> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-573) Multiple Domains - Query Search

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-573:
------------------------------------

    Fix Version/s:     (was: 1.1)

> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-573) Multiple Domains - Query Search

Posted by "Srikarthik Venkataraman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776451#action_12776451 ] 

Srikarthik Venkataraman commented on NUTCH-573:
-----------------------------------------------

I am very interested in using the Multiterm Query feature for searching in multiple domains. 
Can you please let me know if this patch is tested and available on any of your release builds.

Can we expect this fix to be available in version 1.1 or could you provide us intermediate release.


> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>             Fix For: 1.1
>
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-573) Multiple Domains - Query Search

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542422 ] 

Doğacan Güney commented on NUTCH-573:
-------------------------------------

Why not fix NUTCH-479 and use it generally? If I understand correctly, after NUTCH-479 we can support a query like "site:www.apache.org OR site:www.mozilla.org" , can't we? IMHO, all-uppercase OR is more intuitive...

> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-573) Multiple Domains - Query Search

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sami Siren updated NUTCH-573:
-----------------------------

    Patch Info: [Patch Available]

> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-573) Multiple Domains - Query Search

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-573:
--------------------------------

    Attachment: multiTermQuery_v1.patch

Here is a patch that enables querying multiple values for the same field. 
#The query syntax is changed to enable  [<field>:]term1(,term2)* type queries, where multiple terms are converted to a boolean OR query. 
#Query.Clause, Query.Term, and Query.Phrase is changed significantly. 

This is an initial version of the patch for review, today I will test it a little bit more. 


> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>            Priority: Minor
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-573) Multiple Domains - Query Search

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542414 ] 

Andrzej Bialecki  commented on NUTCH-573:
-----------------------------------------

I hope I didn't come across as arguing - your patch looks good from the technical point of view, I'm just trying to figure out the long-term impact of this patch.

I agree, the full Lucene syntax is too complex - but even the Google syntax falls into the "advanced" category, i.e. you need to learn how to construct such query. As far as I could determine, Google indeed treats an infix comma as a list operator but only for some fields, such as inurl:. Try the following queries:
{code}
site:www.apache.org server
site:www.cnn.com server
site:www.apache.org,www.cnn.com server
{code}

For other fields, such as intitle, inanchor it gives inconsistent results (maybe I discovered a Google bug :) ).

Regarding the question whether to enable it for any field: I think one important exception would be "raw fields", where a QueryFilter implementation wants to interpret the input token differently, and in such cases infix comma may be a valid token character. Perhaps we could add support for an escape character, which turns comma into a regular token character?

> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-573) Multiple Domains - Query Search

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-573:
--------------------------------

        Fix Version/s: 1.0.0
             Priority: Major  (was: Minor)
    Affects Version/s:     (was: 0.8.1)
                           (was: 0.8)
                           (was: 0.7.2)
                           (was: 0.7.1)
                           (was: 0.7)
                           (was: 0.6)

Incrementing priority to major. 

> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-573) Multiple Domains - Query Search

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542449 ] 

Enis Soztutar commented on NUTCH-573:
-------------------------------------

@Andrzej
I recall google over comma delimited syntax, but now it doesn't work. does it? Maybe I remembered wrong. 
http://www.google.com/intl/en/help/operators.html confirms that comma delimited syntax is not allowed, but a we can make allintitle: ... type queries. 

I think the raw fields, which are site, date, type and lang are unlikely to contain commas, so we may not have to worry about escape characters. As far as i know, we treat comma as white space, so searching comma-containing phrases in raw fields is not enabled anyway. Of course we may fix this should it be needed. 

@Dogacan 
I share the same concerns about performance and complexity about NUTCH-479. However it may be good if it were implemented correct. 

> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-573) Multiple Domains - Query Search

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-573:
------------------------------------


- pushing this out per http://bit.ly/c7tBv9

> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-573) Multiple Domains - Query Search

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543067 ] 

Enis Soztutar commented on NUTCH-573:
-------------------------------------

So,  how shall we proceed with this one?
I give +1 to commit this, and deal with NUTCH-479 in its own issue. Having both multi term queries and OR syntax wont be too bad i guess. 


> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-573) Multiple Domains - Query Search

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542389 ] 

Enis Soztutar commented on NUTCH-573:
-------------------------------------

bq. Using commas is IMHO not intuitive

With your respect I should disagree. We cannot expect search users to type queries of the form +(site:www.somesite.com site:www.foo.com). Last time I checked google used comma syntax. I think that supporting only a subset of the query syntax that lucene supports was the initial intention to implement another query parser for nutch, so that ordinary search users will not get confused, and they can use the de-facto syntax.   

bq. Also, I'm not sure if the original reporter asked for a generic solution that would work with every field - if the issue at hand is just the site: field, then we can use "raw field" and a RawQueryFilter to parse multiple terms within the SiteQueryFilter implementation, without changing the global query syntax.
The original intention was to allow this in only site queries, howeve i cannot see a reason to not enable this for other fields. 




> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: multiTermQuery_v1.patch
>
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-573) Multiple Domains - Query Search

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar reassigned NUTCH-573:
-----------------------------------

    Assignee: Enis Soztutar

> Multiple Domains - Query Search
> -------------------------------
>
>                 Key: NUTCH-573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: searcher
>    Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0
>         Environment: All
>            Reporter: Rajasekar Karthik
>            Assignee: Enis Soztutar
>            Priority: Minor
>
> Searching multiple domains can be done on Lucene - nut not that efficiently on nutch.
> Query:
> +content:"abc" +(site"www.aaa.com" site:"www.bbb.com")
> works on lucene but the same concept does not work on nutch.
> In Lucene, it works with 
> org.apache.lucene.analysis.KeywordAnalyzer
> org.apache.lucene.analysis.standard.StandardAnalyzer 
> but NOT on
> org.apache.lucene.analysis.SimpleAnalyzer 
> Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a workaround to make this work? Is there an option to change what analyzer nutch is using? 
> Just FYI, another solution (inefficient I believe) which seems to be working on nutch
> <query> -site:"ccc.com" -site:"ddd.com" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.