You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chris May <ch...@warwick.ac.uk> on 2005/07/27 21:47:01 UTC

Searching a URL with a PrefixQuery / Too Many Clauses (again...)

First, apologies for what seems to be something of an FAQ.

However, I've not been able to find an answer either in LIA or in the  
relevant section of the FAQ (http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)

My setup is as follows: I have an index of a few hundred thousand web  
pages. I'd like the be able to construct queries that search for some  
arbitrary text within a specified URL. Kind of like google's syntax

searchterm +site:www.foo.com/some/section

So, I have the page title & content indexed, and the URL stored as a  
keywords field, and I imagined that I'd be able to construct a query  
something like this:

String[] fields = new String[]  
{DocumentFields.TITLE,DocumentFields.CONTENT};
Query searchTextQuery = MultiFieldQueryParser.parse 
(request.getSearchQuery(), fields, analyzer);
PrefixQuery urlPrefix = new PrefixQuery(new Term(DocumentFields.URL,  
request.getUrlPrefix()));
hits = searcher.search(searchTextQuery, new QueryFilter(urlPrefix));

However, as soon as the set of documents returned by the prefixquery  
is more than a thousand or so, I get a TooManyClausesException, as  
you might expect.

AFAICS the solutions suggested in the FAQ don't seem to apply here:  
I'm already using a Filter, and that's not helping (pace suggestion  
1), I don't think I can reduce the number of terms in the index, else  
my URLs wouldn't be unique any more, and increasing the number of  
clauses seems like a poor choice from a scalability point of view - I  
anticipate queries that could filter perhaps a hundred thousand  
documents or so.

I'm guessing that it might be possible to do something smart by  
splitting the URL up into multiple fields - for example, one for the  
host and one for the path, or even one for the host and one for host 
+path together - but I'm not clear on exactly how I'd use the two  
fields, and how they'd help. Can someone enlighten me?

Thanks in advance

Chris





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jul 28, 2005, at 12:37 PM, Chris May wrote:

> Works beautifully (at least on my 30K-document test index ). I'll  
> need to do some fiddling if I want to allow partial URLs (i.e.  
> http://www2.warwick.ac.uk/ab* to match http://www2.warwick.ac.uk/ 
> about) but I can see how to do that, I think (and I'm not sure I  
> need it anyway).
>
>  Thanks Scott!
>
> Incidentally, is there an easy way to make QueryParser not treat  
> the colon in 'http://' as a term separator? It seems that URLS get  
> broken into two chunks ('http' and 'www.warwick.ac.uk/somewhere')   
> before they get fed to my custom analyzer. I got round it by just  
> constructing the PhraseQuery by hand,  but I wonder if there's an  
> easier way ?

I'm not sure what string you're passing to QP, but the : denotes a  
field selector (such as title:lucene).  There is no easy way for  
QueryParser to deal with that differently - it'd be custom parser at  
that point.  You can backslash escape it \:, but that is probably not  
desirable.  Or you could pre-process the string from the user before  
handing it to QP and escape it under the covers.

     Erik


>
> Chris
>
> On 28 Jul 2005, at 02:02, Scott Ganyo wrote:
>
>
>> Chris,
>>
>> How about indexing the domain as one field and each part of the  
>> path as separate terms in another field?  I'm sure you've probably  
>> already thought of doing this... and maybe discarded the idea  
>> because you'd lose the position information.  However, even though  
>> you can't just simply split the URL on '/' and shove it into the  
>> field, you can add the position information back into the term and  
>> then put it into the field.  Then, you would be able to completely  
>> ditch the prefix query and still retrieve the documents using the  
>> entire, ordered path in (I think) the most efficient way possible.
>>
>> For example:
>>
>> http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
>> modules/commonlaw/
>>
>> becomes something like (using n/*** to identify the position):
>>
>> domain: www2.warwick.ac.uk
>> path: 1/fac, 2/soc, 3/law, 4/ug, 5/propective, 6/degrees, 7/ 
>> modules, 8/commonlaw
>>
>> And you could search based on any prefix you desired.  For example  
>> searching for this:
>>
>> http://www2.warwick.ac.uk/fac/soc/law/*
>>
>> would end up being a Lucene search that looks something like this  
>> (note: not query parser syntax!):
>>
>> domain: www2.warwick.ac.uk AND path: 1/fac AND path: 2/soc AND  
>> path: 3/law
>>
>> Does that make sense?  Would it work for you?
>>
>> S
>>
>> On Jul 27, 2005, at 3:56 PM, Chris May wrote:
>>
>>
>>
>>> Always domain + part of a path e.g.
>>>
>>> url:http://blogs.warwick.ac.uk/chrismay/*
>>>
>>> or
>>>
>>> url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
>>> modules/commonlaw/*
>>>
>>> or
>>>
>>> url:http://www2.warwick.ac.uk/services/its/*
>>>
>>>
>>> ... and so on. Part of the problem is that we may need to go an  
>>> arbitrary number of levels down the path to get an acceptably  
>>> small set of documents to start from - we couldn't impose a rule  
>>> that said something like 'specify the first 2 directories on the  
>>> path' (c.f my second example). We wouldn't need to query for the  
>>> same path over different domains though (e.g. url:*.warwick.ac.uk/ 
>>> about/* )
>>>
>>> thanks
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>> On 27 Jul 2005, at 21:33, Erik Hatcher wrote:
>>>
>>>
>>>
>>>
>>>> Could you give some examples of the types of PrefixQuery's you'd  
>>>> like to use?   Is it always at a granularity of domain and  
>>>> path?  Or are you wanting to do a prefix pieces of the domain  
>>>> and path?
>>>>
>>>>     Erik
>>>>
>>>> On Jul 27, 2005, at 3:47 PM, Chris May wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> First, apologies for what seems to be something of an FAQ.
>>>>>
>>>>> However, I've not been able to find an answer either in LIA or  
>>>>> in the relevant section of the FAQ (http://wiki.apache.org/ 
>>>>> jakarta-lucene/ 
>>>>> LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
>>>>>
>>>>> My setup is as follows: I have an index of a few hundred  
>>>>> thousand web pages. I'd like the be able to construct queries  
>>>>> that search for some arbitrary text within a specified URL.  
>>>>> Kind of like google's syntax
>>>>>
>>>>> searchterm +site:www.foo.com/some/section
>>>>>
>>>>> So, I have the page title & content indexed, and the URL stored  
>>>>> as a keywords field, and I imagined that I'd be able to  
>>>>> construct a query something like this:
>>>>>
>>>>> String[] fields = new String[]  
>>>>> {DocumentFields.TITLE,DocumentFields.CONTENT};
>>>>> Query searchTextQuery = MultiFieldQueryParser.parse 
>>>>> (request.getSearchQuery(), fields, analyzer);
>>>>> PrefixQuery urlPrefix = new PrefixQuery(new Term 
>>>>> (DocumentFields.URL, request.getUrlPrefix()));
>>>>> hits = searcher.search(searchTextQuery, new QueryFilter 
>>>>> (urlPrefix));
>>>>>
>>>>> However, as soon as the set of documents returned by the  
>>>>> prefixquery is more than a thousand or so, I get a  
>>>>> TooManyClausesException, as you might expect.
>>>>>
>>>>> AFAICS the solutions suggested in the FAQ don't seem to apply  
>>>>> here: I'm already using a Filter, and that's not helping (pace  
>>>>> suggestion 1), I don't think I can reduce the number of terms  
>>>>> in the index, else my URLs wouldn't be unique any more, and  
>>>>> increasing the number of clauses seems like a poor choice from  
>>>>> a scalability point of view - I anticipate queries that could  
>>>>> filter perhaps a hundred thousand documents or so.
>>>>>
>>>>> I'm guessing that it might be possible to do something smart by  
>>>>> splitting the URL up into multiple fields - for example, one  
>>>>> for the host and one for the path, or even one for the host and  
>>>>> one for host+path together - but I'm not clear on exactly how  
>>>>> I'd use the two fields, and how they'd help. Can someone  
>>>>> enlighten me?
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)

Posted by Chris May <ch...@warwick.ac.uk>.

Works beautifully (at least on my 30K-document test index ). I'll  
need to do some fiddling if I want to allow partial URLs (i.e. http:// 
www2.warwick.ac.uk/ab* to match http://www2.warwick.ac.uk/about) but  
I can see how to do that, I think (and I'm not sure I need it anyway).

  Thanks Scott!

Incidentally, is there an easy way to make QueryParser not treat the  
colon in 'http://' as a term separator? It seems that URLS get broken  
into two chunks ('http' and 'www.warwick.ac.uk/somewhere')  before  
they get fed to my custom analyzer. I got round it by just  
constructing the PhraseQuery by hand,  but I wonder if there's an  
easier way ?

Chris

On 28 Jul 2005, at 02:02, Scott Ganyo wrote:

> Chris,
>
> How about indexing the domain as one field and each part of the  
> path as separate terms in another field?  I'm sure you've probably  
> already thought of doing this... and maybe discarded the idea  
> because you'd lose the position information.  However, even though  
> you can't just simply split the URL on '/' and shove it into the  
> field, you can add the position information back into the term and  
> then put it into the field.  Then, you would be able to completely  
> ditch the prefix query and still retrieve the documents using the  
> entire, ordered path in (I think) the most efficient way possible.
>
> For example:
>
> http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
> modules/commonlaw/
>
> becomes something like (using n/*** to identify the position):
>
> domain: www2.warwick.ac.uk
> path: 1/fac, 2/soc, 3/law, 4/ug, 5/propective, 6/degrees, 7/ 
> modules, 8/commonlaw
>
> And you could search based on any prefix you desired.  For example  
> searching for this:
>
> http://www2.warwick.ac.uk/fac/soc/law/*
>
> would end up being a Lucene search that looks something like this  
> (note: not query parser syntax!):
>
> domain: www2.warwick.ac.uk AND path: 1/fac AND path: 2/soc AND  
> path: 3/law
>
> Does that make sense?  Would it work for you?
>
> S
>
> On Jul 27, 2005, at 3:56 PM, Chris May wrote:
>
>
>> Always domain + part of a path e.g.
>>
>> url:http://blogs.warwick.ac.uk/chrismay/*
>>
>> or
>>
>> url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
>> modules/commonlaw/*
>>
>> or
>>
>> url:http://www2.warwick.ac.uk/services/its/*
>>
>>
>> ... and so on. Part of the problem is that we may need to go an  
>> arbitrary number of levels down the path to get an acceptably  
>> small set of documents to start from - we couldn't impose a rule  
>> that said something like 'specify the first 2 directories on the  
>> path' (c.f my second example). We wouldn't need to query for the  
>> same path over different domains though (e.g. url:*.warwick.ac.uk/ 
>> about/* )
>>
>> thanks
>>
>> Chris
>>
>>
>>
>>
>> On 27 Jul 2005, at 21:33, Erik Hatcher wrote:
>>
>>
>>
>>> Could you give some examples of the types of PrefixQuery's you'd  
>>> like to use?   Is it always at a granularity of domain and path?   
>>> Or are you wanting to do a prefix pieces of the domain and path?
>>>
>>>     Erik
>>>
>>> On Jul 27, 2005, at 3:47 PM, Chris May wrote:
>>>
>>>
>>>
>>>
>>>> First, apologies for what seems to be something of an FAQ.
>>>>
>>>> However, I've not been able to find an answer either in LIA or  
>>>> in the relevant section of the FAQ (http://wiki.apache.org/ 
>>>> jakarta-lucene/ 
>>>> LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
>>>>
>>>> My setup is as follows: I have an index of a few hundred  
>>>> thousand web pages. I'd like the be able to construct queries  
>>>> that search for some arbitrary text within a specified URL. Kind  
>>>> of like google's syntax
>>>>
>>>> searchterm +site:www.foo.com/some/section
>>>>
>>>> So, I have the page title & content indexed, and the URL stored  
>>>> as a keywords field, and I imagined that I'd be able to  
>>>> construct a query something like this:
>>>>
>>>> String[] fields = new String[]  
>>>> {DocumentFields.TITLE,DocumentFields.CONTENT};
>>>> Query searchTextQuery = MultiFieldQueryParser.parse 
>>>> (request.getSearchQuery(), fields, analyzer);
>>>> PrefixQuery urlPrefix = new PrefixQuery(new Term 
>>>> (DocumentFields.URL, request.getUrlPrefix()));
>>>> hits = searcher.search(searchTextQuery, new QueryFilter 
>>>> (urlPrefix));
>>>>
>>>> However, as soon as the set of documents returned by the  
>>>> prefixquery is more than a thousand or so, I get a  
>>>> TooManyClausesException, as you might expect.
>>>>
>>>> AFAICS the solutions suggested in the FAQ don't seem to apply  
>>>> here: I'm already using a Filter, and that's not helping (pace  
>>>> suggestion 1), I don't think I can reduce the number of terms in  
>>>> the index, else my URLs wouldn't be unique any more, and  
>>>> increasing the number of clauses seems like a poor choice from a  
>>>> scalability point of view - I anticipate queries that could  
>>>> filter perhaps a hundred thousand documents or so.
>>>>
>>>> I'm guessing that it might be possible to do something smart by  
>>>> splitting the URL up into multiple fields - for example, one for  
>>>> the host and one for the path, or even one for the host and one  
>>>> for host+path together - but I'm not clear on exactly how I'd  
>>>> use the two fields, and how they'd help. Can someone enlighten me?
>>>>
>>>> Thanks in advance
>>>>
>>>> Chris
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)

Posted by Scott Ganyo <sc...@ganyo.com>.

Chris,

How about indexing the domain as one field and each part of the path  
as separate terms in another field?  I'm sure you've probably already  
thought of doing this... and maybe discarded the idea because you'd  
lose the position information.  However, even though you can't just  
simply split the URL on '/' and shove it into the field, you can add  
the position information back into the term and then put it into the  
field.  Then, you would be able to completely ditch the prefix query  
and still retrieve the documents using the entire, ordered path in (I  
think) the most efficient way possible.

For example:

http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/modules/ 
commonlaw/

becomes something like (using n/*** to identify the position):

domain: www2.warwick.ac.uk
path: 1/fac, 2/soc, 3/law, 4/ug, 5/propective, 6/degrees, 7/modules,  
8/commonlaw

And you could search based on any prefix you desired.  For example  
searching for this:

http://www2.warwick.ac.uk/fac/soc/law/*

would end up being a Lucene search that looks something like this  
(note: not query parser syntax!):

domain: www2.warwick.ac.uk AND path: 1/fac AND path: 2/soc AND path:  
3/law

Does that make sense?  Would it work for you?

S

On Jul 27, 2005, at 3:56 PM, Chris May wrote:

> Always domain + part of a path e.g.
>
> url:http://blogs.warwick.ac.uk/chrismay/*
>
> or
>
> url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
> modules/commonlaw/*
>
> or
>
> url:http://www2.warwick.ac.uk/services/its/*
>
>
> ... and so on. Part of the problem is that we may need to go an  
> arbitrary number of levels down the path to get an acceptably small  
> set of documents to start from - we couldn't impose a rule that  
> said something like 'specify the first 2 directories on the  
> path' (c.f my second example). We wouldn't need to query for the  
> same path over different domains though (e.g. url:*.warwick.ac.uk/ 
> about/* )
>
> thanks
>
> Chris
>
>
>
>
> On 27 Jul 2005, at 21:33, Erik Hatcher wrote:
>
>
>> Could you give some examples of the types of PrefixQuery's you'd  
>> like to use?   Is it always at a granularity of domain and path?   
>> Or are you wanting to do a prefix pieces of the domain and path?
>>
>>     Erik
>>
>> On Jul 27, 2005, at 3:47 PM, Chris May wrote:
>>
>>
>>
>>> First, apologies for what seems to be something of an FAQ.
>>>
>>> However, I've not been able to find an answer either in LIA or in  
>>> the relevant section of the FAQ (http://wiki.apache.org/jakarta- 
>>> lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
>>>
>>> My setup is as follows: I have an index of a few hundred thousand  
>>> web pages. I'd like the be able to construct queries that search  
>>> for some arbitrary text within a specified URL. Kind of like  
>>> google's syntax
>>>
>>> searchterm +site:www.foo.com/some/section
>>>
>>> So, I have the page title & content indexed, and the URL stored  
>>> as a keywords field, and I imagined that I'd be able to construct  
>>> a query something like this:
>>>
>>> String[] fields = new String[]  
>>> {DocumentFields.TITLE,DocumentFields.CONTENT};
>>> Query searchTextQuery = MultiFieldQueryParser.parse 
>>> (request.getSearchQuery(), fields, analyzer);
>>> PrefixQuery urlPrefix = new PrefixQuery(new Term 
>>> (DocumentFields.URL, request.getUrlPrefix()));
>>> hits = searcher.search(searchTextQuery, new QueryFilter(urlPrefix));
>>>
>>> However, as soon as the set of documents returned by the  
>>> prefixquery is more than a thousand or so, I get a  
>>> TooManyClausesException, as you might expect.
>>>
>>> AFAICS the solutions suggested in the FAQ don't seem to apply  
>>> here: I'm already using a Filter, and that's not helping (pace  
>>> suggestion 1), I don't think I can reduce the number of terms in  
>>> the index, else my URLs wouldn't be unique any more, and  
>>> increasing the number of clauses seems like a poor choice from a  
>>> scalability point of view - I anticipate queries that could  
>>> filter perhaps a hundred thousand documents or so.
>>>
>>> I'm guessing that it might be possible to do something smart by  
>>> splitting the URL up into multiple fields - for example, one for  
>>> the host and one for the path, or even one for the host and one  
>>> for host+path together - but I'm not clear on exactly how I'd use  
>>> the two fields, and how they'd help. Can someone enlighten me?
>>>
>>> Thanks in advance
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jul 27, 2005, at 4:56 PM, Chris May wrote:

> Always domain + part of a path e.g.
>
> url:http://blogs.warwick.ac.uk/chrismay/*
>
> or
>
> url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
> modules/commonlaw/*
>
> or
>
> url:http://www2.warwick.ac.uk/services/its/*
>
>
> ... and so on. Part of the problem is that we may need to go an  
> arbitrary number of levels down the path to get an acceptably small  
> set of documents to start from - we couldn't impose a rule that  
> said something like 'specify the first 2 directories on the  
> path' (c.f my second example). We wouldn't need to query for the  
> same path over different domains though (e.g. url:*.warwick.ac.uk/ 
> about/* )

Here's an idea off the top of my head that I haven't fully explored.   
Maybe it'll do the trick...

Adjust the analysis of the url field to tokenize something like  
http://lucene.apache.org/nutch/ into [http://lucene.apache.org] and [/ 
nutch].  Prefix queries, as you've shown, would be turned into an  
ordered zero-slop PhraseQuery.  The tokenization shouldn't be too  
difficult.  But getting QueryParser happy will require subclassing  
and overriding getWildcardQuery to return an appropriate PhraseQuery.

That will definitely solve the issue of too many clauses!   Are there  
any issues with this approach?

     Erik


>
> thanks
>
> Chris
>
>
>
>
> On 27 Jul 2005, at 21:33, Erik Hatcher wrote:
>
>
>> Could you give some examples of the types of PrefixQuery's you'd  
>> like to use?   Is it always at a granularity of domain and path?   
>> Or are you wanting to do a prefix pieces of the domain and path?
>>
>>     Erik
>>
>> On Jul 27, 2005, at 3:47 PM, Chris May wrote:
>>
>>
>>
>>> First, apologies for what seems to be something of an FAQ.
>>>
>>> However, I've not been able to find an answer either in LIA or in  
>>> the relevant section of the FAQ (http://wiki.apache.org/jakarta- 
>>> lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
>>>
>>> My setup is as follows: I have an index of a few hundred thousand  
>>> web pages. I'd like the be able to construct queries that search  
>>> for some arbitrary text within a specified URL. Kind of like  
>>> google's syntax
>>>
>>> searchterm +site:www.foo.com/some/section
>>>
>>> So, I have the page title & content indexed, and the URL stored  
>>> as a keywords field, and I imagined that I'd be able to construct  
>>> a query something like this:
>>>
>>> String[] fields = new String[]  
>>> {DocumentFields.TITLE,DocumentFields.CONTENT};
>>> Query searchTextQuery = MultiFieldQueryParser.parse 
>>> (request.getSearchQuery(), fields, analyzer);
>>> PrefixQuery urlPrefix = new PrefixQuery(new Term 
>>> (DocumentFields.URL, request.getUrlPrefix()));
>>> hits = searcher.search(searchTextQuery, new QueryFilter(urlPrefix));
>>>
>>> However, as soon as the set of documents returned by the  
>>> prefixquery is more than a thousand or so, I get a  
>>> TooManyClausesException, as you might expect.
>>>
>>> AFAICS the solutions suggested in the FAQ don't seem to apply  
>>> here: I'm already using a Filter, and that's not helping (pace  
>>> suggestion 1), I don't think I can reduce the number of terms in  
>>> the index, else my URLs wouldn't be unique any more, and  
>>> increasing the number of clauses seems like a poor choice from a  
>>> scalability point of view - I anticipate queries that could  
>>> filter perhaps a hundred thousand documents or so.
>>>
>>> I'm guessing that it might be possible to do something smart by  
>>> splitting the URL up into multiple fields - for example, one for  
>>> the host and one for the path, or even one for the host and one  
>>> for host+path together - but I'm not clear on exactly how I'd use  
>>> the two fields, and how they'd help. Can someone enlighten me?
>>>
>>> Thanks in advance
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)

Posted by Chris May <ch...@warwick.ac.uk>.

Always domain + part of a path e.g.

url:http://blogs.warwick.ac.uk/chrismay/*

or

url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/ 
modules/commonlaw/*

or

url:http://www2.warwick.ac.uk/services/its/*


... and so on. Part of the problem is that we may need to go an  
arbitrary number of levels down the path to get an acceptably small  
set of documents to start from - we couldn't impose a rule that said  
something like 'specify the first 2 directories on the path' (c.f my  
second example). We wouldn't need to query for the same path over  
different domains though (e.g. url:*.warwick.ac.uk/about/* )

thanks

Chris




On 27 Jul 2005, at 21:33, Erik Hatcher wrote:

> Could you give some examples of the types of PrefixQuery's you'd  
> like to use?   Is it always at a granularity of domain and path?   
> Or are you wanting to do a prefix pieces of the domain and path?
>
>     Erik
>
> On Jul 27, 2005, at 3:47 PM, Chris May wrote:
>
>
>> First, apologies for what seems to be something of an FAQ.
>>
>> However, I've not been able to find an answer either in LIA or in  
>> the relevant section of the FAQ (http://wiki.apache.org/jakarta- 
>> lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
>>
>> My setup is as follows: I have an index of a few hundred thousand  
>> web pages. I'd like the be able to construct queries that search  
>> for some arbitrary text within a specified URL. Kind of like  
>> google's syntax
>>
>> searchterm +site:www.foo.com/some/section
>>
>> So, I have the page title & content indexed, and the URL stored as  
>> a keywords field, and I imagined that I'd be able to construct a  
>> query something like this:
>>
>> String[] fields = new String[]  
>> {DocumentFields.TITLE,DocumentFields.CONTENT};
>> Query searchTextQuery = MultiFieldQueryParser.parse 
>> (request.getSearchQuery(), fields, analyzer);
>> PrefixQuery urlPrefix = new PrefixQuery(new Term 
>> (DocumentFields.URL, request.getUrlPrefix()));
>> hits = searcher.search(searchTextQuery, new QueryFilter(urlPrefix));
>>
>> However, as soon as the set of documents returned by the  
>> prefixquery is more than a thousand or so, I get a  
>> TooManyClausesException, as you might expect.
>>
>> AFAICS the solutions suggested in the FAQ don't seem to apply  
>> here: I'm already using a Filter, and that's not helping (pace  
>> suggestion 1), I don't think I can reduce the number of terms in  
>> the index, else my URLs wouldn't be unique any more, and  
>> increasing the number of clauses seems like a poor choice from a  
>> scalability point of view - I anticipate queries that could filter  
>> perhaps a hundred thousand documents or so.
>>
>> I'm guessing that it might be possible to do something smart by  
>> splitting the URL up into multiple fields - for example, one for  
>> the host and one for the path, or even one for the host and one  
>> for host+path together - but I'm not clear on exactly how I'd use  
>> the two fields, and how they'd help. Can someone enlighten me?
>>
>> Thanks in advance
>>
>> Chris
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Could you give some examples of the types of PrefixQuery's you'd like  
to use?   Is it always at a granularity of domain and path?  Or are  
you wanting to do a prefix pieces of the domain and path?

     Erik

On Jul 27, 2005, at 3:47 PM, Chris May wrote:

> First, apologies for what seems to be something of an FAQ.
>
> However, I've not been able to find an answer either in LIA or in  
> the relevant section of the FAQ (http://wiki.apache.org/jakarta- 
> lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
>
> My setup is as follows: I have an index of a few hundred thousand  
> web pages. I'd like the be able to construct queries that search  
> for some arbitrary text within a specified URL. Kind of like  
> google's syntax
>
> searchterm +site:www.foo.com/some/section
>
> So, I have the page title & content indexed, and the URL stored as  
> a keywords field, and I imagined that I'd be able to construct a  
> query something like this:
>
> String[] fields = new String[]  
> {DocumentFields.TITLE,DocumentFields.CONTENT};
> Query searchTextQuery = MultiFieldQueryParser.parse 
> (request.getSearchQuery(), fields, analyzer);
> PrefixQuery urlPrefix = new PrefixQuery(new Term 
> (DocumentFields.URL, request.getUrlPrefix()));
> hits = searcher.search(searchTextQuery, new QueryFilter(urlPrefix));
>
> However, as soon as the set of documents returned by the  
> prefixquery is more than a thousand or so, I get a  
> TooManyClausesException, as you might expect.
>
> AFAICS the solutions suggested in the FAQ don't seem to apply here:  
> I'm already using a Filter, and that's not helping (pace suggestion  
> 1), I don't think I can reduce the number of terms in the index,  
> else my URLs wouldn't be unique any more, and increasing the number  
> of clauses seems like a poor choice from a scalability point of  
> view - I anticipate queries that could filter perhaps a hundred  
> thousand documents or so.
>
> I'm guessing that it might be possible to do something smart by  
> splitting the URL up into multiple fields - for example, one for  
> the host and one for the path, or even one for the host and one for  
> host+path together - but I'm not clear on exactly how I'd use the  
> two fields, and how they'd help. Can someone enlighten me?
>
> Thanks in advance
>
> Chris
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org