You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Pradeep Kumar K <pr...@robosoftin.com> on 2002/05/11 10:48:11 UTC

Re: contains

Joshua

Thanks a lot for reply. I think I have not explained my doubt clearly .
I am not exactly searching for a letter, but can be a letter or a set of 
 letters .

Example: Consider the sentences
1) "God is love"
2) "Life is beutiful"
Currently using lucene we can index this sentence by using different 
Analyzers.
When I create a query for searching in the indexes, as far as my 
knowledge there are different types of queries . TermQuery,BooleanQuery, 
PrefixQuery etc.
but when we supply a query to an IndexSearcher, the results returned by 
the  IndexSearcher   shows that unless if u give the whole word it 
doesn't find.
ie,     if I search for  "beuti" it doen't return any results. but if we 
search for "beutiful" it returns the number of hits 1.
Is there any way to create a query using any of the  lucene Query 
objects, which should make the Indexsearcher to search even for  set of 
letters..

Best Wishes
Pradeep

Joshua O'Madadhain wrote:

>Pradeep:
>
>I think what Peter was trying to get at was the question "when is it
>useful for a search engine user to be able to search for words that
>contain a particular letter?"
>
>For a language like Chinese, it would certainly be useful to be able to
>search for a single character.  However, the informational content of a
>single letter in an alphabet-based language (such as English) is so low
>that I have trouble believing that it would be useful to be able to do
>this kind of search.
>
>That is to say: unless this feature has been presented to you as a
>requirement, you may want to think about how it might be used in practice
>before you spend a lot of time implementing it.
>
>Regards,
>
>Joshua O'Madadhain
>
> jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
>  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
> It's that moment of dawning comprehension that I live for--Bill Watterson
>My opinions are too rational and insightful to be those of any organization.
>
>On Thu, 11 Jul 2002, Pradeep Kumar K wrote:
>
>  
>
>>Hi Peter
>>  I want to include an option called "contains " in my search application.
>> for  example: Name "contains"  'p' like that...
>>Thanks for reply.
>>Pradeep
>>
>>Peter Carlson wrote:
>>
>>    
>>
>>>Do you really want to be able to find items by letter? Do you have some
>>>other purpose that tokenizing by letter is trying to get around.
>>>
>>>If your do want to tokenize by letter, you can create your own tokenizer
>>>which creates breaks up items by letter. See the current tokenizers under
>>>org.apache.lucene.analysis.
>>>
>>>--Peter
>>>
>>>On 7/10/02 10:26 AM, "Pradeep Kumar K" <pr...@robosoftin.com> wrote:
>>>      
>>>
>
>  
>
>>>>Is it possible to search for a word contains some letters?
>>>>example : "God is love"
>>>>
>>>>how can I create query to search for sentences having  "d".
>>>>I found that lucene is tokenizing a sentence  in to words not into letters.
>>>>is it possible using lucene? Can anybody give a clue for this?
>>>>        
>>>>
>
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>  
>


Re: Too many open files?

Posted by Rosen Marinov <ro...@sirma.bg>.
just close your Searcher after finishing work

----- Original Message -----
From: "Hang Li" <hx...@careersite.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, July 23, 2002 5:59 PM
Subject: Too many open files?


> >
>
> I have seen a lot postings about this topic. Any final thoughts?
>
> We did a simple stress test, Lucene would produce this error between 30 -
80
> concurren searches.  The index directory has 24 files (15 fields), and
> "
> ulimit -n
> 32768
> ",
>
> there should be more than enough FDs.  Note, we did not do any writings to
index
> while we were searching.  Any ideas? Thx.
>
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Too many open files?

Posted by Hang Li <hx...@careersite.com>.
>

I have seen a lot postings about this topic. Any final thoughts?

We did a simple stress test, Lucene would produce this error between 30 - 80
concurren searches.  The index directory has 24 files (15 fields), and
"
ulimit -n
32768
",

there should be more than enough FDs.  Note, we did not do any writings to index
while we were searching.  Any ideas? Thx.



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: contains

Posted by Joshua O'Madadhain <jm...@ics.uci.edu>.
On Tue, 16 Jul 2002, Lothar Simon wrote:

> Just to correct a few points:
> - The factor would be 2 * (average no of chars per word)/2 = (average
> no of chars per word).

Actually, I made a mistake in my earlier analysis, but your factor is also
inaccurate.  Ignoring other factors and overhead, storing just the set of
strings S = {s_1, s_2, ..., s_n} with corresponding lengths L = {l_1, l_2,
..., l_n} requires

(sum_i l_i) [sum of all lengths in L]

= (mean(L) * n) space
[where mean(L) = the above sum divided by the size of L (n)].


Storing all prefixes of each string requires

(sum_i sum_j j) [where j varies from 1 to l_i]

= (sum_i (l_i(l_i + 1))/2)

> 1/2(sum_i (l_i^2)).

= (sum_i (l_i^2)) for both prefixes and suffixes

!= (l_i * sum_i l_i) [because you can't take the extra factor of l_i 
outside of the sum]

The mistake that you and I both made is this: the sum of the squares of
the mean lengths is not the same as the mean length squared times n.
  
E.g.: L = {2,3,4,6,7,8} [mean 5]
sum_i l_i = 30
2 * (sum_i sum_j j) = 262; 262/30 = 8.73 [correct multiplicative factor]
2 * mean(L)^2 = 180; 180/30 = 6 [this is *wrong*]
and the original hypothesis was that the additional factor should have
been the mean, 5.

Some additional calculations suggest that if you assume an exponential
distribution of word lengths (cf. Zipf's Law) that basing your guess on a
mean word length will cause you to underestimate by a factor of something
like 17%.

This is all just a fancy way of demonstrating that having long strings
hurts you more than having short ones helps you, i.e., using a mean value
in place of the sum is inaccurate in general.

> - One would probably create a set of 2 * (maximum number of chars per word)
> as Fields for a document. If this could work was actually my question...
> - Most important: my proposal is exactly (and almost only) designed to solve
> the substring ("*uti*") problem !!! One field in the first group of fields
> in my example contains "utiful" and would be found by "uti*", a field in the
> other group of fields contains "itueb" and would be found by "itu*". Voila!

You are correct; I goofed.
 
> I still think my idea would work (given you spend the space for the index).

I still don't see how you deal with the problem that I mentioned before:

'Another problem with this is that in order to be able to get from "ful"
to "beautiful", you have to store, in the index entry for "ful", (pointers
to) every single complete word in your document set that contains "ful" as
a prefix or suffix.  Just _creating_ such an index would be extremely
time-consuming even with clever data structures, and consider how much
extra storage for pointers would be necessary for entries like "e" or
"n".'

In any case, I personally would consider the expected overhead of space to
be prohibitive.  However, so long as you address the remaining issue I
just mentioned, yes, I think that your scheme would work.

Regards,

Joshua O'Madadhain

 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: contains

Posted by Lothar Simon <lo...@eidon.de>.
Just to correct a few points:
- The factor would be 2 * (average no of chars per word)/2 = (average no of
chars per word).
- One would probably create a set of 2 * (maximum number of chars per word)
as Fields for a document. If this could work was actually my question...
- Most important: my proposal is exactly (and almost only) designed to solve
the substring ("*uti*") problem !!! One field in the first group of fields
in my example contains "utiful" and would be found by "uti*", a field in the
other group of fields contains "itueb" and would be found by "itu*". Voila!

I still think my idea would work (given you spend the space for the index).

Lothar


-----Original Message-----
From: Joshua O'Madadhain [mailto:jmadden@ics.uci.edu]
Sent: Friday, July 12, 2002 6:45 PM
To: Lucene Users List
Subject: RE: contains


On Fri, 12 Jul 2002, Lothar Simon wrote:

[in response to Peter Carlson pointing out that searching for *xyz* is a
difficult problem]
> Of course you are right. And I am surely more the last then the first
> one to try to come up with THE solution for this. But still... Could
> the following work?
>
> If space (ok, a lot) is available you could store "beutiful",
> "eutiful", "utiful", "tiful", "iful", "ful", "ul", "l" PLUS its
> inversions ("lufitueb", "ufitueb", "fitueb", "itueb", "tueb", "ueb",
> "eb", "b") in the index. Space needed would be something like (average
> no of chars per word) as much as in a "normal" index.

Actually it would be twice that, because you're storing backward and
forward versions.  I'd hazard a guess that this factor alone would mean
something like a 10- or 12-fold increase in index size (the average length
of a word is less than 5 or 6 letters, but by throwing out stop words you
throw out a lot of the words that drag the average down).

Another problem with this is that in order to be able to get from "ful" to
"beautiful", you have to store, in the index entry for "ful", (pointers
to) every single complete word in your document set that contains "ful" as
a substring.  Just _creating_ such an index would be extremely
time-consuming even with clever data structures, and consider how much
extra storage for pointers would be necessary for entries like "e" or "n".

Finally, you're not including all substrings: your scheme doesn't allow me
to search for "*uti*" and find "beautiful".  If you did, the number of
entries would then be multiplied by a factor of the _square_ of the
average number of characters per word.  (You might be able to avoid this
by doing prefix and suffix searches--which are difficult but less so--on
the strings you specify, though.)

There might be some clever way to get around these problems, but I suspect
that developing one would be a dissertation topic.  :)

Regards,

Joshua O'Madadhain

 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: contains

Posted by Joshua O'Madadhain <jm...@ics.uci.edu>.
On Fri, 12 Jul 2002, Lothar Simon wrote:

[in response to Peter Carlson pointing out that searching for *xyz* is a
difficult problem]
> Of course you are right. And I am surely more the last then the first
> one to try to come up with THE solution for this. But still... Could
> the following work?
> 
> If space (ok, a lot) is available you could store "beutiful",
> "eutiful", "utiful", "tiful", "iful", "ful", "ul", "l" PLUS its
> inversions ("lufitueb", "ufitueb", "fitueb", "itueb", "tueb", "ueb",
> "eb", "b") in the index. Space needed would be something like (average
> no of chars per word) as much as in a "normal" index.

Actually it would be twice that, because you're storing backward and
forward versions.  I'd hazard a guess that this factor alone would mean
something like a 10- or 12-fold increase in index size (the average length
of a word is less than 5 or 6 letters, but by throwing out stop words you
throw out a lot of the words that drag the average down).

Another problem with this is that in order to be able to get from "ful" to
"beautiful", you have to store, in the index entry for "ful", (pointers
to) every single complete word in your document set that contains "ful" as
a substring.  Just _creating_ such an index would be extremely
time-consuming even with clever data structures, and consider how much
extra storage for pointers would be necessary for entries like "e" or "n".

Finally, you're not including all substrings: your scheme doesn't allow me
to search for "*uti*" and find "beautiful".  If you did, the number of
entries would then be multiplied by a factor of the _square_ of the
average number of characters per word.  (You might be able to avoid this
by doing prefix and suffix searches--which are difficult but less so--on
the strings you specify, though.)

There might be some clever way to get around these problems, but I suspect
that developing one would be a dissertation topic.  :)

Regards,

Joshua O'Madadhain
 
 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: contains

Posted by Lothar Simon <lo...@eidon.de>.
Of course you are right. And I am surely more the last then the first one to
try to come up with THE solution for this. But still... Could the following
work?

If space (ok, a lot) is available you could store "beutiful", "eutiful",
"utiful", "tiful", "iful", "ful", "ul", "l" PLUS its inversions ("lufitueb",
"ufitueb", "fitueb", "itueb", "tueb", "ueb", "eb", "b") in the index. Space
needed would be something like (average no of chars per word) as much as in
a "normal" index.

Say you want to search for "*tif*", you would actually search for "tif*" in
the first group AND for "fit*" in the second group and voila hit "beutiful".

Regards,
Lothar Simon
eidon


-----Original Message-----
From: Peter Carlson [mailto:carlson@bookandhammer.com]
Sent: Thursday, July 11, 2002 6:02 PM
To: Lucene Users List
Subject: Re: contains


Just as a note, there is a big difference between

*xyz
abc*

And

*bcxy*

For the first two, there are techniques that can be used to search much
faster. For the 3rd option, the only way I can think of how to solve it is
brute force.

--Peter


On 7/11/02 8:31 AM, "Pradeep Kumar K" <pr...@robosoftin.com> wrote:

> How can we search for words having  "ful"
>
> Thanks
> Pradeep
>
>
> Ilya Khandamirov wrote:
>
>> Try searching for "beuti*"
>>
>> Regards,
>> Ilya


--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: contains

Posted by Peter Carlson <ca...@bookandhammer.com>.
Just as a note, there is a big difference between

*xyz
abc*

And

*bcxy*

For the first two, there are techniques that can be used to search much
faster. For the 3rd option, the only way I can think of how to solve it is
brute force.

--Peter


On 7/11/02 8:31 AM, "Pradeep Kumar K" <pr...@robosoftin.com> wrote:

> How can we search for words having  "ful"
> 
> Thanks
> Pradeep
> 
> 
> Ilya Khandamirov wrote:
> 
>> Try searching for "beuti*"
>> 
>> Regards,
>> Ilya


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: contains

Posted by Pradeep Kumar K <pr...@robosoftin.com>.
How can we search for words having  "ful"

Thanks
Pradeep


Ilya Khandamirov wrote:

>Try searching for "beuti*"
>
>Regards,
>Ilya
>
>
>-----Original Message-----
>From: Pradeep Kumar K [mailto:pradeepk@robosoftin.com] 
>Sent: Samstag, 11. Mai 2002 10:48
>To: Lucene Users List
>Subject: Re: contains
>
>
>Joshua
>
>Thanks a lot for reply. I think I have not explained my doubt clearly .
>I am not exactly searching for a letter, but can be a letter or a set of
>letters .
>
>Example: Consider the sentences
>1) "God is love"
>2) "Life is beutiful"
>Currently using lucene we can index this sentence by using different
>Analyzers. When I create a query for searching in the indexes, as far as
>my knowledge there are different types of queries .
>TermQuery,BooleanQuery, PrefixQuery etc. but when we supply a query to
>an IndexSearcher, the results returned by
>the  IndexSearcher   shows that unless if u give the whole word it
>doesn't find.
>ie,     if I search for  "beuti" it doen't return any results. but if we
>search for "beutiful" it returns the number of hits 1.
>Is there any way to create a query using any of the  lucene Query
>objects, which should make the Indexsearcher to search even for  set of
>letters..
>
>Best Wishes
>Pradeep
>
>Joshua O'Madadhain wrote:
>
>  
>
>>Pradeep:
>>
>>I think what Peter was trying to get at was the question "when is it 
>>useful for a search engine user to be able to search for words that 
>>contain a particular letter?"
>>
>>For a language like Chinese, it would certainly be useful to be able to
>>    
>>
>
>  
>
>>search for a single character.  However, the informational content of a
>>    
>>
>
>  
>
>>single letter in an alphabet-based language (such as English) is so low
>>    
>>
>
>  
>
>>that I have trouble believing that it would be useful to be able to do 
>>this kind of search.
>>
>>That is to say: unless this feature has been presented to you as a 
>>requirement, you may want to think about how it might be used in 
>>practice before you spend a lot of time implementing it.
>>
>>Regards,
>>
>>Joshua O'Madadhain
>>
>>jmadden@ics.uci.edu...Obscurium Per 
>>Obscurius...www.ics.uci.edu/~jmadden
>> Joshua O'Madadhain: Information Scientist, Musician,
>>    
>>
>Philosopher-At-Tall
>  
>
>>It's that moment of dawning comprehension that I live for--Bill
>>    
>>
>Watterson
>  
>
>>My opinions are too rational and insightful to be those of any
>>    
>>
>organization.
>  
>
>>On Thu, 11 Jul 2002, Pradeep Kumar K wrote:
>>
>>
>>
>>    
>>
>>>Hi Peter
>>> I want to include an option called "contains " in my search 
>>>application.  for  example: Name "contains"  'p' like that... Thanks 
>>>for reply. Pradeep
>>>
>>>Peter Carlson wrote:
>>>
>>>
>>>
>>>      
>>>
>>>>Do you really want to be able to find items by letter? Do you have 
>>>>some other purpose that tokenizing by letter is trying to get around.
>>>>
>>>>If your do want to tokenize by letter, you can create your own 
>>>>tokenizer which creates breaks up items by letter. See the current 
>>>>tokenizers under org.apache.lucene.analysis.
>>>>
>>>>--Peter
>>>>
>>>>On 7/10/02 10:26 AM, "Pradeep Kumar K" <pr...@robosoftin.com> wrote:
>>>>
>>>>
>>>>        
>>>>
>>
>>    
>>
>>>>>Is it possible to search for a word contains some letters? example :
>>>>>          
>>>>>
>
>  
>
>>>>>"God is love"
>>>>>
>>>>>how can I create query to search for sentences having  "d". I found 
>>>>>that lucene is tokenizing a sentence  in to words not into letters. 
>>>>>is it possible using lucene? Can anybody give a clue for this?
>>>>>
>>>>>
>>>>>          
>>>>>
>>--
>>To unsubscribe, e-mail:
>>    
>>
><ma...@jakarta.apache.org>
>  
>
>>For additional commands, e-mail: 
>><ma...@jakarta.apache.org>
>>
>>
>>
>>    
>>
>
>
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>  
>


RE: contains

Posted by Ilya Khandamirov <ik...@startext.de>.
Try searching for "beuti*"

Regards,
Ilya


-----Original Message-----
From: Pradeep Kumar K [mailto:pradeepk@robosoftin.com] 
Sent: Samstag, 11. Mai 2002 10:48
To: Lucene Users List
Subject: Re: contains


Joshua

Thanks a lot for reply. I think I have not explained my doubt clearly .
I am not exactly searching for a letter, but can be a letter or a set of
letters .

Example: Consider the sentences
1) "God is love"
2) "Life is beutiful"
Currently using lucene we can index this sentence by using different
Analyzers. When I create a query for searching in the indexes, as far as
my knowledge there are different types of queries .
TermQuery,BooleanQuery, PrefixQuery etc. but when we supply a query to
an IndexSearcher, the results returned by
the  IndexSearcher   shows that unless if u give the whole word it
doesn't find.
ie,     if I search for  "beuti" it doen't return any results. but if we
search for "beutiful" it returns the number of hits 1.
Is there any way to create a query using any of the  lucene Query
objects, which should make the Indexsearcher to search even for  set of
letters..

Best Wishes
Pradeep

Joshua O'Madadhain wrote:

>Pradeep:
>
>I think what Peter was trying to get at was the question "when is it 
>useful for a search engine user to be able to search for words that 
>contain a particular letter?"
>
>For a language like Chinese, it would certainly be useful to be able to

>search for a single character.  However, the informational content of a

>single letter in an alphabet-based language (such as English) is so low

>that I have trouble believing that it would be useful to be able to do 
>this kind of search.
>
>That is to say: unless this feature has been presented to you as a 
>requirement, you may want to think about how it might be used in 
>practice before you spend a lot of time implementing it.
>
>Regards,
>
>Joshua O'Madadhain
>
> jmadden@ics.uci.edu...Obscurium Per 
>Obscurius...www.ics.uci.edu/~jmadden
>  Joshua O'Madadhain: Information Scientist, Musician,
Philosopher-At-Tall
> It's that moment of dawning comprehension that I live for--Bill
Watterson
>My opinions are too rational and insightful to be those of any
organization.
>
>On Thu, 11 Jul 2002, Pradeep Kumar K wrote:
>
>
>
>>Hi Peter
>>  I want to include an option called "contains " in my search 
>>application.  for  example: Name "contains"  'p' like that... Thanks 
>>for reply. Pradeep
>>
>>Peter Carlson wrote:
>>
>>
>>
>>>Do you really want to be able to find items by letter? Do you have 
>>>some other purpose that tokenizing by letter is trying to get around.
>>>
>>>If your do want to tokenize by letter, you can create your own 
>>>tokenizer which creates breaks up items by letter. See the current 
>>>tokenizers under org.apache.lucene.analysis.
>>>
>>>--Peter
>>>
>>>On 7/10/02 10:26 AM, "Pradeep Kumar K" <pr...@robosoftin.com> 
>>>wrote:
>>>
>>>
>
>
>
>>>>Is it possible to search for a word contains some letters? example :

>>>>"God is love"
>>>>
>>>>how can I create query to search for sentences having  "d". I found 
>>>>that lucene is tokenizing a sentence  in to words not into letters. 
>>>>is it possible using lucene? Can anybody give a clue for this?
>>>>
>>>>
>
>
>--
>To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
>For additional commands, e-mail: 
><ma...@jakarta.apache.org>
>
>
>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>