You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ya...@bloglines.com on 2005/02/04 02:36:11 UTC

Optimize not deleting all files

Hi,

When I run an optimize in our production environment, old index are
left in the directory and are not deleted.  

My understanding is that an
optimize will create new index files and all existing index files should be
deleted.  Is this correct?

We are running Lucene 1.4.2 on Windows.  


Any help is appreciated.  Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Optimize not deleting all files

Posted by Ernesto De Santis <er...@colaborativa.net>.

Hi all

We have the same problem.
We guess that the problem is that windows lock files.

Our enviroment:
Windows 2000
Tomcat 5.5.4

Ernesto.

yahootintin.1247688@bloglines.com escribió:

>Hi,
>
>When I run an optimize in our production environment, old index are
>left in the directory and are not deleted.  
>
>My understanding is that an
>optimize will create new index files and all existing index files should be
>deleted.  Is this correct?
>
>We are running Lucene 1.4.2 on Windows.  
>
>
>Any help is appreciated.  Thanks!
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>  
>


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.8.5 - Release Date: 03/02/2005


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Optimize not deleting all files

Posted by 张瑾 <pr...@gmail.com>.

Your understanding is right!

The old existing files should be deleted,but it  will build new files!


On Thu, 03 Feb 2005 17:36:27 -0800 (PST),
yahootintin.1247688@bloglines.com <ya...@bloglines.com>
wrote:
> Hi,
> 
> When I run an optimize in our production environment, old index are
> left in the directory and are not deleted.
> 
> My understanding is that an
> optimize will create new index files and all existing index files should be
> deleted.  Is this correct?
> 
> We are running Lucene 1.4.2 on Windows.
> 
> Any help is appreciated.  Thanks!
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


-- 
愿你快乐每一天

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Optimize not deleting all files

Posted by Steven Rowe <sa...@syr.edu>.

Hi Patricio,

Is it the case that the "old index files" are not removed from session to
session, or only within the same session?  The discussion below pertains to
the latter case, that is, where the "old index files" are used in the same
process as the files replacing them.

I was having a similar problem, and tracked the source down to IndexReaders
not being closed in my application.  

As far as I can tell, in order for IndexReaders to present a consistent
view of an index while changes are being made to it, read-only copies
of the index are kept around until all IndexReaders using them are
closed.  If any IndexReaders are open on the index, IndexWriters first
make a copy, then operate on the copy.  If you track down all of these
open IndexReaders and close them before optimization, all of the
"old index files" should be deleted.  (Lucene Gurus, please correct this
if I have misrepresented the situation).

In my application, I had a bad interaction between IndexReader caching,
garbage collection, and incremental indexing, in which a new IndexReader
was being opened on an index after each indexing increment, without
closing the already-opened IndexReaders.

On Windows, operating-system level file locking caused by IndexReaders
left open was disallowing index re-creation, because the IndexWriter
wasn't allowed to delete the index files opened by the abandoned
IndexReaders.

In short, if you need to write to an index more than once in a single
session, be sure to keep careful track of your IndexReaders.

Hope it helps,
Steve

Patricio Keilty wrote:
> Hi Otis, tried version 1.4.3 without success, old index files still 
> remain in the directory.
> Also tried not calling optimize(), and still getting the same behaviour, 
> maybe our problem is not related to optimize() call at all.
> 
> --p
> 
> Otis Gospodnetic wrote:
> 
>> Get and try Lucene 1.4.3.  One of the older versions had a bug that was
>> not deleting old index files.
>>
>> Otis
>>
>> --- yahootintin.1247688@bloglines.com wrote:
>>
>>
>>> Hi,
>>>
>>> When I run an optimize in our production environment, old index are
>>> left in the directory and are not deleted. 
>>> My understanding is that an
>>> optimize will create new index files and all existing index files
>>> should be
>>> deleted.  Is this correct?
>>>
>>> We are running Lucene 1.4.2 on Windows. 
>>>
>>> Any help is appreciated.  Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Optimize not deleting all files

Posted by Patricio Keilty <pa...@infovia.com.ar>.

Hi Otis, tried version 1.4.3 without success, old index files still 
remain in the directory.
Also tried not calling optimize(), and still getting the same behaviour, 
maybe our problem is not related to optimize() call at all.

--p

Otis Gospodnetic wrote:

>Get and try Lucene 1.4.3.  One of the older versions had a bug that was
>not deleting old index files.
>
>Otis
>
>--- yahootintin.1247688@bloglines.com wrote:
>
>
>>Hi,
>>
>>When I run an optimize in our production environment, old index are
>>left in the directory and are not deleted.  
>>
>>My understanding is that an
>>optimize will create new index files and all existing index files
>>should be
>>deleted.  Is this correct?
>>
>>We are running Lucene 1.4.2 on Windows.  
>>
>>
>>Any help is appreciated.  Thanks!
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by sergiu gordea <gs...@ifit.uni-klu.ac.at>.

Erik Hatcher wrote:

>
> On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote:
>
>> Hi Erik,
>>
>>> I'm not changing any functionality.  WildcardQuery will still 
>>> support leading wildcard characters, QueryParser will still disallow 
>>> them.  All I'm going to change is the javadoc that makes it sound 
>>> like WildcardQuery does not support leading wildcard characters.
>>>
>>>     Erik
>>
>>
>> From what I was reading in the mailing list there are more lucene 
>> users that would like to be able to construct sufix queries.
>> They are very usefull for german language, because it has many long 
>> composite words , created by concatenation of other simple words.
>> This is one of the requirements of our system. Therefore I needed to 
>> patch lucene to make QueryParser to allow SufixQueries.
>>
>> Now I will need to update lucene library to the latest version, and I 
>> need to patch it again.
>> Do you think it will be possible in the future to have a field in 
>> QueryParser,  boolean ALLOW_SUFFIX_QUERIES?
>
>
> I have no objections to that type of switch.  Please submit a path to 
> QueryParser.jj that implements this as an option with the default to 
> disallow suffix queries, along with a test case and I'd be happy to 
> apply it.

I'm pleased to hear that. I'm not very skilled in writing .jj files but 
I will try to do it in next days,

Sergiu

>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote:

> Hi Erik,
>
>> I'm not changing any functionality.  WildcardQuery will still support 
>> leading wildcard characters, QueryParser will still disallow them.  
>> All I'm going to change is the javadoc that makes it sound like 
>> WildcardQuery does not support leading wildcard characters.
>>
>>     Erik
>
> From what I was reading in the mailing list there are more lucene 
> users that would like to be able to construct sufix queries.
> They are very usefull for german language, because it has many long 
> composite words , created by concatenation of other simple words.
> This is one of the requirements of our system. Therefore I needed to 
> patch lucene to make QueryParser to allow SufixQueries.
>
> Now I will need to update lucene library to the latest version, and I 
> need to patch it again.
> Do you think it will be possible in the future to have a field in 
> QueryParser,  boolean ALLOW_SUFFIX_QUERIES?

I have no objections to that type of switch.  Please submit a path to 
QueryParser.jj that implements this as an option with the default to 
disallow suffix queries, along with a test case and I'd be happy to 
apply it.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by sergiu gordea <gs...@ifit.uni-klu.ac.at>.

 Hi Erik,

> I'm not changing any functionality.  WildcardQuery will still support 
> leading wildcard characters, QueryParser will still disallow them.  
> All I'm going to change is the javadoc that makes it sound like 
> WildcardQuery does not support leading wildcard characters.
>
>     Erik

 From what I was reading in the mailing list there are more lucene users 
that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

Now I will need to update lucene library to the latest version, and I 
need to patch it again.
Do you think it will be possible in the future to have a field in 
QueryParser,  boolean ALLOW_SUFFIX_QUERIES?

 Thanks for understanding,

  Sergiu

>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 7, 2005, at 2:07 AM, sergiu gordea wrote:

> Hi Erick,
>
>>
>>
>> "In order to prevent extremely slow WildcardQueries, a Wildcard term 
>> must not start with one of the wildcards <code>*</code> or 
>> <code>?</code>."
>>
>> I don't read that as saying you cannot use an initial wildcard 
>> character, but rather as if you use a leading wildcard character you 
>> risk performance issues.  I'm going to change "must" to "should".
>
> Will this change available in the next realease of lucene? How do you 
> plan to implement this? Will this be available as an atributte of  
> QueryParser?

I'm not changing any functionality.  WildcardQuery will still support 
leading wildcard characters, QueryParser will still disallow them.  All 
I'm going to change is the javadoc that makes it sound like 
WildcardQuery does not support leading wildcard characters.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by sergiu gordea <gs...@ifit.uni-klu.ac.at>.

Hi Erick,

>
>
> "In order to prevent extremely slow WildcardQueries, a Wildcard term 
> must not start with one of the wildcards <code>*</code> or 
> <code>?</code>."
>
> I don't read that as saying you cannot use an initial wildcard 
> character, but rather as if you use a leading wildcard character you 
> risk performance issues.  I'm going to change "must" to "should". 

Will this change available in the next realease of lucene? How do you 
plan to implement this? Will this be available as an atributte of  
QueryParser?

  Best,

  Sergiu


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: RangeQuery With Date

Posted by Luke Shannon <ls...@futurebrand.com>.

Bingo. Thanks!

Luke

----- Original Message ----- 
From: "Chris Hostetter" <ho...@fucit.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, February 07, 2005 5:10 PM
Subject: Re: RangeQuery With Date


> : Your dates need to be stored in lexicographical order for the RangeQuery
> : to work.
> :
> : Index them using this date format: YYYYMMDD.
> :
> : Also, I'm not sure if the QueryParser can handle range queries with only
> : one end point. You may need to create this query programmatically.
>
> and when creating them progromaticaly, you need to use the exact same
> format they were indexed in.  Assuming I've corectly guess what your
> indexing code looks like, you probably want...
>
> Query query = new RangeQuery(null, new Term("modified", "20041111"),
false);
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: RangeQuery With Date

Posted by Chris Hostetter <ho...@fucit.org>.

: Your dates need to be stored in lexicographical order for the RangeQuery
: to work.
:
: Index them using this date format: YYYYMMDD.
:
: Also, I'm not sure if the QueryParser can handle range queries with only
: one end point. You may need to create this query programmatically.

and when creating them progromaticaly, you need to use the exact same
format they were indexed in.  Assuming I've corectly guess what your
indexing code looks like, you probably want...

Query query = new RangeQuery(null, new Term("modified", "20041111"), false);




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: RangeQuery With Date

Posted by Luke Francl <lu...@stellent.com>.

Your dates need to be stored in lexicographical order for the RangeQuery
to work.

Index them using this date format: YYYYMMDD.

Also, I'm not sure if the QueryParser can handle range queries with only
one end point. You may need to create this query programmatically.

Regards,
Luke Francl

RangeQuery With Date

Posted by Luke Shannon <ls...@futurebrand.com>.

Hi;

I am working on a set of queries that allow you to find modification dates
before, after and equal to a given date.

Here are some of the before queries I have been playing with. I want a query
that pull up dates modified before Nov 11 2004:

Query query = new RangeQuery(null, new Term("modified", "11/11/04"), false);

This one doesn't work. It turns up all the documents in the index.

Query query = QueryParser.parse("modified:[1/1/00 TO 11/11/04]", "subject",
new StandardAnalyzer());

This works but I don't like having to specify the begin date like this.

Query query = QueryParser.parse("modified:[null TO 11/11/04]", "subject",
new StandardAnalyzer());

This throws an exception.

How are other doing a Query like this?

Thanks,

Luke



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by Chris Hostetter <ho...@fucit.org>.

: book Managing Gigabytes, making "*string*" queries drastically more
: efficient for searching (though also impacting index size).  Take the
: term "cat".  It would be indexed with all rotated variations with an
: end of word marker added:
    ...
: The query for "*at*" would be preprocessed and rotated such that the
: wildcards are collapsed at the end to search for "at*" as a
: PrefixQuery.  A wildcard in the middle of a string like "c*t" would
: become a prefix query for "t$c*".

That's a pretty slick trick.

Considering how many Terms the index would wind up containing in order to
denormalize the data in that way, I wonder if it would be more practicle
to index each of the characters as a seperate term, with the word repeated
after the "end of word" character, making wildcard searches into "phase"
searches (after doing preprocessing and rotating as you described).

Ie, index "cat" as:   c a t $ c a t
  search for "*at*" as a phrase search for "a t"
  search for "*at"  as a phrase search for "a t $"
  search for "c*t"  as a phrase search for "t $ c"

...i'm fairly certain that would keep the index size much smaller (the
number of terms would be much smaller, while the average term frequence
wouldn't really increase), but i'm not sure if it would actaully be any
faster.  it depends on the algorithm/performace of PhraseQuery -- which is
something I haven't really looked into.  It could very well be
significantly slower.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 4, 2005, at 9:37 PM, Chris Hostetter wrote:
> If you want to start doing suffix queries (ie: all names ending with
> "s", or all names ending with "Smith") one approach would be to use
> WildcarQuery, which as Erik mentioned, will allow you to use a quey 
> Term
> that starts with a "*". ie...
>
>    Query q3 = new WildcardQuery(new Term("name","*s"));
>    Query q4 = new WildcardQuery(new Term("name","*Smith"));
>
> (NOTE: Erik says you can do this, but the docs for WildcardQuery say 
> you
> can't I'll assume the docs are wrong and Erik is correct.)

I assume you mean this comment on WildcardQuery's javadocs:

"In order to prevent extremely slow WildcardQueries, a Wildcard term 
must not start with one of the wildcards <code>*</code> or 
<code>?</code>."

I don't read that as saying you cannot use an initial wildcard 
character, but rather as if you use a leading wildcard character you 
risk performance issues.  I'm going to change "must" to "should".  And 
yes, WildcardQuery itself supports a leading wildcard character exactly 
as you have shown.

> Which leads me to my point: if you denormalize your data so that you 
> store
> both the Term you want, and the *reverse* of the term you want, then a
> Suffix query is just a Prefix query on a reversed field -- by 
> sacrificing
> space, you can get all the speed efficiencies of a PrefixQuery when 
> doing
> a SuffixQuery...
>
>    D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ...
>    D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ...
>    D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ...
>    D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ...
>
>    Query q1 = new PrefixQuery(new Term("name","J*"));
>    Query q2 = new PrefixQuery(new Term("name","Sue*"));
>    Query q3 = new PrefixQuery(new Term("rname","s*"));
>    Query q4 = new PrefixQuery(new Term("rname","htimS*"));
>
>
> (If anyone sees a flaw in my theory, please chime in)

This trick has been mentioned on this list before, and is a good one.  
I'll go one step further and mention another technique I found in the 
book Managing Gigabytes, making "*string*" queries drastically more 
efficient for searching (though also impacting index size).  Take the 
term "cat".  It would be indexed with all rotated variations with an 
end of word marker added:

	cat$
	at$c
	t$ca
	$cat

The query for "*at*" would be preprocessed and rotated such that the 
wildcards are collapsed at the end to search for "at*" as a 
PrefixQuery.  A wildcard in the middle of a string like "c*t" would 
become a prefix query for "t$c*".

Has anyone tried this technique with Lucene?

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by Luke Shannon <ls...@futurebrand.com>.

I implemented this concept for my ends with query. It works very well!

----- Original Message ----- 
From: "Chris Hostetter" <ho...@fucit.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Friday, February 04, 2005 9:37 PM
Subject: Re: Starts With x and Ends With x Queries


>
> : Also keep in mind that QueryParser only allows a trailing asterisk,
> : creating a PrefixQuery.  However, if you use a WildcardQuery directly,
> : you can use an asterisk as the starting character (at the risk of
> : performance).
>
> On the issue of "ends with" wildcard queries, I wanted to throw out and
> idea that i've seen used to deal with matches like this in other systems.
> I've never acctually tried this with Lucene, but I've seen it used
> effectively with other systems where the goal is to "sort" strings by the
> least significant (ie: right most) characters first.  I think it could
> apply nicely to people who have compelling needs for efficent 'ends with'
> queries.
>
>
>
> Imagine you have a field call name, which you can already do efficient
> prefix matching on using the PrefixQuery class.  Your docs and query may
> look something like this...
>
>    D1> name:"Adam Smith" age:13 state:CA ...
>    D2> name:"Joe Bob" age:42 state:WA ...
>    D3> name:"John Adams" age:35 state:NV ...
>    D3> name:"Sue Smith" age:33 state:CA ...
>
> ...and your queries may look something like...
>
>    Query q1 = new PrefixQuery(new Term("name","J*"));
>    Query q2 = new PrefixQuery(new Term("name","Sue*"));
>
> If you want to start doing suffix queries (ie: all names ending with
> "s", or all names ending with "Smith") one approach would be to use
> WildcarQuery, which as Erik mentioned, will allow you to use a quey Term
> that starts with a "*". ie...
>
>    Query q3 = new WildcardQuery(new Term("name","*s"));
>    Query q4 = new WildcardQuery(new Term("name","*Smith"));
>
> (NOTE: Erik says you can do this, but the docs for WildcardQuery say you
> can't I'll assume the docs are wrong and Erik is correct.)
>
> The problem is that this is horrendously inefficient.  In order to find
> the docs that contain Terms which match your suffix, WildcardQuery must
> first identify what all of those Terms are, by iterating over every Term
> in your index to see if they match the suffix.  This is much slower then a
> PrefixQuery, or even a WildcardQuery that has just 1 initial character
> before a "*" (ie: "s*foobar"), because it can then seek to directly to the
> first Term that starts with that character, and also stop iterating as
> soon as it encounters a Term that no longer begins with that character.
>
> Which leads me to my point: if you denormalize your data so that you store
> both the Term you want, and the *reverse* of the term you want, then a
> Suffix query is just a Prefix query on a reversed field -- by sacrificing
> space, you can get all the speed efficiencies of a PrefixQuery when doing
> a SuffixQuery...
>
>    D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ...
>    D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ...
>    D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ...
>    D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ...
>
>    Query q1 = new PrefixQuery(new Term("name","J*"));
>    Query q2 = new PrefixQuery(new Term("name","Sue*"));
>    Query q3 = new PrefixQuery(new Term("rname","s*"));
>    Query q4 = new PrefixQuery(new Term("rname","htimS*"));
>
>
> (If anyone sees a flaw in my theory, please chime in)
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by Chris Hostetter <ho...@fucit.org>.

: Also keep in mind that QueryParser only allows a trailing asterisk,
: creating a PrefixQuery.  However, if you use a WildcardQuery directly,
: you can use an asterisk as the starting character (at the risk of
: performance).

On the issue of "ends with" wildcard queries, I wanted to throw out and
idea that i've seen used to deal with matches like this in other systems.
I've never acctually tried this with Lucene, but I've seen it used
effectively with other systems where the goal is to "sort" strings by the
least significant (ie: right most) characters first.  I think it could
apply nicely to people who have compelling needs for efficent 'ends with'
queries.



Imagine you have a field call name, which you can already do efficient
prefix matching on using the PrefixQuery class.  Your docs and query may
look something like this...

   D1> name:"Adam Smith" age:13 state:CA ...
   D2> name:"Joe Bob" age:42 state:WA ...
   D3> name:"John Adams" age:35 state:NV ...
   D3> name:"Sue Smith" age:33 state:CA ...

...and your queries may look something like...

   Query q1 = new PrefixQuery(new Term("name","J*"));
   Query q2 = new PrefixQuery(new Term("name","Sue*"));

If you want to start doing suffix queries (ie: all names ending with
"s", or all names ending with "Smith") one approach would be to use
WildcarQuery, which as Erik mentioned, will allow you to use a quey Term
that starts with a "*". ie...

   Query q3 = new WildcardQuery(new Term("name","*s"));
   Query q4 = new WildcardQuery(new Term("name","*Smith"));

(NOTE: Erik says you can do this, but the docs for WildcardQuery say you
can't I'll assume the docs are wrong and Erik is correct.)

The problem is that this is horrendously inefficient.  In order to find
the docs that contain Terms which match your suffix, WildcardQuery must
first identify what all of those Terms are, by iterating over every Term
in your index to see if they match the suffix.  This is much slower then a
PrefixQuery, or even a WildcardQuery that has just 1 initial character
before a "*" (ie: "s*foobar"), because it can then seek to directly to the
first Term that starts with that character, and also stop iterating as
soon as it encounters a Term that no longer begins with that character.

Which leads me to my point: if you denormalize your data so that you store
both the Term you want, and the *reverse* of the term you want, then a
Suffix query is just a Prefix query on a reversed field -- by sacrificing
space, you can get all the speed efficiencies of a PrefixQuery when doing
a SuffixQuery...

   D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ...
   D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ...
   D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ...
   D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ...

   Query q1 = new PrefixQuery(new Term("name","J*"));
   Query q2 = new PrefixQuery(new Term("name","Sue*"));
   Query q3 = new PrefixQuery(new Term("rname","s*"));
   Query q4 = new PrefixQuery(new Term("rname","htimS*"));


(If anyone sees a flaw in my theory, please chime in)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

It matches both because you're tokenizing the name field.  In both 
documents, the name field has a "testing" term in it (it gets 
lowercased also).  A PrefixQuery matches terms that start with the 
prefix.  Use an untokenized field type (Field.Keyword) if you want to 
keep the entire original string as-is for searching purposes - however 
you'd have issues with case-sensitivity in your example.

Also keep in mind that QueryParser only allows a trailing asterisk, 
creating a PrefixQuery.  However, if you use a WildcardQuery directly, 
you can use an asterisk as the starting character (at the risk of 
performance).

	Erik


On Feb 4, 2005, at 7:50 PM, Luke Shannon wrote:

> Hello;
>
> I have these two documents:
>
> Text<sort:9>
> Keyword<modified:0e1as4og8>
> Text<progress_ref:1099927045180>
> Text<name:FutureBrand Testing>
> Text<desc:Demo>
> Text<anouncement:We are testing our project>
> Text<category:Category 1>
> Text<olfaithfull:stillhere>
> Text<poster:hello>
> Text<urgent:yes>
> Text<provider:Mo>
>
>
> Text<sort:1>
> Text<Author:cbalom>
> Text<Creator:PScript5.dll Version 5.2.2>
> Keyword<modified:0e1bgsfk0>
> Keyword<modified:0e1bgsfk0>
> Text<Producer:Acrobat Distiller 5.0.5 (Windows)>
> Text<progress_ref:1099957931806>
> Text<name:testing stuff>
> Text<desc:testing>
> Text<category:Category 1>
> Text<olfaithfull:stillhere>
> Text<poster:hello>
> Text<Title:Microsoft Word - FINAL-FutureBrand Creates, Launches 'Air 
> Canada'
> Brand Ide.>
> Text<provider:Ray>
> Text<kcfileupload:aircanada3.pdf>
>
> I would like to be able to match a name fields that starts with testing
> (specifically) and those that end with it.
>
> I thought the below code would parse to a Prefix Query that would 
> satisfy my
> starting requirment (maybe I don't understand what this query is for). 
> But
> this matches both.
>
> Query query = QueryParser.parse("testing*", "name", new 
> StandardAnalyzer());
>
> Has anyone done this before? Any tips?
>
> Thanks,
>
> Luke
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by Peter Pimley <pp...@semantico.com>.

I sent this to the wrong address.  Sorry.

Peter Pimley wrote:

>
>
> Well done.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Starts With x and Ends With x Queries

Posted by Peter Pimley <pp...@semantico.com>.


Well done.

I was so annoyed with the humiliation-for-kicks this afternoon that I 
just practised my self-destruction technicques with some friends this 
evening ;)

As for configuration, java.lang.system.getenv will give you access to an 
environment variable.

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Starts With x and Ends With x Queries

Posted by Luke Shannon <ls...@futurebrand.com>.

Hello;

I have these two documents:

Text<sort:9>
Keyword<modified:0e1as4og8>
Text<progress_ref:1099927045180>
Text<name:FutureBrand Testing>
Text<desc:Demo>
Text<anouncement:We are testing our project>
Text<category:Category 1>
Text<olfaithfull:stillhere>
Text<poster:hello>
Text<urgent:yes>
Text<provider:Mo>


Text<sort:1>
Text<Author:cbalom>
Text<Creator:PScript5.dll Version 5.2.2>
Keyword<modified:0e1bgsfk0>
Keyword<modified:0e1bgsfk0>
Text<Producer:Acrobat Distiller 5.0.5 (Windows)>
Text<progress_ref:1099957931806>
Text<name:testing stuff>
Text<desc:testing>
Text<category:Category 1>
Text<olfaithfull:stillhere>
Text<poster:hello>
Text<Title:Microsoft Word - FINAL-FutureBrand Creates, Launches 'Air Canada'
Brand Ide.>
Text<provider:Ray>
Text<kcfileupload:aircanada3.pdf>

I would like to be able to match a name fields that starts with testing
(specifically) and those that end with it.

I thought the below code would parse to a Prefix Query that would satisfy my
starting requirment (maybe I don't understand what this query is for). But
this matches both.

Query query = QueryParser.parse("testing*", "name", new StandardAnalyzer());

Has anyone done this before? Any tips?

Thanks,

Luke



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Optimize not deleting all files

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Get and try Lucene 1.4.3.  One of the older versions had a bug that was
not deleting old index files.

Otis

--- yahootintin.1247688@bloglines.com wrote:

> Hi,
> 
> When I run an optimize in our production environment, old index are
> left in the directory and are not deleted.  
> 
> My understanding is that an
> optimize will create new index files and all existing index files
> should be
> deleted.  Is this correct?
> 
> We are running Lucene 1.4.2 on Windows.  
> 
> 
> Any help is appreciated.  Thanks!
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org