You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Kalani Ruwanpathirana <ka...@gmail.com> on 2008/08/04 12:05:48 UTC

escaping special characters

Hi,

I followed the following procedure to escape special characteres.

String escapedKeywords = QueryParser.escape(keywords);
Query query = new QueryParser("content", new
StandardAnalyzer()).parse(escapedKeywords);

this works with most of the special characters like * and ~ except \ . I
can't do a search for a keyword like "ho\w" and get results.
am I doing anything wrong here.


Thanks,
Kalani
-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

Re: Clarification on deletion process...

Posted by Michael McCandless <lu...@mikemccandless.com>.
Some more details below...

<Ar...@equifax.com> wrote:
> The documentation for delete operation seems to be confusing (i am going
> thru the book and also posted in the books forums...), so appreciate if
> someone can let me know if my below understanding is correct.
>
> When i delete a document from the index
>
> 1) It is marked for deletion in the BUFFER until I commit/close the
> writer. Does that mean the document is still visible for the Searcher?

Right, IndexWriter simply records the fact that you want to delete all
docs matching query X or term Y, in RAM.

> 2) Once i commit/close the writer then IT IS JUST MARKED for delete in the
> Index. At this time the document is NOT visible for the Searcher, but the
> document is still taking up the space in the index.

Yes, every so often (or, when you explicitly commit or close)
IndexWriter will translate the buffered delete requests into _X_N.del
files, which record exactly which docIDs are now deleted.  If you
reopen a searcher after this point the documents won't be seen.

> 3) Once the index is merged (optimized), it is removed from the index

As Hoss said, ordinary merges also reclaim the space consumed by
deleted docs.  You can also call expungeDeletes, which forces any
segments containing deletions to be merged.

Note that with ConcurrentMergeScheduler, ordinary merges are kicked
off and complete in background threads.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Clarification on deletion process...

Posted by Chris Hostetter <ho...@fucit.org>.
: When i delete a document from the index
	...

The answer to all of your questions is yes, however documents marked for 
deletion are also "removed" from segments whenever they are merged, which 
can happen on any add.

PS...

: In-Reply-To: <48...@gmail.com>
: Subject: Clarification on deletion process...

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Clarification on deletion process...

Posted by Ar...@equifax.com.
The documentation for delete operation seems to be confusing (i am going 
thru the book and also posted in the books forums...), so appreciate if 
someone can let me know if my below understanding is correct.

When i delete a document from the index

1) It is marked for deletion in the BUFFER until I commit/close the 
writer. Does that mean the document is still visible for the Searcher?

2) Once i commit/close the writer then IT IS JUST MARKED for delete in the 
Index. At this time the document is NOT visible for the Searcher, but the 
document is still taking up the space in the index.

3) Once the index is merged (optimized), it is removed from the index

Regards, 
Aravind R Yarram 
This message contains information from Equifax Inc. which may be confidential and privileged.  If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited.  If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.

Re: Field sizes: maxFieldLength

Posted by Chris Hostetter <ho...@fucit.org>.
: In-Reply-To: <48...@gmail.com>
: Subject: Field sizes: maxFieldLength

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Field sizes: maxFieldLength

Posted by Mark Miller <ma...@gmail.com>.
No. Its just a simple number of terms taken from the document limiter used
on the assumption that text that long stops adding value, or starts adding
noise, or starts accelerating into diminishing returns at that point. No
optimizations are used based on it.

On Mon, Aug 11, 2008 at 5:20 PM, <Ar...@equifax.com> wrote:

> tx for the response but i think i didnt make my question clear...
>
> If i am indexing a filed that can at the max contain 1000 fileds, does it
> help in improving performance if i let Lucene know IN ADVANCE about 1000?
>
>
>
>
>
>
> Mark Miller <ma...@gmail.com>
> 08/11/2008 05:13 PM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: Field sizes: maxFieldLength
>
>
>
>
>
>
> Aravind.Yarram@equifax.com wrote:
> > Hi all -
> >
> > I know in advance that each of the fileds i index doesnt go more than
> > 1000, Can i gain any performance improvement while writing the index by
> > limiting the maxFieldLength to 200?
> >
> > tx
> > Regards,
> > Aravind R Yarram
> > This message contains information from Equifax Inc. which may be
> confidential and privileged.  If you are not an intended recipient, please
> refrain from any disclosure, copying, distribution or use of this
> information and note that such actions are prohibited.  If you have
> received this transmission in error, please notify by e-mail
> postmaster@equifax.com.
> >
> >
> Its 10000. Sure, if you have a lot of docs between 200 and 10000,
> indexing less will be faster. But you will only be able to search on
> those first 200 tokens for any doc longer.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> This message contains information from Equifax Inc. which may be
> confidential and privileged.  If you are not an intended recipient, please
> refrain from any disclosure, copying, distribution or use of this
> information and note that such actions are prohibited.  If you have received
> this transmission in error, please notify by e-mail postmaster@equifax.com
> .
>

Re: Field sizes: maxFieldLength

Posted by Mark Miller <ma...@gmail.com>.
The gist is: it doesn't help. That simply cuts long documents off at the 
knees on the assumption that its long enough already, that more won't 
add much value (and may add noise?). Its not used for any sort of 
optimizations...its a straight, just use the first n tokens from a document.

Aravind.Yarram@equifax.com wrote:
> tx for the response but i think i didnt make my question clear...
>
> If i am indexing a filed that can at the max contain 1000 fileds, does it 
> help in improving performance if i let Lucene know IN ADVANCE about 1000?
>
>
>
>
>
>
> Mark Miller <ma...@gmail.com> 
> 08/11/2008 05:13 PM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: Field sizes: maxFieldLength
>
>
>
>
>
>
> Aravind.Yarram@equifax.com wrote:
>   
>> Hi all -
>>
>> I know in advance that each of the fileds i index doesnt go more than 
>> 1000, Can i gain any performance improvement while writing the index by 
>> limiting the maxFieldLength to 200? 
>>
>> tx
>> Regards, 
>> Aravind R Yarram 
>> This message contains information from Equifax Inc. which may be 
>>     
> confidential and privileged.  If you are not an intended recipient, please 
> refrain from any disclosure, copying, distribution or use of this 
> information and note that such actions are prohibited.  If you have 
> received this transmission in error, please notify by e-mail 
> postmaster@equifax.com.
>   
>>     
> Its 10000. Sure, if you have a lot of docs between 200 and 10000, 
> indexing less will be faster. But you will only be able to search on 
> those first 200 tokens for any doc longer.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> This message contains information from Equifax Inc. which may be confidential and privileged.  If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited.  If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Field sizes: maxFieldLength

Posted by Ar...@equifax.com.
tx for the response but i think i didnt make my question clear...

If i am indexing a filed that can at the max contain 1000 fileds, does it 
help in improving performance if i let Lucene know IN ADVANCE about 1000?






Mark Miller <ma...@gmail.com> 
08/11/2008 05:13 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Re: Field sizes: maxFieldLength






Aravind.Yarram@equifax.com wrote:
> Hi all -
>
> I know in advance that each of the fileds i index doesnt go more than 
> 1000, Can i gain any performance improvement while writing the index by 
> limiting the maxFieldLength to 200? 
>
> tx
> Regards, 
> Aravind R Yarram 
> This message contains information from Equifax Inc. which may be 
confidential and privileged.  If you are not an intended recipient, please 
refrain from any disclosure, copying, distribution or use of this 
information and note that such actions are prohibited.  If you have 
received this transmission in error, please notify by e-mail 
postmaster@equifax.com.
>
> 
Its 10000. Sure, if you have a lot of docs between 200 and 10000, 
indexing less will be faster. But you will only be able to search on 
those first 200 tokens for any doc longer.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



This message contains information from Equifax Inc. which may be confidential and privileged.  If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited.  If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.

Re: Field sizes: maxFieldLength

Posted by Mark Miller <ma...@gmail.com>.
Aravind.Yarram@equifax.com wrote:
> Hi all -
>
> I know in advance that each of the fileds i index doesnt go more than 
> 1000, Can i gain any performance improvement while writing the index by 
> limiting the maxFieldLength to 200? 
>
> tx
> Regards, 
> Aravind R Yarram 
> This message contains information from Equifax Inc. which may be confidential and privileged.  If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited.  If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.
>
>   
Its 10000. Sure, if you have a lot of docs between 200 and 10000, 
indexing less will be faster. But you will only be able to search on 
those first 200 tokens for any doc longer.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Field sizes: maxFieldLength

Posted by Ar...@equifax.com.
Hi all -

I know in advance that each of the fileds i index doesnt go more than 
1000, Can i gain any performance improvement while writing the index by 
limiting the maxFieldLength to 200? 

tx
Regards, 
Aravind R Yarram 
This message contains information from Equifax Inc. which may be confidential and privileged.  If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited.  If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.

Re: escaping special characters

Posted by Mark Miller <ma...@gmail.com>.
Steven A Rowe wrote:
> On 08/11/2008 at 2:14 PM, Chris Hostetter wrote:
>   
>> Aravind R Yarram wrote:
>>     
>>> can i escape built in lucene keywords like OR, AND aswell?
>>>       
>> as of the last time i checked: no, they're baked into the grammer.
>>     
>
> I have not tested this, but I've read somewhere on this list that enclosing OR and AND in double quotes effectively escapes them.
>   
Yeah, this works - it short circuits the token as an operator by 
triggering a quoted match instead - which eventually just pops out the 
single term in the quotes.

But also, have you just tried escaping with a simple backslash? Seems to 
work for me with a simple test.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: escaping special characters

Posted by Matthew Hall <mh...@informatics.jax.org>.
You can simply change your input string to lowercase before passing it 
to the analyzers, which will give you the effect of escaping the boolean 
operators.  (I.E you will now search on and or and not)  Remember 
however that these are extremely common words, and chances are high that 
you are removing them via your stop words list in your analyzer.  This 
is also assuming you are using an analyzer that does lowercasing as part 
of its normal processing, which many do.

Matt

Steven A Rowe wrote:
> On 08/11/2008 at 2:14 PM, Chris Hostetter wrote:
>   
>> Aravind R Yarram wrote:
>>     
>>> can i escape built in lucene keywords like OR, AND aswell?
>>>       
>> as of the last time i checked: no, they're baked into the grammer.
>>     
>
> I have not tested this, but I've read somewhere on this list that enclosing OR and AND in double quotes effectively escapes them.
>
>   
>> (that may have changed when it switchedfrom a javac to a flex grammer
>> though, so i'm not 100% positive)
>>     
>
> Although the StandardTokenizer was switched about a year ago from a JavaCC to a JFlex grammar, QueryParser's grammar remains in the JavaCC camp.
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   

-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: escaping special characters

Posted by Steven A Rowe <sa...@syr.edu>.
On 08/11/2008 at 2:14 PM, Chris Hostetter wrote:
> Aravind R Yarram wrote:
> > can i escape built in lucene keywords like OR, AND aswell?
> 
> as of the last time i checked: no, they're baked into the grammer.

I have not tested this, but I've read somewhere on this list that enclosing OR and AND in double quotes effectively escapes them.

> (that may have changed when it switchedfrom a javac to a flex grammer
> though, so i'm not 100% positive)

Although the StandardTokenizer was switched about a year ago from a JavaCC to a JFlex grammar, QueryParser's grammar remains in the JavaCC camp.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: escaping special characters

Posted by Chris Hostetter <ho...@fucit.org>.
: can i escape built in lucene keywords like OR, AND aswell?

as of the last time i checked: no, they're baked into the grammer.

(that may have changed when it switchedfrom a javac to a flex grammer 
though, so i'm not 100% positive)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: escaping special characters

Posted by Ar...@equifax.com.
can i escape built in lucene keywords like OR, AND aswell?

Regards, 
Aravind R Yarram
 





Chris Hostetter <ho...@fucit.org> 
08/06/2008 07:05 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Re: escaping special characters







: String escapedKeywords = QueryParser.escape(keywords);
: Query query = new QueryParser("content", new
: StandardAnalyzer()).parse(escapedKeywords);
: 
: this works with most of the special characters like * and ~ except \ . I
: can't do a search for a keyword like "ho\w" and get results.
: am I doing anything wrong here.

QueryParser.escape will in fact escape a backslash, but keep in mind 
StandardAnalyzer splits on backslash so that may be what's confusing you.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



This message contains information from Equifax Inc. which may be confidential and privileged.  If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited.  If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.

Re: escaping special characters

Posted by Chris Hostetter <ho...@fucit.org>.
: String escapedKeywords = QueryParser.escape(keywords);
: Query query = new QueryParser("content", new
: StandardAnalyzer()).parse(escapedKeywords);
: 
: this works with most of the special characters like * and ~ except \ . I
: can't do a search for a keyword like "ho\w" and get results.
: am I doing anything wrong here.

QueryParser.escape will in fact escape a backslash, but keep in mind 
StandardAnalyzer splits on backslash so that may be what's confusing you.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org