You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Wei Ho <we...@princeton.edu> on 2010/04/29 21:51:05 UTC

Lucene QueryParser and Analyzer

Hello,

I'm using Lucene to index and search through a collection of Chinese 
documents. However, I'm noticing an odd behavior in query parsing/searching.

Given the two queries below:

(Ci refers to Chinese character i)
Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

Input1 returns absolutely nothing, while Input2 (replacing the commas 
with spaces) works as expected. I'm a bit confused why this would be 
happening - it seems that QueryParser uses the Analyzer passed to it to 
tokenize the input query string, so if the Analyzer ignores the 
punctuations, it seems that Input1 and Input2 should return identical 
results. Is there some pre-Analyzer filtering or whatever that 
QueryParser does? I've tried this with the StandardAnalyzer, 
SmartChineseAnalyzer, and an analyzer that I implemented which 
explicitly skips over punctuations and whitespaces in tokenizing the 
query string, but to no avail.

-------sample code-------------
Analyzer analyzer = new LingPipeAnalyzer();
Searcher searcher = new IndexSearcher(directory);
QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30, 
SEARCH_FIELDS, analyzer);
Query query = qParser.parse(queryLine[1]);
ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;
-----------------------------------

I'm probably just doing something dumb, but any help would be greatly 
appreciated!

Thanks,
Wei Ho

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene QueryParser and Analyzer

Posted by Robert Muir <rc...@gmail.com>.
FYI: I opened a jira issue for this bug here:
https://issues.apache.org/jira/browse/LUCENE-2458

On Thu, Apr 29, 2010 at 7:01 PM, Wei Ho <we...@princeton.edu> wrote:
> I think I've figured out what the problem is. Given the inputs,
>
> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>
> Input1 gets parsed as
> Query1: (text: "C1C2  C3C4  C5C6  C7  C8C9C10")
> whereas Input2 gets parsed as
> Query2: (text: "C1C2") (text: "C3C4") (text: "C5C6") (text: "C7") (text:
> "C8C9C10")
>
> That is, Lucene constructs the query and then pass the query text through
> the analyzer. Is there any way to
> force QueryParser to pass the input string through the analyzer before
> creating the query? That is, force Lucene
> to create Query2 for both Input1 and Input2.
>
> Thanks,
> Wei
>
>
> -------- Original Message  --------
> Subject: Re: Lucene QueryParser and Analyzer
> From: Sudarsan, Sithu D. <Si...@fda.hhs.gov>
> To: java-user@lucene.apache.org
> Date: 4/29/2010 4:54 PM
>>
>> -------sample code-------------
>>
>>>>
>>>> Analyzer analyzer = new LingPipeAnalyzer();
>>>> Searcher searcher = new IndexSearcher(directory);
>>>> QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30,
>>>> SEARCH_FIELDS, analyzer);
>>>> Query query = qParser.parse(queryLine[1]);
>>>> ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;
>>>>
>>
>> qParser will use the analyzer LingPipeAnalyzer() before forming the
>> query.
>>
>>
>> Sincerely,
>> Sithu D Sudarsan
>>
>>
>> -----Original Message-----
>> From: Wei Ho [mailto:weiho@princeton.edu]
>> Sent: Thursday, April 29, 2010 4:44 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene QueryParser and Analyzer
>>
>> Sorry, I guess "discarding the punctuation" was a bit misleading.
>> I meant that given the two input strings,
>>
>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>
>> The analyzer I implemented tokenizes both Input1 and Input2 as "C1C2",
>> "C3C4", "C5C6", "C7", "C8C9C10" - that is, it doesn't include the
>> punctuation in the tokenization. I'm assuming that QueryParser is simply
>>
>> passing the entire input string to the analyzer and taking the tokens,
>> in which case Input1 and Input2 should be considered identifcal. Does
>> QueryParser doing any sort of pre-processing or filtering beforehand? If
>>
>> so, how can I turn it off?
>>
>> Aside from stopping tokens at punctuations, my analyzer is also doing
>> Chinese word segmentation, so I'd like to be sure that QueryParser is
>> using the analyzer the way I expect it to.
>>
>> Thanks,
>> Wei
>>
>>
>>
>> -------- Original Message  --------
>> Subject: Re: Lucene QueryParser and Analyzer
>> From: Sudarsan, Sithu D.<Si...@fda.hhs.gov>
>> To: java-user@lucene.apache.org
>> Date: 4/29/2010 4:08 PM
>>
>>>
>>> If so,
>>>
>>> Input1:  c1c2c3c4c5c6c7....
>>> Input2: c1c2 c3c4 ...
>>>
>>> I guess, they are different! Add a whitespace after commas and see if
>>> that works...
>>>
>>> Sincerely,
>>> Sithu D Sudarsan
>>>
>>>
>>> -----Original Message-----
>>> From: Wei Ho [mailto:weiho@princeton.edu]
>>> Sent: Thursday, April 29, 2010 4:04 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Lucene QueryParser and Analyzer
>>>
>>> No, there is no whitespace after the comma in Input1
>>>
>>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>>
>>> Input1 is basically one big long word with commas and Chinese
>>>
>>
>> characters
>>
>>>
>>> one after the other. Input2 is where I manually separated the string
>>> into the component terms by replacing the comma with whitespace. My
>>> confusion stems from the fact that I thought it should not matter
>>>
>>
>> since
>>
>>>
>>> the analyzer should be discarding the punctuation anyway? So the
>>> tokenization process should be the same for both Input1 and Input2? If
>>> that is not the case, what do I need to change?
>>>
>>> Thanks,
>>> Wei Ho
>>>
>>> -------- Original Message  --------
>>> Subject: Re: Lucene QueryParser and Analyzer
>>> From: Sudarsan, Sithu D.<Si...@fda.hhs.gov>
>>> To: java-user@lucene.apache.org
>>> Date: 4/29/2010 3:54 PM
>>>
>>>
>>>>
>>>> Hi,
>>>>
>>>> Is there a whitespace after the comma?
>>>>
>>>>
>>>> Sincerely,
>>>> Sithu D Sudarsan
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Wei Ho [mailto:weiho@princeton.edu]
>>>> Sent: Thursday, April 29, 2010 3:51 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Lucene QueryParser and Analyzer
>>>>
>>>> Hello,
>>>>
>>>> I'm using Lucene to index and search through a collection of Chinese
>>>> documents. However, I'm noticing an odd behavior in query
>>>> parsing/searching.
>>>>
>>>> Given the two queries below:
>>>>
>>>> (Ci refers to Chinese character i)
>>>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>>>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>>>
>>>> Input1 returns absolutely nothing, while Input2 (replacing the commas
>>>> with spaces) works as expected. I'm a bit confused why this would be
>>>> happening - it seems that QueryParser uses the Analyzer passed to it
>>>>
>>>>
>>>
>>> to
>>>
>>>
>>>>
>>>> tokenize the input query string, so if the Analyzer ignores the
>>>> punctuations, it seems that Input1 and Input2 should return identical
>>>> results. Is there some pre-Analyzer filtering or whatever that
>>>> QueryParser does? I've tried this with the StandardAnalyzer,
>>>> SmartChineseAnalyzer, and an analyzer that I implemented which
>>>> explicitly skips over punctuations and whitespaces in tokenizing the
>>>> query string, but to no avail.
>>>>
>>>>
>>>> -----------------------------------
>>>>
>>>> I'm probably just doing something dumb, but any help would be greatly
>>>> appreciated!
>>>>
>>>> Thanks,
>>>> Wei Ho
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Lucene QueryParser and Analyzer

Posted by "Sudarsan, Sithu D." <Si...@fda.hhs.gov>.
 

Quick fix:

Create a filter to replace commas with white space and then run your
code.

Sincerely,
Sithu D Sudarsan



-----Original Message-----
From: Wei Ho [mailto:weiho@princeton.edu] 
Sent: Thursday, April 29, 2010 7:01 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene QueryParser and Analyzer

I think I've figured out what the problem is. Given the inputs,

Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

Input1 gets parsed as
Query1: (text: "C1C2  C3C4  C5C6  C7  C8C9C10")
whereas Input2 gets parsed as
Query2: (text: "C1C2") (text: "C3C4") (text: "C5C6") (text: "C7") (text:

"C8C9C10")

That is, Lucene constructs the query and then pass the query text 
through the analyzer. Is there any way to
force QueryParser to pass the input string through the analyzer before 
creating the query? That is, force Lucene
to create Query2 for both Input1 and Input2.

Thanks,
Wei


-------- Original Message  --------
Subject: Re: Lucene QueryParser and Analyzer
From: Sudarsan, Sithu D. <Si...@fda.hhs.gov>
To: java-user@lucene.apache.org
Date: 4/29/2010 4:54 PM
>
> -------sample code-------------
>    
>>> Analyzer analyzer = new LingPipeAnalyzer();
>>> Searcher searcher = new IndexSearcher(directory);
>>> QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30,
>>> SEARCH_FIELDS, analyzer);
>>> Query query = qParser.parse(queryLine[1]);
>>> ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;
>>>        
> qParser will use the analyzer LingPipeAnalyzer() before forming the
> query.
>
>
> Sincerely,
> Sithu D Sudarsan
>
>
> -----Original Message-----
> From: Wei Ho [mailto:weiho@princeton.edu]
> Sent: Thursday, April 29, 2010 4:44 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene QueryParser and Analyzer
>
> Sorry, I guess "discarding the punctuation" was a bit misleading.
> I meant that given the two input strings,
>
> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>
> The analyzer I implemented tokenizes both Input1 and Input2 as "C1C2",
> "C3C4", "C5C6", "C7", "C8C9C10" - that is, it doesn't include the
> punctuation in the tokenization. I'm assuming that QueryParser is
simply
>
> passing the entire input string to the analyzer and taking the tokens,
> in which case Input1 and Input2 should be considered identifcal. Does
> QueryParser doing any sort of pre-processing or filtering beforehand?
If
>
> so, how can I turn it off?
>
> Aside from stopping tokens at punctuations, my analyzer is also doing
> Chinese word segmentation, so I'd like to be sure that QueryParser is
> using the analyzer the way I expect it to.
>
> Thanks,
> Wei
>
>
>
> -------- Original Message  --------
> Subject: Re: Lucene QueryParser and Analyzer
> From: Sudarsan, Sithu D.<Si...@fda.hhs.gov>
> To: java-user@lucene.apache.org
> Date: 4/29/2010 4:08 PM
>    
>> If so,
>>
>> Input1:  c1c2c3c4c5c6c7....
>> Input2: c1c2 c3c4 ...
>>
>> I guess, they are different! Add a whitespace after commas and see if
>> that works...
>>
>> Sincerely,
>> Sithu D Sudarsan
>>
>>
>> -----Original Message-----
>> From: Wei Ho [mailto:weiho@princeton.edu]
>> Sent: Thursday, April 29, 2010 4:04 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene QueryParser and Analyzer
>>
>> No, there is no whitespace after the comma in Input1
>>
>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>
>> Input1 is basically one big long word with commas and Chinese
>>      
> characters
>    
>> one after the other. Input2 is where I manually separated the string
>> into the component terms by replacing the comma with whitespace. My
>> confusion stems from the fact that I thought it should not matter
>>      
> since
>    
>> the analyzer should be discarding the punctuation anyway? So the
>> tokenization process should be the same for both Input1 and Input2?
If
>> that is not the case, what do I need to change?
>>
>> Thanks,
>> Wei Ho
>>
>> -------- Original Message  --------
>> Subject: Re: Lucene QueryParser and Analyzer
>> From: Sudarsan, Sithu D.<Si...@fda.hhs.gov>
>> To: java-user@lucene.apache.org
>> Date: 4/29/2010 3:54 PM
>>
>>      
>>> Hi,
>>>
>>> Is there a whitespace after the comma?
>>>
>>>
>>> Sincerely,
>>> Sithu D Sudarsan
>>>
>>>
>>> -----Original Message-----
>>> From: Wei Ho [mailto:weiho@princeton.edu]
>>> Sent: Thursday, April 29, 2010 3:51 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Lucene QueryParser and Analyzer
>>>
>>> Hello,
>>>
>>> I'm using Lucene to index and search through a collection of Chinese
>>> documents. However, I'm noticing an odd behavior in query
>>> parsing/searching.
>>>
>>> Given the two queries below:
>>>
>>> (Ci refers to Chinese character i)
>>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>>
>>> Input1 returns absolutely nothing, while Input2 (replacing the
commas
>>> with spaces) works as expected. I'm a bit confused why this would be
>>> happening - it seems that QueryParser uses the Analyzer passed to it
>>>
>>>        
>> to
>>
>>      
>>> tokenize the input query string, so if the Analyzer ignores the
>>> punctuations, it seems that Input1 and Input2 should return
identical
>>> results. Is there some pre-Analyzer filtering or whatever that
>>> QueryParser does? I've tried this with the StandardAnalyzer,
>>> SmartChineseAnalyzer, and an analyzer that I implemented which
>>> explicitly skips over punctuations and whitespaces in tokenizing the
>>> query string, but to no avail.
>>>
>>>
>>> -----------------------------------
>>>
>>> I'm probably just doing something dumb, but any help would be
greatly
>>> appreciated!
>>>
>>> Thanks,
>>> Wei Ho
>>>
>>>
---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>        
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>      
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene QueryParser and Analyzer

Posted by Ahmet Arslan <io...@yahoo.com>.
> I think I've figured out what the
> problem is. Given the inputs,
> 
> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
> 
> Input1 gets parsed as
> Query1: (text: "C1C2  C3C4  C5C6  C7 
> C8C9C10")
> whereas Input2 gets parsed as
> Query2: (text: "C1C2") (text: "C3C4") (text: "C5C6") (text:
> "C7") (text: 
> "C8C9C10")
> 
> That is, Lucene constructs the query and then pass the
> query text 
> through the analyzer. Is there any way to
> force QueryParser to pass the input string through the
> analyzer before 
> creating the query? That is, force Lucene
> to create Query2 for both Input1 and Input2.

QueryParser tokenizes input query string using white-spaces. You can alter this behavior to ways: 

1-) escaping whitespaces with backslash, e.g. C1C2\ C3C4\ C5C6\ C7\ C8C9C10
2-) using quotes, e.g. "C1C2  C3C4  C5C6  C7  C8C9C10"


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene QueryParser and Analyzer

Posted by Wei Ho <we...@princeton.edu>.
I think I've figured out what the problem is. Given the inputs,

Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

Input1 gets parsed as
Query1: (text: "C1C2  C3C4  C5C6  C7  C8C9C10")
whereas Input2 gets parsed as
Query2: (text: "C1C2") (text: "C3C4") (text: "C5C6") (text: "C7") (text: 
"C8C9C10")

That is, Lucene constructs the query and then pass the query text 
through the analyzer. Is there any way to
force QueryParser to pass the input string through the analyzer before 
creating the query? That is, force Lucene
to create Query2 for both Input1 and Input2.

Thanks,
Wei


-------- Original Message  --------
Subject: Re: Lucene QueryParser and Analyzer
From: Sudarsan, Sithu D. <Si...@fda.hhs.gov>
To: java-user@lucene.apache.org
Date: 4/29/2010 4:54 PM
>
> -------sample code-------------
>    
>>> Analyzer analyzer = new LingPipeAnalyzer();
>>> Searcher searcher = new IndexSearcher(directory);
>>> QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30,
>>> SEARCH_FIELDS, analyzer);
>>> Query query = qParser.parse(queryLine[1]);
>>> ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;
>>>        
> qParser will use the analyzer LingPipeAnalyzer() before forming the
> query.
>
>
> Sincerely,
> Sithu D Sudarsan
>
>
> -----Original Message-----
> From: Wei Ho [mailto:weiho@princeton.edu]
> Sent: Thursday, April 29, 2010 4:44 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene QueryParser and Analyzer
>
> Sorry, I guess "discarding the punctuation" was a bit misleading.
> I meant that given the two input strings,
>
> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>
> The analyzer I implemented tokenizes both Input1 and Input2 as "C1C2",
> "C3C4", "C5C6", "C7", "C8C9C10" - that is, it doesn't include the
> punctuation in the tokenization. I'm assuming that QueryParser is simply
>
> passing the entire input string to the analyzer and taking the tokens,
> in which case Input1 and Input2 should be considered identifcal. Does
> QueryParser doing any sort of pre-processing or filtering beforehand? If
>
> so, how can I turn it off?
>
> Aside from stopping tokens at punctuations, my analyzer is also doing
> Chinese word segmentation, so I'd like to be sure that QueryParser is
> using the analyzer the way I expect it to.
>
> Thanks,
> Wei
>
>
>
> -------- Original Message  --------
> Subject: Re: Lucene QueryParser and Analyzer
> From: Sudarsan, Sithu D.<Si...@fda.hhs.gov>
> To: java-user@lucene.apache.org
> Date: 4/29/2010 4:08 PM
>    
>> If so,
>>
>> Input1:  c1c2c3c4c5c6c7....
>> Input2: c1c2 c3c4 ...
>>
>> I guess, they are different! Add a whitespace after commas and see if
>> that works...
>>
>> Sincerely,
>> Sithu D Sudarsan
>>
>>
>> -----Original Message-----
>> From: Wei Ho [mailto:weiho@princeton.edu]
>> Sent: Thursday, April 29, 2010 4:04 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene QueryParser and Analyzer
>>
>> No, there is no whitespace after the comma in Input1
>>
>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>
>> Input1 is basically one big long word with commas and Chinese
>>      
> characters
>    
>> one after the other. Input2 is where I manually separated the string
>> into the component terms by replacing the comma with whitespace. My
>> confusion stems from the fact that I thought it should not matter
>>      
> since
>    
>> the analyzer should be discarding the punctuation anyway? So the
>> tokenization process should be the same for both Input1 and Input2? If
>> that is not the case, what do I need to change?
>>
>> Thanks,
>> Wei Ho
>>
>> -------- Original Message  --------
>> Subject: Re: Lucene QueryParser and Analyzer
>> From: Sudarsan, Sithu D.<Si...@fda.hhs.gov>
>> To: java-user@lucene.apache.org
>> Date: 4/29/2010 3:54 PM
>>
>>      
>>> Hi,
>>>
>>> Is there a whitespace after the comma?
>>>
>>>
>>> Sincerely,
>>> Sithu D Sudarsan
>>>
>>>
>>> -----Original Message-----
>>> From: Wei Ho [mailto:weiho@princeton.edu]
>>> Sent: Thursday, April 29, 2010 3:51 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Lucene QueryParser and Analyzer
>>>
>>> Hello,
>>>
>>> I'm using Lucene to index and search through a collection of Chinese
>>> documents. However, I'm noticing an odd behavior in query
>>> parsing/searching.
>>>
>>> Given the two queries below:
>>>
>>> (Ci refers to Chinese character i)
>>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>>
>>> Input1 returns absolutely nothing, while Input2 (replacing the commas
>>> with spaces) works as expected. I'm a bit confused why this would be
>>> happening - it seems that QueryParser uses the Analyzer passed to it
>>>
>>>        
>> to
>>
>>      
>>> tokenize the input query string, so if the Analyzer ignores the
>>> punctuations, it seems that Input1 and Input2 should return identical
>>> results. Is there some pre-Analyzer filtering or whatever that
>>> QueryParser does? I've tried this with the StandardAnalyzer,
>>> SmartChineseAnalyzer, and an analyzer that I implemented which
>>> explicitly skips over punctuations and whitespaces in tokenizing the
>>> query string, but to no avail.
>>>
>>>
>>> -----------------------------------
>>>
>>> I'm probably just doing something dumb, but any help would be greatly
>>> appreciated!
>>>
>>> Thanks,
>>> Wei Ho
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>        
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>      
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Lucene QueryParser and Analyzer

Posted by "Sudarsan, Sithu D." <Si...@fda.hhs.gov>.
 
-------sample code-------------
>> Analyzer analyzer = new LingPipeAnalyzer();
>> Searcher searcher = new IndexSearcher(directory);
>> QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30,
>> SEARCH_FIELDS, analyzer);
>> Query query = qParser.parse(queryLine[1]);
>> ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;

qParser will use the analyzer LingPipeAnalyzer() before forming the
query.


Sincerely,
Sithu D Sudarsan


-----Original Message-----
From: Wei Ho [mailto:weiho@princeton.edu] 
Sent: Thursday, April 29, 2010 4:44 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene QueryParser and Analyzer

Sorry, I guess "discarding the punctuation" was a bit misleading.
I meant that given the two input strings,

Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

The analyzer I implemented tokenizes both Input1 and Input2 as "C1C2", 
"C3C4", "C5C6", "C7", "C8C9C10" - that is, it doesn't include the 
punctuation in the tokenization. I'm assuming that QueryParser is simply

passing the entire input string to the analyzer and taking the tokens, 
in which case Input1 and Input2 should be considered identifcal. Does 
QueryParser doing any sort of pre-processing or filtering beforehand? If

so, how can I turn it off?

Aside from stopping tokens at punctuations, my analyzer is also doing 
Chinese word segmentation, so I'd like to be sure that QueryParser is 
using the analyzer the way I expect it to.

Thanks,
Wei



-------- Original Message  --------
Subject: Re: Lucene QueryParser and Analyzer
From: Sudarsan, Sithu D. <Si...@fda.hhs.gov>
To: java-user@lucene.apache.org
Date: 4/29/2010 4:08 PM
>
> If so,
>
> Input1:  c1c2c3c4c5c6c7....
> Input2: c1c2 c3c4 ...
>
> I guess, they are different! Add a whitespace after commas and see if
> that works...
>
> Sincerely,
> Sithu D Sudarsan
>
>
> -----Original Message-----
> From: Wei Ho [mailto:weiho@princeton.edu]
> Sent: Thursday, April 29, 2010 4:04 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene QueryParser and Analyzer
>
> No, there is no whitespace after the comma in Input1
>
> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>
> Input1 is basically one big long word with commas and Chinese
characters
>
> one after the other. Input2 is where I manually separated the string
> into the component terms by replacing the comma with whitespace. My
> confusion stems from the fact that I thought it should not matter
since
> the analyzer should be discarding the punctuation anyway? So the
> tokenization process should be the same for both Input1 and Input2? If
> that is not the case, what do I need to change?
>
> Thanks,
> Wei Ho
>
> -------- Original Message  --------
> Subject: Re: Lucene QueryParser and Analyzer
> From: Sudarsan, Sithu D.<Si...@fda.hhs.gov>
> To: java-user@lucene.apache.org
> Date: 4/29/2010 3:54 PM
>    
>> Hi,
>>
>> Is there a whitespace after the comma?
>>
>>
>> Sincerely,
>> Sithu D Sudarsan
>>
>>
>> -----Original Message-----
>> From: Wei Ho [mailto:weiho@princeton.edu]
>> Sent: Thursday, April 29, 2010 3:51 PM
>> To: java-user@lucene.apache.org
>> Subject: Lucene QueryParser and Analyzer
>>
>> Hello,
>>
>> I'm using Lucene to index and search through a collection of Chinese
>> documents. However, I'm noticing an odd behavior in query
>> parsing/searching.
>>
>> Given the two queries below:
>>
>> (Ci refers to Chinese character i)
>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>
>> Input1 returns absolutely nothing, while Input2 (replacing the commas
>> with spaces) works as expected. I'm a bit confused why this would be
>> happening - it seems that QueryParser uses the Analyzer passed to it
>>      
> to
>    
>> tokenize the input query string, so if the Analyzer ignores the
>> punctuations, it seems that Input1 and Input2 should return identical
>> results. Is there some pre-Analyzer filtering or whatever that
>> QueryParser does? I've tried this with the StandardAnalyzer,
>> SmartChineseAnalyzer, and an analyzer that I implemented which
>> explicitly skips over punctuations and whitespaces in tokenizing the
>> query string, but to no avail.
>>
>> 
>> -----------------------------------
>>
>> I'm probably just doing something dumb, but any help would be greatly
>> appreciated!
>>
>> Thanks,
>> Wei Ho
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>      
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene QueryParser and Analyzer

Posted by Wei Ho <we...@princeton.edu>.
Sorry, I guess "discarding the punctuation" was a bit misleading.
I meant that given the two input strings,

Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

The analyzer I implemented tokenizes both Input1 and Input2 as "C1C2", 
"C3C4", "C5C6", "C7", "C8C9C10" - that is, it doesn't include the 
punctuation in the tokenization. I'm assuming that QueryParser is simply 
passing the entire input string to the analyzer and taking the tokens, 
in which case Input1 and Input2 should be considered identifcal. Does 
QueryParser doing any sort of pre-processing or filtering beforehand? If 
so, how can I turn it off?

Aside from stopping tokens at punctuations, my analyzer is also doing 
Chinese word segmentation, so I'd like to be sure that QueryParser is 
using the analyzer the way I expect it to.

Thanks,
Wei



-------- Original Message  --------
Subject: Re: Lucene QueryParser and Analyzer
From: Sudarsan, Sithu D. <Si...@fda.hhs.gov>
To: java-user@lucene.apache.org
Date: 4/29/2010 4:08 PM
>
> If so,
>
> Input1:  c1c2c3c4c5c6c7....
> Input2: c1c2 c3c4 ...
>
> I guess, they are different! Add a whitespace after commas and see if
> that works...
>
> Sincerely,
> Sithu D Sudarsan
>
>
> -----Original Message-----
> From: Wei Ho [mailto:weiho@princeton.edu]
> Sent: Thursday, April 29, 2010 4:04 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene QueryParser and Analyzer
>
> No, there is no whitespace after the comma in Input1
>
> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>
> Input1 is basically one big long word with commas and Chinese characters
>
> one after the other. Input2 is where I manually separated the string
> into the component terms by replacing the comma with whitespace. My
> confusion stems from the fact that I thought it should not matter since
> the analyzer should be discarding the punctuation anyway? So the
> tokenization process should be the same for both Input1 and Input2? If
> that is not the case, what do I need to change?
>
> Thanks,
> Wei Ho
>
> -------- Original Message  --------
> Subject: Re: Lucene QueryParser and Analyzer
> From: Sudarsan, Sithu D.<Si...@fda.hhs.gov>
> To: java-user@lucene.apache.org
> Date: 4/29/2010 3:54 PM
>    
>> Hi,
>>
>> Is there a whitespace after the comma?
>>
>>
>> Sincerely,
>> Sithu D Sudarsan
>>
>>
>> -----Original Message-----
>> From: Wei Ho [mailto:weiho@princeton.edu]
>> Sent: Thursday, April 29, 2010 3:51 PM
>> To: java-user@lucene.apache.org
>> Subject: Lucene QueryParser and Analyzer
>>
>> Hello,
>>
>> I'm using Lucene to index and search through a collection of Chinese
>> documents. However, I'm noticing an odd behavior in query
>> parsing/searching.
>>
>> Given the two queries below:
>>
>> (Ci refers to Chinese character i)
>> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
>> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>>
>> Input1 returns absolutely nothing, while Input2 (replacing the commas
>> with spaces) works as expected. I'm a bit confused why this would be
>> happening - it seems that QueryParser uses the Analyzer passed to it
>>      
> to
>    
>> tokenize the input query string, so if the Analyzer ignores the
>> punctuations, it seems that Input1 and Input2 should return identical
>> results. Is there some pre-Analyzer filtering or whatever that
>> QueryParser does? I've tried this with the StandardAnalyzer,
>> SmartChineseAnalyzer, and an analyzer that I implemented which
>> explicitly skips over punctuations and whitespaces in tokenizing the
>> query string, but to no avail.
>>
>> -------sample code-------------
>> Analyzer analyzer = new LingPipeAnalyzer();
>> Searcher searcher = new IndexSearcher(directory);
>> QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30,
>> SEARCH_FIELDS, analyzer);
>> Query query = qParser.parse(queryLine[1]);
>> ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;
>> -----------------------------------
>>
>> I'm probably just doing something dumb, but any help would be greatly
>> appreciated!
>>
>> Thanks,
>> Wei Ho
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>      
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Lucene QueryParser and Analyzer

Posted by "Sudarsan, Sithu D." <Si...@fda.hhs.gov>.
 
If so,

Input1:  c1c2c3c4c5c6c7....
Input2: c1c2 c3c4 ...

I guess, they are different! Add a whitespace after commas and see if
that works...

Sincerely,
Sithu D Sudarsan


-----Original Message-----
From: Wei Ho [mailto:weiho@princeton.edu] 
Sent: Thursday, April 29, 2010 4:04 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene QueryParser and Analyzer

No, there is no whitespace after the comma in Input1

Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

Input1 is basically one big long word with commas and Chinese characters

one after the other. Input2 is where I manually separated the string 
into the component terms by replacing the comma with whitespace. My 
confusion stems from the fact that I thought it should not matter since 
the analyzer should be discarding the punctuation anyway? So the 
tokenization process should be the same for both Input1 and Input2? If 
that is not the case, what do I need to change?

Thanks,
Wei Ho

-------- Original Message  --------
Subject: Re: Lucene QueryParser and Analyzer
From: Sudarsan, Sithu D. <Si...@fda.hhs.gov>
To: java-user@lucene.apache.org
Date: 4/29/2010 3:54 PM
> Hi,
>
> Is there a whitespace after the comma?
>
>
> Sincerely,
> Sithu D Sudarsan
>
>
> -----Original Message-----
> From: Wei Ho [mailto:weiho@princeton.edu]
> Sent: Thursday, April 29, 2010 3:51 PM
> To: java-user@lucene.apache.org
> Subject: Lucene QueryParser and Analyzer
>
> Hello,
>
> I'm using Lucene to index and search through a collection of Chinese
> documents. However, I'm noticing an odd behavior in query
> parsing/searching.
>
> Given the two queries below:
>
> (Ci refers to Chinese character i)
> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>
> Input1 returns absolutely nothing, while Input2 (replacing the commas
> with spaces) works as expected. I'm a bit confused why this would be
> happening - it seems that QueryParser uses the Analyzer passed to it
to
> tokenize the input query string, so if the Analyzer ignores the
> punctuations, it seems that Input1 and Input2 should return identical
> results. Is there some pre-Analyzer filtering or whatever that
> QueryParser does? I've tried this with the StandardAnalyzer,
> SmartChineseAnalyzer, and an analyzer that I implemented which
> explicitly skips over punctuations and whitespaces in tokenizing the
> query string, but to no avail.
>
> -------sample code-------------
> Analyzer analyzer = new LingPipeAnalyzer();
> Searcher searcher = new IndexSearcher(directory);
> QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30,
> SEARCH_FIELDS, analyzer);
> Query query = qParser.parse(queryLine[1]);
> ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;
> -----------------------------------
>
> I'm probably just doing something dumb, but any help would be greatly
> appreciated!
>
> Thanks,
> Wei Ho
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene QueryParser and Analyzer

Posted by Wei Ho <we...@princeton.edu>.
No, there is no whitespace after the comma in Input1

Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

Input1 is basically one big long word with commas and Chinese characters 
one after the other. Input2 is where I manually separated the string 
into the component terms by replacing the comma with whitespace. My 
confusion stems from the fact that I thought it should not matter since 
the analyzer should be discarding the punctuation anyway? So the 
tokenization process should be the same for both Input1 and Input2? If 
that is not the case, what do I need to change?

Thanks,
Wei Ho

-------- Original Message  --------
Subject: Re: Lucene QueryParser and Analyzer
From: Sudarsan, Sithu D. <Si...@fda.hhs.gov>
To: java-user@lucene.apache.org
Date: 4/29/2010 3:54 PM
> Hi,
>
> Is there a whitespace after the comma?
>
>
> Sincerely,
> Sithu D Sudarsan
>
>
> -----Original Message-----
> From: Wei Ho [mailto:weiho@princeton.edu]
> Sent: Thursday, April 29, 2010 3:51 PM
> To: java-user@lucene.apache.org
> Subject: Lucene QueryParser and Analyzer
>
> Hello,
>
> I'm using Lucene to index and search through a collection of Chinese
> documents. However, I'm noticing an odd behavior in query
> parsing/searching.
>
> Given the two queries below:
>
> (Ci refers to Chinese character i)
> Input1: C1C2,C3C4,C5C6,C7,C8C9C10
> Input2: C1C2  C3C4  C5C6  C7  C8C9C10
>
> Input1 returns absolutely nothing, while Input2 (replacing the commas
> with spaces) works as expected. I'm a bit confused why this would be
> happening - it seems that QueryParser uses the Analyzer passed to it to
> tokenize the input query string, so if the Analyzer ignores the
> punctuations, it seems that Input1 and Input2 should return identical
> results. Is there some pre-Analyzer filtering or whatever that
> QueryParser does? I've tried this with the StandardAnalyzer,
> SmartChineseAnalyzer, and an analyzer that I implemented which
> explicitly skips over punctuations and whitespaces in tokenizing the
> query string, but to no avail.
>
> -------sample code-------------
> Analyzer analyzer = new LingPipeAnalyzer();
> Searcher searcher = new IndexSearcher(directory);
> QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30,
> SEARCH_FIELDS, analyzer);
> Query query = qParser.parse(queryLine[1]);
> ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;
> -----------------------------------
>
> I'm probably just doing something dumb, but any help would be greatly
> appreciated!
>
> Thanks,
> Wei Ho
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Lucene QueryParser and Analyzer

Posted by "Sudarsan, Sithu D." <Si...@fda.hhs.gov>.
Hi,

Is there a whitespace after the comma? 


Sincerely,
Sithu D Sudarsan


-----Original Message-----
From: Wei Ho [mailto:weiho@princeton.edu] 
Sent: Thursday, April 29, 2010 3:51 PM
To: java-user@lucene.apache.org
Subject: Lucene QueryParser and Analyzer

Hello,

I'm using Lucene to index and search through a collection of Chinese 
documents. However, I'm noticing an odd behavior in query
parsing/searching.

Given the two queries below:

(Ci refers to Chinese character i)
Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2  C3C4  C5C6  C7  C8C9C10

Input1 returns absolutely nothing, while Input2 (replacing the commas 
with spaces) works as expected. I'm a bit confused why this would be 
happening - it seems that QueryParser uses the Analyzer passed to it to 
tokenize the input query string, so if the Analyzer ignores the 
punctuations, it seems that Input1 and Input2 should return identical 
results. Is there some pre-Analyzer filtering or whatever that 
QueryParser does? I've tried this with the StandardAnalyzer, 
SmartChineseAnalyzer, and an analyzer that I implemented which 
explicitly skips over punctuations and whitespaces in tokenizing the 
query string, but to no avail.

-------sample code-------------
Analyzer analyzer = new LingPipeAnalyzer();
Searcher searcher = new IndexSearcher(directory);
QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30, 
SEARCH_FIELDS, analyzer);
Query query = qParser.parse(queryLine[1]);
ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;
-----------------------------------

I'm probably just doing something dumb, but any help would be greatly 
appreciated!

Thanks,
Wei Ho

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org