You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2014/04/10 19:53:32 UTC

Another japanese analysis problem

My analysis chain includes CJKBigramFilter on both the index and query.  
I have outputUnigrams enabled on the index side, but it is disabled on 
the query side.  This has resulted in a problem with phrase queries.  
This is a subset of my index analysis for the three terms you can see in 
the ICUNF step, separated by spaces:

https://www.dropbox.com/s/9q1x9pdbsjhzocg/bigram-position-problem.png

Note that in the CJKBF step, the second unigram is output at position 2, 
pushing the english terms to 3 and 4.

When the customer phrase filter query (lucene query parser) for the 
first two terms on this specific field, it doesn't match, because the 
query analysis doesn't output the unigrams and therefore the positions 
don't match.

I would have expected both unigrams to be at position 1.  Is this a bug 
or expected behavior?

Thanks,
Shawn

Re: Another japanese analysis problem

Posted by Shawn Heisey <so...@elyograg.org>.

On 4/18/2014 12:04 AM, Alexandre Rafalovitch wrote:
> Did you read through the CJK article series? Maybe there is something
> in there? http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
> 
> Sorry, no help on actual Japanese.

Almost everything I know about the Japanese language has been learned in
the last few weeks, working on this Solr config!

That blog series looks like really awesome information.  I will be
trying out some of what they've mentioned.  Thank you for pointing me
that direction.  The author's index is a lot more complex than ours ...
I'm really hoping to avoid having a lot of copies of each field.  The
index is already relatively large.

I think I'll take my discussion about a possible bug in CJKBigramFilter
to the dev list.

Thanks,
Shawn

Re: Another japanese analysis problem

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Did you read through the CJK article series? Maybe there is something
in there? http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html

Sorry, no help on actual Japanese.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Fri, Apr 18, 2014 at 12:50 PM, Shawn Heisey <so...@elyograg.org> wrote:
> On 4/10/2014 11:53 AM, Shawn Heisey wrote:
>> My analysis chain includes CJKBigramFilter on both the index and query.
>> I have outputUnigrams enabled on the index side, but it is disabled on
>> the query side.  This has resulted in a problem with phrase queries.
>> This is a subset of my index analysis for the three terms you can see in
>> the ICUNF step, separated by spaces:
>>
>> https://www.dropbox.com/s/9q1x9pdbsjhzocg/bigram-position-problem.png
>>
>> Note that in the CJKBF step, the second unigram is output at position 2,
>> pushing the english terms to 3 and 4.
>>
>> When the customer phrase filter query (lucene query parser) for the
>> first two terms on this specific field, it doesn't match, because the
>> query analysis doesn't output the unigrams and therefore the positions
>> don't match.
>>
>> I would have expected both unigrams to be at position 1.  Is this a bug
>> or expected behavior?
>
> It's been a week with no reply.
>
> First I worked around this problem by disabling outputUnigrams on the
> index side, to match the query side.  At that point, the customer was
> unable to do a searches for a single character and find longer strings
> containing that character.  I knew this would happen ... I did tell our
> project manager, but I do not know whether it was communicated to the
> customer.
>
> Then I tried setting outputUnigrams to true on both index and query.
> Just as I had anticipated, the customer was unhappy with getting results
> where a "word" containing only one character of their multi-character
> search string was present.
>
> Re-stating the underlying problem and my question:
>
> The outputUnigrams option sets one of the unigrams from each bigram to
> the same position as the bigram, but then puts the other one at the next
> position, breaking phrase queries.  This sounds like a bug.  Is it a
> bug?  If not, I would REALLY like a config option to produce the
> behavior that I expected.
>
> Thanks,
> Shawn
>

Re: Another japanese analysis problem

Posted by Shawn Heisey <so...@elyograg.org>.

On 4/10/2014 11:53 AM, Shawn Heisey wrote:
> My analysis chain includes CJKBigramFilter on both the index and query. 
> I have outputUnigrams enabled on the index side, but it is disabled on
> the query side.  This has resulted in a problem with phrase queries. 
> This is a subset of my index analysis for the three terms you can see in
> the ICUNF step, separated by spaces:
> 
> https://www.dropbox.com/s/9q1x9pdbsjhzocg/bigram-position-problem.png
> 
> Note that in the CJKBF step, the second unigram is output at position 2,
> pushing the english terms to 3 and 4.
> 
> When the customer phrase filter query (lucene query parser) for the
> first two terms on this specific field, it doesn't match, because the
> query analysis doesn't output the unigrams and therefore the positions
> don't match.
> 
> I would have expected both unigrams to be at position 1.  Is this a bug
> or expected behavior?

It's been a week with no reply.

First I worked around this problem by disabling outputUnigrams on the
index side, to match the query side.  At that point, the customer was
unable to do a searches for a single character and find longer strings
containing that character.  I knew this would happen ... I did tell our
project manager, but I do not know whether it was communicated to the
customer.

Then I tried setting outputUnigrams to true on both index and query.
Just as I had anticipated, the customer was unhappy with getting results
where a "word" containing only one character of their multi-character
search string was present.

Re-stating the underlying problem and my question:

The outputUnigrams option sets one of the unigrams from each bigram to
the same position as the bigram, but then puts the other one at the next
position, breaking phrase queries.  This sounds like a bug.  Is it a
bug?  If not, I would REALLY like a config option to produce the
behavior that I expected.

Thanks,
Shawn