You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Shettler <ds...@gmail.com> on 2009/01/17 04:27:15 UTC

Word Delimiter struggles

This has likely been covered, and I've tried searching through the
archives, but having trouble finding an answer.

On OSVDB.org, if you search for:

title:PHPGroupWare

You get...nothing

if you search for:

title:phpGroupWare

(which is how the entry is indexed originally), you get a match of course.

same with phpgroupware

If I get rid of word delimiter, then things are fine, unless you want
to search for PHP GroupWare and get a match...

Basically, I need to get a match on any of these searches:

PHPGroupWare
PHPGroupware
phpGroupware
phpGroupWare
phpgroupware
php groupware
php group ware
PHPGroup ware

etc.

We've been dealing with this problem for about 36 months now, but
there has to be a better way...or am I dreaming? :)

Can anyone suggestion a schema that would accommodate this?  I've
tried every combination of word delimiter that I can think of, but I'm
no expert on the topic.

I can also manipulate input prior to search and indexing if you can
think of a way there.  It's wanting the best of select from LIKE, and
solr's voodoo...perhaps I'm wanting too much!

Cheers,

Dave
OSVDB.org

Re: Word Delimiter struggles

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Mon, Jan 19, 2009 at 9:42 PM, David Shettler <ds...@gmail.com> wrote:

> Thank you Shalin, I'm in the process of implementing your suggestion,
> and it works marvelously.  Had to upgrade to solr 1.3, and had to hack
> up acts_as_solr to function correctly.
>
> Is there a way to receive a search for a given field, and have solr
> know to automatically check the two fields?  I suppose not.

If you use DisMax (defType=dismax) instead of the standard handler, the qf
parameter can be used to specify all the fields you want to search for the
given query.

http://wiki.apache.org/solr/DisMaxRequestHandler

-- 
Regards,
Shalin Shekhar Mangar.

Re: Word Delimiter struggles

Posted by David Shettler <ds...@gmail.com>.

Thank you Shalin, I'm in the process of implementing your suggestion,
and it works marvelously.  Had to upgrade to solr 1.3, and had to hack
up acts_as_solr to function correctly.

Is there a way to receive a search for a given field, and have solr
know to automatically check the two fields?  I suppose not.

I'm trying to avoid having to manipulate user input too much, so
hoping to be able to have a user search for:

title:phpGroupWare

and have it search the two fields automatically.  Right now, in
implementing your solution, I take their title search and convert it
to (titlew:(phpGroupWare) OR titlec:(phpGroupWare)) and it works
marvelously, but of course would be easier if I could just let it go
as is.

(titlew being wdf_wordparts and titlec being wdf_catenatewords)

Thank you kindly, we've grown to depend strongly on solr for OSVDB.org
and datalossdb.org -- it is a fantastic tool.

Dave

On Sat, Jan 17, 2009 at 5:08 AM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
> Hi Dave,
>
> A quick experimentation found the following fieldtypes to be successful with
> your queries. Add one as a copyField to the other and search on both:
>
> <fieldtype name="wdf_wordparts" class="solr.TextField">
>      <analyzer>
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldtype>
>
>    <fieldtype name="wdf_catenatewords" class="solr.TextField">
>      <analyzer>
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
> preserveOriginal="0"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldtype>
>
> I added the following test to TestWordDelimiterFilter.java
>
> public void testDave() {
>
>    assertU(adoc("id", "191",
>            "wdf_preserve", "phpGroupWare"));
>    assertU(commit());
>
>    assertQ("preserving original word",
>            req("wdf_preserve:PHPGroupWare")
>            , "//result[@numFound=1]"
>    );
>
>    assertQ("preserving original word",
>            req("wdf_wordparts:phpGroupWare wdf_catenatewords:phpGroupWare")
>            , "//result[@numFound=1]"
>    );
>
>    assertQ("preserving original word",
>            req("wdf_wordparts:PHPGroupware wdf_catenatewords:PHPGroupware")
>            , "//result[@numFound=1]"
>    );
>    assertQ("preserving original word",
>            req("wdf_wordparts:phpGroupware wdf_catenatewords:phpGroupware")
>            , "//result[@numFound=1]"
>    );
>    assertQ("preserving original word",
>            req("wdf_wordparts:phpgroupware wdf_catenatewords:phpgroupware")
>            , "//result[@numFound=1]"
>    );
>
>    assertQ("preserving original word",
>            req("wdf_wordparts:(php groupware) wdf_catenatewords:(php
> groupware)")
>            , "//result[@numFound=1]"
>    );
>
>    assertQ("preserving original word",
>            req("wdf_wordparts:(php group ware) wdf_catenatewords:(php group
> ware)")
>            , "//result[@numFound=1]"
>    );
>
>    assertQ("preserving original word",
>            req("wdf_wordparts:(PHPGroup ware) wdf_catenatewords:(PHPGroup
> ware)")
>            , "//result[@numFound=1]"
>    );
>
>  }
>
> I'll let someone else comment if there is an easier way to do this (without
> two fields).
>

Re: Word Delimiter struggles

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Hi Dave,

A quick experimentation found the following fieldtypes to be successful with
your queries. Add one as a copyField to the other and search on both:

<fieldtype name="wdf_wordparts" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
preserveOriginal="1"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

    <fieldtype name="wdf_catenatewords" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
preserveOriginal="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

I added the following test to TestWordDelimiterFilter.java

public void testDave() {

    assertU(adoc("id", "191",
            "wdf_preserve", "phpGroupWare"));
    assertU(commit());

    assertQ("preserving original word",
            req("wdf_preserve:PHPGroupWare")
            , "//result[@numFound=1]"
    );

    assertQ("preserving original word",
            req("wdf_wordparts:phpGroupWare wdf_catenatewords:phpGroupWare")
            , "//result[@numFound=1]"
    );

    assertQ("preserving original word",
            req("wdf_wordparts:PHPGroupware wdf_catenatewords:PHPGroupware")
            , "//result[@numFound=1]"
    );
    assertQ("preserving original word",
            req("wdf_wordparts:phpGroupware wdf_catenatewords:phpGroupware")
            , "//result[@numFound=1]"
    );
    assertQ("preserving original word",
            req("wdf_wordparts:phpgroupware wdf_catenatewords:phpgroupware")
            , "//result[@numFound=1]"
    );

    assertQ("preserving original word",
            req("wdf_wordparts:(php groupware) wdf_catenatewords:(php
groupware)")
            , "//result[@numFound=1]"
    );

    assertQ("preserving original word",
            req("wdf_wordparts:(php group ware) wdf_catenatewords:(php group
ware)")
            , "//result[@numFound=1]"
    );

    assertQ("preserving original word",
            req("wdf_wordparts:(PHPGroup ware) wdf_catenatewords:(PHPGroup
ware)")
            , "//result[@numFound=1]"
    );

  }

I'll let someone else comment if there is an easier way to do this (without
two fields).

On Sat, Jan 17, 2009 at 3:06 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Sorry I typed without thinking too much. Please disregard my previous mail.
>
> I'll run a few tests and let you know.
>
>
> On Sat, Jan 17, 2009 at 2:46 PM, Shalin Shekhar Mangar <
> shalinmangar@gmail.com> wrote:
>
>> Hi Dave,
>>
>> There is an attribute on the WordDelimiterFactory preserveOriginal="true"
>> which should keep the original string. I think if you keep LowerCaseFilter
>> before WordDelimiterFactory with the preserveOriginal setting, it should do
>> what you have outlined.
>>
>>
>> On Sat, Jan 17, 2009 at 8:57 AM, David Shettler <ds...@gmail.com>wrote:
>>
>>> This has likely been covered, and I've tried searching through the
>>> archives, but having trouble finding an answer.
>>>
>>> On OSVDB.org, if you search for:
>>>
>>> title:PHPGroupWare
>>>
>>> You get...nothing
>>>
>>> if you search for:
>>>
>>> title:phpGroupWare
>>>
>>> (which is how the entry is indexed originally), you get a match of
>>> course.
>>>
>>> same with phpgroupware
>>>
>>> If I get rid of word delimiter, then things are fine, unless you want
>>> to search for PHP GroupWare and get a match...
>>>
>>> Basically, I need to get a match on any of these searches:
>>>
>>> PHPGroupWare
>>> PHPGroupware
>>> phpGroupware
>>> phpGroupWare
>>> phpgroupware
>>> php groupware
>>> php group ware
>>> PHPGroup ware
>>>
>>> etc.
>>>
>>> We've been dealing with this problem for about 36 months now, but
>>> there has to be a better way...or am I dreaming? :)
>>>
>>> Can anyone suggestion a schema that would accommodate this?  I've
>>> tried every combination of word delimiter that I can think of, but I'm
>>> no expert on the topic.
>>>
>>> I can also manipulate input prior to search and indexing if you can
>>> think of a way there.  It's wanting the best of select from LIKE, and
>>> solr's voodoo...perhaps I'm wanting too much!
>>>
>>> Cheers,
>>>
>>> Dave
>>> OSVDB.org
>>>
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Word Delimiter struggles

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Sorry I typed without thinking too much. Please disregard my previous mail.

I'll run a few tests and let you know.

On Sat, Jan 17, 2009 at 2:46 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Hi Dave,
>
> There is an attribute on the WordDelimiterFactory preserveOriginal="true"
> which should keep the original string. I think if you keep LowerCaseFilter
> before WordDelimiterFactory with the preserveOriginal setting, it should do
> what you have outlined.
>
>
> On Sat, Jan 17, 2009 at 8:57 AM, David Shettler <ds...@gmail.com>wrote:
>
>> This has likely been covered, and I've tried searching through the
>> archives, but having trouble finding an answer.
>>
>> On OSVDB.org, if you search for:
>>
>> title:PHPGroupWare
>>
>> You get...nothing
>>
>> if you search for:
>>
>> title:phpGroupWare
>>
>> (which is how the entry is indexed originally), you get a match of course.
>>
>> same with phpgroupware
>>
>> If I get rid of word delimiter, then things are fine, unless you want
>> to search for PHP GroupWare and get a match...
>>
>> Basically, I need to get a match on any of these searches:
>>
>> PHPGroupWare
>> PHPGroupware
>> phpGroupware
>> phpGroupWare
>> phpgroupware
>> php groupware
>> php group ware
>> PHPGroup ware
>>
>> etc.
>>
>> We've been dealing with this problem for about 36 months now, but
>> there has to be a better way...or am I dreaming? :)
>>
>> Can anyone suggestion a schema that would accommodate this?  I've
>> tried every combination of word delimiter that I can think of, but I'm
>> no expert on the topic.
>>
>> I can also manipulate input prior to search and indexing if you can
>> think of a way there.  It's wanting the best of select from LIKE, and
>> solr's voodoo...perhaps I'm wanting too much!
>>
>> Cheers,
>>
>> Dave
>> OSVDB.org
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Word Delimiter struggles

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Hi Dave,

There is an attribute on the WordDelimiterFactory preserveOriginal="true"
which should keep the original string. I think if you keep LowerCaseFilter
before WordDelimiterFactory with the preserveOriginal setting, it should do
what you have outlined.

On Sat, Jan 17, 2009 at 8:57 AM, David Shettler <ds...@gmail.com> wrote:

> This has likely been covered, and I've tried searching through the
> archives, but having trouble finding an answer.
>
> On OSVDB.org, if you search for:
>
> title:PHPGroupWare
>
> You get...nothing
>
> if you search for:
>
> title:phpGroupWare
>
> (which is how the entry is indexed originally), you get a match of course.
>
> same with phpgroupware
>
> If I get rid of word delimiter, then things are fine, unless you want
> to search for PHP GroupWare and get a match...
>
> Basically, I need to get a match on any of these searches:
>
> PHPGroupWare
> PHPGroupware
> phpGroupware
> phpGroupWare
> phpgroupware
> php groupware
> php group ware
> PHPGroup ware
>
> etc.
>
> We've been dealing with this problem for about 36 months now, but
> there has to be a better way...or am I dreaming? :)
>
> Can anyone suggestion a schema that would accommodate this?  I've
> tried every combination of word delimiter that I can think of, but I'm
> no expert on the topic.
>
> I can also manipulate input prior to search and indexing if you can
> think of a way there.  It's wanting the best of select from LIKE, and
> solr's voodoo...perhaps I'm wanting too much!
>
> Cheers,
>
> Dave
> OSVDB.org
>



-- 
Regards,
Shalin Shekhar Mangar.