You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "@Nandan@" <na...@gmail.com> on 2017/06/15 01:23:22 UTC

Reg:- StrField Analyzer Issue

Hi ,

I am using Apache Solr for do advanced searching with my Big Data.

When I am creating Solr core , then by default for text field , it is
coming as TextField data type and class.

Can you please tell me how to change TextField to StrField. My table
contains record into English as well as Chinese .

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<schema name="autoSolrSchema" version="1.5">

  <types>

    <fieldType class="org.apache.solr.schema.StrField" name="StrField">

      <analyzer>

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

    <fieldType class="org.apache.solr.schema.UUIDField" name="UUIDField"/>

    <fieldType class="org.apache.solr.schema.TrieIntField"
name="TrieIntField"/>

  </types>

  <fields>

    <field indexed="true" multiValued="false" name="title" stored="true"
type="StrField"/>

    <field indexed="true" multiValued="false" name="isbn" stored="true"
type="StrField"/>

    <field indexed="true" multiValued="false" name="publisher"
stored="true" type="StrField"/>

    <field indexed="true" multiValued="false" name="author" stored="true"
type="StrField"/>

    <field docValues="true" indexed="true" multiValued="false" name="id"
stored="true" type="UUIDField"/>

    <field docValues="true" indexed="true" multiValued="false" name="date"
stored="true" type="TrieIntField"/>

  </fields>


Please guide me for correct StrField.

Thanks.

Re: Reg:- StrField Analyzer Issue

Posted by "@Nandan@" <na...@gmail.com>.
Thanks Erick For best Explanation.

The issue with My data is as below. :-
I have few data on my books table.

cqlsh:nandan> select * from books;



 id                                   | author   | date | isbn     |
solr_query | title

--------------------------------------+----------+------+----------+------------+-----------

 3910b29d-c957-4312-9b8b-738b1d0e25d0 |  Chandan | 2015 |  1asd33s |
null |      Solr

 d7534021-80c2-4315-8027-84f04bf92f53 | 现在有货 | 2015 | 现在有货 |       null
|      Solr

 780b5163-ca6b-40bf-a523-af2c075ef7df |   在有货 | 2015 |   在有货 |       null
|      Solr

 e6229268-d0fd-485b-ad89-bbde73a07ed6 |       货 | 2015 |   现有货 |       null
|      Solr

 76461e7e-6c31-4a4b-8a36-0df5ce746d50 |   Nandan | 2017 |    11111 |
null |  Datastax

 9a9c66c2-cd34-460e-a301-6d8e7eb14e55 |   Kundan | 2016 |     12ws |
null | Cassandra

 7e87dc3a-5e4e-4653-84cc-3d83239708d4 |   现有货 | 2015 |   现有货 |       null
|      Solr

 6971976e-2528-4956-94a8-345deefe5796 |     现货 | 2015 |     现货 |       null
|      Solr


When I am trying to select from table based on author  as:-

cqlsh:nandan> SELECT * from books where solr_query = 'author:现有货';



 id                                   | author   | date | isbn     |
solr_query | title

--------------------------------------+----------+------+----------+------------+-------

 d7534021-80c2-4315-8027-84f04bf92f53 | 现在有货 | 2015 | 现在有货 |       null |
Solr

 7e87dc3a-5e4e-4653-84cc-3d83239708d4 |   现有货 | 2015 |   现有货 |       null
|  Solr

 6971976e-2528-4956-94a8-345deefe5796 |     现货 | 2015 |     现货 |       null
|  Solr

 780b5163-ca6b-40bf-a523-af2c075ef7df |   在有货 | 2015 |   在有货 |       null
|  Solr

It should return me one value , but I am getting other records also,


But when I am trying to retrive another way, then it is returning me 0 rows
as :-

cqlsh:nandan> SELECT * from books where solr_query = 'author:*现有货*';



 id | author | date | isbn | solr_query | title

----+--------+------+------+------------+-------



(0 rows)

cqlsh:nandan> SELECT * from books where solr_query = 'author:*现有货';



 id | author | date | isbn | solr_query | title

----+--------+------+------+------------+-------



(0 rows)

cqlsh:nandan> SELECT * from books where solr_query = 'author:现有货*';



 id | author | date | isbn | solr_query | title

----+--------+------+------+------------+-------



(0 rows)


In Some cases, I am getting correct data but in some case, I am getting
wrong data. Please check.

Thanks

Nandan

On Thu, Jun 15, 2017 at 11:47 AM, Erick Erickson <er...@gmail.com>
wrote:

> Back up a bit and tell us why you want to use StrField, because what
> you're trying to do is somewhat confused.
>
> First of all, StrFields are totally unanalyzed. So defining an
> <analyzer> as part of a StrField type definition is totally
> unsupported. I'm a bit surprised that Solr even starts up.
>
> Second, you can't search a StrField unless you search the whole thing
> exactly. That is, if your title field is "My dog has fleas", there
> only a few ways to match anything in that field
>
> 1> search "My dog has fleas" exactly. Even "my dog has fleas" wouldn't
> match because of the capitalization. "My dog has fleas." would also
> fail because of the period. StrField types are intended for data that
> should be invariant and not tokenized.
>
> 2> prefix search as "My dog*"
>
> 3> pre-and-postfix as "*dog*"
>
> <2> is actually reasonable if you have more than, say, 3 or 4 "real"
> characters before the wildcard.
>
> <3> performs very poorly at any kind of scale.
>
> A search for "dog" would not match. A search for "fleas" wouldn't
> match. You see where this is going.
>
> If those restrictions are OK, just use the already-defined "string" type.
>
> As for the English/Chinese that's actually kind of a tough one.
> Splitting Chinese up into searchable tokens is nothing like breaking
> English up. There are examples in the managed-schema file that have
> field definitions for Chinese, but I know of no way to have a single
> field type shard the two different analysis chains. One solution
> people have used is to have a title_ch and title_en field and search
> both. Or search one or the other preferentially if the input is in one
> language or the other.
>
> I strongly advise you use the admin UI>>analysis page to understand
> the effects of tokenization, it's the heart of searching.
>
> Best,
> Erick
>
> On Wed, Jun 14, 2017 at 6:23 PM, @Nandan@
> <na...@gmail.com> wrote:
> > Hi ,
> >
> > I am using Apache Solr for do advanced searching with my Big Data.
> >
> > When I am creating Solr core , then by default for text field , it is
> > coming as TextField data type and class.
> >
> > Can you please tell me how to change TextField to StrField. My table
> > contains record into English as well as Chinese .
> >
> > <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> >
> > <schema name="autoSolrSchema" version="1.5">
> >
> >   <types>
> >
> >     <fieldType class="org.apache.solr.schema.StrField" name="StrField">
> >
> >       <analyzer>
> >
> >         <tokenizer class="solr.StandardTokenizerFactory"/>
> >
> >         <filter class="solr.LowerCaseFilterFactory"/>
> >
> >       </analyzer>
> >
> >     </fieldType>
> >
> >     <fieldType class="org.apache.solr.schema.UUIDField"
> name="UUIDField"/>
> >
> >     <fieldType class="org.apache.solr.schema.TrieIntField"
> > name="TrieIntField"/>
> >
> >   </types>
> >
> >   <fields>
> >
> >     <field indexed="true" multiValued="false" name="title" stored="true"
> > type="StrField"/>
> >
> >     <field indexed="true" multiValued="false" name="isbn" stored="true"
> > type="StrField"/>
> >
> >     <field indexed="true" multiValued="false" name="publisher"
> > stored="true" type="StrField"/>
> >
> >     <field indexed="true" multiValued="false" name="author" stored="true"
> > type="StrField"/>
> >
> >     <field docValues="true" indexed="true" multiValued="false" name="id"
> > stored="true" type="UUIDField"/>
> >
> >     <field docValues="true" indexed="true" multiValued="false"
> name="date"
> > stored="true" type="TrieIntField"/>
> >
> >   </fields>
> >
> >
> > Please guide me for correct StrField.
> >
> > Thanks.
>

Re: Reg:- StrField Analyzer Issue

Posted by Erick Erickson <er...@gmail.com>.
Back up a bit and tell us why you want to use StrField, because what
you're trying to do is somewhat confused.

First of all, StrFields are totally unanalyzed. So defining an
<analyzer> as part of a StrField type definition is totally
unsupported. I'm a bit surprised that Solr even starts up.

Second, you can't search a StrField unless you search the whole thing
exactly. That is, if your title field is "My dog has fleas", there
only a few ways to match anything in that field

1> search "My dog has fleas" exactly. Even "my dog has fleas" wouldn't
match because of the capitalization. "My dog has fleas." would also
fail because of the period. StrField types are intended for data that
should be invariant and not tokenized.

2> prefix search as "My dog*"

3> pre-and-postfix as "*dog*"

<2> is actually reasonable if you have more than, say, 3 or 4 "real"
characters before the wildcard.

<3> performs very poorly at any kind of scale.

A search for "dog" would not match. A search for "fleas" wouldn't
match. You see where this is going.

If those restrictions are OK, just use the already-defined "string" type.

As for the English/Chinese that's actually kind of a tough one.
Splitting Chinese up into searchable tokens is nothing like breaking
English up. There are examples in the managed-schema file that have
field definitions for Chinese, but I know of no way to have a single
field type shard the two different analysis chains. One solution
people have used is to have a title_ch and title_en field and search
both. Or search one or the other preferentially if the input is in one
language or the other.

I strongly advise you use the admin UI>>analysis page to understand
the effects of tokenization, it's the heart of searching.

Best,
Erick

On Wed, Jun 14, 2017 at 6:23 PM, @Nandan@
<na...@gmail.com> wrote:
> Hi ,
>
> I am using Apache Solr for do advanced searching with my Big Data.
>
> When I am creating Solr core , then by default for text field , it is
> coming as TextField data type and class.
>
> Can you please tell me how to change TextField to StrField. My table
> contains record into English as well as Chinese .
>
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
>
> <schema name="autoSolrSchema" version="1.5">
>
>   <types>
>
>     <fieldType class="org.apache.solr.schema.StrField" name="StrField">
>
>       <analyzer>
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>
>     </fieldType>
>
>     <fieldType class="org.apache.solr.schema.UUIDField" name="UUIDField"/>
>
>     <fieldType class="org.apache.solr.schema.TrieIntField"
> name="TrieIntField"/>
>
>   </types>
>
>   <fields>
>
>     <field indexed="true" multiValued="false" name="title" stored="true"
> type="StrField"/>
>
>     <field indexed="true" multiValued="false" name="isbn" stored="true"
> type="StrField"/>
>
>     <field indexed="true" multiValued="false" name="publisher"
> stored="true" type="StrField"/>
>
>     <field indexed="true" multiValued="false" name="author" stored="true"
> type="StrField"/>
>
>     <field docValues="true" indexed="true" multiValued="false" name="id"
> stored="true" type="UUIDField"/>
>
>     <field docValues="true" indexed="true" multiValued="false" name="date"
> stored="true" type="TrieIntField"/>
>
>   </fields>
>
>
> Please guide me for correct StrField.
>
> Thanks.