You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "anurag.jain" <an...@gmail.com> on 2013/03/21 15:10:25 UTC

CommaSplit and query is free text search

I have field named as  worked_company_name.

in json input i am giving value like

{
"worked_company_name":["Dell","Microsoft,Facebook"] 
}

-> data is very bad. means it may have comma etc.


<field name="worked_company_name" type="comaSplitwithsearch" indexed="true"
stored="true"/>


so can you please tell me how type should ? 


comaSplitwithsearch ?? 


thanks 






--
View this message in context: http://lucene.472066.n3.nabble.com/CommaSplit-and-query-is-free-text-search-tp4049734.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: CommaSplit and query is free text search

Posted by Jack Krupansky <ja...@basetechnology.com>.
You should clean up your data before sending it to Solr. Theoretically, you 
could develop a custom update processor to do that cleanup within Solr, but 
it probably wouldn't be worth the extra effort.

Once you have decided what the clean input format is, then you can decide 
what the details of the Solr schema should be.

Actually, the first question is what schema your applications will be 
expecting to see. I mean, is it simply a multivalued string field, or do 
they want to do keyword search? Decide how the app will consume the data, 
then design the rough schema, then decide what the clean data will look 
like, then tune the schema for any nuances. For example, maybe you want both 
a multivalued list of strings (e.g., for a formatted display) and a 
multivalued list of keyword text values. Or, maybe you want just a simple 
keyword text field for the whole list as one value.

In any case, start with the app usage requirements.

-- Jack Krupansky

-----Original Message----- 
From: anurag.jain
Sent: Thursday, March 21, 2013 10:10 AM
To: solr-user@lucene.apache.org
Subject: CommaSplit and query is free text search

I have field named as  worked_company_name.

in json input i am giving value like

{
"worked_company_name":["Dell","Microsoft,Facebook"]
}

-> data is very bad. means it may have comma etc.


<field name="worked_company_name" type="comaSplitwithsearch" indexed="true"
stored="true"/>


so can you please tell me how type should ?


comaSplitwithsearch ??


thanks






--
View this message in context: 
http://lucene.472066.n3.nabble.com/CommaSplit-and-query-is-free-text-search-tp4049734.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: CommaSplit and query is free text search

Posted by "anurag.jain" <an...@gmail.com>.
I tried with text_general type, it is working according to my need, with
multivalue = true.

 <field name="worked_company_name" type="text_general" indexed="true"
stored="true" multiValued="true"/>

<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>








--
View this message in context: http://lucene.472066.n3.nabble.com/CommaSplit-and-query-is-free-text-search-tp4049734p4050708.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: CommaSplit and query is free text search

Posted by Jack Krupansky <ja...@basetechnology.com>.
That's fine as far as it goes, but the input is multi-valued, so merely 
splitting tokens on comma doesn't make the tokens separate values.

Given:

{"worked_company_name":["Dell","Microsoft,Facebook"] }

The regex would produced the equivalent of :

{"worked_company_name":["Dell","Microsoft Facebook"] }

Or is the desired goal:

{"worked_company_name":["Dell","Microsoft","Facebook"] }

Or, something else?

-- Jack Krupansky

-----Original Message----- 
From: Keswani, Nitin - BLS CTR
Sent: Thursday, March 21, 2013 2:54 PM
To: solr-user@lucene.apache.org ; anurag.kota@gmail.com
Subject: RE: CommaSplit and query is free text search

You can use a type defined below to split on comma. Please note I have not 
used any additional filters.
Based on your requirements you might want to add more filters for further 
processing after tokenisation :

<!-- A text field that only splits on comma for exact matching of words -->
    <fieldType name="text_split_on_comma" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="," />
      </analyzer>
    </fieldType>

Thanks.

Regards,

Nitin Keswani


-----Original Message-----
From: anurag.jain [mailto:anurag.kota@gmail.com]
Sent: Thursday, March 21, 2013 10:10 AM
To: solr-user@lucene.apache.org
Subject: CommaSplit and query is free text search

I have field named as  worked_company_name.

in json input i am giving value like

{
"worked_company_name":["Dell","Microsoft,Facebook"]
}

-> data is very bad. means it may have comma etc.


<field name="worked_company_name" type="comaSplitwithsearch" indexed="true"
stored="true"/>


so can you please tell me how type should ?


comaSplitwithsearch ??


thanks






--
View this message in context: 
http://lucene.472066.n3.nabble.com/CommaSplit-and-query-is-free-text-search-tp4049734.html
Sent from the Solr - User mailing list archive at Nabble.com. 


RE: CommaSplit and query is free text search

Posted by "Keswani, Nitin - BLS CTR" <Ke...@bls.gov>.
You can use a type defined below to split on comma. Please note I have not used any additional filters.
Based on your requirements you might want to add more filters for further processing after tokenisation :

<!-- A text field that only splits on comma for exact matching of words -->
    <fieldType name="text_split_on_comma" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="," />
      </analyzer>
    </fieldType>

Thanks.

Regards,

Nitin Keswani


-----Original Message-----
From: anurag.jain [mailto:anurag.kota@gmail.com] 
Sent: Thursday, March 21, 2013 10:10 AM
To: solr-user@lucene.apache.org
Subject: CommaSplit and query is free text search

I have field named as  worked_company_name.

in json input i am giving value like

{
"worked_company_name":["Dell","Microsoft,Facebook"] 
}

-> data is very bad. means it may have comma etc.


<field name="worked_company_name" type="comaSplitwithsearch" indexed="true"
stored="true"/>


so can you please tell me how type should ? 


comaSplitwithsearch ?? 


thanks 






--
View this message in context: http://lucene.472066.n3.nabble.com/CommaSplit-and-query-is-free-text-search-tp4049734.html
Sent from the Solr - User mailing list archive at Nabble.com.