You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by vatuska <va...@yandex.ru> on 2013/10/22 12:59:26 UTC

Language detection for multivalued field

Is there any way to define language for multivalued field?
Seems it doesn't work if there are several values with different languages
in the documents.

*I have multivalued field in schema.xml*
...
<field name="tag" type="text_general" indexed="true" stored="true"
required="false" multiValued="true"/>
...
<dynamicField name="*_undfnd" type="text_general" indexed="true"
stored="true" multiValued="true"/>
<dynamicField name="*_en" type="text_en_splitting" indexed="true"
stored="true" multiValued="true"/>

*And I have configured UpdateRequestProcessorChain*

<updateRequestProcessorChain name="langid">
       <processor
class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
         <str name="langid.fl">tag</str>
         <str name="langid.langField">lang_global</str>
		 <str name="langid.langFields">langs</str>
		 <bool name="langid.map">true</bool>
		 <bool name="langid.map.individual">true</bool>
		 <bool name="langid.map.keepOrig">true</bool>
		 <str name="langid.fallback">undfnd</str>
		 <str name="langid.whitelist">en,en_GB,en_US</str>
		 <str
name="langid.map.individual.fl">title,source,tag,creatorName,description</str>
		 <str name="langid.map.lcmap">en_GB:en en_US:en</str>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory"/>
       <processor class="solr.RunUpdateProcessorFactory"/>
     </updateRequestProcessorChain>

*All works fine for document like:*
...
<field name="tag">My test tag</field>
...

*And all works fine for document like*
...
<field name="tag">test</field>
<field name="tag">first</field>
<field name="tag">My tag</field>
...

*But for* 
...
<field name="tag">español</field>
<field name="tag">first</field>
<field name="tag">My tag</field>
...
*There isn't tag indexed*
*But I expect*
tag_en : first, My tag
tag_undfnd : español

Is there any way to fix this?



--
View this message in context: http://lucene.472066.n3.nabble.com/Language-detection-for-multivalued-field-tp4096996.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Language detection for multivalued field

Posted by lsanchez <ls...@scarlet.be>.
Hi all,
I don't know if this can help somebody, I've changed the method process of
the class LanguageIdentifierUpdateProcessor in order to support of
multivalued fields and it works pretty well


protected SolrInputDocument process(SolrInputDocument doc) {
    String docLang = null;
    HashSet<String> docLangs = new HashSet<String>();
    String fallbackLang = getFallbackLang(doc, fallbackFields,
fallbackValue);

    if(langField == null || !doc.containsKey(langField) ||
(doc.containsKey(langField) && overwrite)) {
      String allText = concatFields(doc, inputFields);
      List<DetectedLanguage> languagelist = detectLanguage(allText);
      docLang = resolveLanguage(languagelist, fallbackLang);
      docLangs.add(docLang);
      log.debug("Detected main document language from fields " +
inputFields.toString() + ": "+docLang);

      if(doc.containsKey(langField) && overwrite) {
        log.debug("Overwritten old value "+doc.getFieldValue(langField));
      }
      if(langField != null && langField.length() != 0) {
        doc.setField(langField, docLang);
      }
    } else {
      // langField is set, we sanity check it against whitelist and fallback
      docLang = resolveLanguage((String) doc.getFieldValue(langField),
fallbackLang);
      docLangs.add(docLang);
      log.debug("Field "+langField+" already contained value "+docLang+",
not overwriting.");
    }

    if(enableMapping) {
      for (String fieldName : allMapFieldsSet) {
        if(doc.containsKey(fieldName)) {
          String fieldLang="";
          if(mapIndividual && mapIndividualFieldsSet.contains(fieldName)) {
            
            Collection c = doc.getFieldValues(fieldName);
            for (Object o : c){
                if(o instanceof String ){
                    List<DetectedLanguage> languagelist =
detectLanguage((String) o);
                    fieldLang = resolveLanguage(languagelist, docLang);
                    docLangs.add(fieldLang);
                    log.debug("Mapping multivalued  field "+fieldName+"
using individually detected language "+fieldLang);
                    String mappedOutputField = getMappedField(fieldName,
fieldLang);
                    if (mappedOutputField != null) {
                        log.debug("Mapping multivalued field {} to {}",
doc.getFieldValue(docIdField), fieldLang);
                        SolrInputField inField = new SolrInputField
(fieldName);
                        Collection currentContent
=doc.getFieldValues(mappedOutputField);
                        if (currentContent != null &&
currentContent.size()>0){
                            doc.addField(mappedOutputField, o);
                            
                        }
                        else{
                            inField.setValue(o,
doc.getField(fieldName).getBoost());
                            doc.setField(mappedOutputField,
inField.getValue(), inField.getBoost());
                        }
                        
                                               
                        
                        if(!mapKeepOrig) {
                          log.debug("Removing old field {}", fieldName);
                          doc.removeField(fieldName);
                        }
                      } else {
                        throw new
SolrException(SolrException.ErrorCode.BAD_REQUEST, "Invalid output field
mapping for "
                                + fieldName + " field and language: " +
fieldLang);
                      }
                }
            }
            
          } else {
            
            fieldLang = docLang;
            log.debug("Mapping field "+fieldName+" using document global
language "+fieldLang);
            String mappedOutputField = getMappedField(fieldName, fieldLang);

            if (mappedOutputField != null) {
              log.debug("Mapping field {} to {}",
doc.getFieldValue(docIdField), fieldLang);
              SolrInputField inField = doc.getField(fieldName);
              doc.setField(mappedOutputField, inField.getValue(),
inField.getBoost());
              if(!mapKeepOrig) {
                log.debug("Removing old field {}", fieldName);
                doc.removeField(fieldName);
              }
            } else {
              throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
"Invalid output field mapping for "
                      + fieldName + " field and language: " + fieldLang);
            }
          }
          
        }
      }
    }

    // Set the languages field to an array of all detected languages
    if(langsField != null && langsField.length() != 0) {
      doc.setField(langsField, docLangs.toArray());
    }

    return doc;
  }



--
View this message in context: http://lucene.472066.n3.nabble.com/Language-detection-for-multivalued-field-tp4096996p4157573.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Language detection for multivalued field

Posted by vatuska <va...@yandex.ru>.
And if I use dynamic fields to split multivalued field on different fields,
can I use this dynamic field in *updateRequestProcessorChain* ? I've tried
this, but seems dynamic values doesn't supported in langid.map.individual.fl  



--
View this message in context: http://lucene.472066.n3.nabble.com/Language-detection-for-multivalued-field-tp4096996p4098570.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Language detection for multivalued field

Posted by Jan Høydahl <ja...@cominvent.com>.
Hi,

First, the feature will only detect ONE language per field, even if it is a multi-valued field. In your case there is VERY little text for the detector, so do not expect great detection quality. But I believe the detector chose ES as language and mapped the whole field as tag_es. The reason you do not see tag_es in the first schema version is naturally because you have it defined as stored="false".

If you want individual detection of each value, please send the values in differently named fields, of file a JIRA to add a feature request for individual detection of language for values in a multiValued field.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

22. okt. 2013 kl. 14:16 skrev vatuska <va...@yandex.ru>:

> *Can you elaborate on your comment "There isn't tag indexed". Are you saying
> that your multiValued "tag" field is not indexed at all, gone, missing? *
> There aren't any tag_... field despite of indexed=true stored=true for
> dynamicField 
> 
> I found the reason, but I don't understand why
> If I specify
> <str name="langid.whitelist">en,es</str> 
> 
> There aren't any tag_... field for document
> ...
> <field name="tag">español</field> 
> <field name="tag">first</field> 
> <field name="tag">My tag</field>
> ...
> 
> If there are these lines in schema.xml 
> <dynamicField name=&quot;*_undfnd&quot; type=&quot;text_general&quot;
> indexed=&lt;b>"true"* stored="true" multiValued="true"/><dynamicField
> name=&quot;*_en&quot; type=&quot;text_en_splitting&quot;
> indexed=&lt;b>"true"* stored="true" multiValued="true"/>
> <dynamicField name=&quot;*_es&quot; type=&quot;text_es&quot;
> indexed=&quot;true&quot; stored=&lt;b>"false"* multiValued="true"/> 
> 
> But if I specify
> <dynamicField name=&quot;*_es&quot; type=&quot;text_es&quot;
> indexed=&quot;true&quot; stored=&lt;b>"true"* multiValued="true"/> 
> 
> There is a *tag_es* : español , first,  My tag
> in the stored document
> 
> Could you explain, please, how does it work? 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Language-detection-for-multivalued-field-tp4096996p4097013.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Language detection for multivalued field

Posted by vatuska <va...@yandex.ru>.
*Can you elaborate on your comment "There isn't tag indexed". Are you saying
that your multiValued "tag" field is not indexed at all, gone, missing? *
There aren't any tag_... field despite of indexed=true stored=true for
dynamicField 

I found the reason, but I don't understand why
If I specify
<str name="langid.whitelist">en,es</str> 

There aren't any tag_... field for document
...
<field name="tag">español</field> 
<field name="tag">first</field> 
<field name="tag">My tag</field>
...

If there are these lines in schema.xml 
<dynamicField name=&quot;*_undfnd&quot; type=&quot;text_general&quot;
indexed=&lt;b>"true"* stored="true" multiValued="true"/><dynamicField
name=&quot;*_en&quot; type=&quot;text_en_splitting&quot;
indexed=&lt;b>"true"* stored="true" multiValued="true"/>
<dynamicField name=&quot;*_es&quot; type=&quot;text_es&quot;
indexed=&quot;true&quot; stored=&lt;b>"false"* multiValued="true"/> 

But if I specify
<dynamicField name=&quot;*_es&quot; type=&quot;text_es&quot;
indexed=&quot;true&quot; stored=&lt;b>"true"* multiValued="true"/> 

There is a *tag_es* : español , first,  My tag
in the stored document

Could you explain, please, how does it work? 



--
View this message in context: http://lucene.472066.n3.nabble.com/Language-detection-for-multivalued-field-tp4096996p4097013.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Language detection for multivalued field

Posted by Jan Høydahl <ja...@cominvent.com>.
Hi,

The feature is designed to detect exactly one language per field.
In case of multValued, it will concatenate all values before detection.

Can you elaborate on your comment "There isn't tag indexed". Are you saying that your multiValued "tag" field is not indexed at all, gone, missing?

If you have a requirement for detecting language per field-value and then map those into multiple language specific fields, please add a JIRA feature request which will then be considered for future inclusion.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

22. okt. 2013 kl. 12:59 skrev vatuska <va...@yandex.ru>:

> Is there any way to define language for multivalued field?
> Seems it doesn't work if there are several values with different languages
> in the documents.
> 
> *I have multivalued field in schema.xml*
> ...
> <field name="tag" type="text_general" indexed="true" stored="true"
> required="false" multiValued="true"/>
> ...
> <dynamicField name="*_undfnd" type="text_general" indexed="true"
> stored="true" multiValued="true"/>
> <dynamicField name="*_en" type="text_en_splitting" indexed="true"
> stored="true" multiValued="true"/>
> 
> *And I have configured UpdateRequestProcessorChain*
> 
> <updateRequestProcessorChain name="langid">
>       <processor
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
>         <str name="langid.fl">tag</str>
>         <str name="langid.langField">lang_global</str>
> 		 <str name="langid.langFields">langs</str>
> 		 <bool name="langid.map">true</bool>
> 		 <bool name="langid.map.individual">true</bool>
> 		 <bool name="langid.map.keepOrig">true</bool>
> 		 <str name="langid.fallback">undfnd</str>
> 		 <str name="langid.whitelist">en,en_GB,en_US</str>
> 		 <str
> name="langid.map.individual.fl">title,source,tag,creatorName,description</str>
> 		 <str name="langid.map.lcmap">en_GB:en en_US:en</str>
>       </processor>
>       <processor class="solr.LogUpdateProcessorFactory"/>
>       <processor class="solr.RunUpdateProcessorFactory"/>
>     </updateRequestProcessorChain>
> 
> *All works fine for document like:*
> ...
> <field name="tag">My test tag</field>
> ...
> 
> *And all works fine for document like*
> ...
> <field name="tag">test</field>
> <field name="tag">first</field>
> <field name="tag">My tag</field>
> ...
> 
> *But for* 
> ...
> <field name="tag">español</field>
> <field name="tag">first</field>
> <field name="tag">My tag</field>
> ...
> *There isn't tag indexed*
> *But I expect*
> tag_en : first, My tag
> tag_undfnd : español
> 
> Is there any way to fix this?
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Language-detection-for-multivalued-field-tp4096996.html
> Sent from the Solr - User mailing list archive at Nabble.com.