You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chantal Ackermann <ch...@btelligent.de> on 2009/08/03 14:12:56 UTC

Copy Field Question

Dear all,

before searching through the source code - maybe one of you can answer 
this easily:

When and based on what are the tokenizer and filters applied when 
copying fields? Can it happen that fields are analyzed twice (once when 
creating the first field, and a second time when they are copied to the 
another field)?


Here an example from my current setup:
I have the following types defined, in schema.xml:

<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
	<analyzer type="index">
	<tokenizer class="solr.StandardTokenizerFactory" />
	<filter class="solr.LengthFilterFactory" min="2" max="5000" />
	<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_de.txt" />
	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1" />
	<filter class="solr.LowerCaseFilterFactory" />
	<filter class="solr.SnowballPorterFilterFactory" language="German" />
	<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
	</analyzer>
	<analyzer type="query">
	<tokenizer class="solr.StandardTokenizerFactory" />
	<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_de.txt" />
	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
catenateAll="0" splitOnCaseChange="1" />
	<filter class="solr.LowerCaseFilterFactory" />
	<filter class="solr.SnowballPorterFilterFactory" language="German" />
	<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
	</analyzer>
</fieldType>

Used for those fields:

<field name="title" type="keyword" index="true" stored="true" 
required="true" />
<field name="title_de" type="text_de" index="true" stored="false" 
required="false" />
<field name="subtitle_text_de" type="text_de" index="true" stored="true" 
required="false" />
<field name="dtext_de" type="text_de" index="true" stored="false" 
required="false" />

Which are used to populate this field using the copy field directive:

<field name="all_text_de" type="text_de" indexed="true" stored="false"
			multiValued="true" />

like that (that is what I do, now, at least):

<copyField source="title" dest="title_de" />
<copyField source="title" dest="all_text_de" />
<copyField source="subtitle_text_de" dest="all_text_de" />
<copyField source="dtext_de" dest="all_text_de" />


I am copying fields with different types to all_text_de, e.g. title is 
different from subtitle_text_de. Is the valued copied to the destination 
field the raw (input) value or the already analyzed one?


Thanks!
Chantal


-- 
Chantal Ackermann

Re: Copy Field Question

Posted by Chantal Ackermann <ch...@btelligent.de>.

Thanks, Mark!


Mark Miller schrieb:
> Its the pre-analyzed form thats copied. The field that its copied to will
> determine the analyzer/filters for that field.
> If you want to check out the code doing it, its
> in org.apache.solr.update.DocumentBuilder
> 
> --
> - Mark
> 
> http://www.lucidimagination.com
> 
> On Mon, Aug 3, 2009 at 8:12 AM, Chantal Ackermann <
> chantal.ackermann@btelligent.de> wrote:
> 
>> Dear all,
>>
>> before searching through the source code - maybe one of you can answer this
>> easily:
>>
>> When and based on what are the tokenizer and filters applied when copying
>> fields? Can it happen that fields are analyzed twice (once when creating the
>> first field, and a second time when they are copied to the another field)?
>>
>>
>> Here an example from my current setup:
>> I have the following types defined, in schema.xml:
>>
>> <fieldType name="text_de" class="solr.TextField"
>> positionIncrementGap="100">
>>        <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory" />
>>        <filter class="solr.LengthFilterFactory" min="2" max="5000" />
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_de.txt" />
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
>>        <filter class="solr.LowerCaseFilterFactory" />
>>        <filter class="solr.SnowballPorterFilterFactory" language="German"
>> />
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>>        </analyzer>
>>        <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory" />
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_de.txt" />
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
>>        <filter class="solr.LowerCaseFilterFactory" />
>>        <filter class="solr.SnowballPorterFilterFactory" language="German"
>> />
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>>        </analyzer>
>> </fieldType>
>>
>> Used for those fields:
>>
>> <field name="title" type="keyword" index="true" stored="true"
>> required="true" />
>> <field name="title_de" type="text_de" index="true" stored="false"
>> required="false" />
>> <field name="subtitle_text_de" type="text_de" index="true" stored="true"
>> required="false" />
>> <field name="dtext_de" type="text_de" index="true" stored="false"
>> required="false" />
>>
>> Which are used to populate this field using the copy field directive:
>>
>> <field name="all_text_de" type="text_de" indexed="true" stored="false"
>>                        multiValued="true" />
>>
>> like that (that is what I do, now, at least):
>>
>> <copyField source="title" dest="title_de" />
>> <copyField source="title" dest="all_text_de" />
>> <copyField source="subtitle_text_de" dest="all_text_de" />
>> <copyField source="dtext_de" dest="all_text_de" />
>>
>>
>> I am copying fields with different types to all_text_de, e.g. title is
>> different from subtitle_text_de. Is the valued copied to the destination
>> field the raw (input) value or the already analyzed one?
>>
>>
>> Thanks!
>> Chantal
>>
>>
>> --
>> Chantal Ackermann
>>

-- 
Chantal Ackermann
Consultant

mobil    +49 (176) 10 00 09 45
email    chantal.ackermann@btelligent.de

--------------------------------------------------------------------------------------------------------

b.telligent GmbH & Co. KG
Lichtenbergstraße 8
D-85748 Garching / München

fon       +49 (89) 54 84 25 60
fax        +49 (89) 54 84 25 69
web      www.btelligent.de

Registered in Munich: HRA 84393
Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented 
by Sebastian Amtage and Klaus Blaschek
USt.Id.-Nr. DE814054803



Confidentiality Note
This email is intended only for the use of the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential and exempt from disclosure under applicable law. If the 
reader of this email message is not the intended recipient, or the 
employee or agent responsible for delivery of the message to the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this communication is prohibited. If you have 
received this email in error, please notify us immediately by telephone 
at +49 (0) 89 54 84 25 60. Thank you.

Re: Copy Field Question

Posted by Mark Miller <ma...@gmail.com>.

Its the pre-analyzed form thats copied. The field that its copied to will
determine the analyzer/filters for that field.
If you want to check out the code doing it, its
in org.apache.solr.update.DocumentBuilder

-- 
- Mark

http://www.lucidimagination.com

On Mon, Aug 3, 2009 at 8:12 AM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> Dear all,
>
> before searching through the source code - maybe one of you can answer this
> easily:
>
> When and based on what are the tokenizer and filters applied when copying
> fields? Can it happen that fields are analyzed twice (once when creating the
> first field, and a second time when they are copied to the another field)?
>
>
> Here an example from my current setup:
> I have the following types defined, in schema.xml:
>
> <fieldType name="text_de" class="solr.TextField"
> positionIncrementGap="100">
>        <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory" />
>        <filter class="solr.LengthFilterFactory" min="2" max="5000" />
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_de.txt" />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
>        <filter class="solr.LowerCaseFilterFactory" />
>        <filter class="solr.SnowballPorterFilterFactory" language="German"
> />
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>        </analyzer>
>        <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory" />
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_de.txt" />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
>        <filter class="solr.LowerCaseFilterFactory" />
>        <filter class="solr.SnowballPorterFilterFactory" language="German"
> />
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>        </analyzer>
> </fieldType>
>
> Used for those fields:
>
> <field name="title" type="keyword" index="true" stored="true"
> required="true" />
> <field name="title_de" type="text_de" index="true" stored="false"
> required="false" />
> <field name="subtitle_text_de" type="text_de" index="true" stored="true"
> required="false" />
> <field name="dtext_de" type="text_de" index="true" stored="false"
> required="false" />
>
> Which are used to populate this field using the copy field directive:
>
> <field name="all_text_de" type="text_de" indexed="true" stored="false"
>                        multiValued="true" />
>
> like that (that is what I do, now, at least):
>
> <copyField source="title" dest="title_de" />
> <copyField source="title" dest="all_text_de" />
> <copyField source="subtitle_text_de" dest="all_text_de" />
> <copyField source="dtext_de" dest="all_text_de" />
>
>
> I am copying fields with different types to all_text_de, e.g. title is
> different from subtitle_text_de. Is the valued copied to the destination
> field the raw (input) value or the already analyzed one?
>
>
> Thanks!
> Chantal
>
>
> --
> Chantal Ackermann
>