You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chantal Ackermann <ch...@btelligent.de> on 2009/11/03 14:33:57 UTC

DIH : RegexTransformer with groupNames requires all groups to be not empty?

Dear all,

my DIH config contains the following directive for the RegexTransformer:

<field column="person" groupNames="participant,role"
regex="([^\|]+)\|\d+,\d+,\d+,(.+)" />

(this is SOLR 1.4.0 RC downloaded yesterday from Grant's URL)

It expects input of the kind (version A):
Daniel Radcliffe|24897,1,1,Harry Potter

It should also work with (version B):
Daniel Radcliffe|24897,1,1,

In my index, however, I can only find documents that either contain
participant and role or neither. Of course, I didn't check all
documents. But for both fields, Luke shows the same number of documents:
Docs:  47015

(There are definitely datasets that contain participants without role.)

I'll check the code and try with a different configuration (using
sourceCol). But I thought I'd spread the news before the release is definit.

Thanks,
Chantal



DIH : RegexTransformer with groupNames requires all groups to be not empty?

Posted by Chantal Ackermann <ch...@btelligent.de>.
Ok, I can confirm that the following configuration for RegexTransformer 
works as I would expect it:


<field column="participant" sourceColName="person" regex="([^\|]+)\|.*" />
<field column="role" sourceColName="person" 
regex="[^\|]+\|\d+,\d+,\d+,(.+)" />

To the multivalued fields participant and role, values are only added if 
their corresponding regex matches.



The following configuration does not add any matched value to any field 
if one (or more) of the groups is not matched. It only adds values to 
all fields in groupNames if all groups are matched:

<!--field column="person" groupNames="participant,role" 
regex="([^\|]+)\|\d+,\d+,\d+,(.+)" /-->


Chantal


Chantal Ackermann schrieb:
> follow-up:
> 
> 
> regex="([^\|]+)\|\d+,\d+,\d+,(.+)"
> 
> is the version I chose after I had the following problems with
> regex="([^\|]+)\|\d+,\d+,\d+,(.*)"
> (changed * into + for the second group):
> 
> The role field contained empty values even if I added a
> TrimFilterFactory with minimum length of 1. So, I changed the regular
> expression to find only non-empty values. Well, it does now - but if it
> cannot find a value for the second group it doesn't even add the value
> for the first group.
> 
> Any help on getting this solved is greatly appreciated.
> It boils down to this question:
> 
> - How can I achieve that the RegexTransformer adds a value only if
> it contains a non-empty value and avoiding at the same time that it only
> adds values when all of the groups contain values.
> 
> Maybe the configuration with groupNames is meant to work like that. If
> that is the case, it's probably worth adding this information to the
> Wiki. I will change back to using the sourceCol attribute as
> https://issues.apache.org/jira/browse/SOLR-1498
> should be fixed with this 1.4.0RC version, now.
> 
> Thanks!
> Chantal
> 
> Chantal Ackermann schrieb:
>> Dear all,
>>
>> my DIH config contains the following directive for the RegexTransformer:
>>
>> <field column="person" groupNames="participant,role"
>> regex="([^\|]+)\|\d+,\d+,\d+,(.+)" />
>>
>> (this is SOLR 1.4.0 RC downloaded yesterday from Grant's URL)
>>
>> It expects input of the kind (version A):
>> Daniel Radcliffe|24897,1,1,Harry Potter
>>
>> It should also work with (version B):
>> Daniel Radcliffe|24897,1,1,
>>
>> In my index, however, I can only find documents that either contain
>> participant and role or neither. Of course, I didn't check all
>> documents. But for both fields, Luke shows the same number of documents:
>> Docs:  47015
>>
>> (There are definitely datasets that contain participants without role.)
>>
>> I'll check the code and try with a different configuration (using
>> sourceCol). But I thought I'd spread the news before the release is definit.
>>
>> Thanks,
>> Chantal
>>
>>

DIH : RegexTransformer with groupNames requires all groups to be not empty?

Posted by Chantal Ackermann <ch...@btelligent.de>.
follow-up:


regex="([^\|]+)\|\d+,\d+,\d+,(.+)"

is the version I chose after I had the following problems with
regex="([^\|]+)\|\d+,\d+,\d+,(.*)"
(changed * into + for the second group):

The role field contained empty values even if I added a 
TrimFilterFactory with minimum length of 1. So, I changed the regular 
expression to find only non-empty values. Well, it does now - but if it 
cannot find a value for the second group it doesn't even add the value 
for the first group.

Any help on getting this solved is greatly appreciated.
It boils down to this question:

- How can I achieve that the RegexTransformer adds a value only if
it contains a non-empty value and avoiding at the same time that it only 
adds values when all of the groups contain values.

Maybe the configuration with groupNames is meant to work like that. If 
that is the case, it's probably worth adding this information to the 
Wiki. I will change back to using the sourceCol attribute as
https://issues.apache.org/jira/browse/SOLR-1498
should be fixed with this 1.4.0RC version, now.

Thanks!
Chantal

Chantal Ackermann schrieb:
> Dear all,
> 
> my DIH config contains the following directive for the RegexTransformer:
> 
> <field column="person" groupNames="participant,role"
> regex="([^\|]+)\|\d+,\d+,\d+,(.+)" />
> 
> (this is SOLR 1.4.0 RC downloaded yesterday from Grant's URL)
> 
> It expects input of the kind (version A):
> Daniel Radcliffe|24897,1,1,Harry Potter
> 
> It should also work with (version B):
> Daniel Radcliffe|24897,1,1,
> 
> In my index, however, I can only find documents that either contain
> participant and role or neither. Of course, I didn't check all
> documents. But for both fields, Luke shows the same number of documents:
> Docs:  47015
> 
> (There are definitely datasets that contain participants without role.)
> 
> I'll check the code and try with a different configuration (using
> sourceCol). But I thought I'd spread the news before the release is definit.
> 
> Thanks,
> Chantal
> 
>