You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Bradley Belyeu <br...@life.church> on 2017/12/08 16:56:37 UTC

Copy field and regex

I’m struggling a bit getting a copy field & regex tokenizer to work like I think it should…
I have an open source project I’m just starting out with here: https://github.com/youversion/solrcloud
I have a uniqueKey field USFM defined as:
<field name="usfm" type="string" indexed="true" required="true" stored="true" />
And a USFM will always be in the pattern of 3 characters followed by a period followed by one or more digits followed by another period and finally one or more digits.
Optionally after the final digit there may be a hyphen and another digit.
IE: JHN.3.16 or MAT.6.33-34

I’m wanting to do a result grouping by the first three characters, period, & digit(s). For example, docs with the unique keys JHN.3.16 & JHN.3.17 I would want grouped together.
So my thought was to define another field and then copy the USFM into it and use the regex tokenizer defined as so:

    <fieldType name="chapter" class="solr.TextField" positionIncrementGap="0">
        <analyzer>
            <tokenizer class="solr.PatternTokenizerFactory" pattern="^(\w+\.\d+)\.\d+-*\d*$" group="1" />
        </analyzer>
    </fieldType>
    <field name="chapter" type="chapter" indexed="true" required="true" stored="true" />
    <copyField source="usfm" dest="chapter" />

BUT, when I import my data the entire USFM is being stored inside the chapter field. And I get query results that look like:
       {
        "usfm":"MAT.10.1",
        "chapter":"MAT.10.1",
        "devo_keywords_en":"fear",
        "_version_":1586184983451533312},
      {
        "usfm":"MAT.10.10",
        "chapter":"MAT.10.10",
        "devo_keywords_en":"fear",
        "_version_":1586184983451533314},
      {
        "usfm":"MAT.10.11",
        "chapter":"MAT.10.11",
        "devo_keywords_en":"fear",
        "_version_":1586184983451533316},
      {
        "usfm":"MAT.10.12",
        "chapter":"MAT.10.12",
        "devo_keywords_en":"fear",
        "_version_":1586184983451533318}

It’s probably something simple I’ve missed, but I’ve been banging my head for long enough I thought I’d ask for help.
Thanks in advance!

Re: Copy field and regex

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/8/2017 9:56 AM, Bradley Belyeu wrote:
> I’m wanting to do a result grouping by the first three characters, period, & digit(s). For example, docs with the unique keys JHN.3.16 & JHN.3.17 I would want grouped together.
> So my thought was to define another field and then copy the USFM into it and use the regex tokenizer defined as so:
>
>     <fieldType name="chapter" class="solr.TextField" positionIncrementGap="0">
>         <analyzer>
>             <tokenizer class="solr.PatternTokenizerFactory" pattern="^(\w+\.\d+)\.\d+-*\d*$" group="1" />
>         </analyzer>
>     </fieldType>
>     <field name="chapter" type="chapter" indexed="true" required="true" stored="true" />
>     <copyField source="usfm" dest="chapter" />
>
> BUT, when I import my data the entire USFM is being stored inside the chapter field. And I get query results that look like:

Analysis only affects indexed terms.  The field contents in query
results is *ALWAYS* the original indexed text -- analysis *CANNOT*
affect the fields returned for a document.  The copyField feature does
not copy the results of analysis, it always copies the original input.

Since this is a "solr.TextField" type, you cannot define docValues on
it, which means that the Result Grouping feature in Solr will use the
indexed terms.  Note that if your index is distributed, you probably
won't be able to use the grouping feature -- that seems to require
docValues.  But if your index has a single shard, you should be OK.

Thanks,
Shawn

Re: Copy field and regex

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/8/2017 1:03 PM, Erick Erickson wrote:
> Second, grouping works fine in distributed mode with a couple of
> restrictions, see the reference guide. Collapse/Expand (an alternative
> to standard grouping) requires that all the members of a group be on
> the same shard.

In 5.x, distributed grouping seems to require docValues, while in 4.x it 
didn't.  I filed an issue about it.  Yonik said that this is additional 
fallout from LUCENE-5666.  I haven't tried it on 6.x or 7.x.

https://issues.apache.org/jira/browse/SOLR-8088

If I send a request to one core, the grouping works, but if I make it 
distributed and the field doesn't have docValues, I get an exception. 
In order to accommodate the data-mining grouping queries I needed on my 
dev server (all production is still running 4.x versions), I used 
copyField to a string type with docValues and reindexed.

I am not running SolrCloud for these indexes.

Thanks,
Shawn

Re: Copy field and regex

Posted by Erick Erickson <er...@gmail.com>.

Grouping does _not_ require docValues, it's just that the with
docValues=false, uninverted structure is built on the heap at run
time. When docValues=true, the uninverted structure is written to disk
at index time and MMapped into the OS's memory space rather than the
Java heap.

Second, grouping works fine in distributed mode with a couple of
restrictions, see the reference guide. Collapse/Expand (an alternative
to standard grouping) requires that all the members of a group be on
the same shard.

Right, text fields aren't eligible for docValues, only "simple" types
(string included). If you want to use docValues, I'd recommend doing
the extraction on the client side. You can also put that in an update
component, but that's probably overkill.

Best,
Erick

On Fri, Dec 8, 2017 at 10:51 AM, Bradley Belyeu
<br...@life.church> wrote:
> Ah, thank you Erick & Shawn. That makes perfect sense. And yes when this goes to prod it will be distributed. Good point about docValues and needing a single shard, thanks!
> I’m new to result grouping, so I’m still prototyping that it will work for what I need.
>
> On 12/8/17, 12:00 PM, "Erick Erickson" <er...@gmail.com> wrote:
>
>     I think you're getting confused by seeing the _stored_ data rather
>     than the indexed data. When you return fields in documents, you get
>     the stored data which is a verbatim copy of the input, no analysis
>     done at all. To see what's in the index (and thus what would be
>     grouped on) look at:
>
>     adminUI>>analysis>>(your field) and put some sample values in and see
>     what the regex transformer does. NOTE: unclick the "verbose" box for
>     less clutter.
>     or
>     adminUI>>(select core)>>schema browser
>     or
>     termscomponent
>
>     If you require the stored value to be different, you have several choices
>     1> change it on the client side before ingestion
>     2> use one of field mutating classes
>
>     Most often, people don't bother storing the copyfield since the stored
>     value is available in the original, the copyField destination is just
>     used for things like you're interested in.
>
>     Best,
>     Erick
>
>     On Fri, Dec 8, 2017 at 8:56 AM, Bradley Belyeu
>     <br...@life.church> wrote:
>     > I’m struggling a bit getting a copy field & regex tokenizer to work like I think it should…
>     > I have an open source project I’m just starting out with here: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fyouversion%2Fsolrcloud&data=02%7C01%7Cbradley.belyeu%40life.church%7C1c830048a2f84986e57d08d53e659b6d%7C8c9a6ca9b4314964afb4b8e1a2ba636f%7C1%7C0%7C636483528492765542&sdata=ZWo4gQwKOa0wGo5%2B822bro2sxnEg9F5b7cNil%2F0pj4k%3D&reserved=0
>     > I have a uniqueKey field USFM defined as:
>     > <field name="usfm" type="string" indexed="true" required="true" stored="true" />
>     > And a USFM will always be in the pattern of 3 characters followed by a period followed by one or more digits followed by another period and finally one or more digits.
>     > Optionally after the final digit there may be a hyphen and another digit.
>     > IE: JHN.3.16 or MAT.6.33-34
>     >
>     > I’m wanting to do a result grouping by the first three characters, period, & digit(s). For example, docs with the unique keys JHN.3.16 & JHN.3.17 I would want grouped together.
>     > So my thought was to define another field and then copy the USFM into it and use the regex tokenizer defined as so:
>     >
>     >     <fieldType name="chapter" class="solr.TextField" positionIncrementGap="0">
>     >         <analyzer>
>     >             <tokenizer class="solr.PatternTokenizerFactory" pattern="^(\w+\.\d+)\.\d+-*\d*$" group="1" />
>     >         </analyzer>
>     >     </fieldType>
>     >     <field name="chapter" type="chapter" indexed="true" required="true" stored="true" />
>     >     <copyField source="usfm" dest="chapter" />
>     >
>     > BUT, when I import my data the entire USFM is being stored inside the chapter field. And I get query results that look like:
>     >        {
>     >         "usfm":"MAT.10.1",
>     >         "chapter":"MAT.10.1",
>     >         "devo_keywords_en":"fear",
>     >         "_version_":1586184983451533312},
>     >       {
>     >         "usfm":"MAT.10.10",
>     >         "chapter":"MAT.10.10",
>     >         "devo_keywords_en":"fear",
>     >         "_version_":1586184983451533314},
>     >       {
>     >         "usfm":"MAT.10.11",
>     >         "chapter":"MAT.10.11",
>     >         "devo_keywords_en":"fear",
>     >         "_version_":1586184983451533316},
>     >       {
>     >         "usfm":"MAT.10.12",
>     >         "chapter":"MAT.10.12",
>     >         "devo_keywords_en":"fear",
>     >         "_version_":1586184983451533318}
>     >
>     > It’s probably something simple I’ve missed, but I’ve been banging my head for long enough I thought I’d ask for help.
>     > Thanks in advance!
>
>

Re: Copy field and regex

Posted by Bradley Belyeu <br...@life.church>.

Ah, thank you Erick & Shawn. That makes perfect sense. And yes when this goes to prod it will be distributed. Good point about docValues and needing a single shard, thanks!
I’m new to result grouping, so I’m still prototyping that it will work for what I need.

On 12/8/17, 12:00 PM, "Erick Erickson" <er...@gmail.com> wrote:

    I think you're getting confused by seeing the _stored_ data rather
    than the indexed data. When you return fields in documents, you get
    the stored data which is a verbatim copy of the input, no analysis
    done at all. To see what's in the index (and thus what would be
    grouped on) look at:
    
    adminUI>>analysis>>(your field) and put some sample values in and see
    what the regex transformer does. NOTE: unclick the "verbose" box for
    less clutter.
    or
    adminUI>>(select core)>>schema browser
    or
    termscomponent
    
    If you require the stored value to be different, you have several choices
    1> change it on the client side before ingestion
    2> use one of field mutating classes
    
    Most often, people don't bother storing the copyfield since the stored
    value is available in the original, the copyField destination is just
    used for things like you're interested in.
    
    Best,
    Erick
    
    On Fri, Dec 8, 2017 at 8:56 AM, Bradley Belyeu
    <br...@life.church> wrote:
    > I’m struggling a bit getting a copy field & regex tokenizer to work like I think it should…
    > I have an open source project I’m just starting out with here: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fyouversion%2Fsolrcloud&data=02%7C01%7Cbradley.belyeu%40life.church%7C1c830048a2f84986e57d08d53e659b6d%7C8c9a6ca9b4314964afb4b8e1a2ba636f%7C1%7C0%7C636483528492765542&sdata=ZWo4gQwKOa0wGo5%2B822bro2sxnEg9F5b7cNil%2F0pj4k%3D&reserved=0
    > I have a uniqueKey field USFM defined as:
    > <field name="usfm" type="string" indexed="true" required="true" stored="true" />
    > And a USFM will always be in the pattern of 3 characters followed by a period followed by one or more digits followed by another period and finally one or more digits.
    > Optionally after the final digit there may be a hyphen and another digit.
    > IE: JHN.3.16 or MAT.6.33-34
    >
    > I’m wanting to do a result grouping by the first three characters, period, & digit(s). For example, docs with the unique keys JHN.3.16 & JHN.3.17 I would want grouped together.
    > So my thought was to define another field and then copy the USFM into it and use the regex tokenizer defined as so:
    >
    >     <fieldType name="chapter" class="solr.TextField" positionIncrementGap="0">
    >         <analyzer>
    >             <tokenizer class="solr.PatternTokenizerFactory" pattern="^(\w+\.\d+)\.\d+-*\d*$" group="1" />
    >         </analyzer>
    >     </fieldType>
    >     <field name="chapter" type="chapter" indexed="true" required="true" stored="true" />
    >     <copyField source="usfm" dest="chapter" />
    >
    > BUT, when I import my data the entire USFM is being stored inside the chapter field. And I get query results that look like:
    >        {
    >         "usfm":"MAT.10.1",
    >         "chapter":"MAT.10.1",
    >         "devo_keywords_en":"fear",
    >         "_version_":1586184983451533312},
    >       {
    >         "usfm":"MAT.10.10",
    >         "chapter":"MAT.10.10",
    >         "devo_keywords_en":"fear",
    >         "_version_":1586184983451533314},
    >       {
    >         "usfm":"MAT.10.11",
    >         "chapter":"MAT.10.11",
    >         "devo_keywords_en":"fear",
    >         "_version_":1586184983451533316},
    >       {
    >         "usfm":"MAT.10.12",
    >         "chapter":"MAT.10.12",
    >         "devo_keywords_en":"fear",
    >         "_version_":1586184983451533318}
    >
    > It’s probably something simple I’ve missed, but I’ve been banging my head for long enough I thought I’d ask for help.
    > Thanks in advance!

Re: Copy field and regex

Posted by Erick Erickson <er...@gmail.com>.

I think you're getting confused by seeing the _stored_ data rather
than the indexed data. When you return fields in documents, you get
the stored data which is a verbatim copy of the input, no analysis
done at all. To see what's in the index (and thus what would be
grouped on) look at:

adminUI>>analysis>>(your field) and put some sample values in and see
what the regex transformer does. NOTE: unclick the "verbose" box for
less clutter.
or
adminUI>>(select core)>>schema browser
or
termscomponent

If you require the stored value to be different, you have several choices
1> change it on the client side before ingestion
2> use one of field mutating classes

Most often, people don't bother storing the copyfield since the stored
value is available in the original, the copyField destination is just
used for things like you're interested in.

Best,
Erick

On Fri, Dec 8, 2017 at 8:56 AM, Bradley Belyeu
<br...@life.church> wrote:
> I’m struggling a bit getting a copy field & regex tokenizer to work like I think it should…
> I have an open source project I’m just starting out with here: https://github.com/youversion/solrcloud
> I have a uniqueKey field USFM defined as:
> <field name="usfm" type="string" indexed="true" required="true" stored="true" />
> And a USFM will always be in the pattern of 3 characters followed by a period followed by one or more digits followed by another period and finally one or more digits.
> Optionally after the final digit there may be a hyphen and another digit.
> IE: JHN.3.16 or MAT.6.33-34
>
> I’m wanting to do a result grouping by the first three characters, period, & digit(s). For example, docs with the unique keys JHN.3.16 & JHN.3.17 I would want grouped together.
> So my thought was to define another field and then copy the USFM into it and use the regex tokenizer defined as so:
>
>     <fieldType name="chapter" class="solr.TextField" positionIncrementGap="0">
>         <analyzer>
>             <tokenizer class="solr.PatternTokenizerFactory" pattern="^(\w+\.\d+)\.\d+-*\d*$" group="1" />
>         </analyzer>
>     </fieldType>
>     <field name="chapter" type="chapter" indexed="true" required="true" stored="true" />
>     <copyField source="usfm" dest="chapter" />
>
> BUT, when I import my data the entire USFM is being stored inside the chapter field. And I get query results that look like:
>        {
>         "usfm":"MAT.10.1",
>         "chapter":"MAT.10.1",
>         "devo_keywords_en":"fear",
>         "_version_":1586184983451533312},
>       {
>         "usfm":"MAT.10.10",
>         "chapter":"MAT.10.10",
>         "devo_keywords_en":"fear",
>         "_version_":1586184983451533314},
>       {
>         "usfm":"MAT.10.11",
>         "chapter":"MAT.10.11",
>         "devo_keywords_en":"fear",
>         "_version_":1586184983451533316},
>       {
>         "usfm":"MAT.10.12",
>         "chapter":"MAT.10.12",
>         "devo_keywords_en":"fear",
>         "_version_":1586184983451533318}
>
> It’s probably something simple I’ve missed, but I’ve been banging my head for long enough I thought I’d ask for help.
> Thanks in advance!