You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Bill Tantzen <ta...@umn.edu.INVALID> on 2023/05/02 16:40:03 UTC

standard tokenizer seemingly splitting on dot

In my solr 9.2 schema, I am leveraging the dynamicField

<dynamicField name="*_txt" type="text_general" indexed="true"
stored="true"/>

which tokenizes with solr.StandardTokenizerFactory for index and query.

However, when I query with, for example,
<str name="q">metadata_txt:XYZ.tif</str>

I see many more hits than I expect.  When I add debug=true to the query, I
see:
<str name="rawquerystring">metadata_txt:XYZ.tif</str>
<str name="querystring">metadata_txt:XYZ.tif</str>
<str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str>

But I expect that dots not followed by whitespace will be kept as part of
the token, that is, the parsed query should remain "metadata_txt:XYZ.tif"
but solr appears to be splitting into two tokens.

Can somebody point out what I am misunderstanding?
Thanks,
~~Bill

Re: standard tokenizer seemingly splitting on dot

Posted by Gus Heck <gu...@gmail.com>.

I concur that the docs clearly state your expected behavior should be true:
Standard Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and
punctuation as delimiters. Delimiter characters are discarded, with the
following exceptions:

   -

   Periods (dots) that are not followed by whitespace are kept as part of
   the token, including Internet domain names.
   -

   The "@" character is among the set of token-splitting punctuation, so
   email addresses are not preserved as single tokens.


However when I used the analysis screen vs the _default configset in a
local solr (happens to be running 10-snapshot because of something I am
working on, but shouldn't matter) I get the documented behavior.
ST
XYZ.tif
SF
XYZ.tif
LCF
 xyz.tif
Can you check that your text_general type or your *_txt dynamic type hasn't
been re-defined?


On Tue, May 2, 2023 at 3:17 PM Bill Tantzen <ta...@umn.edu.invalid>
wrote:

> Thanks Dave!
> Using a string field instead would work fine for my purposes I think...
> I'm just trying to understand why it doesn't work with a field of type
> text_general which uses the standard tokenizer in both the index and the
> query analyzer.  The docs state:
>
> This tokenizer splits the text field into tokens, treating whitespace and
> punctuation as delimiters.
> Delimiter characters are discarded, with the following exceptions:
> Periods (dots) that are not followed by whitespace are kept as part of the
> token, including Internet domain names.
>
> That's what is confusing me...  Meanwhile, I'm going to take your
> suggestion and convert the field to a string!
> ~~Bill
>
> On Tue, May 2, 2023 at 1:40 PM Dave <ha...@gmail.com> wrote:
>
> > You’re not doing anything wrong, a dot is not a character so it splits
> the
> > field in the index and the query. If you used a string instead it
> > theoretically would maintain the non characters but also lead to more
> > strict search constraints. If you tried this you need to re index a
> couple
> > documents to
> > Make sure you are getting what you want.
> >
> > -Dave
> >
> > > On May 2, 2023, at 2:22 PM, Bill Tantzen <ta...@umn.edu.invalid>
> > wrote:
> > >
> > > I'm using the solrconfig.xml from the distribution,
> > > ./server/solr/configsets/_default/conf/solrconfig.xml
> > >
> > > But this problem extends to the index as well; using the initial
> example,
> > > if I search for <str name="parsedquery">metadata_txt:ab00001</str>
> > (instead
> > > of ab00001.tif), my result set includes ab00001.tif, ab00001.jpg,
> > > ab00001.png, etc so the tokens in the index are split on dot as well,
> not
> > > just the query.
> > >
> > > I'm doing something wrong, or I'm misunderstanding something!!
> > > ~~Bill
> > >
> > >> On Tue, May 2, 2023 at 1:02 PM Mikhail Khludnev <mk...@apache.org>
> > wrote:
> > >>
> > >> Analyzer is configured in schema.xml. But literally, splitting on dot
> is
> > >> what I expect from StandardTokenizer.
> > >>
> > >> On Tue, May 2, 2023 at 8:48 PM Bill Tantzen <tantz001@umn.edu.invalid
> >
> > >> wrote:
> > >>
> > >>> Mikhail,
> > >>> Thanks for the quick reply.  Here is the parser info:
> > >>>
> > >>> <str name="QParser">LuceneQParser</str>
> > >>>
> > >>> ~~Bill
> > >>>
> > >>> On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev <mk...@apache.org>
> > >> wrote:
> > >>>
> > >>>> Hello Bill,
> > >>>> Which analyzer is configured for metadata_txt?  Perhaps you need to
> > >> tune
> > >>> it
> > >>>> accordingly.
> > >>>>
> > >>>> On Tue, May 2, 2023 at 7:40 PM Bill Tantzen
> <tantz001@umn.edu.invalid
> > >
> > >>>> wrote:
> > >>>>
> > >>>>> In my solr 9.2 schema, I am leveraging the dynamicField
> > >>>>>
> > >>>>> <dynamicField name="*_txt" type="text_general" indexed="true"
> > >>>>> stored="true"/>
> > >>>>>
> > >>>>> which tokenizes with solr.StandardTokenizerFactory for index and
> > >> query.
> > >>>>>
> > >>>>> However, when I query with, for example,
> > >>>>> <str name="q">metadata_txt:XYZ.tif</str>
> > >>>>>
> > >>>>> I see many more hits than I expect.  When I add debug=true to the
> > >>> query,
> > >>>> I
> > >>>>> see:
> > >>>>> <str name="rawquerystring">metadata_txt:XYZ.tif</str>
> > >>>>> <str name="querystring">metadata_txt:XYZ.tif</str>
> > >>>>> <str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str>
> > >>>>>
> > >>>>> But I expect that dots not followed by whitespace will be kept as
> > >> part
> > >>> of
> > >>>>> the token, that is, the parsed query should remain
> > >>> "metadata_txt:XYZ.tif"
> > >>>>> but solr appears to be splitting into two tokens.
> > >>>>>
> > >>>>> Can somebody point out what I am misunderstanding?
> > >>>>> Thanks,
> > >>>>> ~~Bill
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Sincerely yours
> > >>>> Mikhail Khludnev
> > >>>> https://t.me/MUST_SEARCH
> > >>>> A caveat: Cyrillic!
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Human wheels spin round and round
> > >>> While the clock keeps the pace... -- John Mellencamp
> > >>> ________________________________________________________________
> > >>> Bill Tantzen    University of Minnesota Libraries
> > >>> 612-626-9949 (U of M)    612-325-1777 (cell)
> > >>>
> > >>
> > >>
> > >> --
> > >> Sincerely yours
> > >> Mikhail Khludnev
> > >> https://t.me/MUST_SEARCH
> > >> A caveat: Cyrillic!
> > >>
> > >
> > >
> > > --
> > > Human wheels spin round and round
> > > While the clock keeps the pace... -- John Mellencamp
> > > ________________________________________________________________
> > > Bill Tantzen    University of Minnesota Libraries
> > > 612-626-9949 (U of M)    612-325-1777 (cell)
> >
>
>
> --
> Human wheels spin round and round
> While the clock keeps the pace... -- John Mellencamp
> ________________________________________________________________
> Bill Tantzen    University of Minnesota Libraries
> 612-626-9949 (U of M)    612-325-1777 (cell)
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: standard tokenizer seemingly splitting on dot

Posted by Bill Tantzen <ta...@umn.edu.INVALID>.

Rahul,
No I do not, but note that this behavior has been observed by others and
reported as a possible issue.
Thank you!
~~Bill

On Thu, May 4, 2023 at 1:07 PM Rahul Goswami <ra...@gmail.com> wrote:

> Bill,
> Do you have a WordDelimiterFilterFactory in the analysis chain (with
> "*preserveOriginal"
> *attribute likely set to *0*)?
> That would split the token on the period downstream in the analysis chain
> even if StandardTokenizer doesn't.
>
> -Rahul
>
> On Thu, May 4, 2023 at 6:22 AM Mikhail Khludnev <mk...@apache.org> wrote:
>
> > Raised https://github.com/apache/lucene/issues/12264.
> > Let's look at what devs say.
> >
> > On Wed, May 3, 2023 at 6:13 PM Bill Tantzen <ta...@umn.edu.invalid>
> > wrote:
> >
> > > Shawn,
> > > No, email addresses are not preserved -- from the docs:
> > >
> > >
> > >    -
> > >
> > >    The "@" character is among the set of token-splitting punctuation,
> so
> > >    email addresses are not preserved as single tokens.
> > >
> > >
> > > but the non-split on "test.com" vs the split on "test7.com" is
> > unexpected!
> > > ~~Bill
> > >
> > >
> > > On Wed, May 3, 2023 at 10:04 AM Shawn Heisey <ap...@elyograg.org>
> > wrote:
> > >
> > > > On 5/2/23 15:30, Bill Tantzen wrote:
> > > > > This works as I expected:
> > > > > ab00c.tif -- tokenizes as it should with a value of ab00c.tif
> > > > >
> > > > > This doesn't work as I expected
> > > > > ab003.tif -- tokenizes with a result of ab003 and tif
> > > >
> > > > I got the same behavior with ICUTokenizer, which uses ICU4J for
> Unicode
> > > > handling.  I am pretty sure ICU4J is IBM's implementation of Unicode.
> > I
> > > > think StandardTokenizer is using a different implementation.
> > > >
> > > > I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses
> > > > reference icu4j version 70.1, which is dated Oct 28, 2021 on maven
> > > central.
> > > >
> > > > Two different Unicode implementations are doing exactly the same
> thing.
> > > > Is it a bug, or expected behavior?  It does mean filenames are
> > sometimes
> > > > not being handled in the way you expect.
> > > >
> > > > I ran another check ... I had thought that StandardTokenizer
> preserved
> > > > email addresses as a single token ... but I am seeing that
> > test@test.com
> > > > is split into two terms.  It splits test@test7.com into three terms.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > >
> > >
> > > --
> > > Human wheels spin round and round
> > > While the clock keeps the pace... -- John Mellencamp
> > > ________________________________________________________________
> > > Bill Tantzen    University of Minnesota Libraries
> > > 612-626-9949 (U of M)    612-325-1777 (cell)
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/MUST_SEARCH
> > A caveat: Cyrillic!
> >
>


-- 
Human wheels spin round and round
While the clock keeps the pace... -- John Mellencamp
________________________________________________________________
Bill Tantzen    University of Minnesota Libraries
612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Posted by Rahul Goswami <ra...@gmail.com>.

Bill,
Do you have a WordDelimiterFilterFactory in the analysis chain (with
"*preserveOriginal"
*attribute likely set to *0*)?
That would split the token on the period downstream in the analysis chain
even if StandardTokenizer doesn't.

-Rahul

On Thu, May 4, 2023 at 6:22 AM Mikhail Khludnev <mk...@apache.org> wrote:

> Raised https://github.com/apache/lucene/issues/12264.
> Let's look at what devs say.
>
> On Wed, May 3, 2023 at 6:13 PM Bill Tantzen <ta...@umn.edu.invalid>
> wrote:
>
> > Shawn,
> > No, email addresses are not preserved -- from the docs:
> >
> >
> >    -
> >
> >    The "@" character is among the set of token-splitting punctuation, so
> >    email addresses are not preserved as single tokens.
> >
> >
> > but the non-split on "test.com" vs the split on "test7.com" is
> unexpected!
> > ~~Bill
> >
> >
> > On Wed, May 3, 2023 at 10:04 AM Shawn Heisey <ap...@elyograg.org>
> wrote:
> >
> > > On 5/2/23 15:30, Bill Tantzen wrote:
> > > > This works as I expected:
> > > > ab00c.tif -- tokenizes as it should with a value of ab00c.tif
> > > >
> > > > This doesn't work as I expected
> > > > ab003.tif -- tokenizes with a result of ab003 and tif
> > >
> > > I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode
> > > handling.  I am pretty sure ICU4J is IBM's implementation of Unicode.
> I
> > > think StandardTokenizer is using a different implementation.
> > >
> > > I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses
> > > reference icu4j version 70.1, which is dated Oct 28, 2021 on maven
> > central.
> > >
> > > Two different Unicode implementations are doing exactly the same thing.
> > > Is it a bug, or expected behavior?  It does mean filenames are
> sometimes
> > > not being handled in the way you expect.
> > >
> > > I ran another check ... I had thought that StandardTokenizer preserved
> > > email addresses as a single token ... but I am seeing that
> test@test.com
> > > is split into two terms.  It splits test@test7.com into three terms.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
> >
> > --
> > Human wheels spin round and round
> > While the clock keeps the pace... -- John Mellencamp
> > ________________________________________________________________
> > Bill Tantzen    University of Minnesota Libraries
> > 612-626-9949 (U of M)    612-325-1777 (cell)
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>

Re: standard tokenizer seemingly splitting on dot

Posted by Mikhail Khludnev <mk...@apache.org>.

Raised https://github.com/apache/lucene/issues/12264.
Let's look at what devs say.

On Wed, May 3, 2023 at 6:13 PM Bill Tantzen <ta...@umn.edu.invalid>
wrote:

> Shawn,
> No, email addresses are not preserved -- from the docs:
>
>
>    -
>
>    The "@" character is among the set of token-splitting punctuation, so
>    email addresses are not preserved as single tokens.
>
>
> but the non-split on "test.com" vs the split on "test7.com" is unexpected!
> ~~Bill
>
>
> On Wed, May 3, 2023 at 10:04 AM Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 5/2/23 15:30, Bill Tantzen wrote:
> > > This works as I expected:
> > > ab00c.tif -- tokenizes as it should with a value of ab00c.tif
> > >
> > > This doesn't work as I expected
> > > ab003.tif -- tokenizes with a result of ab003 and tif
> >
> > I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode
> > handling.  I am pretty sure ICU4J is IBM's implementation of Unicode.  I
> > think StandardTokenizer is using a different implementation.
> >
> > I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses
> > reference icu4j version 70.1, which is dated Oct 28, 2021 on maven
> central.
> >
> > Two different Unicode implementations are doing exactly the same thing.
> > Is it a bug, or expected behavior?  It does mean filenames are sometimes
> > not being handled in the way you expect.
> >
> > I ran another check ... I had thought that StandardTokenizer preserved
> > email addresses as a single token ... but I am seeing that test@test.com
> > is split into two terms.  It splits test@test7.com into three terms.
> >
> > Thanks,
> > Shawn
> >
>
>
> --
> Human wheels spin round and round
> While the clock keeps the pace... -- John Mellencamp
> ________________________________________________________________
> Bill Tantzen    University of Minnesota Libraries
> 612-626-9949 (U of M)    612-325-1777 (cell)
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: standard tokenizer seemingly splitting on dot

Posted by Bill Tantzen <ta...@umn.edu.INVALID>.

Shawn,
No, email addresses are not preserved -- from the docs:


   -

   The "@" character is among the set of token-splitting punctuation, so
   email addresses are not preserved as single tokens.


but the non-split on "test.com" vs the split on "test7.com" is unexpected!
~~Bill


On Wed, May 3, 2023 at 10:04 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 5/2/23 15:30, Bill Tantzen wrote:
> > This works as I expected:
> > ab00c.tif -- tokenizes as it should with a value of ab00c.tif
> >
> > This doesn't work as I expected
> > ab003.tif -- tokenizes with a result of ab003 and tif
>
> I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode
> handling.  I am pretty sure ICU4J is IBM's implementation of Unicode.  I
> think StandardTokenizer is using a different implementation.
>
> I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses
> reference icu4j version 70.1, which is dated Oct 28, 2021 on maven central.
>
> Two different Unicode implementations are doing exactly the same thing.
> Is it a bug, or expected behavior?  It does mean filenames are sometimes
> not being handled in the way you expect.
>
> I ran another check ... I had thought that StandardTokenizer preserved
> email addresses as a single token ... but I am seeing that test@test.com
> is split into two terms.  It splits test@test7.com into three terms.
>
> Thanks,
> Shawn
>


-- 
Human wheels spin round and round
While the clock keeps the pace... -- John Mellencamp
________________________________________________________________
Bill Tantzen    University of Minnesota Libraries
612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Posted by Shawn Heisey <ap...@elyograg.org>.

On 5/2/23 15:30, Bill Tantzen wrote:
> This works as I expected:
> ab00c.tif -- tokenizes as it should with a value of ab00c.tif
> 
> This doesn't work as I expected
> ab003.tif -- tokenizes with a result of ab003 and tif

I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode 
handling.  I am pretty sure ICU4J is IBM's implementation of Unicode.  I 
think StandardTokenizer is using a different implementation.

I'm on Solr 9.3.0-SNAPSHOT ... the ICU analysis components it uses 
reference icu4j version 70.1, which is dated Oct 28, 2021 on maven central.

Two different Unicode implementations are doing exactly the same thing. 
Is it a bug, or expected behavior?  It does mean filenames are sometimes 
not being handled in the way you expect.

I ran another check ... I had thought that StandardTokenizer preserved 
email addresses as a single token ... but I am seeing that test@test.com 
is split into two terms.  It splits test@test7.com into three terms.

Thanks,
Shawn

Re: standard tokenizer seemingly splitting on dot

Posted by Gus Heck <gu...@gmail.com>.

That looks like a bug. Seems to be splitting if the character class before
and after differ, but not if they are the same.

ST
XYZ123
tif
SF
XYZ123
tif
LCF
 xyz123
 tif

and

ST
XYZ
123tif
SF
XYZ
123tif
LCF
 xyz
 123tif


But...

ST
XYZ123.123tif
SF
XYZ123.123tif
LCF
 xyz123.123tif

On Tue, May 2, 2023 at 5:30 PM Bill Tantzen <ta...@umn.edu.invalid>
wrote:

> OK, I see what's going on.  I should not have used a generic example like
> XYZ.
>
> In my specific case, as you can see, I'm working with filenames.
>
> This works as I expected:
> ab00c.tif -- tokenizes as it should with a value of ab00c.tif
>
> This doesn't work as I expected
> ab003.tif -- tokenizes with a result of ab003 and tif
>
> That is, the standard tokenizer treats dot as described in the docs when it
> is preceded by an alpha character.
> It treats dot as any other delimiter when it is preceded by a numeric
> character, that is, it creates two tokens.
>
> (This is maybe documented in the linked unicode.org page in that section
> of
> the docs, but honestly that page went way over my head...)
>
> So at least it works as advertised except in the edge case where the dot is
> preceded by a numeric.  I don't know why that is the case, but I can work
> with that!
>
> Thanks to everybody who weighed in on this!
> ~~Bill
>
>
>
>
>
> On Tue, May 2, 2023 at 3:56 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 5/2/23 13:16, Bill Tantzen wrote:
> > > This tokenizer splits the text field into tokens, treating whitespace
> and
> > > punctuation as delimiters.
> > > Delimiter characters are discarded, with the following exceptions:
> > > Periods (dots) that are not followed by whitespace are kept as part of
> > the
> > > token, including Internet domain names.
> >
> > I checked on a dev version (9.3.0-SNAPSHOT) and StandardTokenizer does
> > indeed do exactly what the docs say.
> >
> > The analysis definition in the fieldType probably has things beyond the
> > StandardTokenizer, one or more filters that DO break up terms on a
> period.
> >
> > Thanks,
> > Shawn
> >
>
>
> --
> Human wheels spin round and round
> While the clock keeps the pace... -- John Mellencamp
> ________________________________________________________________
> Bill Tantzen    University of Minnesota Libraries
> 612-626-9949 (U of M)    612-325-1777 (cell)
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: standard tokenizer seemingly splitting on dot

Posted by Bill Tantzen <ta...@umn.edu.INVALID>.

OK, I see what's going on.  I should not have used a generic example like
XYZ.

In my specific case, as you can see, I'm working with filenames.

This works as I expected:
ab00c.tif -- tokenizes as it should with a value of ab00c.tif

This doesn't work as I expected
ab003.tif -- tokenizes with a result of ab003 and tif

That is, the standard tokenizer treats dot as described in the docs when it
is preceded by an alpha character.
It treats dot as any other delimiter when it is preceded by a numeric
character, that is, it creates two tokens.

(This is maybe documented in the linked unicode.org page in that section of
the docs, but honestly that page went way over my head...)

So at least it works as advertised except in the edge case where the dot is
preceded by a numeric.  I don't know why that is the case, but I can work
with that!

Thanks to everybody who weighed in on this!
~~Bill

On Tue, May 2, 2023 at 3:56 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 5/2/23 13:16, Bill Tantzen wrote:
> > This tokenizer splits the text field into tokens, treating whitespace and
> > punctuation as delimiters.
> > Delimiter characters are discarded, with the following exceptions:
> > Periods (dots) that are not followed by whitespace are kept as part of
> the
> > token, including Internet domain names.
>
> I checked on a dev version (9.3.0-SNAPSHOT) and StandardTokenizer does
> indeed do exactly what the docs say.
>
> The analysis definition in the fieldType probably has things beyond the
> StandardTokenizer, one or more filters that DO break up terms on a period.
>
> Thanks,
> Shawn
>

-- 
Human wheels spin round and round
While the clock keeps the pace... -- John Mellencamp
________________________________________________________________
Bill Tantzen    University of Minnesota Libraries
612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Posted by Shawn Heisey <ap...@elyograg.org>.

On 5/2/23 13:16, Bill Tantzen wrote:
> This tokenizer splits the text field into tokens, treating whitespace and
> punctuation as delimiters.
> Delimiter characters are discarded, with the following exceptions:
> Periods (dots) that are not followed by whitespace are kept as part of the
> token, including Internet domain names.

I checked on a dev version (9.3.0-SNAPSHOT) and StandardTokenizer does 
indeed do exactly what the docs say.

The analysis definition in the fieldType probably has things beyond the 
StandardTokenizer, one or more filters that DO break up terms on a period.

Thanks,
Shawn

Re: standard tokenizer seemingly splitting on dot

Posted by Bill Tantzen <ta...@umn.edu.INVALID>.

Thanks Dave!
Using a string field instead would work fine for my purposes I think...
I'm just trying to understand why it doesn't work with a field of type
text_general which uses the standard tokenizer in both the index and the
query analyzer.  The docs state:

This tokenizer splits the text field into tokens, treating whitespace and
punctuation as delimiters.
Delimiter characters are discarded, with the following exceptions:
Periods (dots) that are not followed by whitespace are kept as part of the
token, including Internet domain names.

That's what is confusing me...  Meanwhile, I'm going to take your
suggestion and convert the field to a string!
~~Bill

On Tue, May 2, 2023 at 1:40 PM Dave <ha...@gmail.com> wrote:

> You’re not doing anything wrong, a dot is not a character so it splits the
> field in the index and the query. If you used a string instead it
> theoretically would maintain the non characters but also lead to more
> strict search constraints. If you tried this you need to re index a couple
> documents to
> Make sure you are getting what you want.
>
> -Dave
>
> > On May 2, 2023, at 2:22 PM, Bill Tantzen <ta...@umn.edu.invalid>
> wrote:
> >
> > I'm using the solrconfig.xml from the distribution,
> > ./server/solr/configsets/_default/conf/solrconfig.xml
> >
> > But this problem extends to the index as well; using the initial example,
> > if I search for <str name="parsedquery">metadata_txt:ab00001</str>
> (instead
> > of ab00001.tif), my result set includes ab00001.tif, ab00001.jpg,
> > ab00001.png, etc so the tokens in the index are split on dot as well, not
> > just the query.
> >
> > I'm doing something wrong, or I'm misunderstanding something!!
> > ~~Bill
> >
> >> On Tue, May 2, 2023 at 1:02 PM Mikhail Khludnev <mk...@apache.org>
> wrote:
> >>
> >> Analyzer is configured in schema.xml. But literally, splitting on dot is
> >> what I expect from StandardTokenizer.
> >>
> >> On Tue, May 2, 2023 at 8:48 PM Bill Tantzen <ta...@umn.edu.invalid>
> >> wrote:
> >>
> >>> Mikhail,
> >>> Thanks for the quick reply.  Here is the parser info:
> >>>
> >>> <str name="QParser">LuceneQParser</str>
> >>>
> >>> ~~Bill
> >>>
> >>> On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev <mk...@apache.org>
> >> wrote:
> >>>
> >>>> Hello Bill,
> >>>> Which analyzer is configured for metadata_txt?  Perhaps you need to
> >> tune
> >>> it
> >>>> accordingly.
> >>>>
> >>>> On Tue, May 2, 2023 at 7:40 PM Bill Tantzen <tantz001@umn.edu.invalid
> >
> >>>> wrote:
> >>>>
> >>>>> In my solr 9.2 schema, I am leveraging the dynamicField
> >>>>>
> >>>>> <dynamicField name="*_txt" type="text_general" indexed="true"
> >>>>> stored="true"/>
> >>>>>
> >>>>> which tokenizes with solr.StandardTokenizerFactory for index and
> >> query.
> >>>>>
> >>>>> However, when I query with, for example,
> >>>>> <str name="q">metadata_txt:XYZ.tif</str>
> >>>>>
> >>>>> I see many more hits than I expect.  When I add debug=true to the
> >>> query,
> >>>> I
> >>>>> see:
> >>>>> <str name="rawquerystring">metadata_txt:XYZ.tif</str>
> >>>>> <str name="querystring">metadata_txt:XYZ.tif</str>
> >>>>> <str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str>
> >>>>>
> >>>>> But I expect that dots not followed by whitespace will be kept as
> >> part
> >>> of
> >>>>> the token, that is, the parsed query should remain
> >>> "metadata_txt:XYZ.tif"
> >>>>> but solr appears to be splitting into two tokens.
> >>>>>
> >>>>> Can somebody point out what I am misunderstanding?
> >>>>> Thanks,
> >>>>> ~~Bill
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Sincerely yours
> >>>> Mikhail Khludnev
> >>>> https://t.me/MUST_SEARCH
> >>>> A caveat: Cyrillic!
> >>>>
> >>>
> >>>
> >>> --
> >>> Human wheels spin round and round
> >>> While the clock keeps the pace... -- John Mellencamp
> >>> ________________________________________________________________
> >>> Bill Tantzen    University of Minnesota Libraries
> >>> 612-626-9949 (U of M)    612-325-1777 (cell)
> >>>
> >>
> >>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >> https://t.me/MUST_SEARCH
> >> A caveat: Cyrillic!
> >>
> >
> >
> > --
> > Human wheels spin round and round
> > While the clock keeps the pace... -- John Mellencamp
> > ________________________________________________________________
> > Bill Tantzen    University of Minnesota Libraries
> > 612-626-9949 (U of M)    612-325-1777 (cell)
>


-- 
Human wheels spin round and round
While the clock keeps the pace... -- John Mellencamp
________________________________________________________________
Bill Tantzen    University of Minnesota Libraries
612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Posted by Dave <ha...@gmail.com>.

You’re not doing anything wrong, a dot is not a character so it splits the field in the index and the query. If you used a string instead it theoretically would maintain the non characters but also lead to more strict search constraints. If you tried this you need to re index a couple documents to
Make sure you are getting what you want. 

-Dave

> On May 2, 2023, at 2:22 PM, Bill Tantzen <ta...@umn.edu.invalid> wrote:
> 
> I'm using the solrconfig.xml from the distribution,
> ./server/solr/configsets/_default/conf/solrconfig.xml
> 
> But this problem extends to the index as well; using the initial example,
> if I search for <str name="parsedquery">metadata_txt:ab00001</str> (instead
> of ab00001.tif), my result set includes ab00001.tif, ab00001.jpg,
> ab00001.png, etc so the tokens in the index are split on dot as well, not
> just the query.
> 
> I'm doing something wrong, or I'm misunderstanding something!!
> ~~Bill
> 
>> On Tue, May 2, 2023 at 1:02 PM Mikhail Khludnev <mk...@apache.org> wrote:
>> 
>> Analyzer is configured in schema.xml. But literally, splitting on dot is
>> what I expect from StandardTokenizer.
>> 
>> On Tue, May 2, 2023 at 8:48 PM Bill Tantzen <ta...@umn.edu.invalid>
>> wrote:
>> 
>>> Mikhail,
>>> Thanks for the quick reply.  Here is the parser info:
>>> 
>>> <str name="QParser">LuceneQParser</str>
>>> 
>>> ~~Bill
>>> 
>>> On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev <mk...@apache.org>
>> wrote:
>>> 
>>>> Hello Bill,
>>>> Which analyzer is configured for metadata_txt?  Perhaps you need to
>> tune
>>> it
>>>> accordingly.
>>>> 
>>>> On Tue, May 2, 2023 at 7:40 PM Bill Tantzen <ta...@umn.edu.invalid>
>>>> wrote:
>>>> 
>>>>> In my solr 9.2 schema, I am leveraging the dynamicField
>>>>> 
>>>>> <dynamicField name="*_txt" type="text_general" indexed="true"
>>>>> stored="true"/>
>>>>> 
>>>>> which tokenizes with solr.StandardTokenizerFactory for index and
>> query.
>>>>> 
>>>>> However, when I query with, for example,
>>>>> <str name="q">metadata_txt:XYZ.tif</str>
>>>>> 
>>>>> I see many more hits than I expect.  When I add debug=true to the
>>> query,
>>>> I
>>>>> see:
>>>>> <str name="rawquerystring">metadata_txt:XYZ.tif</str>
>>>>> <str name="querystring">metadata_txt:XYZ.tif</str>
>>>>> <str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str>
>>>>> 
>>>>> But I expect that dots not followed by whitespace will be kept as
>> part
>>> of
>>>>> the token, that is, the parsed query should remain
>>> "metadata_txt:XYZ.tif"
>>>>> but solr appears to be splitting into two tokens.
>>>>> 
>>>>> Can somebody point out what I am misunderstanding?
>>>>> Thanks,
>>>>> ~~Bill
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>> https://t.me/MUST_SEARCH
>>>> A caveat: Cyrillic!
>>>> 
>>> 
>>> 
>>> --
>>> Human wheels spin round and round
>>> While the clock keeps the pace... -- John Mellencamp
>>> ________________________________________________________________
>>> Bill Tantzen    University of Minnesota Libraries
>>> 612-626-9949 (U of M)    612-325-1777 (cell)
>>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> https://t.me/MUST_SEARCH
>> A caveat: Cyrillic!
>> 
> 
> 
> -- 
> Human wheels spin round and round
> While the clock keeps the pace... -- John Mellencamp
> ________________________________________________________________
> Bill Tantzen    University of Minnesota Libraries
> 612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Posted by Bill Tantzen <ta...@umn.edu.INVALID>.

I'm using the solrconfig.xml from the distribution,
./server/solr/configsets/_default/conf/solrconfig.xml

But this problem extends to the index as well; using the initial example,
if I search for <str name="parsedquery">metadata_txt:ab00001</str> (instead
of ab00001.tif), my result set includes ab00001.tif, ab00001.jpg,
ab00001.png, etc so the tokens in the index are split on dot as well, not
just the query.

I'm doing something wrong, or I'm misunderstanding something!!
~~Bill

On Tue, May 2, 2023 at 1:02 PM Mikhail Khludnev <mk...@apache.org> wrote:

> Analyzer is configured in schema.xml. But literally, splitting on dot is
> what I expect from StandardTokenizer.
>
> On Tue, May 2, 2023 at 8:48 PM Bill Tantzen <ta...@umn.edu.invalid>
> wrote:
>
> > Mikhail,
> > Thanks for the quick reply.  Here is the parser info:
> >
> > <str name="QParser">LuceneQParser</str>
> >
> > ~~Bill
> >
> > On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev <mk...@apache.org>
> wrote:
> >
> > > Hello Bill,
> > > Which analyzer is configured for metadata_txt?  Perhaps you need to
> tune
> > it
> > > accordingly.
> > >
> > > On Tue, May 2, 2023 at 7:40 PM Bill Tantzen <ta...@umn.edu.invalid>
> > > wrote:
> > >
> > > > In my solr 9.2 schema, I am leveraging the dynamicField
> > > >
> > > > <dynamicField name="*_txt" type="text_general" indexed="true"
> > > > stored="true"/>
> > > >
> > > > which tokenizes with solr.StandardTokenizerFactory for index and
> query.
> > > >
> > > > However, when I query with, for example,
> > > > <str name="q">metadata_txt:XYZ.tif</str>
> > > >
> > > > I see many more hits than I expect.  When I add debug=true to the
> > query,
> > > I
> > > > see:
> > > > <str name="rawquerystring">metadata_txt:XYZ.tif</str>
> > > > <str name="querystring">metadata_txt:XYZ.tif</str>
> > > > <str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str>
> > > >
> > > > But I expect that dots not followed by whitespace will be kept as
> part
> > of
> > > > the token, that is, the parsed query should remain
> > "metadata_txt:XYZ.tif"
> > > > but solr appears to be splitting into two tokens.
> > > >
> > > > Can somebody point out what I am misunderstanding?
> > > > Thanks,
> > > > ~~Bill
> > > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > https://t.me/MUST_SEARCH
> > > A caveat: Cyrillic!
> > >
> >
> >
> > --
> > Human wheels spin round and round
> > While the clock keeps the pace... -- John Mellencamp
> > ________________________________________________________________
> > Bill Tantzen    University of Minnesota Libraries
> > 612-626-9949 (U of M)    612-325-1777 (cell)
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>


-- 
Human wheels spin round and round
While the clock keeps the pace... -- John Mellencamp
________________________________________________________________
Bill Tantzen    University of Minnesota Libraries
612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Posted by Mikhail Khludnev <mk...@apache.org>.

Analyzer is configured in schema.xml. But literally, splitting on dot is
what I expect from StandardTokenizer.

On Tue, May 2, 2023 at 8:48 PM Bill Tantzen <ta...@umn.edu.invalid>
wrote:

> Mikhail,
> Thanks for the quick reply.  Here is the parser info:
>
> <str name="QParser">LuceneQParser</str>
>
> ~~Bill
>
> On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev <mk...@apache.org> wrote:
>
> > Hello Bill,
> > Which analyzer is configured for metadata_txt?  Perhaps you need to tune
> it
> > accordingly.
> >
> > On Tue, May 2, 2023 at 7:40 PM Bill Tantzen <ta...@umn.edu.invalid>
> > wrote:
> >
> > > In my solr 9.2 schema, I am leveraging the dynamicField
> > >
> > > <dynamicField name="*_txt" type="text_general" indexed="true"
> > > stored="true"/>
> > >
> > > which tokenizes with solr.StandardTokenizerFactory for index and query.
> > >
> > > However, when I query with, for example,
> > > <str name="q">metadata_txt:XYZ.tif</str>
> > >
> > > I see many more hits than I expect.  When I add debug=true to the
> query,
> > I
> > > see:
> > > <str name="rawquerystring">metadata_txt:XYZ.tif</str>
> > > <str name="querystring">metadata_txt:XYZ.tif</str>
> > > <str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str>
> > >
> > > But I expect that dots not followed by whitespace will be kept as part
> of
> > > the token, that is, the parsed query should remain
> "metadata_txt:XYZ.tif"
> > > but solr appears to be splitting into two tokens.
> > >
> > > Can somebody point out what I am misunderstanding?
> > > Thanks,
> > > ~~Bill
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/MUST_SEARCH
> > A caveat: Cyrillic!
> >
>
>
> --
> Human wheels spin round and round
> While the clock keeps the pace... -- John Mellencamp
> ________________________________________________________________
> Bill Tantzen    University of Minnesota Libraries
> 612-626-9949 (U of M)    612-325-1777 (cell)
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: standard tokenizer seemingly splitting on dot

Posted by Bill Tantzen <ta...@umn.edu.INVALID>.

Mikhail,
Thanks for the quick reply.  Here is the parser info:

<str name="QParser">LuceneQParser</str>

~~Bill

On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev <mk...@apache.org> wrote:

> Hello Bill,
> Which analyzer is configured for metadata_txt?  Perhaps you need to tune it
> accordingly.
>
> On Tue, May 2, 2023 at 7:40 PM Bill Tantzen <ta...@umn.edu.invalid>
> wrote:
>
> > In my solr 9.2 schema, I am leveraging the dynamicField
> >
> > <dynamicField name="*_txt" type="text_general" indexed="true"
> > stored="true"/>
> >
> > which tokenizes with solr.StandardTokenizerFactory for index and query.
> >
> > However, when I query with, for example,
> > <str name="q">metadata_txt:XYZ.tif</str>
> >
> > I see many more hits than I expect.  When I add debug=true to the query,
> I
> > see:
> > <str name="rawquerystring">metadata_txt:XYZ.tif</str>
> > <str name="querystring">metadata_txt:XYZ.tif</str>
> > <str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str>
> >
> > But I expect that dots not followed by whitespace will be kept as part of
> > the token, that is, the parsed query should remain "metadata_txt:XYZ.tif"
> > but solr appears to be splitting into two tokens.
> >
> > Can somebody point out what I am misunderstanding?
> > Thanks,
> > ~~Bill
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>


-- 
Human wheels spin round and round
While the clock keeps the pace... -- John Mellencamp
________________________________________________________________
Bill Tantzen    University of Minnesota Libraries
612-626-9949 (U of M)    612-325-1777 (cell)

Re: standard tokenizer seemingly splitting on dot

Posted by Mikhail Khludnev <mk...@apache.org>.

Hello Bill,
Which analyzer is configured for metadata_txt?  Perhaps you need to tune it
accordingly.

On Tue, May 2, 2023 at 7:40 PM Bill Tantzen <ta...@umn.edu.invalid>
wrote:

> In my solr 9.2 schema, I am leveraging the dynamicField
>
> <dynamicField name="*_txt" type="text_general" indexed="true"
> stored="true"/>
>
> which tokenizes with solr.StandardTokenizerFactory for index and query.
>
> However, when I query with, for example,
> <str name="q">metadata_txt:XYZ.tif</str>
>
> I see many more hits than I expect.  When I add debug=true to the query, I
> see:
> <str name="rawquerystring">metadata_txt:XYZ.tif</str>
> <str name="querystring">metadata_txt:XYZ.tif</str>
> <str name="parsedquery">metadata_txt:XYZ metadata_txt:tif</str>
>
> But I expect that dots not followed by whitespace will be kept as part of
> the token, that is, the parsed query should remain "metadata_txt:XYZ.tif"
> but solr appears to be splitting into two tokens.
>
> Can somebody point out what I am misunderstanding?
> Thanks,
> ~~Bill
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!