You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Scott Derrick <sc...@tnstaafl.net> on 2021/08/26 23:25:23 UTC

limiting term facets

There is a mincount when requesting a type: terms facet, limiting the return buckets to at least this number, default 1

I need a maxcount, limiting the return buckets to less than a specified number.

I'm interested in the rare words, not the common words.

thanks,

Scott

Re: trailing space added to fields

Posted by Scott Derrick <sc...@tnstaafl.net>.

Alexandre,

	perfect!!!

	There is a built in white space trim factory, TrimFieldUpdateProcessorFactory that I added to the default chain and now all is good!

	thanks again,

Scott

On 9/7/21 7:32 AM, Alexandre Rafalovitch wrote:
> The general answer is to add UpdateRequestProcessor pipeline. That gives
> you a lot of post processing flexibility.
> 
> But you can also try having the xpath specify  ..../text(), maybe that will
> deal with space specifically.  Did not test it myself though, just a
> thought.
> 
> Regards,
>      Alex
> 
> On Mon., Sep. 6, 2021, 11:10 p.m. Scott Derrick, <sc...@tnstaafl.net> wrote:
> 
>> I'm indexing .xml documents and using the XPathEntityProcessor for data
>> importing.  Here is a snippet of my conf file
>>
>>         <entity name="meta"
>>             dataSource="myfilereader"
>>             processor="XPathEntityProcessor"
>>             url="${jcurrent.fileAbsolutePath}"
>>             stream="false"
>>             forEach="/TEI/teiHeader/fileDesc"
>>             xsl="xslt/meta.xsl"
>>             >
>>             <field column="title" xpath="/TEI/teiHeader//title"
>> flatten="true"/>
>>             <field column="author" xpath="/TEI/teiHeader//author" />
>>             <field column="publisher" xpath="/TEI/teiHeader//publisher" />
>>             <field column="accession" xpath="/TEI/teiHeader//idno" />
>>             <field column="date" xpath="/TEI/teiHeader//date"
>> flatten="true" />
>>             <field column="origin" xpath="/TEI/teiHeader//origin" />
>>             <field column="origPlace" xpath="/TEI/teiHeader//origPlace" />
>>             <field column="origGeo" xpath="/TEI/teiHeader//origGeo" />
>>             <field column="settlement" xpath="/TEI/teiHeader//settlement" />
>>             <field column="region" xpath="/TEI/teiHeader//region" />
>>             <field column="country" xpath="/TEI/teiHeader//country" />
>>             <field column="when" xpath="/TEI/teiHeader//when" />
>>             <field column="when-custom" xpath="/TEI/teiHeader//when-custom"
>> />
>>             <field column="notAfter" xpath="/TEI/teiHeader//notAfter" />
>>             <field column="notBefore" xpath="/TEI/teiHeader//notBefore" />
>>             <field column="note" xpath="/TEI/teiHeader//note"
>> flatten="true" />
>>             <field column="annotator" xpath="/TEI/teiHeader//annotator" />
>>             <field column="scribe" xpath="/TEI/teiHeader//scribe" />
>>             <field column="recipient" xpath="/TEI/teiHeader//recipient" />
>>          </entity>
>>
>> I noticed spaces at the ends of my elements when exporting a result into
>> json or xml.
>>
>> I thought is was my javascript fetch call that was appending the string
>> but looking at the query page on the solr admin site I can clearly see a
>> trailing space.  Doesn't matter how the field is stored string or
>> text_general is the same.
>>
>> here is a snippet of the query response
>>
>> |{ "date":"1884-09-09 September 9, 1884 ", "note":"Handwritten by Mary on
>> a postcard from Boston, Massachusetts. ", "country":"USA ",
>> "origGeo":"42.3584308 -71.0597732 ", "author":"Mary ", "authorString":"Mary
>> ", "origin":"1884-09-09 ",
>> "originSort":"1884-09-09 ", "accession":"639P3.65.026 ",
>> "accessionSort":"639P3.65.026 ", "title":"\n Mary to Mary Baker Eddy, \n
>> September 9, 1884 \n \n ", "titleSort":"\n Mary to Mary Baker Eddy, \n
>> September 9, 1884 \n \n ", "when":"1884-09-09 ",
>> "settlement":"Boston ", "recipient":"Mary Baker Eddy",
>> "recipientString":"Mary Baker Eddy", "publisher":"The Mary Baker Eddy
>> Library ", "origPlace":"places.xml#boston_ma ", "region":"MA ",
>> "type":"incoming_correspondence", "places":"Boston ",
>> "placesString":"Boston ", "people":"Mary ", "peopleString":"Mary ",
>> "body":"Paper rec received Thanks, Just looked it over, good . Have moved
>> at last! Will find me at cor: Shawmut Ave. & Pleasant St. a few doors from
>> 66 S. Ave, further downtown. Hope
>> you will find time to come in. Not yet settled, but like much better. Hope
>> you are prospering. Wanted to see you last Sabbath eve but too tired In
>> love Mary – ", "closer":"Boston Sept 9. 1884 . ",
>> "id":"3272bf21-e6c2-4053-85ef-db3ec5a7f0ae",
>> "_version_":1710182653070671872},|
>>
>>
>>
>> I'm guessing its the XPathEntityProcessor that is doing it but I'm
>> certainly open to pilot error!
>>
>> Any ideas how I can get rid of the trailing space?
>>
>> thanks,
>>
>> Scott
>>
>>
>>
>

Re: trailing space added to fields

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

The general answer is to add UpdateRequestProcessor pipeline. That gives
you a lot of post processing flexibility.

But you can also try having the xpath specify  ..../text(), maybe that will
deal with space specifically.  Did not test it myself though, just a
thought.

Regards,
    Alex

On Mon., Sep. 6, 2021, 11:10 p.m. Scott Derrick, <sc...@tnstaafl.net> wrote:

> I'm indexing .xml documents and using the XPathEntityProcessor for data
> importing.  Here is a snippet of my conf file
>
>        <entity name="meta"
>            dataSource="myfilereader"
>            processor="XPathEntityProcessor"
>            url="${jcurrent.fileAbsolutePath}"
>            stream="false"
>            forEach="/TEI/teiHeader/fileDesc"
>            xsl="xslt/meta.xsl"
>            >
>            <field column="title" xpath="/TEI/teiHeader//title"
> flatten="true"/>
>            <field column="author" xpath="/TEI/teiHeader//author" />
>            <field column="publisher" xpath="/TEI/teiHeader//publisher" />
>            <field column="accession" xpath="/TEI/teiHeader//idno" />
>            <field column="date" xpath="/TEI/teiHeader//date"
> flatten="true" />
>            <field column="origin" xpath="/TEI/teiHeader//origin" />
>            <field column="origPlace" xpath="/TEI/teiHeader//origPlace" />
>            <field column="origGeo" xpath="/TEI/teiHeader//origGeo" />
>            <field column="settlement" xpath="/TEI/teiHeader//settlement" />
>            <field column="region" xpath="/TEI/teiHeader//region" />
>            <field column="country" xpath="/TEI/teiHeader//country" />
>            <field column="when" xpath="/TEI/teiHeader//when" />
>            <field column="when-custom" xpath="/TEI/teiHeader//when-custom"
> />
>            <field column="notAfter" xpath="/TEI/teiHeader//notAfter" />
>            <field column="notBefore" xpath="/TEI/teiHeader//notBefore" />
>            <field column="note" xpath="/TEI/teiHeader//note"
> flatten="true" />
>            <field column="annotator" xpath="/TEI/teiHeader//annotator" />
>            <field column="scribe" xpath="/TEI/teiHeader//scribe" />
>            <field column="recipient" xpath="/TEI/teiHeader//recipient" />
>         </entity>
>
> I noticed spaces at the ends of my elements when exporting a result into
> json or xml.
>
> I thought is was my javascript fetch call that was appending the string
> but looking at the query page on the solr admin site I can clearly see a
> trailing space.  Doesn't matter how the field is stored string or
> text_general is the same.
>
> here is a snippet of the query response
>
> |{ "date":"1884-09-09 September 9, 1884 ", "note":"Handwritten by Mary on
> a postcard from Boston, Massachusetts. ", "country":"USA ",
> "origGeo":"42.3584308 -71.0597732 ", "author":"Mary ", "authorString":"Mary
> ", "origin":"1884-09-09 ",
> "originSort":"1884-09-09 ", "accession":"639P3.65.026 ",
> "accessionSort":"639P3.65.026 ", "title":"\n Mary to Mary Baker Eddy, \n
> September 9, 1884 \n \n ", "titleSort":"\n Mary to Mary Baker Eddy, \n
> September 9, 1884 \n \n ", "when":"1884-09-09 ",
> "settlement":"Boston ", "recipient":"Mary Baker Eddy",
> "recipientString":"Mary Baker Eddy", "publisher":"The Mary Baker Eddy
> Library ", "origPlace":"places.xml#boston_ma ", "region":"MA ",
> "type":"incoming_correspondence", "places":"Boston ",
> "placesString":"Boston ", "people":"Mary ", "peopleString":"Mary ",
> "body":"Paper rec received Thanks, Just looked it over, good . Have moved
> at last! Will find me at cor: Shawmut Ave. & Pleasant St. a few doors from
> 66 S. Ave, further downtown. Hope
> you will find time to come in. Not yet settled, but like much better. Hope
> you are prospering. Wanted to see you last Sabbath eve but too tired In
> love Mary – ", "closer":"Boston Sept 9. 1884 . ",
> "id":"3272bf21-e6c2-4053-85ef-db3ec5a7f0ae",
> "_version_":1710182653070671872},|
>
>
>
> I'm guessing its the XPathEntityProcessor that is doing it but I'm
> certainly open to pilot error!
>
> Any ideas how I can get rid of the trailing space?
>
> thanks,
>
> Scott
>
>
>

trailing space added to fields

Posted by Scott Derrick <sc...@tnstaafl.net>.

I'm indexing .xml documents and using the XPathEntityProcessor for data importing.  Here is a snippet of my conf file

       <entity name="meta"
           dataSource="myfilereader"
           processor="XPathEntityProcessor"
           url="${jcurrent.fileAbsolutePath}"
           stream="false"
           forEach="/TEI/teiHeader/fileDesc"
           xsl="xslt/meta.xsl"
           >
           <field column="title" xpath="/TEI/teiHeader//title" flatten="true"/>
           <field column="author" xpath="/TEI/teiHeader//author" />
           <field column="publisher" xpath="/TEI/teiHeader//publisher" />
           <field column="accession" xpath="/TEI/teiHeader//idno" />
           <field column="date" xpath="/TEI/teiHeader//date" flatten="true" />
           <field column="origin" xpath="/TEI/teiHeader//origin" />
           <field column="origPlace" xpath="/TEI/teiHeader//origPlace" />
           <field column="origGeo" xpath="/TEI/teiHeader//origGeo" />
           <field column="settlement" xpath="/TEI/teiHeader//settlement" />
           <field column="region" xpath="/TEI/teiHeader//region" />
           <field column="country" xpath="/TEI/teiHeader//country" />
           <field column="when" xpath="/TEI/teiHeader//when" />
           <field column="when-custom" xpath="/TEI/teiHeader//when-custom" />
           <field column="notAfter" xpath="/TEI/teiHeader//notAfter" />
           <field column="notBefore" xpath="/TEI/teiHeader//notBefore" />
           <field column="note" xpath="/TEI/teiHeader//note" flatten="true" />
           <field column="annotator" xpath="/TEI/teiHeader//annotator" />
           <field column="scribe" xpath="/TEI/teiHeader//scribe" />
           <field column="recipient" xpath="/TEI/teiHeader//recipient" />
        </entity>

I noticed spaces at the ends of my elements when exporting a result into json or xml.

I thought is was my javascript fetch call that was appending the string but looking at the query page on the solr admin site I can clearly see a trailing space.  Doesn't matter how the field is stored string or text_general is the same.

here is a snippet of the query response

|{ "date":"1884-09-09 September 9, 1884 ", "note":"Handwritten by Mary on a postcard from Boston, Massachusetts. ", "country":"USA ", "origGeo":"42.3584308 -71.0597732 ", "author":"Mary ", "authorString":"Mary ", "origin":"1884-09-09 ", 
"originSort":"1884-09-09 ", "accession":"639P3.65.026 ", "accessionSort":"639P3.65.026 ", "title":"\n Mary to Mary Baker Eddy, \n September 9, 1884 \n \n ", "titleSort":"\n Mary to Mary Baker Eddy, \n September 9, 1884 \n \n ", "when":"1884-09-09 ", 
"settlement":"Boston ", "recipient":"Mary Baker Eddy", "recipientString":"Mary Baker Eddy", "publisher":"The Mary Baker Eddy Library ", "origPlace":"places.xml#boston_ma ", "region":"MA ", "type":"incoming_correspondence", "places":"Boston ", 
"placesString":"Boston ", "people":"Mary ", "peopleString":"Mary ", "body":"Paper rec received Thanks, Just looked it over, good . Have moved at last! Will find me at cor: Shawmut Ave. & Pleasant St. a few doors from 66 S. Ave, further downtown. Hope 
you will find time to come in. Not yet settled, but like much better. Hope you are prospering. Wanted to see you last Sabbath eve but too tired In love Mary – ", "closer":"Boston Sept 9. 1884 . ", "id":"3272bf21-e6c2-4053-85ef-db3ec5a7f0ae", 
"_version_":1710182653070671872},|



I'm guessing its the XPathEntityProcessor that is doing it but I'm certainly open to pilot error!

Any ideas how I can get rid of the trailing space?

thanks,

Scott

Re: Does a type of facet.contains=. exist?

Posted by Niko Himanen <nh...@alpha-sense.com>.

Hey,

Sounds like you need to modify your indexing pipeline so that faceted
fields contains person information as you want it to be counted (full names
or partial). If you for example use a string-type of field, names are not
split from spaces as they would be if you use a field type with tokenizers.
In that case you would get counts of "Micky Mouse" when field is faceted
on.

Sure you can also have multiple fields with different tokenization
strategies and then use copyField to map original value to different fields
and count values together later in your search pipeline for more
versatility.

Hope I answered your question.

Br,

Niko

On Fri, Aug 27, 2021 at 6:27 AM Scott Derrick <sc...@tnstaafl.net> wrote:

> I have <person> elements scattered through our document set. Each document
> may have many <person> elements of the same person.
>
> Like <person>Micky Mouse</person>
>
> I want the facet count of the entire name, "Micky Mouse", not "Micky" and
> "Mouse".
>
> How can I tell tell SOLR to facet on the entire element?
>
> I tried facet.contains=. and did not get the desired results.
>
> thanks,
>
> Scott
>
>
>

-- 

Niko Himanen

Staff Search Engineer

P   +358 50 4100 773

E   niko.himanen@alpha-sense.com

  <http://www.alpha-sense.com>

www.alpha-sense.com

Does a type of facet.contains=. exist?

Posted by Scott Derrick <sc...@tnstaafl.net>.

I have <person> elements scattered through our document set. Each document may have many <person> elements of the same person.

Like <person>Micky Mouse</person>

I want the facet count of the entire name, "Micky Mouse", not "Micky" and "Mouse".

How can I tell tell SOLR to facet on the entire element?

I tried facet.contains=. and did not get the desired results.

thanks,

Scott

Re: limiting term facets

Posted by Eric Pugh <ep...@opensourceconnections.com>.

This could also be an opportunity for using Streaming Expressions….  Especially if there is other filtering or manipulations you would want to do on  top of your first list of  rare words….

> On Aug 27, 2021, at 8:16 AM, Kyle Fransham <ky...@superna.net> wrote:
> 
> Using the json facet, I think a combination of "limit" (to limit the number
> of records to be equal or less than the value you're looking for) and
> "sort" (something like sort:{count:asc} ) would give you what you're
> describing.
> 
> This taken from:
> https://solr.apache.org/guide/8_8/json-facet-api.html#terms-facet
> 
> Kyle
> 
> On Thu, Aug 26, 2021 at 7:25 PM Scott Derrick <sc...@tnstaafl.net> wrote:
> 
>> There is a mincount when requesting a type: terms facet, limiting the
>> return buckets to at least this number, default 1
>> 
>> I need a maxcount, limiting the return buckets to less than a specified
>> number.
>> 
>> I'm interested in the rare words, not the common words.
>> 
>> thanks,
>> 
>> Scott
>> 
> 
> 
> -- 
> 
> *Kyle Fransham*
> Vice President - Research and Development
> 
> 
> *Manage, Protect & Secure Unstructured Data at Scale*
> 
> tel 613-729-1100 | mobile 613-897-9414 | www.supernaeyeglass.com
> 
> 
> <https://www.facebook.com/SupernaNET>  <https://twitter.com/SupernaNet>
> <https://www.linkedin.com/company/superna-net>
> 
> -- 
> CONFIDENTIALITY NOTICE: The information contained in this email is 
> privileged and confidential and intended only for the use of the individual 
> or entity to whom it is addressed.   If you receive this message in error, 
> please notify the sender immediately at 613-729-1100 and destroy the 
> original message and all copies. Thank you.

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

Re: limiting term facets

Posted by Kyle Fransham <ky...@superna.net>.

Using the json facet, I think a combination of "limit" (to limit the number
of records to be equal or less than the value you're looking for) and
"sort" (something like sort:{count:asc} ) would give you what you're
describing.

This taken from:
https://solr.apache.org/guide/8_8/json-facet-api.html#terms-facet

Kyle

On Thu, Aug 26, 2021 at 7:25 PM Scott Derrick <sc...@tnstaafl.net> wrote:

> There is a mincount when requesting a type: terms facet, limiting the
> return buckets to at least this number, default 1
>
> I need a maxcount, limiting the return buckets to less than a specified
> number.
>
> I'm interested in the rare words, not the common words.
>
> thanks,
>
> Scott
>


-- 

*Kyle Fransham*
Vice President - Research and Development


*Manage, Protect & Secure Unstructured Data at Scale*

tel 613-729-1100 | mobile 613-897-9414 | www.supernaeyeglass.com


<https://www.facebook.com/SupernaNET>  <https://twitter.com/SupernaNet>
<https://www.linkedin.com/company/superna-net>

-- 
CONFIDENTIALITY NOTICE: The information contained in this email is 
privileged and confidential and intended only for the use of the individual 
or entity to whom it is addressed.   If you receive this message in error, 
please notify the sender immediately at 613-729-1100 and destroy the 
original message and all copies. Thank you.