You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dan Bolser <db...@ebi.ac.uk> on 2014/07/03 18:10:29 UTC

Field for 'species' data?

Hi,

Does anyone on the list have experience with hierarchical facets,
specifically for species data?

I have a variety of 'messy' species names that I'd like to tidy up at
analysis time and use as the basis for taxonomically guided
hierarchical facets at query time.

I was wondering if there's some schema.xml with custom analyser
pipelines and config files that I can work from if people have done
this before?

Here are some example species names from my source data:
Solanum lycopersicum
Solanum tuberosum
Hordeum vulgare
Vitis vinifera
Arabidopsis thaliana
Arabidopsis lyrata
Brassica rapa
Musa acuminata
Oryza glaberrima
Oryza brachyantha
Physcomitrella patens
Arabis
Triticum sp
Hordeum sp
Zea mays L.
Zea mays
Hordeum vulgare L. convar. vulgare var. hybernum Viborg
Phaseolus vulgaris L. subsp. vulgaris var. nanus Asch
Phaseolus vulgaris L. subsp. vulgaris var. vulgaris
Triticum aestivum L. var. lutescens
Hordeum vulgare L. convar. distichon
Solanum tuberosum L. subsp. tuberosum L
Triticum aestivum L. var. aestivum
Pisum sp
Lupinus sp
Lycopersicon esculentum Mill
Dactylis glomerata L
Avena sp
Nicotiana tabacum


If you're not familiar with the species taxonomy, there are many
hierarchical 'sub groups' that I can define over the species in this
list, not to mention the hierarchies implicit in their names, such as
Solanum lycopersicum vs. Solanum tuberosum, both species in the
Solanum genus, and Hordeum vulgare vs. Hordeum vulgare L. convar.
vulgare var. hybernum Viborg, a specific variety of Hordeum vulgare...

I figure I can't be the first person to look at this?


Thanks for any tips,
Dan.

Re: Field for 'species' data?

Posted by Erick Erickson <er...@gmail.com>.
re: do this in an update processor or in other parts of the pipeline:

whichever is easier, the result will be the same. Personally I like
putting stuff like this in other parts of the pipeline if for no other reason
than the load isn't concentrated on the Solr machine.

In particular if you enrich the document in the pipeline, you can then
scale up indexing by having multiple processes running the pipeline on
multiple clients. Eventually, you'll hit the Solr node's limits, but it'll
be later than if you do all your processing there.

It may be a little easier to manage since you don't have to worry about
getting your custom Jars to the solr nodes as you would in the update
processor case.

But really, whatever is most convenient and meets your SLA. If you
are _already_ going to have a pipeline, there are fewer moving parts there....

Best,
Erick

On Sat, Jul 5, 2014 at 9:02 AM, Dan Bolser <db...@gmail.com> wrote:
> The latter
> On 5 Jul 2014 16:39, "Jack Krupansky" <ja...@basetechnology.com> wrote:
>
>> So, the immediate question is whether the value in the Solr source
>> document has the full taxonomy path for the species, or just parts, and
>> some external taxonomy definition must be consulted to "fill in" the rest
>> of the hierarchy path for that species.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Dan Bolser
>> Sent: Saturday, July 5, 2014 10:36 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Field for 'species' data?
>>
>> One requirement is that the hierarchical facet implementation marches
>> whatever the Drupal ApacheSolr module does with taxonomy terms.
>>
>> The key thing is to add the taxonomy to the doc which only has one 'leaf'
>> term.
>> On 5 Jul 2014 15:01, "Jack Krupansky" <ja...@basetechnology.com> wrote:
>>
>>  Focus on your data model and queries first, then you can decide on the
>>> implementation.
>>>
>>> Take a semi-complex example and manually break it down into field values
>>> and then write some queries, including filters, in English, that do the
>>> required navigation. Once you have a handle on what fields you need to
>>> populate, the analysis and processing details can be worked out.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Dan Bolser
>>> Sent: Saturday, July 5, 2014 4:49 AM
>>> To: solr-user
>>> Subject: Re: Field for 'species' data?
>>>
>>> I'm super noob... Why choose to write it add a custom update request
>>> processor rather than an analysis pipeline?
>>>
>>> Cheers, Dan.
>>> On 5 Jul 2014 03:45, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:
>>>
>>>  Do that with a custom update request processor.
>>>
>>>>
>>>> Just remember Solr is there to find things not to preserve structure. So
>>>> mangle your data until you can find it.
>>>>
>>>> Also check if SirenDB would fit your requirements if you want to encode
>>>> the
>>>> information as complex structure.
>>>>
>>>> Regards,
>>>>     Alex
>>>>
>>>>
>>>>
>>>
>>

Re: Field for 'species' data?

Posted by Dan Bolser <db...@gmail.com>.
The latter
On 5 Jul 2014 16:39, "Jack Krupansky" <ja...@basetechnology.com> wrote:

> So, the immediate question is whether the value in the Solr source
> document has the full taxonomy path for the species, or just parts, and
> some external taxonomy definition must be consulted to "fill in" the rest
> of the hierarchy path for that species.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Dan Bolser
> Sent: Saturday, July 5, 2014 10:36 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Field for 'species' data?
>
> One requirement is that the hierarchical facet implementation marches
> whatever the Drupal ApacheSolr module does with taxonomy terms.
>
> The key thing is to add the taxonomy to the doc which only has one 'leaf'
> term.
> On 5 Jul 2014 15:01, "Jack Krupansky" <ja...@basetechnology.com> wrote:
>
>  Focus on your data model and queries first, then you can decide on the
>> implementation.
>>
>> Take a semi-complex example and manually break it down into field values
>> and then write some queries, including filters, in English, that do the
>> required navigation. Once you have a handle on what fields you need to
>> populate, the analysis and processing details can be worked out.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Dan Bolser
>> Sent: Saturday, July 5, 2014 4:49 AM
>> To: solr-user
>> Subject: Re: Field for 'species' data?
>>
>> I'm super noob... Why choose to write it add a custom update request
>> processor rather than an analysis pipeline?
>>
>> Cheers, Dan.
>> On 5 Jul 2014 03:45, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:
>>
>>  Do that with a custom update request processor.
>>
>>>
>>> Just remember Solr is there to find things not to preserve structure. So
>>> mangle your data until you can find it.
>>>
>>> Also check if SirenDB would fit your requirements if you want to encode
>>> the
>>> information as complex structure.
>>>
>>> Regards,
>>>     Alex
>>>
>>>
>>>
>>
>

Re: Field for 'species' data?

Posted by Jack Krupansky <ja...@basetechnology.com>.
So, the immediate question is whether the value in the Solr source document 
has the full taxonomy path for the species, or just parts, and some external 
taxonomy definition must be consulted to "fill in" the rest of the hierarchy 
path for that species.

-- Jack Krupansky

-----Original Message----- 
From: Dan Bolser
Sent: Saturday, July 5, 2014 10:36 AM
To: solr-user@lucene.apache.org
Subject: Re: Field for 'species' data?

One requirement is that the hierarchical facet implementation marches
whatever the Drupal ApacheSolr module does with taxonomy terms.

The key thing is to add the taxonomy to the doc which only has one 'leaf'
term.
On 5 Jul 2014 15:01, "Jack Krupansky" <ja...@basetechnology.com> wrote:

> Focus on your data model and queries first, then you can decide on the
> implementation.
>
> Take a semi-complex example and manually break it down into field values
> and then write some queries, including filters, in English, that do the
> required navigation. Once you have a handle on what fields you need to
> populate, the analysis and processing details can be worked out.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Dan Bolser
> Sent: Saturday, July 5, 2014 4:49 AM
> To: solr-user
> Subject: Re: Field for 'species' data?
>
> I'm super noob... Why choose to write it add a custom update request
> processor rather than an analysis pipeline?
>
> Cheers, Dan.
> On 5 Jul 2014 03:45, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:
>
>  Do that with a custom update request processor.
>>
>> Just remember Solr is there to find things not to preserve structure. So
>> mangle your data until you can find it.
>>
>> Also check if SirenDB would fit your requirements if you want to encode
>> the
>> information as complex structure.
>>
>> Regards,
>>     Alex
>>
>>
> 


Re: Field for 'species' data?

Posted by Dan Bolser <db...@gmail.com>.
One requirement is that the hierarchical facet implementation marches
whatever the Drupal ApacheSolr module does with taxonomy terms.

The key thing is to add the taxonomy to the doc which only has one 'leaf'
term.
On 5 Jul 2014 15:01, "Jack Krupansky" <ja...@basetechnology.com> wrote:

> Focus on your data model and queries first, then you can decide on the
> implementation.
>
> Take a semi-complex example and manually break it down into field values
> and then write some queries, including filters, in English, that do the
> required navigation. Once you have a handle on what fields you need to
> populate, the analysis and processing details can be worked out.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Dan Bolser
> Sent: Saturday, July 5, 2014 4:49 AM
> To: solr-user
> Subject: Re: Field for 'species' data?
>
> I'm super noob... Why choose to write it add a custom update request
> processor rather than an analysis pipeline?
>
> Cheers, Dan.
> On 5 Jul 2014 03:45, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:
>
>  Do that with a custom update request processor.
>>
>> Just remember Solr is there to find things not to preserve structure. So
>> mangle your data until you can find it.
>>
>> Also check if SirenDB would fit your requirements if you want to encode
>> the
>> information as complex structure.
>>
>> Regards,
>>     Alex
>>
>>
>

Re: Field for 'species' data?

Posted by Jack Krupansky <ja...@basetechnology.com>.
Focus on your data model and queries first, then you can decide on the 
implementation.

Take a semi-complex example and manually break it down into field values and 
then write some queries, including filters, in English, that do the required 
navigation. Once you have a handle on what fields you need to populate, the 
analysis and processing details can be worked out.

-- Jack Krupansky

-----Original Message----- 
From: Dan Bolser
Sent: Saturday, July 5, 2014 4:49 AM
To: solr-user
Subject: Re: Field for 'species' data?

I'm super noob... Why choose to write it add a custom update request
processor rather than an analysis pipeline?

Cheers, Dan.
On 5 Jul 2014 03:45, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:

> Do that with a custom update request processor.
>
> Just remember Solr is there to find things not to preserve structure. So
> mangle your data until you can find it.
>
> Also check if SirenDB would fit your requirements if you want to encode 
> the
> information as complex structure.
>
> Regards,
>     Alex
> 


Re: Field for 'species' data?

Posted by Dan Bolser <db...@ebi.ac.uk>.
I'm super noob... Why choose to write it add a custom update request
processor rather than an analysis pipeline?

Cheers, Dan.
On 5 Jul 2014 03:45, "Alexandre Rafalovitch" <ar...@gmail.com> wrote:

> Do that with a custom update request processor.
>
> Just remember Solr is there to find things not to preserve structure. So
> mangle your data until you can find it.
>
> Also check if SirenDB would fit your requirements if you want to encode the
> information as complex structure.
>
> Regards,
>     Alex
>

Re: Field for 'species' data?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Do that with a custom update request processor.

Just remember Solr is there to find things not to preserve structure. So
mangle your data until you can find it.

Also check if SirenDB would fit your requirements if you want to encode the
information as complex structure.

Regards,
    Alex

Re: Field for 'species' data?

Posted by Dan Bolser <db...@ebi.ac.uk>.
I think I need to lookup the given species value in a taxonomy, build the
'path' and pass the result to the path hierarchy tokenizer or similar. I
figure I'll do this with a field analyzer.
On 4 Jul 2014 22:30, "Jack Krupansky" <ja...@basetechnology.com> wrote:

> I haven't fully digested your species hierarchy requirements, but you can
> do just about anything in a Solr update processor. So you can parse the
> string and then put pieces into different fields to represent portions of
> the hierarchy. Then at query time, your application facet navigation is
> simply using the prefix from the facet selection as a filter on one of
> those fields in which the various hierarchy components are stored.
>
> Alternatively, you may be able to get by using the path hierarchy
> tokenizer:
> http://lucene.apache.org/core/4_0_0/analyzers-common/org/
> apache/lucene/analysis/path/PathHierarchyTokenizerFactory.html
>
> Or maybe a combination of the two approaches.
>
> I think I have some examples of it in my e-book.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Dan Bolser
> Sent: Friday, July 4, 2014 11:57 AM
> To: solr-user
> Subject: Re: Field for 'species' data?
>
> The problem is that each document has a single species (or
> super-species, or sub-species), and needs to get information about
> it's place in the hierarchy 'elsewhere', i.e. in an externally encoded
> hierarchy.
>
> I don't, for example, have data in this format:
> SPECIES: "Hordeum / Hordeum vulgare / Hordeum vulgare var. hybernum"
>
> but rather
> SPECIES: "Hordeum vulgare".
>
> How can I add in that data at analysis time?
>
>
> Cheers,
> Dan.
>
>
> On 4 July 2014 04:19, Gora Mohanty <go...@mimirtech.com> wrote:
>
>> On 3 July 2014 21:40, Dan Bolser <db...@ebi.ac.uk> wrote:
>>
>>>
>>> Hi,
>>>
>>> Does anyone on the list have experience with hierarchical facets,
>>> specifically for species data?
>>>
>> [...]
>>
>> Maybe not specifically for species data, but hierarchical faceting works
>> pretty well with Solr. Please see
>> http://wiki.apache.org/solr/HierarchicalFaceting
>> For your use case, I would probably use pivot facets:
>> http://wiki.apache.org/solr/HierarchicalFaceting#Pivot_Facets
>>
>> Regards,
>> Gora
>>
>
>

Re: Field for 'species' data?

Posted by Jack Krupansky <ja...@basetechnology.com>.
I haven't fully digested your species hierarchy requirements, but you can do 
just about anything in a Solr update processor. So you can parse the string 
and then put pieces into different fields to represent portions of the 
hierarchy. Then at query time, your application facet navigation is simply 
using the prefix from the facet selection as a filter on one of those fields 
in which the various hierarchy components are stored.

Alternatively, you may be able to get by using the path hierarchy tokenizer:
http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizerFactory.html

Or maybe a combination of the two approaches.

I think I have some examples of it in my e-book.

-- Jack Krupansky

-----Original Message----- 
From: Dan Bolser
Sent: Friday, July 4, 2014 11:57 AM
To: solr-user
Subject: Re: Field for 'species' data?

The problem is that each document has a single species (or
super-species, or sub-species), and needs to get information about
it's place in the hierarchy 'elsewhere', i.e. in an externally encoded
hierarchy.

I don't, for example, have data in this format:
SPECIES: "Hordeum / Hordeum vulgare / Hordeum vulgare var. hybernum"

but rather
SPECIES: "Hordeum vulgare".

How can I add in that data at analysis time?


Cheers,
Dan.


On 4 July 2014 04:19, Gora Mohanty <go...@mimirtech.com> wrote:
> On 3 July 2014 21:40, Dan Bolser <db...@ebi.ac.uk> wrote:
>>
>> Hi,
>>
>> Does anyone on the list have experience with hierarchical facets,
>> specifically for species data?
> [...]
>
> Maybe not specifically for species data, but hierarchical faceting works
> pretty well with Solr. Please see
> http://wiki.apache.org/solr/HierarchicalFaceting
> For your use case, I would probably use pivot facets:
> http://wiki.apache.org/solr/HierarchicalFaceting#Pivot_Facets
>
> Regards,
> Gora 


Re: Field for 'species' data?

Posted by Dan Bolser <db...@ebi.ac.uk>.
The problem is that each document has a single species (or
super-species, or sub-species), and needs to get information about
it's place in the hierarchy 'elsewhere', i.e. in an externally encoded
hierarchy.

I don't, for example, have data in this format:
SPECIES: "Hordeum / Hordeum vulgare / Hordeum vulgare var. hybernum"

but rather
SPECIES: "Hordeum vulgare".

How can I add in that data at analysis time?


Cheers,
Dan.


On 4 July 2014 04:19, Gora Mohanty <go...@mimirtech.com> wrote:
> On 3 July 2014 21:40, Dan Bolser <db...@ebi.ac.uk> wrote:
>>
>> Hi,
>>
>> Does anyone on the list have experience with hierarchical facets,
>> specifically for species data?
> [...]
>
> Maybe not specifically for species data, but hierarchical faceting works
> pretty well with Solr. Please see
> http://wiki.apache.org/solr/HierarchicalFaceting
> For your use case, I would probably use pivot facets:
> http://wiki.apache.org/solr/HierarchicalFaceting#Pivot_Facets
>
> Regards,
> Gora

Re: Field for 'species' data?

Posted by Gora Mohanty <go...@mimirtech.com>.
On 3 July 2014 21:40, Dan Bolser <db...@ebi.ac.uk> wrote:
>
> Hi,
>
> Does anyone on the list have experience with hierarchical facets,
> specifically for species data?
[...]

Maybe not specifically for species data, but hierarchical faceting works
pretty well with Solr. Please see
http://wiki.apache.org/solr/HierarchicalFaceting
For your use case, I would probably use pivot facets:
http://wiki.apache.org/solr/HierarchicalFaceting#Pivot_Facets

Regards,
Gora