You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Christian Köhler - ZFMK <c....@zfmk.de> on 2013/08/05 09:59:24 UTC

Transform data at index time: country -> continent

Hi,

I am indexing data from a mysql data source. Each record contains the
field "country". I am looking for a suitable way to create a field
"continent" at indexing time. A list with the information country ->
continent is given.

Writing a script and calling it as a transformer in the sql query  would
be my solution of choice right now. A RegexTransformer seems to be less
elegant with 200+ different countries.

Solr indexes 10 million records, so efficiency should be kept in mind.

With my limited knowledge I might be missing the obvious solution. What
would be the best practice? Any thoughts are welcome.

Regards
Chris
--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn

Re: Transform data at index time: country -> continent

Posted by omu_negru <ti...@gmail.com>.
Hey, Since you're using solr and have access to the database in question did
you consider making an extra index on the machine to hold your country to
continent mapping ? I know it's more trouble than it's worth for such a
small data set but hey, you get to set up another index :)



--
View this message in context: http://lucene.472066.n3.nabble.com/Transform-data-at-index-time-country-continent-tp4082486p4083539.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Transform data at index time: country -> continent

Posted by Christian Köhler - ZFMK <c....@zfmk.de>.
Hi,

Am 06.08.2013 12:56, schrieb Raymond Wiker:
> Another option might be to use a pre-existing web service... it should be
> relatively easy to add that to your dataimporthandler configuration (if
> you're using DIH, that is :-)
>
> A quick google search gave me http://www.geonames.org; see
> http://www.geonames.org/export/ for API information.

Interesting approach - thanx! I'll have to test the performance though.
I am indexing millions of records, so the latency of the web service
might be an issue.

Cheers
Chris
--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn

Re: Transform data at index time: country -> continent

Posted by Raymond Wiker <rw...@gmail.com>.
Another option might be to use a pre-existing web service... it should be
relatively easy to add that to your dataimporthandler configuration (if
you're using DIH, that is :-)

A quick google search gave me http://www.geonames.org; see
http://www.geonames.org/export/ for API information.


On Tue, Aug 6, 2013 at 11:18 AM, Christian Köhler - ZFMK
<c....@zfmk.de>wrote:

> Am 05.08.2013 15:52, schrieb Jack Krupansky:
>
>> You can write a brute force JavaScript script using the StatelessScript
>> update processor that hard-codes the mapping.
>>
>
> I'll probably do something like this. Unfortunately I have no influence
> on the original db itself, so I have fix this in solr.
>
> Cheers
> Chris
>
>
> --
> Zoologisches Forschungsmuseum Alexander Koenig
> - Leibniz-Institut für Biodiversität der Tiere -
> Adenauerallee 160, 53113 Bonn, Germany
> www.zfmk.de
>
> Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
> Sitz: Bonn
>

Re: Transform data at index time: country -> continent

Posted by Christian Köhler - ZFMK <c....@zfmk.de>.
Hi,

> One interesting issue: These countries that span continents - Turkey and
> Russia and some of the former USSR Republics.
>
> I arbitrarily assigned them a single continent:
>
> // Note: Turkey is mapped to Asia, and Russia to Europe,
> //       Azerbaijan to Asia, Armenia to Asia, Cyprus to Asia,
> //       Georgia to Asia, Kazakhstan to Asia,

I came across the same problem. Not to mention the oversee territories
of France, the Netherlands, ...

> (I hope I don't get too much hate mail from the Greeks for considering
> Cyprus to be part of Asia, but it is closer.)

I'd rather assign them to both continents. A false positive is (in my
case) better than a miss. My data provides a geo coordinate for each
record which I could use for a clarification when in doubt - but this
might be an other topic.

>
> I suppose continent could be multivalued or maybe a composite string
> ("eu/as" or "eu+as"), but that has an impact on queries.
>
> But, the scripts also handles multivalued fields (one value at a time),
> and nested multivalued fields is not supported.
>
> Thoughts?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Christian Köhler - ZFMK
> Sent: Tuesday, August 06, 2013 5:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Transform data at index time: country -> continent
>
> Am 05.08.2013 15:52, schrieb Jack Krupansky:
>> You can write a brute force JavaScript script using the StatelessScript
>> update processor that hard-codes the mapping.
>
> I'll probably do something like this. Unfortunately I have no influence
> on the original db itself, so I have fix this in solr.
>
> Cheers
> Chris
>
>
> --
> Zoologisches Forschungsmuseum Alexander Koenig
> - Leibniz-Institut für Biodiversität der Tiere -
> Adenauerallee 160, 53113 Bonn, Germany
> www.zfmk.de
>
> Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
> Sitz: Bonn


--
Christian Köhler
Tel.: 0228 9122-433

Zoologisches Forschungsmuseum Alexander Koenig
Leibniz-Institut für Biodiversität der Tiere
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts
Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn
--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn

Re: Transform data at index time: country -> continent

Posted by Jack Krupansky <ja...@basetechnology.com>.
I've implemented a JavaScript script for the StatelessScriptUpdate processor 
that does country code to continent code mapping. It will appear in the next 
early access of my "Solr 4.x Deep Dive" book (on 8/16.)

One interesting issue: These countries that span continents - Turkey and 
Russia and some of the former USSR Republics.

I arbitrarily assigned them a single continent:

// Note: Turkey is mapped to Asia, and Russia to Europe,
//       Azerbaijan to Asia, Armenia to Asia, Cyprus to Asia,
//       Georgia to Asia, Kazakhstan to Asia,

(I hope I don't get too much hate mail from the Greeks for considering 
Cyprus to be part of Asia, but it is closer.)

I suppose continent could be multivalued or maybe a composite string 
("eu/as" or "eu+as"), but that has an impact on queries.

But, the scripts also handles multivalued fields (one value at a time), and 
nested multivalued fields is not supported.

Thoughts?

-- Jack Krupansky

-----Original Message----- 
From: Christian Köhler - ZFMK
Sent: Tuesday, August 06, 2013 5:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Transform data at index time: country -> continent

Am 05.08.2013 15:52, schrieb Jack Krupansky:
> You can write a brute force JavaScript script using the StatelessScript
> update processor that hard-codes the mapping.

I'll probably do something like this. Unfortunately I have no influence
on the original db itself, so I have fix this in solr.

Cheers
Chris


--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn 


Re: Transform data at index time: country -> continent

Posted by Walter Underwood <wu...@wunderwood.org>.
SynonymFilter may have a keepOrig flag. If so, that would map countries to continents and not keep the country names.

<filter class="solr.SynonymFilterFactory" synonyms="continents.txt" keepOrig="false" />

wunder
 
On Aug 8, 2013, at 4:10 AM, Christian Köhler - ZFMK wrote:

> Hi,
> 
> I have thought about synonyms as well. But wouldn't leave me this with a
> field that contains both the original expression and additionally the
> continent? e.g. "germany, continent-europe". I am not sure if this might
> get in the way at some point.
> 
> On the other hand this would enable my to have a single search field,
> where the user could search by country or continent. Interesting - I'll
> give it a thought.
> 
> Thanx
> Chris
> 
> 
> Am 07.08.2013 17:56, schrieb Walter Underwood:
>> Good point. Copying to a separate field that applied synonyms could help.
>> 
>> Filtering out the original countries could be tricky. The Javadoc mentiones a keepOrig flag, but the Solr docs do not. If you could set keepOrig=false, that would do the trick.
>> 
>> wunder
>> 
>> On Aug 7, 2013, at 5:13 AM, Erick Erickson wrote:
>> 
>>> Walter:
>>> 
>>> Oooh, nice! One could even use a copyField if one wanted to
>>> keep them separate...
>>> 
>>> Erick
>>> 
>>> 
>>> On Tue, Aug 6, 2013 at 12:38 PM, Walter Underwood <wu...@wunderwood.org>wrote:
>>> 
>>>> Would synonyms help? If you generate the query terms for the continents,
>>>> you could do something like this:
>>>> 
>>>> usa => continent-na
>>>> canada => continent-na
>>>> germany => continent-europe
>>>> 
>>>> und so weiter.
>>>> 
>>>> wunder
>>>> 
>>>> On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote:
>>>> 
>>>>> Am 05.08.2013 15:52, schrieb Jack Krupansky:
>>>>>> You can write a brute force JavaScript script using the StatelessScript
>>>>>> update processor that hard-codes the mapping.
>>>>> 
>>>>> I'll probably do something like this. Unfortunately I have no influence
>>>>> on the original db itself, so I have fix this in solr.
>>>>> 
>>>>> Cheers
>>>>> Chris
>>>>> 
>>>>> 
>>>>> --
>>>>> Zoologisches Forschungsmuseum Alexander Koenig
>>>>> - Leibniz-Institut für Biodiversität der Tiere -
>>>>> Adenauerallee 160, 53113 Bonn, Germany
>>>>> www.zfmk.de
>>>>> 
>>>>> Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
>>>>> Sitz: Bonn
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> --
>> Walter Underwood
>> wunder@wunderwood.org
>> 
>> 
>> 
>> 
> 
> 
> --
> Christian Köhler
> Tel.: 0228 9122-433
> 
> Zoologisches Forschungsmuseum Alexander Koenig
> Leibniz-Institut für Biodiversität der Tiere
> Adenauerallee 160, 53113 Bonn, Germany
> www.zfmk.de
> 
> Stiftung des öffentlichen Rechts
> Direktor: Prof. J. Wolfgang Wägele
> Sitz: Bonn
> --
> Zoologisches Forschungsmuseum Alexander Koenig
> - Leibniz-Institut für Biodiversität der Tiere -
> Adenauerallee 160, 53113 Bonn, Germany
> www.zfmk.de
> 
> Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
> Sitz: Bonn

--
Walter Underwood
wunder@wunderwood.org




Re: Transform data at index time: country -> continent

Posted by Jack Krupansky <ja...@basetechnology.com>.
(I think you're better off with an update processor script, but...)

The synonym filter supports 2.5 modes:

1. Replace mode

country => continent

2. Expand mode

country, continent

- results in both terms if either is used

2.5) The expand="false" attribute that means treat expand mode as replace 
with the first term as the replacement.

continent, country

- would be treated as:

country, continent => continent

The expand="true" attribute is simply the normal expand mode.

Expand mode is really just replacement mode with the terms auto-copied to 
the right side of the "=>", so:

country, continent

is equivalent to:

country, continent => country, continent

-- Jack Krupansky

-----Original Message----- 
From: Christian Köhler - ZFMK
Sent: Thursday, August 08, 2013 7:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Transform data at index time: country -> continent

Hi,

I have thought about synonyms as well. But wouldn't leave me this with a
field that contains both the original expression and additionally the
continent? e.g. "germany, continent-europe". I am not sure if this might
get in the way at some point.

On the other hand this would enable my to have a single search field,
where the user could search by country or continent. Interesting - I'll
give it a thought.

Thanx
Chris


Am 07.08.2013 17:56, schrieb Walter Underwood:
> Good point. Copying to a separate field that applied synonyms could help.
>
> Filtering out the original countries could be tricky. The Javadoc 
> mentiones a keepOrig flag, but the Solr docs do not. If you could set 
> keepOrig=false, that would do the trick.
>
> wunder
>
> On Aug 7, 2013, at 5:13 AM, Erick Erickson wrote:
>
>> Walter:
>>
>> Oooh, nice! One could even use a copyField if one wanted to
>> keep them separate...
>>
>> Erick
>>
>>
>> On Tue, Aug 6, 2013 at 12:38 PM, Walter Underwood 
>> <wu...@wunderwood.org>wrote:
>>
>>> Would synonyms help? If you generate the query terms for the continents,
>>> you could do something like this:
>>>
>>> usa => continent-na
>>> canada => continent-na
>>> germany => continent-europe
>>>
>>> und so weiter.
>>>
>>> wunder
>>>
>>> On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote:
>>>
>>>> Am 05.08.2013 15:52, schrieb Jack Krupansky:
>>>>> You can write a brute force JavaScript script using the 
>>>>> StatelessScript
>>>>> update processor that hard-codes the mapping.
>>>>
>>>> I'll probably do something like this. Unfortunately I have no influence
>>>> on the original db itself, so I have fix this in solr.
>>>>
>>>> Cheers
>>>> Chris
>>>>
>>>>
>>>> --
>>>> Zoologisches Forschungsmuseum Alexander Koenig
>>>> - Leibniz-Institut für Biodiversität der Tiere -
>>>> Adenauerallee 160, 53113 Bonn, Germany
>>>> www.zfmk.de
>>>>
>>>> Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
>>>> Sitz: Bonn
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> Walter Underwood
> wunder@wunderwood.org
>
>
>
>


--
Christian Köhler
Tel.: 0228 9122-433

Zoologisches Forschungsmuseum Alexander Koenig
Leibniz-Institut für Biodiversität der Tiere
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts
Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn
--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn 


Re: Transform data at index time: country -> continent

Posted by Christian Köhler - ZFMK <c....@zfmk.de>.
Hi,

I have thought about synonyms as well. But wouldn't leave me this with a
field that contains both the original expression and additionally the
continent? e.g. "germany, continent-europe". I am not sure if this might
get in the way at some point.

On the other hand this would enable my to have a single search field,
where the user could search by country or continent. Interesting - I'll
give it a thought.

Thanx
Chris


Am 07.08.2013 17:56, schrieb Walter Underwood:
> Good point. Copying to a separate field that applied synonyms could help.
>
> Filtering out the original countries could be tricky. The Javadoc mentiones a keepOrig flag, but the Solr docs do not. If you could set keepOrig=false, that would do the trick.
>
> wunder
>
> On Aug 7, 2013, at 5:13 AM, Erick Erickson wrote:
>
>> Walter:
>>
>> Oooh, nice! One could even use a copyField if one wanted to
>> keep them separate...
>>
>> Erick
>>
>>
>> On Tue, Aug 6, 2013 at 12:38 PM, Walter Underwood <wu...@wunderwood.org>wrote:
>>
>>> Would synonyms help? If you generate the query terms for the continents,
>>> you could do something like this:
>>>
>>> usa => continent-na
>>> canada => continent-na
>>> germany => continent-europe
>>>
>>> und so weiter.
>>>
>>> wunder
>>>
>>> On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote:
>>>
>>>> Am 05.08.2013 15:52, schrieb Jack Krupansky:
>>>>> You can write a brute force JavaScript script using the StatelessScript
>>>>> update processor that hard-codes the mapping.
>>>>
>>>> I'll probably do something like this. Unfortunately I have no influence
>>>> on the original db itself, so I have fix this in solr.
>>>>
>>>> Cheers
>>>> Chris
>>>>
>>>>
>>>> --
>>>> Zoologisches Forschungsmuseum Alexander Koenig
>>>> - Leibniz-Institut für Biodiversität der Tiere -
>>>> Adenauerallee 160, 53113 Bonn, Germany
>>>> www.zfmk.de
>>>>
>>>> Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
>>>> Sitz: Bonn
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> Walter Underwood
> wunder@wunderwood.org
>
>
>
>


--
Christian Köhler
Tel.: 0228 9122-433

Zoologisches Forschungsmuseum Alexander Koenig
Leibniz-Institut für Biodiversität der Tiere
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts
Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn
--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn

Re: Transform data at index time: country -> continent

Posted by Walter Underwood <wu...@wunderwood.org>.
Good point. Copying to a separate field that applied synonyms could help.

Filtering out the original countries could be tricky. The Javadoc mentiones a keepOrig flag, but the Solr docs do not. If you could set keepOrig=false, that would do the trick.

wunder

On Aug 7, 2013, at 5:13 AM, Erick Erickson wrote:

> Walter:
> 
> Oooh, nice! One could even use a copyField if one wanted to
> keep them separate...
> 
> Erick
> 
> 
> On Tue, Aug 6, 2013 at 12:38 PM, Walter Underwood <wu...@wunderwood.org>wrote:
> 
>> Would synonyms help? If you generate the query terms for the continents,
>> you could do something like this:
>> 
>> usa => continent-na
>> canada => continent-na
>> germany => continent-europe
>> 
>> und so weiter.
>> 
>> wunder
>> 
>> On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote:
>> 
>>> Am 05.08.2013 15:52, schrieb Jack Krupansky:
>>>> You can write a brute force JavaScript script using the StatelessScript
>>>> update processor that hard-codes the mapping.
>>> 
>>> I'll probably do something like this. Unfortunately I have no influence
>>> on the original db itself, so I have fix this in solr.
>>> 
>>> Cheers
>>> Chris
>>> 
>>> 
>>> --
>>> Zoologisches Forschungsmuseum Alexander Koenig
>>> - Leibniz-Institut für Biodiversität der Tiere -
>>> Adenauerallee 160, 53113 Bonn, Germany
>>> www.zfmk.de
>>> 
>>> Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
>>> Sitz: Bonn
>> 
>> 
>> 
>> 
>> 
>> 

--
Walter Underwood
wunder@wunderwood.org




Re: Transform data at index time: country -> continent

Posted by Erick Erickson <er...@gmail.com>.
Walter:

Oooh, nice! One could even use a copyField if one wanted to
keep them separate...

Erick


On Tue, Aug 6, 2013 at 12:38 PM, Walter Underwood <wu...@wunderwood.org>wrote:

> Would synonyms help? If you generate the query terms for the continents,
> you could do something like this:
>
> usa => continent-na
> canada => continent-na
> germany => continent-europe
>
> und so weiter.
>
> wunder
>
> On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote:
>
> > Am 05.08.2013 15:52, schrieb Jack Krupansky:
> >> You can write a brute force JavaScript script using the StatelessScript
> >> update processor that hard-codes the mapping.
> >
> > I'll probably do something like this. Unfortunately I have no influence
> > on the original db itself, so I have fix this in solr.
> >
> > Cheers
> > Chris
> >
> >
> > --
> > Zoologisches Forschungsmuseum Alexander Koenig
> > - Leibniz-Institut für Biodiversität der Tiere -
> > Adenauerallee 160, 53113 Bonn, Germany
> > www.zfmk.de
> >
> > Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
> > Sitz: Bonn
>
>
>
>
>
>

Re: Transform data at index time: country -> continent

Posted by Walter Underwood <wu...@wunderwood.org>.
Would synonyms help? If you generate the query terms for the continents, you could do something like this:

usa => continent-na
canada => continent-na
germany => continent-europe

und so weiter.

wunder

On Aug 6, 2013, at 2:18 AM, Christian Köhler - ZFMK wrote:

> Am 05.08.2013 15:52, schrieb Jack Krupansky:
>> You can write a brute force JavaScript script using the StatelessScript
>> update processor that hard-codes the mapping.
> 
> I'll probably do something like this. Unfortunately I have no influence
> on the original db itself, so I have fix this in solr.
> 
> Cheers
> Chris
> 
> 
> --
> Zoologisches Forschungsmuseum Alexander Koenig
> - Leibniz-Institut für Biodiversität der Tiere -
> Adenauerallee 160, 53113 Bonn, Germany
> www.zfmk.de
> 
> Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
> Sitz: Bonn






Re: Transform data at index time: country -> continent

Posted by Christian Köhler - ZFMK <c....@zfmk.de>.
Am 05.08.2013 15:52, schrieb Jack Krupansky:
> You can write a brute force JavaScript script using the StatelessScript
> update processor that hard-codes the mapping.

I'll probably do something like this. Unfortunately I have no influence
on the original db itself, so I have fix this in solr.

Cheers
Chris


--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn

Re: Transform data at index time: country -> continent

Posted by Jack Krupansky <ja...@basetechnology.com>.
You can write a brute force JavaScript script using the StatelessScript 
update processor that hard-codes the mapping.

-- Jack Krupansky

-----Original Message----- 
From: Christian Köhler - ZFMK
Sent: Monday, August 05, 2013 5:02 AM
To: solr-user@lucene.apache.org
Subject: Re: Transform data at index time: country -> continent

Hi,

> to have a database table holding the relationships between countries and
> continents, and using a join to get the continent.

I forgot to mention: I only have reading access to the database.

Regards
Chris

--
Christian Köhler
Tel.: 0228 9122-433

Zoologisches Forschungsmuseum Alexander Koenig
Leibniz-Institut für Biodiversität der Tiere
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts
Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn
--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn 


Re: Transform data at index time: country -> continent

Posted by Shawn Heisey <so...@elyograg.org>.
On 8/5/2013 3:02 AM, Christian Köhler - ZFMK wrote:
>> to have a database table holding the relationships between countries and
>> continents, and using a join to get the continent.
> 
> I forgot to mention: I only have reading access to the database.

Somebody's got to write something.  If you don't have write access to
the data, here are some things the DB admin could do:

1) Add a field to the table for the continent. Write a program that goes
through the records, figures out the continent, and populates that field
for every row.  This would cause at least a little bit of DB downtime.
2) Set up the table that Raymond recommended, so you can do a JOIN in
your SELECT statement.
3) Use DB server-side code (perhaps a stored procedure?) and give you a
database view that uses that code to add a continent field to the results.

It would be very good to have data like the continent in your source
database.  If the DB admin can't or won't do any of these things, then
you'd have to do it yourself.  This likely means one of two things:

1) Write an application to read the data from the database and index the
data to Solr.  In terms of Solr functionality, the Java API (SolrJ) is
the most comprehensive.  This would basically be a rewrite of the
DataImport handler, but unless it's multi-threaded and written very
carefully, it probably won't be as efficient as DIH.

2) Write a custom UpdateProcessor for the Solr server side that does the
mapping, and continue using Solr's DataImport handler.

Thanks,
Shawn


Re: Transform data at index time: country -> continent

Posted by Christian Köhler - ZFMK <c....@zfmk.de>.
Hi,

> to have a database table holding the relationships between countries and
> continents, and using a join to get the continent.

I forgot to mention: I only have reading access to the database.

Regards
Chris

--
Christian Köhler
Tel.: 0228 9122-433

Zoologisches Forschungsmuseum Alexander Koenig
Leibniz-Institut für Biodiversität der Tiere
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts
Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn
--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn

Re: Transform data at index time: country -> continent

Posted by Raymond Wiker <rw...@gmail.com>.
Don't know about "best practice", but to me, the obvious solution would be
to have a database table holding the relationships between countries and
continents, and using a join to get the continent.


On Mon, Aug 5, 2013 at 9:59 AM, Christian Köhler - ZFMK
<c....@zfmk.de>wrote:

> Hi,
>
> I am indexing data from a mysql data source. Each record contains the
> field "country". I am looking for a suitable way to create a field
> "continent" at indexing time. A list with the information country ->
> continent is given.
>
> Writing a script and calling it as a transformer in the sql query  would
> be my solution of choice right now. A RegexTransformer seems to be less
> elegant with 200+ different countries.
>
> Solr indexes 10 million records, so efficiency should be kept in mind.
>
> With my limited knowledge I might be missing the obvious solution. What
> would be the best practice? Any thoughts are welcome.
>
> Regards
> Chris
> --
> Zoologisches Forschungsmuseum Alexander Koenig
> - Leibniz-Institut für Biodiversität der Tiere -
> Adenauerallee 160, 53113 Bonn, Germany
> www.zfmk.de
>
> Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
> Sitz: Bonn
>