You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by scallawa <da...@altrec.com> on 2014/03/27 07:49:18 UTC

multivalued field using DIH

I am using solr 4.7 and am importing data directly from a mysql database
table using the DIH.  I have a column that looks like similar to this below
in that it has multiple values in the database.

material          cotton "polyester blend" rayon

I would like the data to look like the following when imported.

<str name="material">cotton</str>
<str name="material">polyester blend</str>
<str name="material">rayon</str>.

In other words.  If there is multiple data points for a particular column
and the mapped field is multivalued, create multiple <str name> fields.  If
there are quotes around multiple words, treat them as one token.  Is this
possible?



--
View this message in context: http://lucene.472066.n3.nabble.com/multivalued-field-using-DIH-tp4127297.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: multivalued field using DIH

Posted by Shawn Heisey <so...@elyograg.org>.
On 3/27/2014 12:49 AM, scallawa wrote:
> I am using solr 4.7 and am importing data directly from a mysql database
> table using the DIH.  I have a column that looks like similar to this below
> in that it has multiple values in the database.
> 
> material          cotton "polyester blend" rayon
> 
> I would like the data to look like the following when imported.
> 
> <str name="material">cotton</str>
> <str name="material">polyester blend</str>
> <str name="material">rayon</str>.
> 
> In other words.  If there is multiple data points for a particular column
> and the mapped field is multivalued, create multiple <str name> fields.  If
> there are quotes around multiple words, treat them as one token.  Is this
> possible?

In a direct manner, I do not think so.  If the input data were simply
space separated and didn't have the quoted string that includes a space,
you could use the RegexTransformer in DIH and do a simple 'splitBy' on
the field.

If you know how to write a regex that would only match the spaces
outside of the quotes, you could still use that method.  I have no idea
how to do that.

Alternatively, you can write a custom update processor for Solr that
knows how to break up the input, remove the original field, and reinsert
it with the multiple values.  Custom update processors are not very
difficult if you already know how to write a program, but it's not trivial.

If the database actually has multiple values in a table rather than the
space separation, there are two possibilities: 1) Use nested DIH
entities, which makes a query to the database for every document. 2) Use
a JOIN with GROUP_CONCAT to construct a value with a delimiter other
than space - something that won't ever show up in the actual data.  You
can then use the splitBy method that I already mentioned.

You'd need to consult a database expert for help with JOIN and GROUP_CONCAT.

Thanks,
Shawn