You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Joe Fitzgerald <jo...@oxfordcorp.com> on 2011/05/27 15:47:26 UTC

Splitting fields

Hello,

 

I am in an odd position.  The application server I use has built-in
integration with SOLR.  Unfortunately, its native capabilities are
fairly limited, specifically, it only supports a standard/pre-defined
set of fields which can be indexed.  As a result, it has left me
kludging how I work with Solr and doing things like putting what I'd
like to be multiple, separate fields into a single Solr field.

 

As an example, I may put a customer id and name into a single field
called 'custom1'.  Ideally, I'd like this information to be returned in
separate fields...and even better would be for them to be indexed as
separate fields but I can live without the latter.  Currently, I'm
building out a json representation of this information which makes it
easy for me to deal with when I extract the results...but it all feels
wrong.

 

I do have complete control over the actual Solr installation (just not
the indexing call to Solr), so I was hoping there may be a way to
configure Solr to take my single field and split it up into a different
field for each key in my json representation.

 

I don't see anything native to Solr that would do this for me but there
are a few features that I thought sounded similar and was hoping to get
some opinions on how I may be able to move forward with this...

 

Poly fields, such as the spatial location, might help?  Can I build my
own poly-field that would split up the main field into subfields?  Do
poly-fields let me return the subfields?  I don't quite have my head
around polyfields yet.

 

Another option although I suspect this won't be considered a good
approach, but what about extending the copyField functionality of
schema.xml to support my needs?  It would seem not entirely unreasonable
that copyField would provide a means to extract only a portion of the
contents of the source field to place in the destination field, no?  I'm
sure people more familiar with Solr's architecture could explain why
this isn't really an appropriate thing for Solr to handle (just because
it could doesn't mean it should)...

The other - and probably best -- option would be to leverage Solr
directly, bypassing the native integration of my application server,
which we've already done for most cases.  I'd love to go this route but
I'm having a hard time figuring out how to "easily" accomplish the same
functionality provided by my app server integration...perhaps someone on
the list could help me with this path forward?  Here is what I'm trying
to accomplish:

 

I'm indexing documents (text, pdf, html...) but I need to include fields
in the results of my searches which are only available from a db query.
I know how to have Solr index results from a db query, but I'm having
trouble getting it to index the documents that are associated to each
record of that query (full path/filename is one of the fields of that
query).

 

I started to try to use the dataImport handler to do this, by setting up
a FileDataSource in addition to my jdbc data source.  I tried to
leverage the filedatasource to populate a sub-entity based on the db
field that contains the full path/filename, but I wasn't sure how to
specify the db field from the root query/entity.  Before I spent too
much time, I also realized I wasn't sure how to get Solr to deal with
binary file types this way either which upon further reading seemed like
I would need to leverage Tika - can that be done within the confines of
dataimporthandler?

 

Any advice is greatly appreciated.  Thanks in advance,

 

Joe


Re: Splitting fields

Posted by Markus Jelsma <ma...@openindex.io>.
I'd go for this option as well. The example update processor can't make it 
more easier and it's a very flexible approach. Judging from the patch in 
SOLR-2105 it should still work with the current 3.2 branch.

https://issues.apache.org/jira/browse/SOLR-2105


> Hi,
> 
> Write a custom UpdateProcessor, which gives you full control of the
> SolrDocument prior to indexing. The best would be if you write a generic
> FieldSplitterProcessor which is configurable on what field to take as
> input, what delimiter or regex to split on and finally what fields to
> write the result to. This way other may re-use your code for their
> splitting needs.
> 
> See http://wiki.apache.org/solr/UpdateRequestProcessor and
> http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_sect
> ion
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
> 
> On 27. mai 2011, at 15.47, Joe Fitzgerald wrote:
> > Hello,
> > 
> > 
> > 
> > I am in an odd position.  The application server I use has built-in
> > integration with SOLR.  Unfortunately, its native capabilities are
> > fairly limited, specifically, it only supports a standard/pre-defined
> > set of fields which can be indexed.  As a result, it has left me
> > kludging how I work with Solr and doing things like putting what I'd
> > like to be multiple, separate fields into a single Solr field.
> > 
> > 
> > 
> > As an example, I may put a customer id and name into a single field
> > called 'custom1'.  Ideally, I'd like this information to be returned in
> > separate fields...and even better would be for them to be indexed as
> > separate fields but I can live without the latter.  Currently, I'm
> > building out a json representation of this information which makes it
> > easy for me to deal with when I extract the results...but it all feels
> > wrong.
> > 
> > 
> > 
> > I do have complete control over the actual Solr installation (just not
> > the indexing call to Solr), so I was hoping there may be a way to
> > configure Solr to take my single field and split it up into a different
> > field for each key in my json representation.
> > 
> > 
> > 
> > I don't see anything native to Solr that would do this for me but there
> > are a few features that I thought sounded similar and was hoping to get
> > some opinions on how I may be able to move forward with this...
> > 
> > 
> > 
> > Poly fields, such as the spatial location, might help?  Can I build my
> > own poly-field that would split up the main field into subfields?  Do
> > poly-fields let me return the subfields?  I don't quite have my head
> > around polyfields yet.
> > 
> > 
> > 
> > Another option although I suspect this won't be considered a good
> > approach, but what about extending the copyField functionality of
> > schema.xml to support my needs?  It would seem not entirely unreasonable
> > that copyField would provide a means to extract only a portion of the
> > contents of the source field to place in the destination field, no?  I'm
> > sure people more familiar with Solr's architecture could explain why
> > this isn't really an appropriate thing for Solr to handle (just because
> > it could doesn't mean it should)...
> > 
> > The other - and probably best -- option would be to leverage Solr
> > directly, bypassing the native integration of my application server,
> > which we've already done for most cases.  I'd love to go this route but
> > I'm having a hard time figuring out how to "easily" accomplish the same
> > functionality provided by my app server integration...perhaps someone on
> > the list could help me with this path forward?  Here is what I'm trying
> > to accomplish:
> > 
> > 
> > 
> > I'm indexing documents (text, pdf, html...) but I need to include fields
> > in the results of my searches which are only available from a db query.
> > I know how to have Solr index results from a db query, but I'm having
> > trouble getting it to index the documents that are associated to each
> > record of that query (full path/filename is one of the fields of that
> > query).
> > 
> > 
> > 
> > I started to try to use the dataImport handler to do this, by setting up
> > a FileDataSource in addition to my jdbc data source.  I tried to
> > leverage the filedatasource to populate a sub-entity based on the db
> > field that contains the full path/filename, but I wasn't sure how to
> > specify the db field from the root query/entity.  Before I spent too
> > much time, I also realized I wasn't sure how to get Solr to deal with
> > binary file types this way either which upon further reading seemed like
> > I would need to leverage Tika - can that be done within the confines of
> > dataimporthandler?
> > 
> > 
> > 
> > Any advice is greatly appreciated.  Thanks in advance,
> > 
> > 
> > 
> > Joe

Re: Splitting fields

Posted by Jan Høydahl <ja...@cominvent.com>.
Hi,

Write a custom UpdateProcessor, which gives you full control of the SolrDocument prior to indexing. The best would be if you write a generic FieldSplitterProcessor which is configurable on what field to take as input, what delimiter or regex to split on and finally what fields to write the result to. This way other may re-use your code for their splitting needs.

See http://wiki.apache.org/solr/UpdateRequestProcessor and http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_section

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 27. mai 2011, at 15.47, Joe Fitzgerald wrote:

> Hello,
> 
> 
> 
> I am in an odd position.  The application server I use has built-in
> integration with SOLR.  Unfortunately, its native capabilities are
> fairly limited, specifically, it only supports a standard/pre-defined
> set of fields which can be indexed.  As a result, it has left me
> kludging how I work with Solr and doing things like putting what I'd
> like to be multiple, separate fields into a single Solr field.
> 
> 
> 
> As an example, I may put a customer id and name into a single field
> called 'custom1'.  Ideally, I'd like this information to be returned in
> separate fields...and even better would be for them to be indexed as
> separate fields but I can live without the latter.  Currently, I'm
> building out a json representation of this information which makes it
> easy for me to deal with when I extract the results...but it all feels
> wrong.
> 
> 
> 
> I do have complete control over the actual Solr installation (just not
> the indexing call to Solr), so I was hoping there may be a way to
> configure Solr to take my single field and split it up into a different
> field for each key in my json representation.
> 
> 
> 
> I don't see anything native to Solr that would do this for me but there
> are a few features that I thought sounded similar and was hoping to get
> some opinions on how I may be able to move forward with this...
> 
> 
> 
> Poly fields, such as the spatial location, might help?  Can I build my
> own poly-field that would split up the main field into subfields?  Do
> poly-fields let me return the subfields?  I don't quite have my head
> around polyfields yet.
> 
> 
> 
> Another option although I suspect this won't be considered a good
> approach, but what about extending the copyField functionality of
> schema.xml to support my needs?  It would seem not entirely unreasonable
> that copyField would provide a means to extract only a portion of the
> contents of the source field to place in the destination field, no?  I'm
> sure people more familiar with Solr's architecture could explain why
> this isn't really an appropriate thing for Solr to handle (just because
> it could doesn't mean it should)...
> 
> The other - and probably best -- option would be to leverage Solr
> directly, bypassing the native integration of my application server,
> which we've already done for most cases.  I'd love to go this route but
> I'm having a hard time figuring out how to "easily" accomplish the same
> functionality provided by my app server integration...perhaps someone on
> the list could help me with this path forward?  Here is what I'm trying
> to accomplish:
> 
> 
> 
> I'm indexing documents (text, pdf, html...) but I need to include fields
> in the results of my searches which are only available from a db query.
> I know how to have Solr index results from a db query, but I'm having
> trouble getting it to index the documents that are associated to each
> record of that query (full path/filename is one of the fields of that
> query).
> 
> 
> 
> I started to try to use the dataImport handler to do this, by setting up
> a FileDataSource in addition to my jdbc data source.  I tried to
> leverage the filedatasource to populate a sub-entity based on the db
> field that contains the full path/filename, but I wasn't sure how to
> specify the db field from the root query/entity.  Before I spent too
> much time, I also realized I wasn't sure how to get Solr to deal with
> binary file types this way either which upon further reading seemed like
> I would need to leverage Tika - can that be done within the confines of
> dataimporthandler?
> 
> 
> 
> Any advice is greatly appreciated.  Thanks in advance,
> 
> 
> 
> Joe
> 


Re: Splitting fields

Posted by Erick Erickson <er...@gmail.com>.
Hmmm, I wonder if a custom Transformer would help here? It can be inserted into
a chain of transformers in DIH.

Essentially, you subclass Transformer and implement one method (transformRow)
and do anything you want. The input is a map of <String, Object> that
is a simple
representation of the Solr document. You can add/subtract/whatever you
want to that
map and then just return it.

The map in transformRow has all the changes by any other entries in
the transform
chain at this point, and your changes are passed on to the next
transformer in the chain.

The only restriction I know of is that the document has to conform to
the schema when
all is said and done.

Best
Erick

On Fri, May 27, 2011 at 6:47 AM, Joe Fitzgerald
<jo...@oxfordcorp.com> wrote:
> Hello,
>
>
>
> I am in an odd position.  The application server I use has built-in
> integration with SOLR.  Unfortunately, its native capabilities are
> fairly limited, specifically, it only supports a standard/pre-defined
> set of fields which can be indexed.  As a result, it has left me
> kludging how I work with Solr and doing things like putting what I'd
> like to be multiple, separate fields into a single Solr field.
>
>
>
> As an example, I may put a customer id and name into a single field
> called 'custom1'.  Ideally, I'd like this information to be returned in
> separate fields...and even better would be for them to be indexed as
> separate fields but I can live without the latter.  Currently, I'm
> building out a json representation of this information which makes it
> easy for me to deal with when I extract the results...but it all feels
> wrong.
>
>
>
> I do have complete control over the actual Solr installation (just not
> the indexing call to Solr), so I was hoping there may be a way to
> configure Solr to take my single field and split it up into a different
> field for each key in my json representation.
>
>
>
> I don't see anything native to Solr that would do this for me but there
> are a few features that I thought sounded similar and was hoping to get
> some opinions on how I may be able to move forward with this...
>
>
>
> Poly fields, such as the spatial location, might help?  Can I build my
> own poly-field that would split up the main field into subfields?  Do
> poly-fields let me return the subfields?  I don't quite have my head
> around polyfields yet.
>
>
>
> Another option although I suspect this won't be considered a good
> approach, but what about extending the copyField functionality of
> schema.xml to support my needs?  It would seem not entirely unreasonable
> that copyField would provide a means to extract only a portion of the
> contents of the source field to place in the destination field, no?  I'm
> sure people more familiar with Solr's architecture could explain why
> this isn't really an appropriate thing for Solr to handle (just because
> it could doesn't mean it should)...
>
> The other - and probably best -- option would be to leverage Solr
> directly, bypassing the native integration of my application server,
> which we've already done for most cases.  I'd love to go this route but
> I'm having a hard time figuring out how to "easily" accomplish the same
> functionality provided by my app server integration...perhaps someone on
> the list could help me with this path forward?  Here is what I'm trying
> to accomplish:
>
>
>
> I'm indexing documents (text, pdf, html...) but I need to include fields
> in the results of my searches which are only available from a db query.
> I know how to have Solr index results from a db query, but I'm having
> trouble getting it to index the documents that are associated to each
> record of that query (full path/filename is one of the fields of that
> query).
>
>
>
> I started to try to use the dataImport handler to do this, by setting up
> a FileDataSource in addition to my jdbc data source.  I tried to
> leverage the filedatasource to populate a sub-entity based on the db
> field that contains the full path/filename, but I wasn't sure how to
> specify the db field from the root query/entity.  Before I spent too
> much time, I also realized I wasn't sure how to get Solr to deal with
> binary file types this way either which upon further reading seemed like
> I would need to leverage Tika - can that be done within the confines of
> dataimporthandler?
>
>
>
> Any advice is greatly appreciated.  Thanks in advance,
>
>
>
> Joe
>
>