You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by P Williams <wi...@gmail.com> on 2013/11/13 19:55:45 UTC

Using data-config.xml from DIH in SolrJ

Hi All,

I'm building a utility (Java jar) to create SolrInputDocuments and send
them to a HttpSolrServer using the SolrJ API.  The intention is to find an
efficient way to create documents from a large directory of files (where
multiple files make one Solr document) and be sent to a remote Solr
instance for update and commit.

I've already solved the problem using the DataImportHandler (DIH) so I have
a data-config.xml that describes the templated fields and cross-walking of
the source(s) to the schema.  The original data won't always be able to be
co-located with the Solr server which is why I'm looking for another option.

I've also already solved the problem using ant and xslt to create a
temporary (and unfortunately a potentially large) document which the
UpdateHandler will accept.  I couldn't think of a solution that took
advantage of the XSLT support in the UpdateHandler because each document is
created from multiple files.  Our current dated Java based solution
significantly outperforms this solution in terms of disk and time.  I've
rejected it based on that and gone back to the drawing board.

Does anyone have any suggestions on how I might be able to reuse my DIH
configuration in the SolrJ context without re-inventing the wheel (or DIH
in this case)?  If I'm doing something ridiculous I hope you'll point that
out too.

Thanks,
Tricia

Re: Using data-config.xml from DIH in SolrJ

Posted by P Williams <wi...@gmail.com>.

Hi,

I just discovered
UpdateProcessorFactory<http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/package-summary.html>
in
a big way.  How did this completely slip by me?

Working on two ideas.
1. I have used the DIH in a local EmbeddedSolrServer previously.  I could
write a ForwardingUpdateProcessorFactory to take that local update and send
it to a HttpSolrServer.
2. I have code which walks the file-system to compose rough documents but
haven't yet written the part that handles the templated fields and
cross-walking of the source(s) to the schema.  I could configure the update
handler on the Solr server side to do this with the RegexReplace
<http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html>and
DefaultValue<http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/DefaultValueUpdateProcessorFactory.html>
 UpdateProcessorFactor(ies).

Any thoughts on the advantages/disadvantages of these approaches?

Thanks,
Tricia

On Thu, Nov 14, 2013 at 7:49 AM, Erick Erickson <er...@gmail.com>wrote:

> There's nothing that I know of that takes a DIH configuration and
> uses it through SolrJ. You can use Tika directly in SolrJ if you
> need to parse structured documents though, see:
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> Yep, you're going to be kind of reinventing the wheel a bit I'm
> afraid.
>
> Best,
> Erick
>
>
> On Wed, Nov 13, 2013 at 1:55 PM, P Williams
> <wi...@gmail.com>wrote:
>
> > Hi All,
> >
> > I'm building a utility (Java jar) to create SolrInputDocuments and send
> > them to a HttpSolrServer using the SolrJ API.  The intention is to find
> an
> > efficient way to create documents from a large directory of files (where
> > multiple files make one Solr document) and be sent to a remote Solr
> > instance for update and commit.
> >
> > I've already solved the problem using the DataImportHandler (DIH) so I
> have
> > a data-config.xml that describes the templated fields and cross-walking
> of
> > the source(s) to the schema.  The original data won't always be able to
> be
> > co-located with the Solr server which is why I'm looking for another
> > option.
> >
> > I've also already solved the problem using ant and xslt to create a
> > temporary (and unfortunately a potentially large) document which the
> > UpdateHandler will accept.  I couldn't think of a solution that took
> > advantage of the XSLT support in the UpdateHandler because each document
> is
> > created from multiple files.  Our current dated Java based solution
> > significantly outperforms this solution in terms of disk and time.  I've
> > rejected it based on that and gone back to the drawing board.
> >
> > Does anyone have any suggestions on how I might be able to reuse my DIH
> > configuration in the SolrJ context without re-inventing the wheel (or DIH
> > in this case)?  If I'm doing something ridiculous I hope you'll point
> that
> > out too.
> >
> > Thanks,
> > Tricia
> >
>

Re: Using data-config.xml from DIH in SolrJ

Posted by Erick Erickson <er...@gmail.com>.

There's nothing that I know of that takes a DIH configuration and
uses it through SolrJ. You can use Tika directly in SolrJ if you
need to parse structured documents though, see:
http://searchhub.org/2012/02/14/indexing-with-solrj/

Yep, you're going to be kind of reinventing the wheel a bit I'm
afraid.

Best,
Erick


On Wed, Nov 13, 2013 at 1:55 PM, P Williams
<wi...@gmail.com>wrote:

> Hi All,
>
> I'm building a utility (Java jar) to create SolrInputDocuments and send
> them to a HttpSolrServer using the SolrJ API.  The intention is to find an
> efficient way to create documents from a large directory of files (where
> multiple files make one Solr document) and be sent to a remote Solr
> instance for update and commit.
>
> I've already solved the problem using the DataImportHandler (DIH) so I have
> a data-config.xml that describes the templated fields and cross-walking of
> the source(s) to the schema.  The original data won't always be able to be
> co-located with the Solr server which is why I'm looking for another
> option.
>
> I've also already solved the problem using ant and xslt to create a
> temporary (and unfortunately a potentially large) document which the
> UpdateHandler will accept.  I couldn't think of a solution that took
> advantage of the XSLT support in the UpdateHandler because each document is
> created from multiple files.  Our current dated Java based solution
> significantly outperforms this solution in terms of disk and time.  I've
> rejected it based on that and gone back to the drawing board.
>
> Does anyone have any suggestions on how I might be able to reuse my DIH
> configuration in the SolrJ context without re-inventing the wheel (or DIH
> in this case)?  If I'm doing something ridiculous I hope you'll point that
> out too.
>
> Thanks,
> Tricia
>