You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Anil Cherian <ch...@gmail.com> on 2015/12/15 06:47:44 UTC

Is DIH going to be removed from Solr future versions?

Dear Team,

I use DIH extensively and even wrote my own custom transformers in some
situations.
Recently during an architecture discussion one of my team members told that
Solr is going to take away DIH from its future versions.

Is that true?

Also is using DIH for say 2 or 3 million docs a good option for indexing an
XML content data set. I am planning to use it either by calling separate
entities parallely or multiple /dataimport in solrconfig.xml.

Cld you please reply at your earliest convenience as it is an important
decision for us to continue on DIH or not!

Thanks and Rgds,
Anil.

Re: Is DIH going to be removed from Solr future versions?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Are you saying to do a local mini-collection and then mirror final result
to the real one?

What about deletions? Per-entry cleanup statements and so on? DIH does full
updates, not just additions.

Or did I miss the focus?

Regards,
    Alex
On 15 Dec 2015 11:46 pm, "Erik Hatcher" <er...@gmail.com> wrote:

> With time shaken loose, IMO ideally what we do (under
> https://issues.apache.org/jira/browse/SOLR-7188 <
> https://issues.apache.org/jira/browse/SOLR-7188> probably) is create an
> update processor that *forwards* to a _real_ Solr collection update
> handler, and fire up EmbeddedSolrServer in a client-side command-line tool
> that can run /update/extract, DIH stuff, etc - does what it does now to
> extract, parse, and build documents and then forwards them via javabin to a
> live Solr collection.   I’m not sure that SOLR-7188 currently spells it out
> like that, but it is a nice, clean, straightforward path from DIH and Tika
> embedded inside a real Solr cluster to leveraging and scaling it on its
> own.   We’d lose the DIH admin UI, but that’s ok by me.
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
> > On Dec 15, 2015, at 9:23 AM, Davis, Daniel (NIH/NLM) [C] <
> daniel.davis@nih.gov> wrote:
> >
> > I am aware of the problems with the implementation of DIH, but is there
> any problem with the XML driven data import capability?
> > Could it be rewritten (using modern XPath) to run as a part of SolrJ?
> >
> > I've been interested in that, but I just haven't been able to shake
> loose the time.
> >
> > -----Original Message-----
> > From: Upayavira [mailto:uv@odoko.co.uk]
> > Sent: Tuesday, December 15, 2015 5:04 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Is DIH going to be removed from Solr future versions?
> >
> > I doubt DIH will be "removed". It more likely will be relegated - still
> there, but emphasised less.
> >
> > Another possibility that has been mooted is to extract it, so that it
> can run outside of Solr. This strikes me as the best option. Having it run
> inside Solr strikes me as architecturally wrong, and also problematic in a
> SolrCloud world. Taking the DIH codebase and running it
> > *outside* Solr you get the best of DIH without the same set of issues.
> >
> > Upayavira
> >
> > On Tue, Dec 15, 2015, at 05:47 AM, Anil Cherian wrote:
> >> Dear Team,
> >>
> >> I use DIH extensively and even wrote my own custom transformers in
> >> some situations.
> >> Recently during an architecture discussion one of my team members told
> >> that Solr is going to take away DIH from its future versions.
> >>
> >> Is that true?
> >>
> >> Also is using DIH for say 2 or 3 million docs a good option for
> >> indexing an XML content data set. I am planning to use it either by
> >> calling separate entities parallely or multiple /dataimport in
> >> solrconfig.xml.
> >>
> >> Cld you please reply at your earliest convenience as it is an
> >> important decision for us to continue on DIH or not!
> >>
> >> Thanks and Rgds,
> >> Anil.
>
>

Re: Is DIH going to be removed from Solr future versions?

Posted by Erik Hatcher <er...@gmail.com>.
With time shaken loose, IMO ideally what we do (under https://issues.apache.org/jira/browse/SOLR-7188 <https://issues.apache.org/jira/browse/SOLR-7188> probably) is create an update processor that *forwards* to a _real_ Solr collection update handler, and fire up EmbeddedSolrServer in a client-side command-line tool that can run /update/extract, DIH stuff, etc - does what it does now to extract, parse, and build documents and then forwards them via javabin to a live Solr collection.   I’m not sure that SOLR-7188 currently spells it out like that, but it is a nice, clean, straightforward path from DIH and Tika embedded inside a real Solr cluster to leveraging and scaling it on its own.   We’d lose the DIH admin UI, but that’s ok by me.

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>



> On Dec 15, 2015, at 9:23 AM, Davis, Daniel (NIH/NLM) [C] <da...@nih.gov> wrote:
> 
> I am aware of the problems with the implementation of DIH, but is there any problem with the XML driven data import capability?
> Could it be rewritten (using modern XPath) to run as a part of SolrJ?
> 
> I've been interested in that, but I just haven't been able to shake loose the time.
> 
> -----Original Message-----
> From: Upayavira [mailto:uv@odoko.co.uk] 
> Sent: Tuesday, December 15, 2015 5:04 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is DIH going to be removed from Solr future versions?
> 
> I doubt DIH will be "removed". It more likely will be relegated - still there, but emphasised less.
> 
> Another possibility that has been mooted is to extract it, so that it can run outside of Solr. This strikes me as the best option. Having it run inside Solr strikes me as architecturally wrong, and also problematic in a SolrCloud world. Taking the DIH codebase and running it
> *outside* Solr you get the best of DIH without the same set of issues.
> 
> Upayavira
> 
> On Tue, Dec 15, 2015, at 05:47 AM, Anil Cherian wrote:
>> Dear Team,
>> 
>> I use DIH extensively and even wrote my own custom transformers in 
>> some situations.
>> Recently during an architecture discussion one of my team members told 
>> that Solr is going to take away DIH from its future versions.
>> 
>> Is that true?
>> 
>> Also is using DIH for say 2 or 3 million docs a good option for 
>> indexing an XML content data set. I am planning to use it either by 
>> calling separate entities parallely or multiple /dataimport in 
>> solrconfig.xml.
>> 
>> Cld you please reply at your earliest convenience as it is an 
>> important decision for us to continue on DIH or not!
>> 
>> Thanks and Rgds,
>> Anil.


RE: Is DIH going to be removed from Solr future versions?

Posted by "Davis, Daniel (NIH/NLM) [C]" <da...@nih.gov>.
I am aware of the problems with the implementation of DIH, but is there any problem with the XML driven data import capability?
Could it be rewritten (using modern XPath) to run as a part of SolrJ?

I've been interested in that, but I just haven't been able to shake loose the time.

-----Original Message-----
From: Upayavira [mailto:uv@odoko.co.uk] 
Sent: Tuesday, December 15, 2015 5:04 AM
To: solr-user@lucene.apache.org
Subject: Re: Is DIH going to be removed from Solr future versions?

I doubt DIH will be "removed". It more likely will be relegated - still there, but emphasised less.

Another possibility that has been mooted is to extract it, so that it can run outside of Solr. This strikes me as the best option. Having it run inside Solr strikes me as architecturally wrong, and also problematic in a SolrCloud world. Taking the DIH codebase and running it
*outside* Solr you get the best of DIH without the same set of issues.

Upayavira

On Tue, Dec 15, 2015, at 05:47 AM, Anil Cherian wrote:
> Dear Team,
> 
> I use DIH extensively and even wrote my own custom transformers in 
> some situations.
> Recently during an architecture discussion one of my team members told 
> that Solr is going to take away DIH from its future versions.
> 
> Is that true?
> 
> Also is using DIH for say 2 or 3 million docs a good option for 
> indexing an XML content data set. I am planning to use it either by 
> calling separate entities parallely or multiple /dataimport in 
> solrconfig.xml.
> 
> Cld you please reply at your earliest convenience as it is an 
> important decision for us to continue on DIH or not!
> 
> Thanks and Rgds,
> Anil.

Re: Is DIH going to be removed from Solr future versions?

Posted by Upayavira <uv...@odoko.co.uk>.
I doubt DIH will be "removed". It more likely will be relegated - still
there, but emphasised less.

Another possibility that has been mooted is to extract it, so that it
can run outside of Solr. This strikes me as the best option. Having it
run inside Solr strikes me as architecturally wrong, and also
problematic in a SolrCloud world. Taking the DIH codebase and running it
*outside* Solr you get the best of DIH without the same set of issues.

Upayavira

On Tue, Dec 15, 2015, at 05:47 AM, Anil Cherian wrote:
> Dear Team,
> 
> I use DIH extensively and even wrote my own custom transformers in some
> situations.
> Recently during an architecture discussion one of my team members told
> that
> Solr is going to take away DIH from its future versions.
> 
> Is that true?
> 
> Also is using DIH for say 2 or 3 million docs a good option for indexing
> an
> XML content data set. I am planning to use it either by calling separate
> entities parallely or multiple /dataimport in solrconfig.xml.
> 
> Cld you please reply at your earliest convenience as it is an important
> decision for us to continue on DIH or not!
> 
> Thanks and Rgds,
> Anil.