You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Marek Ščevlík <ms...@codenameprojects.com> on 2016/11/18 14:29:20 UTC

Data Import Request Handler isolated into its own project - any suggestions?

Hello. My name is Marek Scevlik.



Currently I am working for a small company where we are interested in
implementing your Sorl 6.3 search engine.



We are hoping to take out from the original source package the Data Import
Request Handler into its own project and create a usable .jar file out of
it.



It should then serve as tool that would allow to connect to a remote server
and return data for us to our other application that would use the returned
data.



What do you think? Would anything like this possible? To isolate out the
Data Import Request Handler into its own standalone project?



If we could achieve this we won’t mind to share with the community this new
feature.



I realize this is a first email and may lead into several hundreds so for
the start my request is very simple and not so high level detailed but I am
sure you realize it may lead into being quite complex.



So I wonder if anyone replies.



Thanks a lot for any replies and further info or guidance.





Thanks.

Regards Marek Scevlik

Re: Data Import Request Handler isolated into its own project - any suggestions?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Is your goal to still index into Solr? It was not clear.

If yes, then it has been discussed quite a bit. The challenge is that
DIH is integrated into AdminUI, which makes it easier to see the
progress and set some flags. Plus the required jars are loaded via
solrconfig.xml, just like all other extra libraries. So, contribution
back would need to take that into account.

If you are not ready to face that, it may make sense to look at other
libraries first. Apache Camel, Apache NiFi, Cloudera morphline, etc.
All of them can send data into Solr, though their version support
differ. For example Camel seems to need Solr 3.5 still. Somebody
updating their implementation to Solr 6.3 and contributing that back
to that project would do a lot of good.

Regards,
    Alex.
----
Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/

On 19 November 2016 at 01:29, Marek Ščevlík
<ms...@codenameprojects.com> wrote:
> Hello. My name is Marek Scevlik.
>
>
>
> Currently I am working for a small company where we are interested in
> implementing your Sorl 6.3 search engine.
>
>
>
> We are hoping to take out from the original source package the Data Import
> Request Handler into its own project and create a usable .jar file out of
> it.
>
>
>
> It should then serve as tool that would allow to connect to a remote server
> and return data for us to our other application that would use the returned
> data.
>
>
>
> What do you think? Would anything like this possible? To isolate out the
> Data Import Request Handler into its own standalone project?
>
>
>
> If we could achieve this we won’t mind to share with the community this new
> feature.
>
>
>
> I realize this is a first email and may lead into several hundreds so for
> the start my request is very simple and not so high level detailed but I am
> sure you realize it may lead into being quite complex.
>
>
>
> So I wonder if anyone replies.
>
>
>
> Thanks a lot for any replies and further info or guidance.
>
>
>
>
>
> Thanks.
>
> Regards Marek Scevlik

Re: Data Import Request Handler isolated into its own project - any suggestions?

Posted by Erick Erickson <er...@gmail.com>.

on a quick glance, and not having tried this myself...

this seems wrong. You're setting a URL parameter "db":
params.set("db","/dataimport");

that's equivalent to a URL like
http://localhost:8983/solr&db=/dataimport

you'd want:
http://localhost:8983/solr/db/dataimport?command=full-import

I think you want to set your url for your HTTPClient to
the full solr path to dataimport handler, i.e something like
...solr/collection_or_core/dataimport
then set the params for dataimport handler like you are, i.e.:
params.set("command", "full-import");

Best,
Erick

On Sat, Nov 26, 2016 at 11:03 AM, Marek Ščevlík
<ms...@codenameprojects.com> wrote:
> Actually to be honest I realized that I only needed to trigger a data
> import handler from a jar file. Previously this was done in earlier
> versions via the SolrServer object. Now I am thinking if this is OK?:
>
> String urlString1 = "http://localhost:8983/solr/";
> SolrClient solr1 = new HttpSolrClient.Builder(urlString).build();
>
> ModifiableSolrParams params = new ModifiableSolrParams();
> params.set("db","/dataimport");
> params.set("command", "full-import");
> System.out.println(params.toString());
> QueryResponse qresponse1 = solr1.query(params);
>
> System.out.println("response = " + qresponse1);
>
> Output i get from this is: response =
> {responseHeader={status=0,QTime=0,params={wt=javabin,version=2,db=/dataimport,command=full-import}},response={numFound=0,start=0,docs=[]}}
>
> There is a core db which come with the examples in solr 6.3 package. It is
> loaded. From web ui admin I can operate it a run the dih reindex process.
>
> I wonder whether this could work ? What do you think? I am trying to call
> DIH whilst solr is running. This code is in a separate jar file that is run
> besides solr instance.
>
> This so far is not working for me. And I wonder why? What do you think?
> Should this work at all? OR perhaps someone else could help out.
>
>
> Thanks anyone for any help.
> ========================
>
> 2016-11-25 19:50 GMT+01:00 Marek Ščevlík <ms...@codenameprojects.com>:
>
>> I forgot to mention I am creating a jar file beside of a running solr 6.3
>> instance to which I am hoping to attach with java via the
>> SolrDispatchFilter to get at the cores and so then I could work with data
>> in code.
>>
>>
>> 2016-11-25 19:31 GMT+01:00 Marek Ščevlík <ms...@codenameprojects.com>:
>>
>>> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
>>> release of Solr 6.3 to get hold of a running instance of the jetty server
>>> that is part of the solution? I found some code for previous versions where
>>> it was captured with this code and one could then obtain cores for a
>>> running solr instance ...
>>>
>>> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>>>
>>> .getDispatchFilter().getFilter();
>>>
>>>
>>> I was trying to implement it this way but that is not working out very
>>> well now. I cant seem to get the jetty server object for the running
>>> instance. I tried several combinations but none seemed to work.
>>>
>>> Can you perhaps point me in the right direction?
>>>
>>> Perhaps you may know more than I do at the moment.
>>>
>>>
>>> Any help would be great.
>>>
>>>
>>> Thanks a lot
>>> Regards Marek Scevlik
>>>
>>>
>>>
>>> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
>>> daniel.davis@nih.gov>:
>>>
>>>> Marek,
>>>>
>>>> I've wanted to do something like this in the past as well.  However, a
>>>> rewrite that supports the same XML syntax might be better.   There are
>>>> several problems with the design of the Data Import Handler that make it
>>>> not quite suitable:
>>>>
>>>> - Not designed for Multi-threading
>>>> - Bad implementation of XPath
>>>>
>>>> Another issue is that one of the big advantages of Data Import Handler
>>>> goes away at this point, which is that it is hosted within Solr, and has a
>>>> UI for testing within the Solr admin.
>>>>
>>>> A better open-source Java solution might be to connect Solr with Apache
>>>> Camel - http://camel.apache.org/solr.html.
>>>>
>>>> If you are not tied absolutely to pure open-source, and freemium
>>>> products will do, then you might look at Pentaho Spoon and Kettle.
>>>>  Although Talend is much more established in the market, I find Pentaho's
>>>> XML-based ETL a bit easier to integrate as a developer, and unit test and
>>>> such.   Talend does better when you have a full infrastructure set up, but
>>>> then the attention required to unit tests and Git integration seems over
>>>> the top.
>>>>
>>>> Another powerful way to get things done, depending on what you are
>>>> indexing, is to use LogStash and couple that with Document processing
>>>> chains.   Many of our projects benefit from having a single RDBMS view,
>>>> perhaps a materialized view, that is used for the index.   LogStash does
>>>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>>>> hierarchical execution of Data Import Handler is very nice, but this can
>>>> often be handled on the RDBMS side by creating a view, maybe using
>>>> functions to provide some rows.   Many RDBMS systems also support
>>>> federation and the import of XML from files, so that this brings XML
>>>> processing into the picture.
>>>>
>>>> Hoping this helps,
>>>>
>>>> Dan Davis, Systems/Applications Architect (Contractor),
>>>> Office of Computer and Communications Systems,
>>>> National Library of Medicine, NIH
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Marek Ščevlík [mailto:mscevlik@codenameprojects.com]
>>>> Sent: Friday, November 18, 2016 9:29 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Data Import Request Handler isolated into its own project - any
>>>> suggestions?
>>>>
>>>> Hello. My name is Marek Scevlik.
>>>>
>>>>
>>>>
>>>> Currently I am working for a small company where we are interested in
>>>> implementing your Sorl 6.3 search engine.
>>>>
>>>>
>>>>
>>>> We are hoping to take out from the original source package the Data
>>>> Import Request Handler into its own project and create a usable .jar file
>>>> out of it.
>>>>
>>>>
>>>>
>>>> It should then serve as tool that would allow to connect to a remote
>>>> server and return data for us to our other application that would use the
>>>> returned data.
>>>>
>>>>
>>>>
>>>> What do you think? Would anything like this possible? To isolate out the
>>>> Data Import Request Handler into its own standalone project?
>>>>
>>>>
>>>>
>>>> If we could achieve this we won’t mind to share with the community this
>>>> new feature.
>>>>
>>>>
>>>>
>>>> I realize this is a first email and may lead into several hundreds so
>>>> for the start my request is very simple and not so high level detailed but
>>>> I am sure you realize it may lead into being quite complex.
>>>>
>>>>
>>>>
>>>> So I wonder if anyone replies.
>>>>
>>>>
>>>>
>>>> Thanks a lot for any replies and further info or guidance.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> Regards Marek Scevlik
>>>>
>>>
>>>
>>

Re: Data Import Request Handler isolated into its own project - any suggestions?

Posted by Marek Ščevlík <ms...@codenameprojects.com>.

I ran my jar application beside solr running instance where I want to
trigger a DIH import.
I tried this approach:

String urlString1 = "http://localhost:8983/solr/db/dataimport";
SolrClient solr1 = new HttpSolrClient.Builder(urlString1).build();
ModifiableSolrParams params = new ModifiableSolrParams();
params.set("command", "full-import");
SolrRequest request = new QueryRequest(params);
solr1.request(request);

.. and it returns now:

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://localhost:8983/solr/db/dataimport: Expected mime type
application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/db/dataimport/select. Reason:
<pre>    Not Found</pre></p>
</body>
</html>

So I am still confused now ...

What do you think ? Any ideas?

I am trying to figure it out. Silly think is when I create a simple URL
call with the URL string used in those solr request objects and fire it off
in java it does the right desired thing.

Weird. I think.

Thanks for any replies or help.


2016-11-26 20:03 GMT+01:00 Marek Ščevlík <ms...@codenameprojects.com>:

> Actually to be honest I realized that I only needed to trigger a data
> import handler from a jar file. Previously this was done in earlier
> versions via the SolrServer object. Now I am thinking if this is OK?:
>
> String urlString1 = "http://localhost:8983/solr/";
> SolrClient solr1 = new HttpSolrClient.Builder(urlString).build();
> 			
> ModifiableSolrParams params = new ModifiableSolrParams();
> params.set("db","/dataimport");
> params.set("command", "full-import");
> System.out.println(params.toString());
> QueryResponse qresponse1 = solr1.query(params);
>
> System.out.println("response = " + qresponse1);
>
> Output i get from this is: response = {responseHeader={status=0,
> QTime=0,params={wt=javabin,version=2,db=/dataimport,command=full-import}},
> response={numFound=0,start=0,docs=[]}}
>
> There is a core db which come with the examples in solr 6.3 package. It is
> loaded. From web ui admin I can operate it a run the dih reindex process.
>
> I wonder whether this could work ? What do you think? I am trying to call
> DIH whilst solr is running. This code is in a separate jar file that is run
> besides solr instance.
>
> This so far is not working for me. And I wonder why? What do you think?
> Should this work at all? OR perhaps someone else could help out.
>
>
> Thanks anyone for any help.
> ========================
>
> 2016-11-25 19:50 GMT+01:00 Marek Ščevlík <ms...@codenameprojects.com>:
>
>> I forgot to mention I am creating a jar file beside of a running solr 6.3
>> instance to which I am hoping to attach with java via the
>> SolrDispatchFilter to get at the cores and so then I could work with
>> data in code.
>>
>>
>> 2016-11-25 19:31 GMT+01:00 Marek Ščevlík <ms...@codenameprojects.com>:
>>
>>> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
>>> release of Solr 6.3 to get hold of a running instance of the jetty server
>>> that is part of the solution? I found some code for previous versions where
>>> it was captured with this code and one could then obtain cores for a
>>> running solr instance ...
>>>
>>> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>>>
>>> .getDispatchFilter().getFilter();
>>>
>>>
>>> I was trying to implement it this way but that is not working out very
>>> well now. I cant seem to get the jetty server object for the running
>>> instance. I tried several combinations but none seemed to work.
>>>
>>> Can you perhaps point me in the right direction?
>>>
>>> Perhaps you may know more than I do at the moment.
>>>
>>>
>>> Any help would be great.
>>>
>>>
>>> Thanks a lot
>>> Regards Marek Scevlik
>>>
>>>
>>>
>>> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
>>> daniel.davis@nih.gov>:
>>>
>>>> Marek,
>>>>
>>>> I've wanted to do something like this in the past as well.  However, a
>>>> rewrite that supports the same XML syntax might be better.   There are
>>>> several problems with the design of the Data Import Handler that make it
>>>> not quite suitable:
>>>>
>>>> - Not designed for Multi-threading
>>>> - Bad implementation of XPath
>>>>
>>>> Another issue is that one of the big advantages of Data Import Handler
>>>> goes away at this point, which is that it is hosted within Solr, and has a
>>>> UI for testing within the Solr admin.
>>>>
>>>> A better open-source Java solution might be to connect Solr with Apache
>>>> Camel - http://camel.apache.org/solr.html.
>>>>
>>>> If you are not tied absolutely to pure open-source, and freemium
>>>> products will do, then you might look at Pentaho Spoon and Kettle.
>>>>  Although Talend is much more established in the market, I find Pentaho's
>>>> XML-based ETL a bit easier to integrate as a developer, and unit test and
>>>> such.   Talend does better when you have a full infrastructure set up, but
>>>> then the attention required to unit tests and Git integration seems over
>>>> the top.
>>>>
>>>> Another powerful way to get things done, depending on what you are
>>>> indexing, is to use LogStash and couple that with Document processing
>>>> chains.   Many of our projects benefit from having a single RDBMS view,
>>>> perhaps a materialized view, that is used for the index.   LogStash does
>>>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>>>> hierarchical execution of Data Import Handler is very nice, but this can
>>>> often be handled on the RDBMS side by creating a view, maybe using
>>>> functions to provide some rows.   Many RDBMS systems also support
>>>> federation and the import of XML from files, so that this brings XML
>>>> processing into the picture.
>>>>
>>>> Hoping this helps,
>>>>
>>>> Dan Davis, Systems/Applications Architect (Contractor),
>>>> Office of Computer and Communications Systems,
>>>> National Library of Medicine, NIH
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Marek Ščevlík [mailto:mscevlik@codenameprojects.com]
>>>> Sent: Friday, November 18, 2016 9:29 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Data Import Request Handler isolated into its own project -
>>>> any suggestions?
>>>>
>>>> Hello. My name is Marek Scevlik.
>>>>
>>>>
>>>>
>>>> Currently I am working for a small company where we are interested in
>>>> implementing your Sorl 6.3 search engine.
>>>>
>>>>
>>>>
>>>> We are hoping to take out from the original source package the Data
>>>> Import Request Handler into its own project and create a usable .jar file
>>>> out of it.
>>>>
>>>>
>>>>
>>>> It should then serve as tool that would allow to connect to a remote
>>>> server and return data for us to our other application that would use the
>>>> returned data.
>>>>
>>>>
>>>>
>>>> What do you think? Would anything like this possible? To isolate out
>>>> the Data Import Request Handler into its own standalone project?
>>>>
>>>>
>>>>
>>>> If we could achieve this we won’t mind to share with the community this
>>>> new feature.
>>>>
>>>>
>>>>
>>>> I realize this is a first email and may lead into several hundreds so
>>>> for the start my request is very simple and not so high level detailed but
>>>> I am sure you realize it may lead into being quite complex.
>>>>
>>>>
>>>>
>>>> So I wonder if anyone replies.
>>>>
>>>>
>>>>
>>>> Thanks a lot for any replies and further info or guidance.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> Regards Marek Scevlik
>>>>
>>>
>>>
>>
>

Re: Data Import Request Handler isolated into its own project - any suggestions?

Posted by Marek Ščevlík <ms...@codenameprojects.com>.

Actually to be honest I realized that I only needed to trigger a data
import handler from a jar file. Previously this was done in earlier
versions via the SolrServer object. Now I am thinking if this is OK?:

String urlString1 = "http://localhost:8983/solr/";
SolrClient solr1 = new HttpSolrClient.Builder(urlString).build();
			
ModifiableSolrParams params = new ModifiableSolrParams();
params.set("db","/dataimport");
params.set("command", "full-import");
System.out.println(params.toString());
QueryResponse qresponse1 = solr1.query(params);

System.out.println("response = " + qresponse1);

Output i get from this is: response =
{responseHeader={status=0,QTime=0,params={wt=javabin,version=2,db=/dataimport,command=full-import}},response={numFound=0,start=0,docs=[]}}

There is a core db which come with the examples in solr 6.3 package. It is
loaded. From web ui admin I can operate it a run the dih reindex process.

I wonder whether this could work ? What do you think? I am trying to call
DIH whilst solr is running. This code is in a separate jar file that is run
besides solr instance.

This so far is not working for me. And I wonder why? What do you think?
Should this work at all? OR perhaps someone else could help out.


Thanks anyone for any help.
========================

2016-11-25 19:50 GMT+01:00 Marek Ščevlík <ms...@codenameprojects.com>:

> I forgot to mention I am creating a jar file beside of a running solr 6.3
> instance to which I am hoping to attach with java via the
> SolrDispatchFilter to get at the cores and so then I could work with data
> in code.
>
>
> 2016-11-25 19:31 GMT+01:00 Marek Ščevlík <ms...@codenameprojects.com>:
>
>> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
>> release of Solr 6.3 to get hold of a running instance of the jetty server
>> that is part of the solution? I found some code for previous versions where
>> it was captured with this code and one could then obtain cores for a
>> running solr instance ...
>>
>> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>>
>> .getDispatchFilter().getFilter();
>>
>>
>> I was trying to implement it this way but that is not working out very
>> well now. I cant seem to get the jetty server object for the running
>> instance. I tried several combinations but none seemed to work.
>>
>> Can you perhaps point me in the right direction?
>>
>> Perhaps you may know more than I do at the moment.
>>
>>
>> Any help would be great.
>>
>>
>> Thanks a lot
>> Regards Marek Scevlik
>>
>>
>>
>> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
>> daniel.davis@nih.gov>:
>>
>>> Marek,
>>>
>>> I've wanted to do something like this in the past as well.  However, a
>>> rewrite that supports the same XML syntax might be better.   There are
>>> several problems with the design of the Data Import Handler that make it
>>> not quite suitable:
>>>
>>> - Not designed for Multi-threading
>>> - Bad implementation of XPath
>>>
>>> Another issue is that one of the big advantages of Data Import Handler
>>> goes away at this point, which is that it is hosted within Solr, and has a
>>> UI for testing within the Solr admin.
>>>
>>> A better open-source Java solution might be to connect Solr with Apache
>>> Camel - http://camel.apache.org/solr.html.
>>>
>>> If you are not tied absolutely to pure open-source, and freemium
>>> products will do, then you might look at Pentaho Spoon and Kettle.
>>>  Although Talend is much more established in the market, I find Pentaho's
>>> XML-based ETL a bit easier to integrate as a developer, and unit test and
>>> such.   Talend does better when you have a full infrastructure set up, but
>>> then the attention required to unit tests and Git integration seems over
>>> the top.
>>>
>>> Another powerful way to get things done, depending on what you are
>>> indexing, is to use LogStash and couple that with Document processing
>>> chains.   Many of our projects benefit from having a single RDBMS view,
>>> perhaps a materialized view, that is used for the index.   LogStash does
>>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>>> hierarchical execution of Data Import Handler is very nice, but this can
>>> often be handled on the RDBMS side by creating a view, maybe using
>>> functions to provide some rows.   Many RDBMS systems also support
>>> federation and the import of XML from files, so that this brings XML
>>> processing into the picture.
>>>
>>> Hoping this helps,
>>>
>>> Dan Davis, Systems/Applications Architect (Contractor),
>>> Office of Computer and Communications Systems,
>>> National Library of Medicine, NIH
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Marek Ščevlík [mailto:mscevlik@codenameprojects.com]
>>> Sent: Friday, November 18, 2016 9:29 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Data Import Request Handler isolated into its own project - any
>>> suggestions?
>>>
>>> Hello. My name is Marek Scevlik.
>>>
>>>
>>>
>>> Currently I am working for a small company where we are interested in
>>> implementing your Sorl 6.3 search engine.
>>>
>>>
>>>
>>> We are hoping to take out from the original source package the Data
>>> Import Request Handler into its own project and create a usable .jar file
>>> out of it.
>>>
>>>
>>>
>>> It should then serve as tool that would allow to connect to a remote
>>> server and return data for us to our other application that would use the
>>> returned data.
>>>
>>>
>>>
>>> What do you think? Would anything like this possible? To isolate out the
>>> Data Import Request Handler into its own standalone project?
>>>
>>>
>>>
>>> If we could achieve this we won’t mind to share with the community this
>>> new feature.
>>>
>>>
>>>
>>> I realize this is a first email and may lead into several hundreds so
>>> for the start my request is very simple and not so high level detailed but
>>> I am sure you realize it may lead into being quite complex.
>>>
>>>
>>>
>>> So I wonder if anyone replies.
>>>
>>>
>>>
>>> Thanks a lot for any replies and further info or guidance.
>>>
>>>
>>>
>>>
>>>
>>> Thanks.
>>>
>>> Regards Marek Scevlik
>>>
>>
>>
>

Re: Data Import Request Handler isolated into its own project - any suggestions?

Posted by Marek Ščevlík <ms...@codenameprojects.com>.

I forgot to mention I am creating a jar file beside of a running solr 6.3
instance to which I am hoping to attach with java via the SolrDispatchFilter
to get at the cores and so then I could work with data in code.


2016-11-25 19:31 GMT+01:00 Marek Ščevlík <ms...@codenameprojects.com>:

> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
> release of Solr 6.3 to get hold of a running instance of the jetty server
> that is part of the solution? I found some code for previous versions where
> it was captured with this code and one could then obtain cores for a
> running solr instance ...
>
> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>
> .getDispatchFilter().getFilter();
>
>
> I was trying to implement it this way but that is not working out very
> well now. I cant seem to get the jetty server object for the running
> instance. I tried several combinations but none seemed to work.
>
> Can you perhaps point me in the right direction?
>
> Perhaps you may know more than I do at the moment.
>
>
> Any help would be great.
>
>
> Thanks a lot
> Regards Marek Scevlik
>
>
>
> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
> daniel.davis@nih.gov>:
>
>> Marek,
>>
>> I've wanted to do something like this in the past as well.  However, a
>> rewrite that supports the same XML syntax might be better.   There are
>> several problems with the design of the Data Import Handler that make it
>> not quite suitable:
>>
>> - Not designed for Multi-threading
>> - Bad implementation of XPath
>>
>> Another issue is that one of the big advantages of Data Import Handler
>> goes away at this point, which is that it is hosted within Solr, and has a
>> UI for testing within the Solr admin.
>>
>> A better open-source Java solution might be to connect Solr with Apache
>> Camel - http://camel.apache.org/solr.html.
>>
>> If you are not tied absolutely to pure open-source, and freemium products
>> will do, then you might look at Pentaho Spoon and Kettle.   Although Talend
>> is much more established in the market, I find Pentaho's XML-based ETL a
>> bit easier to integrate as a developer, and unit test and such.   Talend
>> does better when you have a full infrastructure set up, but then the
>> attention required to unit tests and Git integration seems over the top.
>>
>> Another powerful way to get things done, depending on what you are
>> indexing, is to use LogStash and couple that with Document processing
>> chains.   Many of our projects benefit from having a single RDBMS view,
>> perhaps a materialized view, that is used for the index.   LogStash does
>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>> hierarchical execution of Data Import Handler is very nice, but this can
>> often be handled on the RDBMS side by creating a view, maybe using
>> functions to provide some rows.   Many RDBMS systems also support
>> federation and the import of XML from files, so that this brings XML
>> processing into the picture.
>>
>> Hoping this helps,
>>
>> Dan Davis, Systems/Applications Architect (Contractor),
>> Office of Computer and Communications Systems,
>> National Library of Medicine, NIH
>>
>>
>>
>>
>> -----Original Message-----
>> From: Marek Ščevlík [mailto:mscevlik@codenameprojects.com]
>> Sent: Friday, November 18, 2016 9:29 AM
>> To: solr-user@lucene.apache.org
>> Subject: Data Import Request Handler isolated into its own project - any
>> suggestions?
>>
>> Hello. My name is Marek Scevlik.
>>
>>
>>
>> Currently I am working for a small company where we are interested in
>> implementing your Sorl 6.3 search engine.
>>
>>
>>
>> We are hoping to take out from the original source package the Data
>> Import Request Handler into its own project and create a usable .jar file
>> out of it.
>>
>>
>>
>> It should then serve as tool that would allow to connect to a remote
>> server and return data for us to our other application that would use the
>> returned data.
>>
>>
>>
>> What do you think? Would anything like this possible? To isolate out the
>> Data Import Request Handler into its own standalone project?
>>
>>
>>
>> If we could achieve this we won’t mind to share with the community this
>> new feature.
>>
>>
>>
>> I realize this is a first email and may lead into several hundreds so for
>> the start my request is very simple and not so high level detailed but I am
>> sure you realize it may lead into being quite complex.
>>
>>
>>
>> So I wonder if anyone replies.
>>
>>
>>
>> Thanks a lot for any replies and further info or guidance.
>>
>>
>>
>>
>>
>> Thanks.
>>
>> Regards Marek Scevlik
>>
>
>

Re: Data Import Request Handler isolated into its own project - any suggestions?

Posted by Marek Ščevlík <ms...@codenameprojects.com>.

Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
release of Solr 6.3 to get hold of a running instance of the jetty server
that is part of the solution? I found some code for previous versions where
it was captured with this code and one could then obtain cores for a
running solr instance ...

SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty

.getDispatchFilter().getFilter();


I was trying to implement it this way but that is not working out very well
now. I cant seem to get the jetty server object for the running instance. I
tried several combinations but none seemed to work.

Can you perhaps point me in the right direction?

Perhaps you may know more than I do at the moment.


Any help would be great.


Thanks a lot
Regards Marek Scevlik



2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <daniel.davis@nih.gov
>:

> Marek,
>
> I've wanted to do something like this in the past as well.  However, a
> rewrite that supports the same XML syntax might be better.   There are
> several problems with the design of the Data Import Handler that make it
> not quite suitable:
>
> - Not designed for Multi-threading
> - Bad implementation of XPath
>
> Another issue is that one of the big advantages of Data Import Handler
> goes away at this point, which is that it is hosted within Solr, and has a
> UI for testing within the Solr admin.
>
> A better open-source Java solution might be to connect Solr with Apache
> Camel - http://camel.apache.org/solr.html.
>
> If you are not tied absolutely to pure open-source, and freemium products
> will do, then you might look at Pentaho Spoon and Kettle.   Although Talend
> is much more established in the market, I find Pentaho's XML-based ETL a
> bit easier to integrate as a developer, and unit test and such.   Talend
> does better when you have a full infrastructure set up, but then the
> attention required to unit tests and Git integration seems over the top.
>
> Another powerful way to get things done, depending on what you are
> indexing, is to use LogStash and couple that with Document processing
> chains.   Many of our projects benefit from having a single RDBMS view,
> perhaps a materialized view, that is used for the index.   LogStash does
> just fine here, pulling from the RDBMS and posting each row to Solr.  The
> hierarchical execution of Data Import Handler is very nice, but this can
> often be handled on the RDBMS side by creating a view, maybe using
> functions to provide some rows.   Many RDBMS systems also support
> federation and the import of XML from files, so that this brings XML
> processing into the picture.
>
> Hoping this helps,
>
> Dan Davis, Systems/Applications Architect (Contractor),
> Office of Computer and Communications Systems,
> National Library of Medicine, NIH
>
>
>
>
> -----Original Message-----
> From: Marek Ščevlík [mailto:mscevlik@codenameprojects.com]
> Sent: Friday, November 18, 2016 9:29 AM
> To: solr-user@lucene.apache.org
> Subject: Data Import Request Handler isolated into its own project - any
> suggestions?
>
> Hello. My name is Marek Scevlik.
>
>
>
> Currently I am working for a small company where we are interested in
> implementing your Sorl 6.3 search engine.
>
>
>
> We are hoping to take out from the original source package the Data Import
> Request Handler into its own project and create a usable .jar file out of
> it.
>
>
>
> It should then serve as tool that would allow to connect to a remote
> server and return data for us to our other application that would use the
> returned data.
>
>
>
> What do you think? Would anything like this possible? To isolate out the
> Data Import Request Handler into its own standalone project?
>
>
>
> If we could achieve this we won’t mind to share with the community this
> new feature.
>
>
>
> I realize this is a first email and may lead into several hundreds so for
> the start my request is very simple and not so high level detailed but I am
> sure you realize it may lead into being quite complex.
>
>
>
> So I wonder if anyone replies.
>
>
>
> Thanks a lot for any replies and further info or guidance.
>
>
>
>
>
> Thanks.
>
> Regards Marek Scevlik
>

RE: Data Import Request Handler isolated into its own project - any suggestions?

Posted by "Davis, Daniel (NIH/NLM) [C]" <da...@nih.gov>.

Marek,

I've wanted to do something like this in the past as well.  However, a rewrite that supports the same XML syntax might be better.   There are several problems with the design of the Data Import Handler that make it not quite suitable:

- Not designed for Multi-threading
- Bad implementation of XPath

Another issue is that one of the big advantages of Data Import Handler goes away at this point, which is that it is hosted within Solr, and has a UI for testing within the Solr admin.

A better open-source Java solution might be to connect Solr with Apache Camel - http://camel.apache.org/solr.html.

If you are not tied absolutely to pure open-source, and freemium products will do, then you might look at Pentaho Spoon and Kettle.   Although Talend is much more established in the market, I find Pentaho's XML-based ETL a bit easier to integrate as a developer, and unit test and such.   Talend does better when you have a full infrastructure set up, but then the attention required to unit tests and Git integration seems over the top.

Another powerful way to get things done, depending on what you are indexing, is to use LogStash and couple that with Document processing chains.   Many of our projects benefit from having a single RDBMS view, perhaps a materialized view, that is used for the index.   LogStash does just fine here, pulling from the RDBMS and posting each row to Solr.  The hierarchical execution of Data Import Handler is very nice, but this can often be handled on the RDBMS side by creating a view, maybe using functions to provide some rows.   Many RDBMS systems also support federation and the import of XML from files, so that this brings XML processing into the picture.

Hoping this helps,

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH




-----Original Message-----
From: Marek Ščevlík [mailto:mscevlik@codenameprojects.com] 
Sent: Friday, November 18, 2016 9:29 AM
To: solr-user@lucene.apache.org
Subject: Data Import Request Handler isolated into its own project - any suggestions?

Hello. My name is Marek Scevlik.



Currently I am working for a small company where we are interested in implementing your Sorl 6.3 search engine.



We are hoping to take out from the original source package the Data Import Request Handler into its own project and create a usable .jar file out of it.



It should then serve as tool that would allow to connect to a remote server and return data for us to our other application that would use the returned data.



What do you think? Would anything like this possible? To isolate out the Data Import Request Handler into its own standalone project?



If we could achieve this we won’t mind to share with the community this new feature.



I realize this is a first email and may lead into several hundreds so for the start my request is very simple and not so high level detailed but I am sure you realize it may lead into being quite complex.



So I wonder if anyone replies.



Thanks a lot for any replies and further info or guidance.





Thanks.

Regards Marek Scevlik