You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Olivier Austina <ol...@gmail.com> on 2014/10/28 22:12:49 UTC

Indexing documents/files for production use

Hi All,

I am reading the solr documentation. I have understood that post.jar
<http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29>
is not meant for production use, cURL
<https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing>
is not recommanded. Is SolrJ better for production?  Thank you.
Regards
Olivier

Re: Indexing documents/files for production use

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
What is your production use? You have to answer that for yourself.

post.jar makes a couple of things easy. If your production use fits
into those (e.g. no cluster) - great, use it. It is certainly not any
worse than cURL.

But if you are running a cluster and have specific requirements, then
yes, use something that's cluster aware. Whether it is a custom client
on top of SolrJ, Spring Data, or Cloudera pipeline will depend on your
particular use case. Don't make your life over-complicated in advance.

Regards,
   Alex.


Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 28 October 2014 17:12, Olivier Austina <ol...@gmail.com> wrote:
> Hi All,
>
> I am reading the solr documentation. I have understood that post.jar
> <http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29>
> is not meant for production use, cURL
> <https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing>
> is not recommanded. Is SolrJ better for production?  Thank you.
> Regards
> Olivier

Re: Indexing documents/files for production use

Posted by Olivier Austina <ol...@gmail.com>.
Thank you Alexandre, Jürgen and Erick for your replies. It is clear for me.

Regards
Olivier


2014-10-28 23:35 GMT+01:00 Erick Erickson <er...@gmail.com>:

> And one other consideration in addition to the two excellent responses
> so far....
>
> In a SolrCloud environment, SolrJ via CloudSolrServer will automatically
> route the documents to the correct shard leader, saving some additional
> overhead. Post.jar and cURL send the docs to a node, which in turn
> forward the docs to the correct shard leader which lowers
> throughput....
>
> Best,
> Erick
>
> On Tue, Oct 28, 2014 at 2:32 PM, "Jürgen Wagner (DVT)"
> <ju...@devoteam.com> wrote:
> > Hello Olivier,
> >   for real production use, you won't really want to use any toys like
> > post.jar or curl. You want a decent connector to whatever data source
> there
> > is, that fetches data, possibly massages it a bit, and then feeds it into
> > Solr - by means of SolrJ or directly into the web service of Solr via
> binary
> > protocols. This way, you can properly handle incremental feeding,
> processing
> > of data from remote locations (with the connector being closer to the
> data
> > source), and also source data security. Also think about what happens if
> you
> > do processing of incoming documents in Solr. What happens if Tika runs
> out
> > of memory because of PDF problems? What if this crashes your Solr node?
> In
> > our Solr projects, we generally do not do any sizable processing within
> Solr
> > as document processing and document indexing or querying have all
> different
> > scaling properties.
> >
> > "Production use" most typically is not achieved by deploying a vanilla
> Solr,
> > but rather having a bit more glue and wrappage, so the whole will fit
> your
> > requirements in terms of functionality, scaling, monitoring and
> robustness.
> > Some similar platforms like Elasticsearch try to alleviate these pains of
> > going to a production-style infrastructure, but that's at the expense of
> > flexibility and comes with limitations.
> >
> > For proof-of-concept or demonstrator-style applications, the plain tools
> out
> > of the box will be fine. For production applications, you want to have
> more
> > robust components.
> >
> > Best regards,
> > --Jürgen
> >
> >
> > On 28.10.2014 22:12, Olivier Austina wrote:
> >
> > Hi All,
> >
> > I am reading the solr documentation. I have understood that post.jar
> > <
> http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29
> >
> > is not meant for production use, cURL
> > <
> https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing
> >
> > is not recommanded. Is SolrJ better for production?  Thank you.
> > Regards
> > Olivier
> >
> >
> >
> > --
> >
> > Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
> > уважением
> > i.A. Jürgen Wagner
> > Head of Competence Center "Intelligence"
> > & Senior Cloud Consultant
> >
> > Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
> > Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
> 1543
> > E-Mail: juergen.wagner@devoteam.com, URL: www.devoteam.de
> >
> > ________________________________
> > Managing Board: Jürgen Hatzipantelis (CEO)
> > Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
> > Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
> >
> >
>

Re: Indexing documents/files for production use

Posted by Erick Erickson <er...@gmail.com>.
And one other consideration in addition to the two excellent responses
so far....

In a SolrCloud environment, SolrJ via CloudSolrServer will automatically
route the documents to the correct shard leader, saving some additional
overhead. Post.jar and cURL send the docs to a node, which in turn
forward the docs to the correct shard leader which lowers
throughput....

Best,
Erick

On Tue, Oct 28, 2014 at 2:32 PM, "Jürgen Wagner (DVT)"
<ju...@devoteam.com> wrote:
> Hello Olivier,
>   for real production use, you won't really want to use any toys like
> post.jar or curl. You want a decent connector to whatever data source there
> is, that fetches data, possibly massages it a bit, and then feeds it into
> Solr - by means of SolrJ or directly into the web service of Solr via binary
> protocols. This way, you can properly handle incremental feeding, processing
> of data from remote locations (with the connector being closer to the data
> source), and also source data security. Also think about what happens if you
> do processing of incoming documents in Solr. What happens if Tika runs out
> of memory because of PDF problems? What if this crashes your Solr node? In
> our Solr projects, we generally do not do any sizable processing within Solr
> as document processing and document indexing or querying have all different
> scaling properties.
>
> "Production use" most typically is not achieved by deploying a vanilla Solr,
> but rather having a bit more glue and wrappage, so the whole will fit your
> requirements in terms of functionality, scaling, monitoring and robustness.
> Some similar platforms like Elasticsearch try to alleviate these pains of
> going to a production-style infrastructure, but that's at the expense of
> flexibility and comes with limitations.
>
> For proof-of-concept or demonstrator-style applications, the plain tools out
> of the box will be fine. For production applications, you want to have more
> robust components.
>
> Best regards,
> --Jürgen
>
>
> On 28.10.2014 22:12, Olivier Austina wrote:
>
> Hi All,
>
> I am reading the solr documentation. I have understood that post.jar
> <http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29>
> is not meant for production use, cURL
> <https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing>
> is not recommanded. Is SolrJ better for production?  Thank you.
> Regards
> Olivier
>
>
>
> --
>
> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
> уважением
> i.A. Jürgen Wagner
> Head of Competence Center "Intelligence"
> & Senior Cloud Consultant
>
> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
> E-Mail: juergen.wagner@devoteam.com, URL: www.devoteam.de
>
> ________________________________
> Managing Board: Jürgen Hatzipantelis (CEO)
> Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
>
>

Re: Indexing documents/files for production use

Posted by "Jürgen Wagner (DVT)" <ju...@devoteam.com>.
Hello Olivier,
  for real production use, you won't really want to use any toys like
post.jar or curl. You want a decent connector to whatever data source
there is, that fetches data, possibly massages it a bit, and then feeds
it into Solr - by means of SolrJ or directly into the web service of
Solr via binary protocols. This way, you can properly handle incremental
feeding, processing of data from remote locations (with the connector
being closer to the data source), and also source data security. Also
think about what happens if you do processing of incoming documents in
Solr. What happens if Tika runs out of memory because of PDF problems?
What if this crashes your Solr node? In our Solr projects, we generally
do not do any sizable processing within Solr as document processing and
document indexing or querying have all different scaling properties.

"Production use" most typically is not achieved by deploying a vanilla
Solr, but rather having a bit more glue and wrappage, so the whole will
fit your requirements in terms of functionality, scaling, monitoring and
robustness. Some similar platforms like Elasticsearch try to alleviate
these pains of going to a production-style infrastructure, but that's at
the expense of flexibility and comes with limitations.

For proof-of-concept or demonstrator-style applications, the plain tools
out of the box will be fine. For production applications, you want to
have more robust components.

Best regards,
--Jürgen

On 28.10.2014 22:12, Olivier Austina wrote:
> Hi All,
>
> I am reading the solr documentation. I have understood that post.jar
> <http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29>
> is not meant for production use, cURL
> <https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing>
> is not recommanded. Is SolrJ better for production?  Thank you.
> Regards
> Olivier
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wagner@devoteam.com
<ma...@devoteam.com>, URL: www.devoteam.de
<http://www.devoteam.de/>

------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071