You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sabeer Hussain <sh...@del.aithent.com> on 2017/12/08 09:40:49 UTC

How to perform delta-import on SolrCloud mode through a scheduler?

I am using Solr 7.1 version and deployed it in standalone mode. I have
created a scheduler in my application itself to perform delta-import
operation based on a pre-configured frequency. I have used the following
lines of code (in java) to invoke delta-import operation
                URL url = new URL("http://localhost:8080/solr/mycorename");
                HttpURLConnection connection =
(HttpURLConnection)url.openConnection();
                connection.setRequestMethod("GET");
                if(connection != null )
                {
                    connection.setDoOutput(true);
                    connection.setRequestProperty ("Authorization",
getBasicAuthorization());
                    connection.setConnectTimeout(10000);// 10 seconds
                    String data = URLEncoder.encode("command", "UTF-8") +
"=" + URLEncoder.encode("delta-import", "UTF-8");
                    data += "&" + URLEncoder.encode("optimize", "UTF-8") +
"=" + URLEncoder.encode("true", "UTF-8");
                    data += "&" + URLEncoder.encode("clean", "UTF-8") + "="
+ URLEncoder.encode("false", "UTF-8");
                    data += "&" + URLEncoder.encode("commit", "UTF-8") + "="
+ URLEncoder.encode("true", "UTF-8");
                     OutputStreamWriter out = new
OutputStreamWriter(connection.getOutputStream());
                    out.write(data);
                    out.flush();
                    BufferedReader in = new BufferedReader(
                                new InputStreamReader(
                                connection.getInputStream()));

                    String decodedString;

                    while ((decodedString = in.readLine()) != null) {
                        System.out.println(decodedString);
                    }
                    in.close();
                    out.close();
                    in = null;
                    out=null;
                    connection = null;
                    url = null;
                }
Now, I want to deploy the application in SolrCloud mode and for each core,
there will be 2 more replicas. Each replica will be running on separate
server (say localhost:8983, localhost:8984,localhost:8985). So, my question
is, how I can use delta-import operation for my cores using my above code?
In the standlone mode, there is only core(s) and no replica available, so, I
can directly use that. On Cloud mode, whether I need to check for the active
node and create the URLConnection based on that or not. Is there any thing
from ZooKeeper that I can use (like SolrJ client). Another issue is that my
delta-import scheduler will be  available in all the three instances of my
application. So, how I can ensure that  delta-import operation will be
performed from one instance only.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: How to perform delta-import on SolrCloud mode through a scheduler?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 12/8/2017 2:40 AM, Sabeer Hussain wrote:
> I am using Solr 7.1 version and deployed it in standalone mode. I have
> created a scheduler in my application itself to perform delta-import
> operation based on a pre-configured frequency. I have used the following
> lines of code (in java) to invoke delta-import operation

When the language is Java, I would use SolrJ.  The code tends to be
easier to write and easier to read than code like you've written which
uses HTTP functionality built into Java.  The response objects have a
lot of sugar methods, and the entire response is available as a Java
object that's easy to use in code -- you don't have to worry about
parsing the response into Java objects.

> Now, I want to deploy the application in SolrCloud mode and for each core,
> there will be 2 more replicas.

Most things in SolrCloud should be done at the collection level --
replacing "corename" with "collectionname" in the URL you have in your
code.  But DIH (the dataimport handler) is not one of them.

Using DIH at the collection level is possible, but you'll find that the
requests are load-balanced across the cloud, so you are likely to get a
status from a different replica than you sent the import to.  So, even
though most of the time I would recommend using CloudSolrClient from
SolrJ when running SolrCloud, for the dataimport handler, you should
actually use HttpSolrClient.

If the index has only one shard, or you are using the compositeId router
for automatic distribution of data between multiple shards, then running
an import on *ANY* core in the collection will distribute and replicate
data as you would expect across the entire collection.  If you're using
the implicit router and there are multiple shards, then things get a lot
more tricky, but SolrCloud will still do all the replication for you. 
I'm not going to go into detail about shards in this message.

Here's some SolrJ code to start an import and print the response.  The
example code includes a possible core name for the "foo" collection in
SolrCloud.  A specific core should be used for DIH so that you can be
sure that all requests are sent to the same place.  The query I've built
in the example doesn't have all the parameters you included, but you
should be able to see how to add anything you need.  One thing I'm not
clear on is whether the distrib=false parameter is required to disable
the load balancing.

  /*
   * By using a base URL without a core/collection name, one client
   * object can be used for requests to multiple indexes hosted on the
   * server side.
   */
  String baseUrl = "http://host:port/solr";
  String coreName = "foo_shard1_replica_n1";
  SolrClient client = new HttpSolrClient.Builder(baseUrl).build();

  SolrQuery startQuery = new SolrQuery();
  startQuery.setRequestHandler("/dataimport");
  startQuery.set("command", "delta-import");
  startQuery.set("clean", "false");

  try{
    QueryResponse response = client.query(coreName, startQuery);
    System.out.println(response.getResponse().toString());
  }
  catch (SolrServerException e){
    // TODO Auto-generated catch block
    e.printStackTrace();
  }
  catch (IOException e){
    // TODO Auto-generated catch block
    e.printStackTrace();
  }

As I mentioned above, for most types of requests against SolrCloud
(other than DIH), you should use CloudSolrClient, not HttpSolrClient,
and send requests to the collection instead of a specific core.  The
cloud client is initialized using ZooKeeper info rather than a URL.  It
is fully aware of the entire cloud at all times.  For DIH though, you
don't want to send things to the collection, because of SolrCloud's
inherent load balancing.

The difficulties of getting a program to deal with a DIH status response
are a whole separate discussion.

Thanks,
Shawn