You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sarah Weissman <sw...@stsci.edu> on 2017/12/05 17:47:34 UTC

Dataimport handler showing idle status with multiple shards

Hi,

I’ve recently been using the dataimport handler to import records from a database into a Solr cloud collection with multiple shards. I have 6 dataimport handlers configured on 6 different paths all running simultaneously against the same DB. I’ve noticed that when I do this I often get “idle” status from the DIH even when the import is still running. The percentage of the time I get an “idle” response seems proportional to the number of shards. I.e., with 1 shard it always shows me non-idle status, with 2 shards I see idle about half the time I check the status, with 96 shards it seems to be showing idle almost all the time. I can see the size of each shard increasing, so I’m sure the import is still going.

I recently switched from 6.1 to 7.1 and I don’t remember this happening in 6.1. Does anyone know why the DIH would report idle when it’s running?

e.g.:
curl http://myserver:8983/solr/collection/dataimport6
{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "initArgs":[
    "defaults",[
      "config","data-config6.xml"]],
  "status":"idle",
  "importResponse":"",
  "statusMessages":{}}

Thanks,
Sarah

Re: Dataimport handler showing idle status with multiple shards

Posted by Sarah Weissman <sw...@stsci.edu>.

From: Shawn Heisey <el...@elyograg.org>
Reply-To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Date: Tuesday, December 5, 2017 at 1:31 PM
To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
Subject: Re: Dataimport handler showing idle status with multiple shards

On 12/5/2017 10:47 AM, Sarah Weissman wrote:
I’ve recently been using the dataimport handler to import records from a database into a Solr cloud collection with multiple shards. I have 6 dataimport handlers configured on 6 different paths all running simultaneously against the same DB. I’ve noticed that when I do this I often get “idle” status from the DIH even when the import is still running. The percentage of the time I get an “idle” response seems proportional to the number of shards. I.e., with 1 shard it always shows me non-idle status, with 2 shards I see idle about half the time I check the status, with 96 shards it seems to be showing idle almost all the time. I can see the size of each shard increasing, so I’m sure the import is still going.

I recently switched from 6.1 to 7.1 and I don’t remember this happening in 6.1. Does anyone know why the DIH would report idle when it’s running?

e.g.:
curl http://myserver:8983/solr/collection/dataimport6

<snip>

To use DIH with SolrCloud, you should be sending your request directly
to a shard replica core, not the collection, so that you can be
absolutely certain that the import command and the status command are
going to the same place.  You MIGHT need to also have a distrib=false
parameter on the request, but I do not know whether that is required to
prevent the load balancing on the dataimport handler.

<snip>

Thanks for the information, Shawn. I am relatively new to Solr cloud and I am used to running the dataimport from the admin dashboard, where it happens at the collection level, so I find it surprising that the right way to do this is at the core level. So, if I want to be able to check the status of my data import for N cores I would need to create N different data import configs that manually partition the collection and start each different config on a different core? That seems like it could get confusing. And then if I wanted to grow or shrink my shards I’d have to rejigger my data import configs every time. I kind of expect a distributed index to hide these details from me.

I only have one node at the moment, and I don’t understand how Solr cloud works internally well enough to understand what it means for the data import to be running on a shard vs. a node. It would be nice if doing a status query would at least tell you something, like the number of documents last indexed on that core, even if nothing is currently running. That way at least I could extrapolate how much longer the operation will take.


Re: Dataimport handler showing idle status with multiple shards

Posted by Shawn Heisey <el...@elyograg.org>.
On 12/5/2017 10:47 AM, Sarah Weissman wrote:
> I’ve recently been using the dataimport handler to import records from a database into a Solr cloud collection with multiple shards. I have 6 dataimport handlers configured on 6 different paths all running simultaneously against the same DB. I’ve noticed that when I do this I often get “idle” status from the DIH even when the import is still running. The percentage of the time I get an “idle” response seems proportional to the number of shards. I.e., with 1 shard it always shows me non-idle status, with 2 shards I see idle about half the time I check the status, with 96 shards it seems to be showing idle almost all the time. I can see the size of each shard increasing, so I’m sure the import is still going.
>
> I recently switched from 6.1 to 7.1 and I don’t remember this happening in 6.1. Does anyone know why the DIH would report idle when it’s running?
>
> e.g.:
> curl http://myserver:8983/solr/collection/dataimport6

When you send a DIH request to the collection name, SolrCloud is going 
to load balance that request across the cloud, just like it would with 
any other request.  Solr will look at the list of all responding nodes 
that host part of the collection and send multiple such requests to 
different cores (shards/replicas) across the cloud.  If there are four 
cores in the collection and the nodes hosting them are all working, then 
each of those cores would only see requests to /dataimport about one 
fourth of the time.

DIH imports happen at the core level, NOT the collection level, so when 
you start an import on a collection with four cores in the cloud, only 
one of those four cores is actually going to be doing the import, the 
rest of them are idle.

This behavior should happen with any version, so I would expect it in 
6.1 as well as 7.1.

To use DIH with SolrCloud, you should be sending your request directly 
to a shard replica core, not the collection, so that you can be 
absolutely certain that the import command and the status command are 
going to the same place.  You MIGHT need to also have a distrib=false 
parameter on the request, but I do not know whether that is required to 
prevent the load balancing on the dataimport handler.

A similar question came to this list two days ago, and I replied to that 
one yesterday.

http://lucene.472066.n3.nabble.com/Dataimporter-status-tp4365602p4365879.html

Somebody did open an issue a LONG time ago about this problem:

https://issues.apache.org/jira/browse/SOLR-3666

I just commented on the issue.

Thanks,
Shawn