You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Joshua Hendrickson <jh...@tripadvisor.com> on 2021/08/10 21:05:55 UTC

Duplicate sample errors using prometheus-exporter in Solr 8.9.0

Hello,

Our organization has implemented Solr 8.9.0 for a production use case. We have standardized on Prometheus for metrics collection and storage. We export metrics from our Solr cluster by deploying the public Solr image for version 8.9.0 to an EC2 instance and using Docker to run the exporter binary against Solr (which is running in a container on the same host). Our Prometheus scraper (hosted in Kubernetes and configured via a Helm chart) reports errors like the following on every scrape:

ts=2021-08-10T16:44:13.929Z caller=dedupe.go:112 component=remote level=error remote_name=11d3d0 url=https://our.endpoint/push msg="non-recoverable error" count=500 err="server returned HTTP status 400 Bad Request: user=nnnnn: err: duplicate sample for timestamp. timestamp=2021-08-10T16:44:13.317Z, series={__name__=\"solr_metrics_core_time_seconds_total\", aws_account=\"our-account\", base_url=\"http://fqdn.for.solr.server:32080/solr\", category=\"QUERY\", cluster=\"our-cluster\", collection=\"a-collection\", core=\"a_collection_shard1_replica_t13\", dc=\"aws\", handler=\"/select\", instance=\" fqdn.for.solr.server:8984\", job=\"solr\", replica=\"replica_t13\", shard=\"shard1\"}"

We have confirmed that there are indeed duplicate time series when we query our promtheus exporter. Here is a sample that shows the duplicate time series:

solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t1",collection="a_collection",shard="shard1",replica="replica_t1",base_url="http://fqdn3.for.solr.server:32080/solr",} 1.533471301599E9
solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t1",collection="a_collection",shard="shard1",replica="replica_t1",base_url="http://fqdn3.for.solr.server:32080/solr",} 8.89078653472891E11
solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t1",collection="a_collection",shard="shard1",replica="replica_t1",base_url="http://fqdn3.for.solr.server:32080/solr",} 8.9061212477449E11
solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t3",collection="a_collection",shard="shard1",replica="replica_t3",base_url="http://fqdn2.for.solr.server:32080/solr",} 1.63796914645E9
solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t3",collection="a_collection",shard="shard1",replica="replica_t3",base_url="http://fqdn2.for.solr.server:32080/solr",} 9.05314998357273E11
solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t3",collection="a_collection",shard="shard1",replica="replica_t3",base_url="http://fqdn2.for.solr.server:32080/solr",} 9.06952967503723E11
solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t5",collection="a_collection",shard="shard1",replica="replica_t5",base_url="http://fqdn1.for.solr.server:32080/solr",} 1.667842814432E9
solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t5",collection="a_collection",shard="shard1",replica="replica_t5",base_url="http://fqdn1.for.solr.server:32080/solr",} 9.1289401347629E11
solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t5",collection="a_collection",shard="shard1",replica="replica_t5",base_url="http://fqdn1.for.solr.server:32080/solr",} 9.14561856290722E11

This is the systemd unit file that runs the exporter container:

[Unit]
Description=Solr Exporter Docker
After=network.target
Wants=network.target
Requires=docker.service
After=docker.service

[Service]
Type=simple
ExecStart=/usr/bin/docker run --rm \
--name=solr-exporter \
--net=host \
--user=solr \
solr:8.9.0 \
/opt/solr/contrib/prometheus-exporter/bin/solr-exporter \
-p 8984 -z the-various-zookeeper-endpoints -f /opt/solr/contrib/prometheus-exporter/conf/solr-exporter-config.xml -n 4

ExecStop=/usr/bin/docker stop -t 2 solr-exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target

I looked into the XML configurations for prometheus-exporter between 8.6.2 (the previous version we used) and latest, and it looks like at some point recently there was a major refactoring in how this works. Is there something we are missing? Can anyone reproduce this issue on 8.9?

Thanks in advance,
Joshua Hendrickson 


Re: Duplicate sample errors using prometheus-exporter in Solr 8.9.0

Posted by Mathieu Marie <mm...@salesforce.com.INVALID>.
It happens because you use *-z zk-url *to connect to solr.
When you do that the prometheus-export assumes that it connects to a
SolrCloud environment and will collect the metrics from all nodes.
Given you have started 3 prometheus-exporters, each one of them will
collect all metrics from the cluster.

You can fix this in two different ways:
1- use *-h <your-local-solr-url>* instead of *-z <zk-url>*
2- have only one instance of the prometheus-exporter in the cluster

Note that solution 1 will not retrieve the metrics you have configured in
the *<collections>* tag in your configuration, as *-h* assumes a non-solr
cloud instance.

Regards,
Mathieu

On Wed, Aug 11, 2021 at 9:32 AM Joshua Hendrickson <
jhendrickson@tripadvisor.com> wrote:

> Hello,
>
> Our organization has implemented Solr 8.9.0 for a production use case. We
> have standardized on Prometheus for metrics collection and storage. We
> export metrics from our Solr cluster by deploying the public Solr image for
> version 8.9.0 to an EC2 instance and using Docker to run the exporter
> binary against Solr (which is running in a container on the same host). Our
> Prometheus scraper (hosted in Kubernetes and configured via a Helm chart)
> reports errors like the following on every scrape:
>
> ts=2021-08-10T16:44:13.929Z caller=dedupe.go:112 component=remote
> level=error remote_name=11d3d0 url=https://our.endpoint/push
> msg="non-recoverable error" count=500 err="server returned HTTP status 400
> Bad Request: user=nnnnn: err: duplicate sample for timestamp.
> timestamp=2021-08-10T16:44:13.317Z,
> series={__name__=\"solr_metrics_core_time_seconds_total\",
> aws_account=\"our-account\", base_url=\"
> http://fqdn.for.solr.server:32080/solr\", category=\"QUERY\",
> cluster=\"our-cluster\", collection=\"a-collection\",
> core=\"a_collection_shard1_replica_t13\", dc=\"aws\", handler=\"/select\",
> instance=\" fqdn.for.solr.server:8984\", job=\"solr\",
> replica=\"replica_t13\", shard=\"shard1\"}"
>
> We have confirmed that there are indeed duplicate time series when we
> query our promtheus exporter. Here is a sample that shows the duplicate
> time series:
>
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t1",collection="a_collection",shard="shard1",replica="replica_t1",base_url="
> http://fqdn3.for.solr.server:32080/solr",} 1.533471301599E9
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t1",collection="a_collection",shard="shard1",replica="replica_t1",base_url="
> http://fqdn3.for.solr.server:32080/solr",} 8.89078653472891E11
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t1",collection="a_collection",shard="shard1",replica="replica_t1",base_url="
> http://fqdn3.for.solr.server:32080/solr",} 8.9061212477449E11
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t3",collection="a_collection",shard="shard1",replica="replica_t3",base_url="
> http://fqdn2.for.solr.server:32080/solr",} 1.63796914645E9
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t3",collection="a_collection",shard="shard1",replica="replica_t3",base_url="
> http://fqdn2.for.solr.server:32080/solr",} 9.05314998357273E11
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t3",collection="a_collection",shard="shard1",replica="replica_t3",base_url="
> http://fqdn2.for.solr.server:32080/solr",} 9.06952967503723E11
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t5",collection="a_collection",shard="shard1",replica="replica_t5",base_url="
> http://fqdn1.for.solr.server:32080/solr",} 1.667842814432E9
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t5",collection="a_collection",shard="shard1",replica="replica_t5",base_url="
> http://fqdn1.for.solr.server:32080/solr",} 9.1289401347629E11
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t5",collection="a_collection",shard="shard1",replica="replica_t5",base_url="
> http://fqdn1.for.solr.server:32080/solr",} 9.14561856290722E11
>
> This is the systemd unit file that runs the exporter container:
>
> [Unit]
> Description=Solr Exporter Docker
> After=network.target
> Wants=network.target
> Requires=docker.service
> After=docker.service
>
> [Service]
> Type=simple
> ExecStart=/usr/bin/docker run --rm \
> --name=solr-exporter \
> --net=host \
> --user=solr \
> solr:8.9.0 \
> /opt/solr/contrib/prometheus-exporter/bin/solr-exporter \
> -p 8984 -z the-various-zookeeper-endpoints -f
> /opt/solr/contrib/prometheus-exporter/conf/solr-exporter-config.xml -n 4
>
> ExecStop=/usr/bin/docker stop -t 2 solr-exporter
> Restart=on-failure
>
> [Install]
> WantedBy=multi-user.target
>
> I looked into the XML configurations for prometheus-exporter between 8.6.2
> (the previous version we used) and latest, and it looks like at some point
> recently there was a major refactoring in how this works. Is there
> something we are missing? Can anyone reproduce this issue on 8.9?
>
> Thanks in advance,
> Joshua Hendrickson
>
>

-- 
Mathieu Marie
Software Engineer | Salesforce
Mobile: + 33 6 98 59 62 31