You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Uwe Schindler <uw...@thetaphi.de> on 2019/08/31 10:19:52 UTC

NVMe - SSD shredding due to Lucene :-)

Hi all,

I just wanted to inform you that I asked the provider of the Policeman Jenkins Server to replace the first of two NVMe SSDs, because it failed with fatal warnings due to too many writes and no more spare sectors:

> root@serv1 ~ # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                    : 0x1
> temperature                         : 76 C
> available_spare                     : 2%
> available_spare_threshold           : 10%
> percentage_used                     : 67%
> data_units_read                     : 62,129,054
> data_units_written                  : 648,788,135
> host_read_commands                  : 6,426,997,226
> host_write_commands                 : 5,582,107,803
> controller_busy_time                : 86,754
> power_cycles                        : 21
> power_on_hours                      : 20,252
> unsafe_shutdowns                    : 16
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 7855
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 76 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0

The second one looks a bit better, but will be changed later, too. I have no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).

So we are really shredding SSDs with Lucene tests 😊

Uwe

P.S.: The replacement is currently going on...
-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: NVMe - SSD shredding due to Lucene :-)

Posted by Đạt Cao Mạnh <ca...@gmail.com>.
Thanks Uwe for keeping the Police up and running!

On Sat, 31 Aug 2019 at 11:20, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi all,
>
> I just wanted to inform you that I asked the provider of the Policeman
> Jenkins Server to replace the first of two NVMe SSDs, because it failed
> with fatal warnings due to too many writes and no more spare sectors:
>
> > root@serv1 ~ # nvme smart-log /dev/nvme0
> > Smart Log for NVME device:nvme0 namespace-id:ffffffff
> > critical_warning                    : 0x1
> > temperature                         : 76 C
> > available_spare                     : 2%
> > available_spare_threshold           : 10%
> > percentage_used                     : 67%
> > data_units_read                     : 62,129,054
> > data_units_written                  : 648,788,135
> > host_read_commands                  : 6,426,997,226
> > host_write_commands                 : 5,582,107,803
> > controller_busy_time                : 86,754
> > power_cycles                        : 21
> > power_on_hours                      : 20,252
> > unsafe_shutdowns                    : 16
> > media_errors                        : 0
> > num_err_log_entries                 : 0
> > Warning Temperature Time            : 7855
> > Critical Composite Temperature Time : 0
> > Temperature Sensor 1                : 76 C
> > Thermal Management T1 Trans Count   : 0
> > Thermal Management T2 Trans Count   : 0
> > Thermal Management T1 Total Time    : 0
> > Thermal Management T2 Total Time    : 0
>
> The second one looks a bit better, but will be changed later, too. I have
> no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).
>
> So we are really shredding SSDs with Lucene tests 😊
>
> Uwe
>
> P.S.: The replacement is currently going on...
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
*Best regards,*
*Cao Mạnh Đạt*
*E-mail: caomanhdat317@gmail.com <ca...@gmail.com>*

RE: NVMe - SSD shredding due to Lucene :-)

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

 

NVMe SSD #2 also replaced. Both are of course “recycled ones” – that’s how data centers work (if people no longer use server and cancel rental agreement, the SSDs are recycled and reused as recovery parts – unless their smart status is bad). But lifetime is good for at least 1.5 years.

 

Uwe

 

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Michael McCandless <lu...@mikemccandless.com> 
Sent: Saturday, August 31, 2019 2:02 PM
To: Lucene/Solr dev <de...@lucene.apache.org>
Subject: Re: NVMe - SSD shredding due to Lucene :-)

 

SSD vendors should use our tests for QA'ing their new SSDs!




Mike McCandless

http://blog.mikemccandless.com

 

 

On Sat, Aug 31, 2019 at 7:50 AM Uwe Schindler <uwe@thetaphi.de <ma...@thetaphi.de> > wrote:

Hi,

 

the service to replace those SSD is included in rental fee 😊

 

I am not sure why it writes so much, but I think Solr is more hammering our SSDs. Lucene’s test do not do too much IO. Nevertheless, the SSD survived more than 2 years. The server was installed on 2017-05-19. After some runtime I calculated the approximate lifetime and I was not bad in estimating: I said 2 years 😊

 

FYI, at the moment they replace disk #2 (I rebuilt the raid array before).

 

Uwe

 

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: uwe@thetaphi.de <ma...@thetaphi.de> 

 

From: Michael McCandless <lucene@mikemccandless.com <ma...@mikemccandless.com> > 
Sent: Saturday, August 31, 2019 1:32 PM
To: Lucene/Solr dev <dev@lucene.apache.org <ma...@lucene.apache.org> >
Subject: Re: NVMe - SSD shredding due to Lucene :-)

 

Nice to know :)  Thanks for upgrading Uwe.

 

I thought we randomly disable fsync in tests just to protect our precious SSDs?




Mike McCandless

 <http://blog.mikemccandless.com> http://blog.mikemccandless.com

 

 

On Sat, Aug 31, 2019 at 6:20 AM Uwe Schindler < <ma...@thetaphi.de> uwe@thetaphi.de> wrote:

Hi all,

I just wanted to inform you that I asked the provider of the Policeman Jenkins Server to replace the first of two NVMe SSDs, because it failed with fatal warnings due to too many writes and no more spare sectors:

> root@serv1 ~ # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                    : 0x1
> temperature                         : 76 C
> available_spare                     : 2%
> available_spare_threshold           : 10%
> percentage_used                     : 67%
> data_units_read                     : 62,129,054
> data_units_written                  : 648,788,135
> host_read_commands                  : 6,426,997,226
> host_write_commands                 : 5,582,107,803
> controller_busy_time                : 86,754
> power_cycles                        : 21
> power_on_hours                      : 20,252
> unsafe_shutdowns                    : 16
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 7855
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 76 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0

The second one looks a bit better, but will be changed later, too. I have no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).

So we are really shredding SSDs with Lucene tests 😊

Uwe

P.S.: The replacement is currently going on...
-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
 <https://www.thetaphi.de> https://www.thetaphi.de
eMail:  <ma...@thetaphi.de> uwe@thetaphi.de



---------------------------------------------------------------------
To unsubscribe, e-mail:  <ma...@lucene.apache.org> dev-unsubscribe@lucene.apache.org
For additional commands, e-mail:  <ma...@lucene.apache.org> dev-help@lucene.apache.org


Re: NVMe - SSD shredding due to Lucene :-)

Posted by Michael McCandless <lu...@mikemccandless.com>.
SSD vendors should use our tests for QA'ing their new SSDs!

Mike McCandless

http://blog.mikemccandless.com


On Sat, Aug 31, 2019 at 7:50 AM Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi,
>
>
>
> the service to replace those SSD is included in rental fee 😊
>
>
>
> I am not sure why it writes so much, but I think Solr is more hammering
> our SSDs. Lucene’s test do not do too much IO. Nevertheless, the SSD
> survived more than 2 years. The server was installed on 2017-05-19. After
> some runtime I calculated the approximate lifetime and I was not bad in
> estimating: I said 2 years 😊
>
>
>
> FYI, at the moment they replace disk #2 (I rebuilt the raid array before).
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> Achterdiek 19, D-28357 Bremen
>
> https://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* Michael McCandless <lu...@mikemccandless.com>
> *Sent:* Saturday, August 31, 2019 1:32 PM
> *To:* Lucene/Solr dev <de...@lucene.apache.org>
> *Subject:* Re: NVMe - SSD shredding due to Lucene :-)
>
>
>
> Nice to know :)  Thanks for upgrading Uwe.
>
>
>
> I thought we randomly disable fsync in tests just to protect our precious
> SSDs?
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
>
>
> On Sat, Aug 31, 2019 at 6:20 AM Uwe Schindler <uw...@thetaphi.de> wrote:
>
> Hi all,
>
> I just wanted to inform you that I asked the provider of the Policeman
> Jenkins Server to replace the first of two NVMe SSDs, because it failed
> with fatal warnings due to too many writes and no more spare sectors:
>
> > root@serv1 ~ # nvme smart-log /dev/nvme0
> > Smart Log for NVME device:nvme0 namespace-id:ffffffff
> > critical_warning                    : 0x1
> > temperature                         : 76 C
> > available_spare                     : 2%
> > available_spare_threshold           : 10%
> > percentage_used                     : 67%
> > data_units_read                     : 62,129,054
> > data_units_written                  : 648,788,135
> > host_read_commands                  : 6,426,997,226
> > host_write_commands                 : 5,582,107,803
> > controller_busy_time                : 86,754
> > power_cycles                        : 21
> > power_on_hours                      : 20,252
> > unsafe_shutdowns                    : 16
> > media_errors                        : 0
> > num_err_log_entries                 : 0
> > Warning Temperature Time            : 7855
> > Critical Composite Temperature Time : 0
> > Temperature Sensor 1                : 76 C
> > Thermal Management T1 Trans Count   : 0
> > Thermal Management T2 Trans Count   : 0
> > Thermal Management T1 Total Time    : 0
> > Thermal Management T2 Total Time    : 0
>
> The second one looks a bit better, but will be changed later, too. I have
> no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).
>
> So we are really shredding SSDs with Lucene tests 😊
>
> Uwe
>
> P.S.: The replacement is currently going on...
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

RE: NVMe - SSD shredding due to Lucene :-)

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

 

the service to replace those SSD is included in rental fee 😊

 

I am not sure why it writes so much, but I think Solr is more hammering our SSDs. Lucene’s test do not do too much IO. Nevertheless, the SSD survived more than 2 years. The server was installed on 2017-05-19. After some runtime I calculated the approximate lifetime and I was not bad in estimating: I said 2 years 😊

 

FYI, at the moment they replace disk #2 (I rebuilt the raid array before).

 

Uwe

 

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Michael McCandless <lu...@mikemccandless.com> 
Sent: Saturday, August 31, 2019 1:32 PM
To: Lucene/Solr dev <de...@lucene.apache.org>
Subject: Re: NVMe - SSD shredding due to Lucene :-)

 

Nice to know :)  Thanks for upgrading Uwe.

 

I thought we randomly disable fsync in tests just to protect our precious SSDs?




Mike McCandless

 <http://blog.mikemccandless.com> http://blog.mikemccandless.com

 

 

On Sat, Aug 31, 2019 at 6:20 AM Uwe Schindler < <ma...@thetaphi.de> uwe@thetaphi.de> wrote:

Hi all,

I just wanted to inform you that I asked the provider of the Policeman Jenkins Server to replace the first of two NVMe SSDs, because it failed with fatal warnings due to too many writes and no more spare sectors:

> root@serv1 ~ # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                    : 0x1
> temperature                         : 76 C
> available_spare                     : 2%
> available_spare_threshold           : 10%
> percentage_used                     : 67%
> data_units_read                     : 62,129,054
> data_units_written                  : 648,788,135
> host_read_commands                  : 6,426,997,226
> host_write_commands                 : 5,582,107,803
> controller_busy_time                : 86,754
> power_cycles                        : 21
> power_on_hours                      : 20,252
> unsafe_shutdowns                    : 16
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 7855
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 76 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0

The second one looks a bit better, but will be changed later, too. I have no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).

So we are really shredding SSDs with Lucene tests 😊

Uwe

P.S.: The replacement is currently going on...
-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
 <https://www.thetaphi.de> https://www.thetaphi.de
eMail:  <ma...@thetaphi.de> uwe@thetaphi.de



---------------------------------------------------------------------
To unsubscribe, e-mail:  <ma...@lucene.apache.org> dev-unsubscribe@lucene.apache.org
For additional commands, e-mail:  <ma...@lucene.apache.org> dev-help@lucene.apache.org


RE: NVMe - SSD shredding due to Lucene :-)

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi Mike,

 

you are right we have the special NIO.2 filesystem that makes fsync a no-op in 90% of all cases. This works fine with Lucene, but as Solr does not use the virtual filesystem and instead just copies the path name of the temp directory as a string and puts it into the default directory factory through its solrconfig.xml file, there is no way to capture fsyncs, as Solr uses plain default filesystem.

 

We should work on a solution for this, as it may speed up tests dramatically.

 

In the meantime I did “apt install eatmydata” (http://manpages.ubuntu.com/manpages/bionic/man1/eatmydata.1.html <http://manpages.ubuntu.com/manpages/trusty/man1/eatmydata.1.html> ). This makes it easy to hide all fsyncs. We can just add this to Jenkins config for new jobs in the job environment plugin, so all jenkins jobs don’t fsync:

 

LD_PRELOAD=libeatmydata.so

 

This trick may be interesting for others, too. Steve Rowe?

 

To test the difference, I will now run the jenkins server for a day, measure number of reads/writes from smart output and then enable this for the linux jobs (it’s easy in the groovy file that selects the random JVM).

 

The VMs for Windows, Mac, Solaris have the virtual disk already configured to ignore any device syncs.

 

Uwe

 

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Michael McCandless <lu...@mikemccandless.com> 
Sent: Saturday, August 31, 2019 1:32 PM
To: Lucene/Solr dev <de...@lucene.apache.org>
Subject: Re: NVMe - SSD shredding due to Lucene :-)

 

Nice to know :)  Thanks for upgrading Uwe.

 

I thought we randomly disable fsync in tests just to protect our precious SSDs?




Mike McCandless

http://blog.mikemccandless.com

 

 

On Sat, Aug 31, 2019 at 6:20 AM Uwe Schindler <uwe@thetaphi.de <ma...@thetaphi.de> > wrote:

Hi all,

I just wanted to inform you that I asked the provider of the Policeman Jenkins Server to replace the first of two NVMe SSDs, because it failed with fatal warnings due to too many writes and no more spare sectors:

> root@serv1 ~ # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                    : 0x1
> temperature                         : 76 C
> available_spare                     : 2%
> available_spare_threshold           : 10%
> percentage_used                     : 67%
> data_units_read                     : 62,129,054
> data_units_written                  : 648,788,135
> host_read_commands                  : 6,426,997,226
> host_write_commands                 : 5,582,107,803
> controller_busy_time                : 86,754
> power_cycles                        : 21
> power_on_hours                      : 20,252
> unsafe_shutdowns                    : 16
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 7855
> Critical Composite Temperature Time : 0
> Temperature Sensor 1                : 76 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0

The second one looks a bit better, but will be changed later, too. I have no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).

So we are really shredding SSDs with Lucene tests 😊

Uwe

P.S.: The replacement is currently going on...
-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de <ma...@thetaphi.de> 



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <ma...@lucene.apache.org> 
For additional commands, e-mail: dev-help@lucene.apache.org <ma...@lucene.apache.org> 


Re: NVMe - SSD shredding due to Lucene :-)

Posted by Michael McCandless <lu...@mikemccandless.com>.
Nice to know :)  Thanks for upgrading Uwe.

I thought we randomly disable fsync in tests just to protect our precious
SSDs?

Mike McCandless

http://blog.mikemccandless.com


On Sat, Aug 31, 2019 at 6:20 AM Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi all,
>
> I just wanted to inform you that I asked the provider of the Policeman
> Jenkins Server to replace the first of two NVMe SSDs, because it failed
> with fatal warnings due to too many writes and no more spare sectors:
>
> > root@serv1 ~ # nvme smart-log /dev/nvme0
> > Smart Log for NVME device:nvme0 namespace-id:ffffffff
> > critical_warning                    : 0x1
> > temperature                         : 76 C
> > available_spare                     : 2%
> > available_spare_threshold           : 10%
> > percentage_used                     : 67%
> > data_units_read                     : 62,129,054
> > data_units_written                  : 648,788,135
> > host_read_commands                  : 6,426,997,226
> > host_write_commands                 : 5,582,107,803
> > controller_busy_time                : 86,754
> > power_cycles                        : 21
> > power_on_hours                      : 20,252
> > unsafe_shutdowns                    : 16
> > media_errors                        : 0
> > num_err_log_entries                 : 0
> > Warning Temperature Time            : 7855
> > Critical Composite Temperature Time : 0
> > Temperature Sensor 1                : 76 C
> > Thermal Management T1 Trans Count   : 0
> > Thermal Management T2 Trans Count   : 0
> > Thermal Management T1 Total Time    : 0
> > Thermal Management T2 Total Time    : 0
>
> The second one looks a bit better, but will be changed later, too. I have
> no idea what a data unit is (512 bytes, 2048 bytes,... - I think one LBA).
>
> So we are really shredding SSDs with Lucene tests 😊
>
> Uwe
>
> P.S.: The replacement is currently going on...
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>