You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by "Bergquist, Brett" <BB...@canoga.com> on 2015/09/04 00:35:22 UTC

Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

A production system with a database of about 400G received this error today and it appears that Derby shutdown parts of itself because from that point on it was spitting out errors saying it could not find the database.   The system was restarted and came up clean and is working with no issues inserting data at well over 100 inserts/second continuous and performing queries and other data updates with no issues.

ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk.
        at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
        at org.apache.derby.impl.store.raw.data.CachedPage.readPage(Unknown Source)
        at org.apache.derby.impl.store.raw.data.CachedPage.setIdentity(Unknown Source)
        at org.apache.derby.impl.services.cache.ConcurrentCache.find(Unknown Source)
        at org.apache.derby.impl.store.raw.data.FileContainer.initPage(Unknown Source)
        at org.apache.derby.impl.store.raw.data.FileContainer.newPage(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseContainer.addPage(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseContainerHandle.addPage(Unknown Source)
        at org.apache.derby.impl.store.access.heap.HeapController.doInsert(Unknown Source)
        at org.apache.derby.impl.store.access.heap.HeapController.insertAndFetchLocation(Unknown Source)
        at org.apache.derby.impl.sql.execute.RowChangerImpl.insertRow(Unknown Source)
        at org.apache.derby.impl.sql.execute.InsertResultSet.normalInsertCore(Unknown Source)
        at org.apache.derby.impl.sql.execute.InsertResultSet.open(Unknown Source)
        at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(Unknown Source)
        at org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(Unknown Source)
        at org.apache.derby.iapi.jdbc.BrokeredPreparedStatement.execute(Unknown Source)
        at org.apache.derby.impl.drda.DRDAStatement.execute(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTTobjects(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTT(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)
Caused by: java.io.EOFException: Reached end of file while attempting to read a whole page.
        at org.apache.derby.impl.store.raw.data.RAFContainer4.readFull(Unknown Source)
        at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage0(Unknown Source)
        at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(Unknown Source)
        at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(Unknown S:

derby the reported errors like:

Thu Sep 03 16:03:54 GMT 2015 Thread[DRDAConnThread_80,5,main] (DATABASE = csemdb), (DRDAID = ????????.??-1011901099978582906{5640754}),
Thu Sep 03 16:03:54 GMT 2015 :
org.apache.derby.iapi.error.ShutdownException:
        at org.apache.derby.iapi.services.context.ContextManager.checkInterrupt(Unknown Source)
        at org.apache.derby.iapi.services.context.ContextManager.popContext(Unknown Source)
        at org.apache.derby.iapi.services.context.ContextImpl.popMe(Unknown Source)
        at org.apache.derby.jdbc.EmbedXAResource.removeXATransaction(Unknown Source)
        at org.apache.derby.jdbc.EmbedXAResource.returnConnectionToResource(Unknown Source)
        at org.apache.derby.jdbc.EmbedXAResource.commit(Unknown Source)
        at org.apache.derby.impl.drda.DRDAXAProtocol.commitXATransaction(Unknown Source)
        at org.apache.derby.impl.drda.DRDAXAProtocol.commitTransaction(Unknown Source)
        at org.apache.derby.impl.drda.DRDAXAProtocol.parseSYNCCTL(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)
Thu Sep 03 16:03:54 GMT 2015 Thread[DRDAConnThread_114,5,main] (DATABASE = csemdb), (DRDAID = ????????.??-940969405847491181{5640766}), null
Thu Sep 03 16:03:54 GMT 2015 : null
Thu Sep 03 16:03:54 GMT 2015 : null
java.lang.NullPointerException
        at org.apache.derby.impl.drda.DRDAConnThread.writePBSD(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)
java.lang.NullPointerException
        at org.apache.derby.impl.drda.DRDAConnThread.writePBSD(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown Source)
        at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)
Thu Sep 03 16:03:54 GMT 2015 Thread[DRDAConnThread_191,5,main] (DATABASE = csemdb), (DRDAID = ????????.??-939843505940648198{5640770}), null
Thu Sep 03 16:03:54 GMT 2015 : null

And

Thu Sep 03 16:03:54 GMT 2015 Thread[DRDAConnThread_185,5,main] Cleanup action starting
java.sql.SQLException: Database 'csemdb' not found.
        at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection.newSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection.handleDBNotFound(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection40.<init>(Unknown Source)
        at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Unknown Source)
        at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
        at org.apache.derby.jdbc.EmbeddedBaseDataSource.getConnection(Unknown Source)
        at org.apache.derby.jdbc.EmbedPooledConnection.openRealConnection(Unknown Source)
        at org.apache.derby.jdbc.EmbedXAConnection.getRealConnection(Unknown Source)
        at org.apache.derby.iapi.jdbc.BrokeredConnection.getRealConnection(Unknown Source)
        at org.apache.derby.iapi.jdbc.BrokeredConnection.isClosed(Unknown Source

The system is an Oracle M5000 Enterprise server with what I believe is a 15TB Sun ZFS Storage 7320 external ZFS storage array connected by Fibre Channel.   This is the first time in over 8 years we have seen any I/O error like such.

What I am trying to confirm is that this is really low level derby code that if it reports an "java.io.EOFException" like it did, it really did have an I/O error somewhere in reading the page from the container file.   Things like performance, java heap space, etc, can pretty much be ruled out as causing such an error.   My gut feeling is that maybe something in the connection to this storage array had a hiccup.   This setup is at the customer site and I cannot directly access system logs nor do I have knowledge on how this storage array works and how to look at such but just having confirmation that an I/O error really did occur would help.



________________________________
Canoga Perkins
20600 Prairie Street
Chatsworth, CA 91311
(818) 718-6300

This e-mail and any attached document(s) is confidential and is intended only for the review of the party to whom it is addressed. If you have received this transmission in error, please notify the sender immediately and discard the original message and any attachment(s).

Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

Posted by Bryan Pendleton <bp...@gmail.com>.
> ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk.
> Caused by: java.io.EOFException: Reached end of file while attempting to read a whole page.

Does the derby.log have any more detail about this specific exception?

Note that you can use the system tables (SYSCONGLOMERATES, I believe)
to figure out which table corresponds to conglomerate 30832, and you
can also multiply 1325564 by the pagesize of your table to figure out
what the file size was at the instant that this happened.

Assuming your page size was 4096, 1325564 * 4096 is 5,429,510,144, so
that conglomerate should be about 5.4 GB in size.

>
> derby the reported errors like:
> org.apache.derby.iapi.error.ShutdownException:

This is normal I believe.

> java.lang.NullPointerException
>          at org.apache.derby.impl.drda.DRDAConnThread.writePBSD(Unknown Source)
>          at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown Source)
>          at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)

This is scary, but it appears to have happened AFTER the shutdown, and hence
may be some secondary, unrelated bug in the network server code related to
not handling a shutdown correctly. It seems worth investigating separately.


> The system is an Oracle M5000 Enterprise server with what I believe is a 15TB Sun ZFS Storage 7320 external ZFS storage array connected by Fibre Channel.   This is the first time in over 8 years we have seen any I/O error like such.
>
> What I am trying to confirm is that this is really low level derby code that if it reports an “java.io.EOFException” like it did, it really did have an I/O error somewhere in reading the page from the container file.   Things like performance, java heap
> space, etc, can pretty much be ruled out as causing such an error.   My gut feeling is that maybe something in the connection to this storage array had a hiccup.   This setup is at the customer site and I cannot directly access system logs nor do I have
> knowledge on how this storage array works and how to look at such but just having confirmation that an I/O error really did occur would help.

This is good information to have.

My feeling is that you should do a more thorough investigation of the
specific conglomerate in question, to check for errors that might
not be showing up using your regular application access patterns.

Also, if you can find any more information in the derby log, it would
be nice to know.

Thanks for sharing the information that you do have, it is quite
interesting to know what your experience is!

bryan

P.S. I believe this is the code that threw the java.io.EOFException:

     /**
      * Attempts to fill buf completely from start until it's full.
      * <p/>
      * FileChannel has no readFull() method, so we roll our own.
      * <p/>
      * @param dstBuffer buffer to read into
      * @param srcChannel channel to read from
      * @param position file position from where to read
      *
      * @throws IOException if an I/O error occurs while reading
      * @throws StandardException If thread is interrupted.
      */
     private void readFull(ByteBuffer dstBuffer,
                           FileChannel srcChannel,
                           long position)
             throws IOException, StandardException
     {
         while(dstBuffer.remaining() > 0) {
             if (srcChannel.read(dstBuffer,
                                     position + dstBuffer.position()) == -1) {
                 throw new EOFException(
                     "Reached end of file while attempting to read a "
                     + "whole page.");
             }

             // (**) Sun Java NIO is weird: it can close the channel due to an
             // interrupt without throwing if bytes got transferred. Compensate,
             // so we can clean up.  Bug 6979009,
             // http://bugs.sun.com/view_bug.do?bug_id=6979009
             if (Thread.currentThread().isInterrupted() &&
                     !srcChannel.isOpen()) {
                 throw new ClosedByInterruptException();
             }
         }
     }



RE: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

Posted by "Bergquist, Brett" <BB...@canoga.com>.
Answering my own question.   I was able to use ALTER TABLE DROP PRIMARY KEY on the table and then ALTER TABLE ADD PRIMARY KEY to recreate the backing index.

-----Original Message-----
From: Bergquist, Brett [mailto:BBergquist@canoga.com]
Sent: Thursday, October 22, 2015 9:13 AM
To: derby-dev@db.apache.org
Subject: RE: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

I finally was able to get a copy of the database.  I wrote a utility program to run the consistency check functions on each table in the database as the database is very large and did not want to have to start over each time an error occurred an exclude the table.   There was one table/index that reported a problem:

Checking PKG_9145E_V1.COS_ED_DROP_PROFILE_PCP_QMAPPING started at Wed Oct 21 19:55:00 EDT 2015 Failed at Wed Oct 21 19:59:27 EDT 201
5 with error: java.sql.SQLException: Inconsistency found between table 'PKG_9145E_V1.COS_ED_DROP_PROFILE_PCP_QMAPPING' and index 'SQ L111109192512240'.  Error when trying to retrieve row location '(583738,145)' from the table.  The full index key, including the row  location, is '{ 54019451, (583738,145) }'. The suggested corrective action is to recreate the index.

So now my question is how to correct that.   I see the reported suggested corrective action to recreate the index, but this index is the backing index for the primary key for the table (an INTEGER column in the table).    So how do I go about recreating this backing index?

-----Original Message-----
From: mike matrigali [mailto:mikemapp1@gmail.com]
Sent: Friday, September 04, 2015 9:35 PM
To: derby-dev@db.apache.org
Subject: Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

if it is not truncated and still available when you get the db a complete derby.log containing the error would be interesting - if it does not have anything that can not be shared.

Derby does not handle I/O errors very well, and it's shutdown mechanism is definitely not clean.  Often what happens is store encounters an I/O error and has no idea what to do at that point.  It's default is to mark some key modules null so that it knows no update action can take place and fails the current transaction and trys to shutdown the whole system.  Once system is shutdown reboot recovery is counted on to fix anything that was encountered, assuming nothing on disk was really corrupted.  This sounds like what happened in your case, but i would always consistency check if there is a problem.

/mikem

On 9/4/2015 7:56 AM, Bergquist, Brett wrote:
> Thanks for the input!
>
> There is no possibility of running the consistency check on the customer's database on their system as it needs to be running 24x7 and cannot be taken down.  As far as I can tell at this point, the database came back up ok after the restart and is operating normally.
>
> I am able to get a copy of the database via file system backup that occurs each night.   Using ZFS allows us to do this by freezing the database (using FREEZE derby calls), doing a ZFS snapshot of the file system, unfreezing the database (using UNFREEZE derby call) and then accessing the ZFS snapshot to make the file system backup.   It takes me a couple of days to get all of the database transferred but then I can stage it locally and run a consistency check on the local copy.
>
> I will open a JIRA on the NullPointerException's that were reported after Derby did its shutdown like Bryan suggested.
>
> For some background, the database is used in a telecommunications environment, being the persistent storage for the configuration for about 90K pieces of network equipment and receives about 10M monitoring updates per day 24x7.   The database has been around for about 8 years continually growing and having derby being upgraded.  It is currently at 10.10.2.0.   We also do a poor man's partitioning in that we have 53 database tables, one for each week of the year, and our 10M inserts are directed to the correct database table for the week of the year and queries are built upon those weeks as well with a VIEW that is created as a UNION query across all 53 tables when needed to process queries that span weeks.   We needed to do this as there was no practical way of deleting older data while simultaneously inserting data into the table at the rate or 10M/day and not having database performance issues, locking contention, and even getting the deletions done in a reasonable amount of time, and also recovering and reusing the freed database space.   Now we simply truncate the tables that are to be purged which is nearly instantaneous.   At some point I may investigate and contact the group here on how one might implement a real partitioning scheme that would be more efficient especially on the queries and add this capability back into derby, so if anyone has any ideas on this, I am all ears.
>
> Brett
>
> -----Original Message-----
> From: mike matrigali [mailto:mikemapp1@gmail.com]
> Sent: Friday, September 04, 2015 12:37 AM
> To: derby-dev@db.apache.org
> Subject: Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."
>
> I agree with all of bryan's suggestions.  If you can't get access to the actual db there is not much to be done.  My usual customer support answer to this situation would be to tell you to shut the db and do a consistency check on it, which would read every page from the table and would certainly run into the error you got eventually if there was a persistent problem.
> Given the size of the db and that derby has no optimizations for db's of this size that is likely to take some time.
>
>   From the stack I can tell you that the problem is in a base page, not an index.  Which is
> much harder to fix if it is persistent.   In derby db's the output
>    Container(0, 30832) is saying container in segment 0 (seg0
> directory) and container id
> 30832  (impressed by the number of containers that db has gone through).  Also you will see system catalog talk about conglomerate numbers.  In derby currently there is always a 1-1 mapping of conglomerate num to container number.
> Ancient history, in cloudscape we thought we might need the abstraction and it was a pain to do the map at the lowest level so we took the opportunity when we redid the arch to make it 1-1 for "now" but allow a map if anyone wanted to do one in the future:
> And here is a note from bryan minus 6 years on how to go from that number in the error to file name and table name.:
> http://bryanpendleton.blogspot.com/2009/09/whats-in-those-files-in-my-
> derby-db.html
>
> A quick check if you could get a ls -l of the seg0 directory would be to look at the size of the associated file and do the math bryan mentioned to see if the file now has a full page.
> including the page size if you figure it out would help as derby page size vs file system page size can be an issue  - but usually only on machine crashes.
>
> I would suggest filing a JIRA for this.  If it really is the case that you got the I/O error for a non-persistent problem it may be that derby can be improved to avoid it.  Before the code was changed to use FileChannel's derby often had retry loops on I/O errors - especially on reads of pages from disk.  In the long past this just avoided some intermittent i/o problems that were in most case network related (even though we likely did not support the network disk officially).  Not sure if the old retry code is still around in the trunk as it was for running in older JVM's.
>
> Also I have also seen wierd timing errors from maybe multiple processing accessing the same file (like backup/virus/... vs the sever), but mostly on windows OS vs unix based ones.
>
> Getting a partial page read is a very weird error for derby as it goes out of its way to write only full pages.
> On 9/3/2015 5:39 PM, Bryan Pendleton wrote:
>> On 9/3/2015 3:35 PM, Bergquist, Brett wrote:
>>> Reached end of file while attempting to read a whole page
>> You should probably take a close read through all the discussion on
>> this slightly old Derby JIRA Issue:
>>
>> https://issues.apache.org/jira/browse/DERBY-5234
>>
>> There are some suggestions about how to diagnose the conglomerate in
>> question in more detail, and also some observations about possible
>> causes and possible courses of action you can take subsequently.
>>
>> thanks,
>>
>> bryan
>>
>>
>
> --
> email:    Mike Matrigali - mikemapp1@gmail.com
> linkedin: https://www.linkedin.com/in/MikeMatrigali
>
>
> Canoga Perkins
> 20600 Prairie Street
> Chatsworth, CA 91311
> (818) 718-6300
>
> This e-mail and any attached document(s) is confidential and is intended only for the review of the party to whom it is addressed. If you have received this transmission in error, please notify the sender immediately and discard the original message and any attachment(s).
>


--
email:    Mike Matrigali - mikemapp1@gmail.com
linkedin: https://www.linkedin.com/in/MikeMatrigali



Canoga Perkins
20600 Prairie Street
Chatsworth, CA 91311
(818) 718-6300

This e-mail and any attached document(s) is confidential and is intended only for the review of the party to whom it is addressed. If you have received this transmission in error, please notify the sender immediately and discard the original message and any attachment(s).

Canoga Perkins
20600 Prairie Street
Chatsworth, CA 91311
(818) 718-6300

This e-mail and any attached document(s) is confidential and is intended only for the review of the party to whom it is addressed. If you have received this transmission in error, please notify the sender immediately and discard the original message and any attachment(s).

RE: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

Posted by "Bergquist, Brett" <BB...@canoga.com>.
I finally was able to get a copy of the database.  I wrote a utility program to run the consistency check functions on each table in the database as the database is very large and did not want to have to start over each time an error occurred an exclude the table.   There was one table/index that reported a problem:

Checking PKG_9145E_V1.COS_ED_DROP_PROFILE_PCP_QMAPPING started at Wed Oct 21 19:55:00 EDT 2015 Failed at Wed Oct 21 19:59:27 EDT 201
5 with error: java.sql.SQLException: Inconsistency found between table 'PKG_9145E_V1.COS_ED_DROP_PROFILE_PCP_QMAPPING' and index 'SQ
L111109192512240'.  Error when trying to retrieve row location '(583738,145)' from the table.  The full index key, including the row
 location, is '{ 54019451, (583738,145) }'. The suggested corrective action is to recreate the index.

So now my question is how to correct that.   I see the reported suggested corrective action to recreate the index, but this index is the backing index for the primary key for the table (an INTEGER column in the table).    So how do I go about recreating this backing index?

-----Original Message-----
From: mike matrigali [mailto:mikemapp1@gmail.com]
Sent: Friday, September 04, 2015 9:35 PM
To: derby-dev@db.apache.org
Subject: Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

if it is not truncated and still available when you get the db a complete derby.log containing the error would be interesting - if it does not have anything that can not be shared.

Derby does not handle I/O errors very well, and it's shutdown mechanism is definitely not clean.  Often what happens is store encounters an I/O error and has no idea what to do at that point.  It's default is to mark some key modules null so that it knows no update action can take place and fails the current transaction and trys to shutdown the whole system.  Once system is shutdown reboot recovery is counted on to fix anything that was encountered, assuming nothing on disk was really corrupted.  This sounds like what happened in your case, but i would always consistency check if there is a problem.

/mikem

On 9/4/2015 7:56 AM, Bergquist, Brett wrote:
> Thanks for the input!
>
> There is no possibility of running the consistency check on the customer's database on their system as it needs to be running 24x7 and cannot be taken down.  As far as I can tell at this point, the database came back up ok after the restart and is operating normally.
>
> I am able to get a copy of the database via file system backup that occurs each night.   Using ZFS allows us to do this by freezing the database (using FREEZE derby calls), doing a ZFS snapshot of the file system, unfreezing the database (using UNFREEZE derby call) and then accessing the ZFS snapshot to make the file system backup.   It takes me a couple of days to get all of the database transferred but then I can stage it locally and run a consistency check on the local copy.
>
> I will open a JIRA on the NullPointerException's that were reported after Derby did its shutdown like Bryan suggested.
>
> For some background, the database is used in a telecommunications environment, being the persistent storage for the configuration for about 90K pieces of network equipment and receives about 10M monitoring updates per day 24x7.   The database has been around for about 8 years continually growing and having derby being upgraded.  It is currently at 10.10.2.0.   We also do a poor man's partitioning in that we have 53 database tables, one for each week of the year, and our 10M inserts are directed to the correct database table for the week of the year and queries are built upon those weeks as well with a VIEW that is created as a UNION query across all 53 tables when needed to process queries that span weeks.   We needed to do this as there was no practical way of deleting older data while simultaneously inserting data into the table at the rate or 10M/day and not having database performance issues, locking contention, and even getting the deletions done in a reasonable amount of time, and also recovering and reusing the freed database space.   Now we simply truncate the tables that are to be purged which is nearly instantaneous.   At some point I may investigate and contact the group here on how one might implement a real partitioning scheme that would be more efficient especially on the queries and add this capability back into derby, so if anyone has any ideas on this, I am all ears.
>
> Brett
>
> -----Original Message-----
> From: mike matrigali [mailto:mikemapp1@gmail.com]
> Sent: Friday, September 04, 2015 12:37 AM
> To: derby-dev@db.apache.org
> Subject: Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."
>
> I agree with all of bryan's suggestions.  If you can't get access to the actual db there is not much to be done.  My usual customer support answer to this situation would be to tell you to shut the db and do a consistency check on it, which would read every page from the table and would certainly run into the error you got eventually if there was a persistent problem.
> Given the size of the db and that derby has no optimizations for db's of this size that is likely to take some time.
>
>   From the stack I can tell you that the problem is in a base page, not an index.  Which is
> much harder to fix if it is persistent.   In derby db's the output
>    Container(0, 30832) is saying container in segment 0 (seg0
> directory) and container id
> 30832  (impressed by the number of containers that db has gone through).  Also you will see system catalog talk about conglomerate numbers.  In derby currently there is always a 1-1 mapping of conglomerate num to container number.
> Ancient history, in cloudscape we thought we might need the abstraction and it was a pain to do the map at the lowest level so we took the opportunity when we redid the arch to make it 1-1 for "now" but allow a map if anyone wanted to do one in the future:
> And here is a note from bryan minus 6 years on how to go from that number in the error to file name and table name.:
> http://bryanpendleton.blogspot.com/2009/09/whats-in-those-files-in-my-
> derby-db.html
>
> A quick check if you could get a ls -l of the seg0 directory would be to look at the size of the associated file and do the math bryan mentioned to see if the file now has a full page.
> including the page size if you figure it out would help as derby page size vs file system page size can be an issue  - but usually only on machine crashes.
>
> I would suggest filing a JIRA for this.  If it really is the case that you got the I/O error for a non-persistent problem it may be that derby can be improved to avoid it.  Before the code was changed to use FileChannel's derby often had retry loops on I/O errors - especially on reads of pages from disk.  In the long past this just avoided some intermittent i/o problems that were in most case network related (even though we likely did not support the network disk officially).  Not sure if the old retry code is still around in the trunk as it was for running in older JVM's.
>
> Also I have also seen wierd timing errors from maybe multiple processing accessing the same file (like backup/virus/... vs the sever), but mostly on windows OS vs unix based ones.
>
> Getting a partial page read is a very weird error for derby as it goes out of its way to write only full pages.
> On 9/3/2015 5:39 PM, Bryan Pendleton wrote:
>> On 9/3/2015 3:35 PM, Bergquist, Brett wrote:
>>> Reached end of file while attempting to read a whole page
>> You should probably take a close read through all the discussion on
>> this slightly old Derby JIRA Issue:
>>
>> https://issues.apache.org/jira/browse/DERBY-5234
>>
>> There are some suggestions about how to diagnose the conglomerate in
>> question in more detail, and also some observations about possible
>> causes and possible courses of action you can take subsequently.
>>
>> thanks,
>>
>> bryan
>>
>>
>
> --
> email:    Mike Matrigali - mikemapp1@gmail.com
> linkedin: https://www.linkedin.com/in/MikeMatrigali
>
>
> Canoga Perkins
> 20600 Prairie Street
> Chatsworth, CA 91311
> (818) 718-6300
>
> This e-mail and any attached document(s) is confidential and is intended only for the review of the party to whom it is addressed. If you have received this transmission in error, please notify the sender immediately and discard the original message and any attachment(s).
>


--
email:    Mike Matrigali - mikemapp1@gmail.com
linkedin: https://www.linkedin.com/in/MikeMatrigali



Canoga Perkins
20600 Prairie Street
Chatsworth, CA 91311
(818) 718-6300

This e-mail and any attached document(s) is confidential and is intended only for the review of the party to whom it is addressed. If you have received this transmission in error, please notify the sender immediately and discard the original message and any attachment(s).

Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

Posted by mike matrigali <mi...@gmail.com>.
if it is not truncated and still available when you get the db a complete derby.log containing the error
would be interesting - if it does not have anything that can not be shared.

Derby does not handle I/O errors very well, and it's shutdown mechanism is definitely not clean.  Often
what happens is store encounters an I/O error and has no idea what to do at that point.  It's default is
to mark some key modules null so that it knows no update action can take place and fails the current
transaction and trys to shutdown the whole system.  Once system is shutdown reboot recovery is counted
on to fix anything that was encountered, assuming nothing on disk was really corrupted.  This sounds like
what happened in your case, but i would always consistency check if there is a problem.

/mikem

On 9/4/2015 7:56 AM, Bergquist, Brett wrote:
> Thanks for the input!
>
> There is no possibility of running the consistency check on the customer's database on their system as it needs to be running 24x7 and cannot be taken down.  As far as I can tell at this point, the database came back up ok after the restart and is operating normally.
>
> I am able to get a copy of the database via file system backup that occurs each night.   Using ZFS allows us to do this by freezing the database (using FREEZE derby calls), doing a ZFS snapshot of the file system, unfreezing the database (using UNFREEZE derby call) and then accessing the ZFS snapshot to make the file system backup.   It takes me a couple of days to get all of the database transferred but then I can stage it locally and run a consistency check on the local copy.
>
> I will open a JIRA on the NullPointerException's that were reported after Derby did its shutdown like Bryan suggested.
>
> For some background, the database is used in a telecommunications environment, being the persistent storage for the configuration for about 90K pieces of network equipment and receives about 10M monitoring updates per day 24x7.   The database has been around for about 8 years continually growing and having derby being upgraded.  It is currently at 10.10.2.0.   We also do a poor man's partitioning in that we have 53 database tables, one for each week of the year, and our 10M inserts are directed to the correct database table for the week of the year and queries are built upon those weeks as well with a VIEW that is created as a UNION query across all 53 tables when needed to process queries that span weeks.   We needed to do this as there was no practical way of deleting older data while simultaneously inserting data into the table at the rate or 10M/day and not having database performance issues, locking contention, and even getting the deletions done in a reasonable amount of time, and also recovering and reusing the freed database space.   Now we simply truncate the tables that are to be purged which is nearly instantaneous.   At some point I may investigate and contact the group here on how one might implement a real partitioning scheme that would be more efficient especially on the queries and add this capability back into derby, so if anyone has any ideas on this, I am all ears.
>
> Brett
>
> -----Original Message-----
> From: mike matrigali [mailto:mikemapp1@gmail.com]
> Sent: Friday, September 04, 2015 12:37 AM
> To: derby-dev@db.apache.org
> Subject: Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."
>
> I agree with all of bryan's suggestions.  If you can't get access to the actual db there is not much to be done.  My usual customer support answer to this situation would be to tell you to shut the db and do a consistency check on it, which would read every page from the table and would certainly run into the error you got eventually if there was a persistent problem.
> Given the size of the db and that derby has no optimizations for db's of this size that is likely to take some time.
>
>   From the stack I can tell you that the problem is in a base page, not an index.  Which is
> much harder to fix if it is persistent.   In derby db's the output
>    Container(0, 30832) is saying container in segment 0 (seg0 directory) and container id
> 30832  (impressed by the number of containers that db has gone through).  Also you will see system catalog talk about conglomerate numbers.  In derby currently there is always a 1-1 mapping of conglomerate num to container number.
> Ancient history, in cloudscape we thought we might need the abstraction and it was a pain to do the map at the lowest level so we took the opportunity when we redid the arch to make it 1-1 for "now" but allow a map if anyone wanted to do one in the future:
> And here is a note from bryan minus 6 years on how to go from that number in the error to file name and table name.:
> http://bryanpendleton.blogspot.com/2009/09/whats-in-those-files-in-my-derby-db.html
>
> A quick check if you could get a ls -l of the seg0 directory would be to look at the size of the associated file and do the math bryan mentioned to see if the file now has a full page.
> including the page size if you figure it out would help as derby page size vs file system page size can be an issue  - but usually only on machine crashes.
>
> I would suggest filing a JIRA for this.  If it really is the case that you got the I/O error for a non-persistent problem it may be that derby can be improved to avoid it.  Before the code was changed to use FileChannel's derby often had retry loops on I/O errors - especially on reads of pages from disk.  In the long past this just avoided some intermittent i/o problems that were in most case network related (even though we likely did not support the network disk officially).  Not sure if the old retry code is still around in the trunk as it was for running in older JVM's.
>
> Also I have also seen wierd timing errors from maybe multiple processing accessing the same file (like backup/virus/... vs the sever), but mostly on windows OS vs unix based ones.
>
> Getting a partial page read is a very weird error for derby as it goes out of its way to write only full pages.
> On 9/3/2015 5:39 PM, Bryan Pendleton wrote:
>> On 9/3/2015 3:35 PM, Bergquist, Brett wrote:
>>> Reached end of file while attempting to read a whole page
>> You should probably take a close read through all the discussion on
>> this slightly old Derby JIRA Issue:
>>
>> https://issues.apache.org/jira/browse/DERBY-5234
>>
>> There are some suggestions about how to diagnose the conglomerate in
>> question in more detail, and also some observations about possible
>> causes and possible courses of action you can take subsequently.
>>
>> thanks,
>>
>> bryan
>>
>>
>
> --
> email:    Mike Matrigali - mikemapp1@gmail.com
> linkedin: https://www.linkedin.com/in/MikeMatrigali
>
>
> Canoga Perkins
> 20600 Prairie Street
> Chatsworth, CA 91311
> (818) 718-6300
>
> This e-mail and any attached document(s) is confidential and is intended only for the review of the party to whom it is addressed. If you have received this transmission in error, please notify the sender immediately and discard the original message and any attachment(s).
>


-- 
email:    Mike Matrigali - mikemapp1@gmail.com
linkedin: https://www.linkedin.com/in/MikeMatrigali



RE: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

Posted by "Bergquist, Brett" <BB...@canoga.com>.
Thanks for the input!

There is no possibility of running the consistency check on the customer's database on their system as it needs to be running 24x7 and cannot be taken down.  As far as I can tell at this point, the database came back up ok after the restart and is operating normally.

I am able to get a copy of the database via file system backup that occurs each night.   Using ZFS allows us to do this by freezing the database (using FREEZE derby calls), doing a ZFS snapshot of the file system, unfreezing the database (using UNFREEZE derby call) and then accessing the ZFS snapshot to make the file system backup.   It takes me a couple of days to get all of the database transferred but then I can stage it locally and run a consistency check on the local copy.

I will open a JIRA on the NullPointerException's that were reported after Derby did its shutdown like Bryan suggested.

For some background, the database is used in a telecommunications environment, being the persistent storage for the configuration for about 90K pieces of network equipment and receives about 10M monitoring updates per day 24x7.   The database has been around for about 8 years continually growing and having derby being upgraded.  It is currently at 10.10.2.0.   We also do a poor man's partitioning in that we have 53 database tables, one for each week of the year, and our 10M inserts are directed to the correct database table for the week of the year and queries are built upon those weeks as well with a VIEW that is created as a UNION query across all 53 tables when needed to process queries that span weeks.   We needed to do this as there was no practical way of deleting older data while simultaneously inserting data into the table at the rate or 10M/day and not having database performance issues, locking contention, and even getting the deletions done in a reasonable amount of time, and also recovering and reusing the freed database space.   Now we simply truncate the tables that are to be purged which is nearly instantaneous.   At some point I may investigate and contact the group here on how one might implement a real partitioning scheme that would be more efficient especially on the queries and add this capability back into derby, so if anyone has any ideas on this, I am all ears.

Brett

-----Original Message-----
From: mike matrigali [mailto:mikemapp1@gmail.com]
Sent: Friday, September 04, 2015 12:37 AM
To: derby-dev@db.apache.org
Subject: Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

I agree with all of bryan's suggestions.  If you can't get access to the actual db there is not much to be done.  My usual customer support answer to this situation would be to tell you to shut the db and do a consistency check on it, which would read every page from the table and would certainly run into the error you got eventually if there was a persistent problem.
Given the size of the db and that derby has no optimizations for db's of this size that is likely to take some time.

 From the stack I can tell you that the problem is in a base page, not an index.  Which is
much harder to fix if it is persistent.   In derby db's the output
  Container(0, 30832) is saying container in segment 0 (seg0 directory) and container id
30832  (impressed by the number of containers that db has gone through).  Also you will see system catalog talk about conglomerate numbers.  In derby currently there is always a 1-1 mapping of conglomerate num to container number.
Ancient history, in cloudscape we thought we might need the abstraction and it was a pain to do the map at the lowest level so we took the opportunity when we redid the arch to make it 1-1 for "now" but allow a map if anyone wanted to do one in the future:
And here is a note from bryan minus 6 years on how to go from that number in the error to file name and table name.:
http://bryanpendleton.blogspot.com/2009/09/whats-in-those-files-in-my-derby-db.html

A quick check if you could get a ls -l of the seg0 directory would be to look at the size of the associated file and do the math bryan mentioned to see if the file now has a full page.
including the page size if you figure it out would help as derby page size vs file system page size can be an issue  - but usually only on machine crashes.

I would suggest filing a JIRA for this.  If it really is the case that you got the I/O error for a non-persistent problem it may be that derby can be improved to avoid it.  Before the code was changed to use FileChannel's derby often had retry loops on I/O errors - especially on reads of pages from disk.  In the long past this just avoided some intermittent i/o problems that were in most case network related (even though we likely did not support the network disk officially).  Not sure if the old retry code is still around in the trunk as it was for running in older JVM's.

Also I have also seen wierd timing errors from maybe multiple processing accessing the same file (like backup/virus/... vs the sever), but mostly on windows OS vs unix based ones.

Getting a partial page read is a very weird error for derby as it goes out of its way to write only full pages.
On 9/3/2015 5:39 PM, Bryan Pendleton wrote:
> On 9/3/2015 3:35 PM, Bergquist, Brett wrote:
>> Reached end of file while attempting to read a whole page
>
> You should probably take a close read through all the discussion on
> this slightly old Derby JIRA Issue:
>
> https://issues.apache.org/jira/browse/DERBY-5234
>
> There are some suggestions about how to diagnose the conglomerate in
> question in more detail, and also some observations about possible
> causes and possible courses of action you can take subsequently.
>
> thanks,
>
> bryan
>
>


--
email:    Mike Matrigali - mikemapp1@gmail.com
linkedin: https://www.linkedin.com/in/MikeMatrigali


Canoga Perkins
20600 Prairie Street
Chatsworth, CA 91311
(818) 718-6300

This e-mail and any attached document(s) is confidential and is intended only for the review of the party to whom it is addressed. If you have received this transmission in error, please notify the sender immediately and discard the original message and any attachment(s).

Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

Posted by mike matrigali <mi...@gmail.com>.
I agree with all of bryan's suggestions.  If you can't get access to the actual db there is not
much to be done.  My usual customer support answer to this situation would be to tell you to
shut the db and do a consistency check on it, which would read every page from the table
and would certainly run into the error you got eventually if there was a persistent problem.
Given the size of the db and that derby has no optimizations for db's of this size that is likely
to take some time.

 From the stack I can tell you that the problem is in a base page, not an index.  Which is
much harder to fix if it is persistent.   In derby db's the output
  Container(0, 30832) is saying container in segment 0 (seg0 directory) and container id
30832  (impressed by the number of containers that db has gone through).  Also you
will see system catalog talk about conglomerate numbers.  In derby currently there is
always a 1-1 mapping of conglomerate num to container number.
Ancient history, in cloudscape we thought we might need the
abstraction and it was a pain to do the map at the lowest level so we took the opportunity
when we redid the arch to make it 1-1 for "now" but allow a map if anyone wanted to
do one in the future:
And here is a note from bryan minus 6 years on how
to go from that number in the error to file name and table name.:
http://bryanpendleton.blogspot.com/2009/09/whats-in-those-files-in-my-derby-db.html

A quick check if you could get a ls -l of the seg0 directory would be to look at the size of
the associated file and do the math bryan mentioned to see if the file now has a full page.
including the page size if you figure it out would help as derby page size vs file system page
size can be an issue  - but usually only on machine crashes.

I would suggest filing a JIRA for this.  If it really is the case that you got the I/O error for a
non-persistent problem it may be that derby can be improved to avoid it.  Before the code
was changed to use FileChannel's derby often had retry loops on I/O errors - especially on
reads of pages from disk.  In the long past this just avoided some intermittent i/o problems
that were in most case network related (even though we likely did not support the network
disk officially).  Not sure if the old retry code is still around in the trunk as it was for running
in older JVM's.

Also I have also seen wierd timing errors from maybe multiple processing accessing the same
file (like backup/virus/... vs the sever), but mostly on windows OS vs unix based ones.

Getting a partial page read is a very weird error for derby as it goes out of its way to write
only full pages.
On 9/3/2015 5:39 PM, Bryan Pendleton wrote:
> On 9/3/2015 3:35 PM, Bergquist, Brett wrote:
>> Reached end of file while attempting to read a whole page
>
> You should probably take a close read through all the
> discussion on this slightly old Derby JIRA Issue:
>
> https://issues.apache.org/jira/browse/DERBY-5234
>
> There are some suggestions about how to diagnose the
> conglomerate in question in more detail, and also some
> observations about possible causes and possible courses
> of action you can take subsequently.
>
> thanks,
>
> bryan
>
>


-- 
email:    Mike Matrigali - mikemapp1@gmail.com
linkedin: https://www.linkedin.com/in/MikeMatrigali


Re: Derby received an error "ERROR XSDG0: Page Page(1325564,Container(0, 30832)) could not be read from disk."

Posted by Bryan Pendleton <bp...@gmail.com>.
On 9/3/2015 3:35 PM, Bergquist, Brett wrote:
> Reached end of file while attempting to read a whole page

You should probably take a close read through all the
discussion on this slightly old Derby JIRA Issue:

https://issues.apache.org/jira/browse/DERBY-5234

There are some suggestions about how to diagnose the
conglomerate in question in more detail, and also some
observations about possible causes and possible courses
of action you can take subsequently.

thanks,

bryan