You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Robert Reid (JIRA)" <ji...@apache.org> on 2018/08/16 22:22:00 UTC

[jira] [Updated] (SPARK-25136) unable to use HDFS checkpoint directories after driver restart

     [ https://issues.apache.org/jira/browse/SPARK-25136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Reid updated SPARK-25136:
--------------------------------
    Description: 
I have been attempting to work around the issue first discussed at the end of SPARK-20894. The problem that we are encountering is the inability of the Spark Driver to process additional data after restart. Upon restart it reprocesses batches from the initial run. But when processing the first batch it crashes because of missing checkpoint files.

We have a structured streaming job running on a Spark Standalone cluster. We restart the Spark Driver to renew the Kerberos token in use to access HDFS. Excerpts of log lines from a run are included below. In short, when the new driver starts it reports that it is going to use a new checkpoint directory in HDFS. Workers report that they have loaded a StateStoreProviders that matches the directory. But then the worker reports that it cannot read the delta files. This causes the driver to crash.

The HDFS directories are present but empty. Further, the directory permissions for the original and new checkpoint directories are the same. The worker never crashes.

As mentioned in SPARK-20894, deleting the _spark_metadata directory makes subsequent restarts succeed.

Here is a timeline of log records from a recent run.

A new run began at 00:29:21. These entries from a worker log look good.

{{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of HDFSStateStoreProvider[id = (op=0,part=0),dir = hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] for update}}

{{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 for HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] to file hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}}

As the shutdown is occurring the worker reports

{{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}}

The restart began at 00:39:38.

Driver log entries

{{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = b7ee0163-47db-4392-ab66-94d36ce63074]. Use hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040 to store the query checkpoint.}}

{{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading delta file hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0,part=0),dir = hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta does not exist}}

{{Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}}

 

Worker log entries

{{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance StateStoreProviderId(StateStoreId(hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state,0,0,default),b7ee0163-47db-4392-ab66-94d36ce63074) is active}}

{{java.lang.IllegalStateException: Error reading delta file hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0,part=0),dir = hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta does not exist}}

{{Caused by: java.io.FileNotFoundException: File does not exist: /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}}

  was:
I have been attempting to work around the issue first discussed at the end of SPARK-20894. The problem that we are encountering is the inability of the Spark Driver to process additional data after restart. Upon restart it reprocesses batches from the initial run. But when processing the first batch it crashes because of missing checkpoint files.
 
 We have a structured streaming job running on a Spark Standalone cluster. We restart the Spark Driver to renew the Kerberos token in use to access HDFS. Excerpts of log lines from a run are included below. In short, when the new driver starts it reports that it is going to use a new checkpoint directory in HDFS. Workers report that they have loaded a StateStoreProviders that matches the directory. But then the worker reports that it cannot read the delta files. This causes the driver to crash.

The HDFS directories are present but empty. Further, the directory permissions for the original and new checkpoint directories are the same. The worker never crashes.

As mentioned in SPARK-20894, deleting the _spark_metadata directory makes subsequent restarts succeed.

Here is a timeline of log records from a recent run.
 
A new run began at 00:29:21. These entries from a worker log look good.

{{{{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of HDFSStateStoreProvider[id = (op=0,part=0),dir = hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] for update}}}}

{{ {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 for HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] to file hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}}}}

As the shutdown is occurring the worker reports

{{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}}

The restart began at 00:39:38.

Driver log entries

{{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = b7ee0163-47db-4392-ab66-94d36ce63074]. Use hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040 to store the query checkpoint.}}
 
 {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading delta file hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0,part=0),dir = hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta does not exist}}

{{Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}}

 

Worker log entries

{{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance StateStoreProviderId(StateStoreId(hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state,0,0,default),b7ee0163-47db-4392-ab66-94d36ce63074) is active}}
 
 {{java.lang.IllegalStateException: Error reading delta file hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0,part=0),dir = hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta does not exist}}
 
 {{Caused by: java.io.FileNotFoundException: File does not exist: /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}}


> unable to use HDFS checkpoint directories after driver restart
> --------------------------------------------------------------
>
>                 Key: SPARK-25136
>                 URL: https://issues.apache.org/jira/browse/SPARK-25136
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Robert Reid
>            Priority: Major
>
> I have been attempting to work around the issue first discussed at the end of SPARK-20894. The problem that we are encountering is the inability of the Spark Driver to process additional data after restart. Upon restart it reprocesses batches from the initial run. But when processing the first batch it crashes because of missing checkpoint files.
> We have a structured streaming job running on a Spark Standalone cluster. We restart the Spark Driver to renew the Kerberos token in use to access HDFS. Excerpts of log lines from a run are included below. In short, when the new driver starts it reports that it is going to use a new checkpoint directory in HDFS. Workers report that they have loaded a StateStoreProviders that matches the directory. But then the worker reports that it cannot read the delta files. This causes the driver to crash.
> The HDFS directories are present but empty. Further, the directory permissions for the original and new checkpoint directories are the same. The worker never crashes.
> As mentioned in SPARK-20894, deleting the _spark_metadata directory makes subsequent restarts succeed.
> Here is a timeline of log records from a recent run.
> A new run began at 00:29:21. These entries from a worker log look good.
> {{18/08/16 00:30:21 INFO HDFSBackedStateStoreProvider: Retrieved version 0 of HDFSStateStoreProvider[id = (op=0,part=0),dir = hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] for update}}
> {{18/08/16 00:30:23 INFO HDFSBackedStateStoreProvider: Committed version 1 for HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0] to file hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0/1.delta}}
> As the shutdown is occurring the worker reports
> {{18/08/16 00:39:11 INFO HDFSBackedStateStoreProvider: Aborted version 29 for HDFSStateStore[id=(op=0,part=0),dir=hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/a82c35a9-9e98-43f4-9f72-c6e72e4d223e/state/0/0]}}
> The restart began at 00:39:38.
> Driver log entries
> {{18/08/16 00:39:51 INFO MicroBatchExecution: Starting [id = e188d15f-e26a-48fd-9ce6-8c57ce53c2c1, runId = b7ee0163-47db-4392-ab66-94d36ce63074]. Use hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040 to store the query checkpoint.}}
> {{18/08/16 00:40:26 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 10, 10.251.104.164, executor 3): java.lang.IllegalStateException: Error reading delta file hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0,part=0),dir = hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta does not exist}}
> {{Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}}
>  
> Worker log entries
> {{18/08/16 00:40:26 INFO StateStore: Reported that the loaded instance StateStoreProviderId(StateStoreId(hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state,0,0,default),b7ee0163-47db-4392-ab66-94d36ce63074) is active}}
> {{java.lang.IllegalStateException: Error reading delta file hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0,part=0),dir = hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0]: hdfs://securehdfs/projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta does not exist}}
> {{Caused by: java.io.FileNotFoundException: File does not exist: /projects/flowsnake/search_relevance_proto/search_relevance_proto/63cd3c36-eff9-4900-a010-8eb204429034/checkpoints/ead17db8-3660-4bb9-b519-bd7a3599c040/state/0/0/1.delta}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org