You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uniffle.apache.org by GitBox <gi...@apache.org> on 2022/11/23 07:14:31 UTC
[GitHub] [incubator-uniffle] lixy529 opened a new issue, #352: [Bug] inconsistent blocks number
lixy529 opened a new issue, #352:
URL: https://github.com/apache/incubator-uniffle/issues/352
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
### Search before asking
- [X] I have searched in the [issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and found no similar issues.
### Describe the bug
Many tasks tasks of spark jobs will throw the exceptions that the inconsistent blocks number. The stacktrace is as follows:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 9.0 failed 6 times, most recent failure: Lost task 2.5 in stage 9.0 (TID 7653, BJLFRZ-10k-152-228.hadoop.jd.local, executor 159): org.apache.uniffle.common.exception.RssException: Blocks read inconsistent: expected 7 blocks, actual 0 blocks
at org.apache.uniffle.client.impl.ShuffleReadClientImpl.checkProcessedBlockIds(ShuffleReadClientImpl.java:215)
at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:135)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.shuffle.reader.RssShuffleReader$MultiPartitionIterator.hasNext(RssShuffleReader.java:227)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:768)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.writer.RssShuffleWriter.write(RssShuffleWriter.java:134)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:129)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:467)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1478)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:470)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2083)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2032)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2031)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2031)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:979)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:979)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:979)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2263)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2212)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2201)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.uniffle.common.exception.RssException: Blocks read inconsistent: expected 7 blocks, actual 0 blocks
at org.apache.uniffle.client.impl.ShuffleReadClientImpl.checkProcessedBlockIds(ShuffleReadClientImpl.java:215)
at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:135)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.shuffle.reader.RssShuffleReader$MultiPartitionIterator.hasNext(RssShuffleReader.java:227)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:768)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.writer.RssShuffleWriter.write(RssShuffleWriter.java:134)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
### Affects Version(s)
0.6.0
### Uniffle Server Log Output
_No response_
### Uniffle Engine Log Output
_No response_
### Uniffle Server Configurations
```yaml
rss.coordinator.quorum=xxx:19999,xxx:19999,xxx:19999
rss.jetty.http.port=19998
rss.prometheus.push.enabled=true
rss.prometheus.uniffle.cluster.name=test100
rss.rpc.executor.size=2000
rss.rpc.message.max.size=1073741824
rss.rpc.server.port=19999
rss.server.app.expired.withoutHeartbeat=120000
rss.server.buffer.capacity=30g
rss.server.commit.timeout=600000
rss.server.flush.cold.storage.threshold.size=64m
rss.server.flush.thread.alive=5
rss.server.flush.threadPool.size=10
rss.server.heartbeat.interval=10000
rss.server.heartbeat.timeout=60000
rss.server.localstorage.initialize.max.fail.number=6
rss.server.preAllocation.expired=120000
rss.server.read.buffer.capacity=15g
rss.storage.type=MEMORY_HDFS
```
### Uniffle Engine Configurations
_No response_
### Additional context
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] jerqi commented on issue #352: [Bug] inconsistent blocks number
Posted by "jerqi (via GitHub)" <gi...@apache.org>.
jerqi commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1650871094
We don't verify the Hadoop version 3.3. We only verify the Hadoop version 3.2.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] lixy529 commented on issue #352: [Bug] inconsistent blocks number
Posted by GitBox <gi...@apache.org>.
lixy529 commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1326252410
> Could you try to use the master code?
Let me try
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] lixy529 commented on issue #352: [Bug] inconsistent blocks number
Posted by GitBox <gi...@apache.org>.
lixy529 commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1326415496
> Could you try to use the master code?
The same problem exists with the master version.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] smlHao commented on issue #352: [Bug] inconsistent blocks number
Posted by "smlHao (via GitHub)" <gi...@apache.org>.
smlHao commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1650920570
@jerqi thanks !!!
> rss.coordinator.shuffle.nodes.max is a little small.
only deployed them in 3 machines, every machine had deployed one coordinator server and shuffle server , if don`t increase machine and shuffle server instances,only increase coordinator server instances is useful ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] jerqi commented on issue #352: [Bug] inconsistent blocks number
Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1326430609
Maybe you should give me the executor's logs, too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] smlHao commented on issue #352: [Bug] inconsistent blocks number
Posted by "smlHao (via GitHub)" <gi...@apache.org>.
smlHao commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1650888620
> We don't verify the Hadoop version 3.3. We only verify the Hadoop version 3.2.
@jerqi thanks !!! have another 2 questions need your help :
1、if hadoop version is 3.3, no matter whether it's correct or not ,the build command should be ./build_distribution.sh --hadoop-profile hadoop3.3 ?
2、want to write shuffle data into hdfs , my conf is correct ?
client.conf :
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
spark.rss.coordinator.quorum=172.100.3.70:19999,172.100.3.71:19999,172.100.3.72:19999
spark.rss.storage.type=MEMORY_LOCALFILE_HDFS
spark.rss.remote.storage.path=hdfs://ns1/rss/data
coordinator.conf:
rss.coordinator.quorum 172.100.3.70:19999,172.100.3.71:19999,172.100.3.72:19999
rss.rpc.server.port 19999
rss.jetty.http.port 19998
rss.coordinator.server.heartbeat.timeout 30000
rss.coordinator.app.expired 60000
rss.coordinator.shuffle.nodes.max 3
rss.coordinator.exclude.nodes.file.path file:///app/rss-0.7.1/conf/exclude_nodes
server.conf:
rss.rpc.server.port 20000
rss.jetty.http.port 20001
rss.storage.basePath /app/rss-0.7.1/data
rss.storage.type MEMORY_LOCALFILE_HDFS
rss.coordinator.quorum 172.100.3.70:19999,172.100.3.71:19999,172.100.3.72:19999
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] jerqi commented on issue #352: [Bug] inconsistent blocks number
Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1326421093
Could you give me the shuffle server's log? What's your shuffle data size and job's concurrency?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] lixy529 closed issue #352: [Bug] inconsistent blocks number
Posted by GitBox <gi...@apache.org>.
lixy529 closed issue #352: [Bug] inconsistent blocks number
URL: https://github.com/apache/incubator-uniffle/issues/352
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] jerqi commented on issue #352: [Bug] inconsistent blocks number
Posted by "jerqi (via GitHub)" <gi...@apache.org>.
jerqi commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1650891219
> > We don't verify the Hadoop version 3.3. We only verify the Hadoop version 3.2.
>
> @jerqi thanks !!! have another 2 questions need your help : 1、if hadoop version is 3.3, no matter whether it's correct or not ,the build command should be ./build_distribution.sh --hadoop-profile hadoop3.3 ?
>
> ```
> 2、want to write shuffle data into hdfs , my conf is correct ?
> client.conf :
> spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
> spark.rss.coordinator.quorum=172.100.3.70:19999,172.100.3.71:19999,172.100.3.72:19999
> spark.rss.storage.type=MEMORY_LOCALFILE_HDFS
> spark.rss.remote.storage.path=hdfs://ns1/rss/data
>
> coordinator.conf:
> rss.coordinator.quorum 172.100.3.70:19999,172.100.3.71:19999,172.100.3.72:19999
> rss.rpc.server.port 19999
> rss.jetty.http.port 19998
> rss.coordinator.server.heartbeat.timeout 30000
> rss.coordinator.app.expired 60000
> rss.coordinator.shuffle.nodes.max 3
> rss.coordinator.exclude.nodes.file.path file:///app/rss-0.7.1/conf/exclude_nodes
>
> server.conf:
> rss.rpc.server.port 20000
> rss.jetty.http.port 20001
> rss.storage.basePath /app/rss-0.7.1/data
> rss.storage.type MEMORY_LOCALFILE_HDFS
> rss.coordinator.quorum 172.100.3.70:19999,172.100.3.71:19999,172.100.3.72:19999
> ```
1. Yes
2. Ok for me. ` rss.coordinator.shuffle.nodes.max` is a little small.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] smlHao commented on issue #352: [Bug] inconsistent blocks number
Posted by "smlHao (via GitHub)" <gi...@apache.org>.
smlHao commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1651188770
> > > @jerqi thanks !!!
> > > > rss.coordinator.shuffle.nodes.max is a little small.
> > > > only deployed them in 3 machines, every machine had deployed one coordinator server and shuffle server , if don`t increase machine and shuffle server instances,only increase coordinator server instances is useful ?
> >
> >
> > @jerqi Could you help me explain this ?
>
> We should increase server instances.
@jerqi got it , thanks !!!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] xianjingfeng commented on issue #352: [Bug] inconsistent blocks number
Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1324661425
Maybe #276 is help for you. Coming soon.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] smlHao commented on issue #352: [Bug] inconsistent blocks number
Posted by "smlHao (via GitHub)" <gi...@apache.org>.
smlHao commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1649311182
> Problems with the company's HADOOP version, Now it is resolved.
@lixy529 hi, I meet the same error , can I add you wechat to help me solve this ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] jerqi commented on issue #352: [Bug] inconsistent blocks number
Posted by "jerqi (via GitHub)" <gi...@apache.org>.
jerqi commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1651154104
> > @jerqi thanks !!!
> > > rss.coordinator.shuffle.nodes.max is a little small.
> > > only deployed them in 3 machines, every machine had deployed one coordinator server and shuffle server , if don`t increase machine and shuffle server instances,only increase coordinator server instances is useful ?
>
> @jerqi Could you help me explain this ?
We should increase server instances.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] smlHao commented on issue #352: [Bug] inconsistent blocks number
Posted by "smlHao (via GitHub)" <gi...@apache.org>.
smlHao commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1650869396
> Problems with the company's HADOOP version, Now it is resolved.
@lixy529 @jerqi
I meet same error, build uniffle must match HADOOP version ?
company's HADOOP version is 3.3.4, the build command should be ./build_distribution.sh --hadoop-profile hadoop3.3 ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] jerqi commented on issue #352: [Bug] inconsistent blocks number
Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1324899191
Could you try to use the master code?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] lixy529 commented on issue #352: [Bug] inconsistent blocks number
Posted by GitBox <gi...@apache.org>.
lixy529 commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1333605480
Problems with the company's HADOOP version,
Now it is resolved.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] lixy529 commented on issue #352: [Bug] inconsistent blocks number
Posted by GitBox <gi...@apache.org>.
lixy529 commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1326250971
> Maybe #276 is help for you. Coming soon.
The patch has been updated. The problem still exists.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] lixy529 commented on issue #352: [Bug] inconsistent blocks number
Posted by GitBox <gi...@apache.org>.
lixy529 commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1326250594
> Maybe #276 is help for you. Coming soon.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] lixy529 closed issue #352: [Bug] inconsistent blocks number
Posted by GitBox <gi...@apache.org>.
lixy529 closed issue #352: [Bug] inconsistent blocks number
URL: https://github.com/apache/incubator-uniffle/issues/352
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [incubator-uniffle] smlHao commented on issue #352: [Bug] inconsistent blocks number
Posted by "smlHao (via GitHub)" <gi...@apache.org>.
smlHao commented on issue #352:
URL: https://github.com/apache/incubator-uniffle/issues/352#issuecomment-1651122158
> @jerqi thanks !!!
>
> > rss.coordinator.shuffle.nodes.max is a little small.
> > only deployed them in 3 machines, every machine had deployed one coordinator server and shuffle server , if don`t increase machine and shuffle server instances,only increase coordinator server instances is useful ?
@jerqi Could you help me explain this ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org