You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 20:26:32 UTC
[GitHub] [beam] kennknowles opened a new issue, #18796: TFRecord Performance Tests doesn't work on hdfs

kennknowles opened a new issue, #18796:
URL: https://github.com/apache/beam/issues/18796

   TFRecord have issue reading files from hdfs using filename pattern _"hdfs://...*"_
   ```
   
   TFRecordIO.read().from(filenamePattern).withCompression(AUTO)
   ```
   
   [link to github](https://github.com/apache/beam/blob/36257aba9054e664ebaafccfefb78bf54a162618/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java#L113) this is a blocker for running full set of filebased io tests on hdfs.
   
   Steps to reproduce:
    1. Create remote hadoop environment. This step asume you have local kubectl tool configured to use your GCP project.
   ```
   
   pushd .test-infra/kubernetes/hadoop/SmallITCluster/ && /bin/bash ./setup-all.sh && popd
   
   ```
   
   2. Update /etc/hosts file with the provided output from sctipt.
   3. Confirm that it works and hadoop web interface is accessible on {color:#FF0000}http://hadoop-xxxxx:50070{color} where xxxxx is added in step2 sequence from your /etc/hosts entry. Please also substitute xxxxx in further usages of this.
    4. Tell runner to use root as hadoop user.
   ```
   
   export HADOOP_USER_NAME=root
   
   ```
   
   5. Run TFRecord tests on this environment using DirectRunner:
   ```
   
   mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests/ -Dit.test=org.apache.beam.sdk.io.tfrecord.TFRecordIOIT
   -Dfilesystem=hdfs -DintegrationTestPipelineOptions='["--filenamePrefix=hdfs://hadoop-xxxxx:9000/TFRecord",
   "--hdfsConfiguration=[{\"fs.defaultFS\" : \"hdfs://hadoop-xxxxx:9000\", \"dfs.replication\": 1, \"dfs.client.use.datanode.hostname\":\"true\"}]"
   ]' -DforceDirectRunner=true 
   
   ```
   
   
   The error message is:
   ```
   
   [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 78.055 s <<< FAILURE! - in
   org.apache.beam.sdk.io.tfrecord.TFRecordIOIT
   [ERROR] writeThenReadAll(org.apache.beam.sdk.io.tfrecord.TFRecordIOIT)
    Time elapsed: 78.055 s  <<< ERROR!
   java.lang.IllegalStateException: Invalid data
           at org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:444)
   
          at org.apache.beam.sdk.io.TFRecordIO$TFRecordCodec.read(TFRecordIO.java:642)
           at org.apache.beam.sdk.io.TFRecordIO$TFRecordSource$TFRecordReader.readNextRecord(TFRecordIO.java:526)
   
          at org.apache.beam.sdk.io.CompressedSource$CompressedReader.readNextRecord(CompressedSource.java:426)
   
          at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:473)
   
          at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.advance(OffsetBasedSource.java:267)
   
          at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$BoundedReadEvaluator.processElement(BoundedReadEvaluatorFactory.java:148)
   
          at org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:161)
   
          at org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:125)
   
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
           at
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
           at java.lang.Thread.run(Thread.java:745)
   
   
   ```
   
   This results were also observed when running tests on jenkins. [Link to jenkins build](https://builds.apache.org/view/A-D/view/Beam/job/beam_PerformanceTests_TFRecordIOIT_HDFS/3/console)
   
   When you open http://hadoop-xxxxx:50070/explorer.html#/ you will see TFRecord files that were created during write phase. Unable to be processed in reading phase.
   
   {color:red}Important note{color}: if I copy files made by writing pipeline from hdfs directory to local directory and run reading pipeline over them, everything is working fine, so only reading from hdfs is a problem.
   
   You can wipe out hdfs environment by runnning:
   ```
   
   pushd .test-infra/kubernetes/hadoop/SmallITCluster/ && /bin/bash ./teardown-all.sh && popd
   
   ```
   
   
   
   Imported from Jira [BEAM-3945](https://issues.apache.org/jira/browse/BEAM-3945). Original Jira may contain additional context.
   Reported by: szewinho.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org