You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/03/01 00:18:45 UTC
[jira] [Commented] (BEAM-1569) HDFSFileSource: Unable to read from
filePattern with spaces in path
[ https://issues.apache.org/jira/browse/BEAM-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889179#comment-15889179 ]
ASF GitHub Bot commented on BEAM-1569:
--------------------------------------
GitHub user adude3141 opened a pull request:
https://github.com/apache/beam/pull/2132
BEAM-1569 support file patterns containing spaces
Be sure to do all of the following to help us incorporate your contribution
quickly and easily:
- [ ] Make sure the PR title is formatted like:
`[BEAM-<Jira issue #>] Description of pull request`
- [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
Travis-CI on your fork and ensure the whole test matrix passes).
- [ ] Replace `<Jira issue #>` in the title with the actual Jira issue
number, if there is one.
- [ ] If this contribution is large, please file an Apache
[Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt).
---
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/adude3141/beam BEAM-1569
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/beam/pull/2132.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2132
----
commit 6ce46b93f8476c66d5911fceb35a3804094e6c0f
Author: Michael Luckey <mi...@ext.gfk.com>
Date: 2017-03-01T00:13:11Z
BEAM-1569 support file patterns containing spaces
----
> HDFSFileSource: Unable to read from filePattern with spaces in path
> -------------------------------------------------------------------
>
> Key: BEAM-1569
> URL: https://issues.apache.org/jira/browse/BEAM-1569
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-core
> Reporter: Michael Luckey
> Assignee: Michael Luckey
>
> After the merge of the changes introduced with https://issues.apache.org/jira/browse/BEAM-1497 we are unable to read from files containing spaces in path. We encounter following stack trace
> {noformat}
> java.lang.reflect.UndeclaredThrowableException
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
> at org.apache.beam.sdk.io.hdfs.HDFSFileSource.validate(HDFSFileSource.java:337)
> at org.apache.beam.sdk.io.hdfs.HDFSFileSource.createReader(HDFSFileSource.java:329)
> at org.apache.beam.sdk.testing.SourceTestUtils.readFromSource(SourceTestUtils.java:138)
> Caused by: java.net.URISyntaxException: Illegal character in path at index 77: /var/folders/1t/s9pcmfj50nxbt68h3_2z_5wc0000gn/T/junit6887354597440386901/tmp data.seq
> at java.net.URI$Parser.fail(URI.java:2848)
> at java.net.URI$Parser.checkChars(URI.java:3021)
> at java.net.URI$Parser.parseHierarchical(URI.java:3105)
> at java.net.URI$Parser.parse(URI.java:3063)
> at java.net.URI.<init>(URI.java:588)
> at org.apache.beam.sdk.io.hdfs.HDFSFileSource$7.run(HDFSFileSource.java:340)
> at org.apache.beam.sdk.io.hdfs.HDFSFileSource$7.run(HDFSFileSource.java:337)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> ... 40 more
> {noformat}
> This can be reproduced for instance by
> {noformat}
> // shameless copy of existing test case
> @Test
> public void testFullyReadSingleFileWithSpaces() throws Exception {
> PipelineOptions options = PipelineOptionsFactory.create();
> List<KV<IntWritable, Text>> expectedResults = createRandomRecords(3, 10, 0);
> File file = createFileWithData("tmp data.seq", expectedResults);
> HDFSFileSource<KV<IntWritable, Text>, IntWritable, Text> source =
> HDFSFileSource.from(
> file.toString(), SequenceFileInputFormat.class, IntWritable.class, Text.class);
> assertEquals(file.length(), source.getEstimatedSizeBytes(null));
> assertThat(expectedResults, containsInAnyOrder(readFromSource(source, options).toArray()));
> }
> {noformat}
> Changing the implementation slightly to
> {noformat}
> diff --git a/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs/HDFSFileSource.java b/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs/HDFSFileSource.java
> index 2a731fb..df72643 100644
> --- a/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs/HDFSFileSource.java
> +++ b/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs/HDFSFileSource.java
> @@ -30,7 +30,6 @@ import java.io.ObjectInput;
> import java.io.ObjectOutput;
> import java.lang.reflect.InvocationTargetException;
> import java.lang.reflect.Method;
> -import java.net.URI;
> import java.security.PrivilegedExceptionAction;
> import java.util.List;
> import java.util.ListIterator;
> @@ -337,9 +336,10 @@ public abstract class HDFSFileSource<T, K, V> extends BoundedSource<T> {
> UGIHelper.getBestUGI(username()).doAs(new PrivilegedExceptionAction<Void>() {
> @Override
> public Void run() throws Exception {
> - FileSystem fs = FileSystem.get(new URI(filepattern()),
> + final Path pathPattern = new Path(filepattern());
> + FileSystem fs = FileSystem.get(pathPattern.toUri(),
> SerializableConfiguration.newConfiguration(serializableConfiguration()));
> - FileStatus[] fileStatuses = fs.globStatus(new Path(filepattern()));
> + FileStatus[] fileStatuses = fs.globStatus(pathPattern);
> checkState(
> fileStatuses != null && fileStatuses.length > 0,
> "Unable to find any files matching %s", filepattern());
> {noformat}
> seems to be fixing the issue for us.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)