You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metamodel.apache.org by kaspersorensen <gi...@git.apache.org> on 2015/12/10 22:27:04 UTC

[GitHub] metamodel pull request: Read hadoop configuration files (core-site...

GitHub user kaspersorensen opened a pull request:

    https://github.com/apache/metamodel/pull/78

    Read hadoop configuration files (core-site.xml and hdfs-site.xml) in HdfsResource

    Suggested fix for METAMODEL-219.
    
    I did a few minor/additional changes too:
    
     * Moved the HDFS outputstream and inputstream classes to separate files (HdfsResource was getting too big IMO). The classes are defined as default (non-public) scoped, so not visible outside the package.
     * Changed the FileHelper to support Java 7 AutoCloseable instead of distinguishing between Closeable, Connection, Statement, ResultSet etc.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kaspersorensen/metamodel METAMODEL-219-hadoop-configuration-files

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metamodel/pull/78.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #78
    
----
commit 14b63cdc262d40b0d6469de6cfc12103aa6239e3
Author: Kasper Sørensen <i....@gmail.com>
Date:   2015-12-10T20:59:29Z

    Moved HDFS resource stream classes to separate files.

commit 1e5dfe34d65cb9880a8ccd0e6adaa3a765640527
Author: Kasper Sørensen <i....@gmail.com>
Date:   2015-12-10T21:06:09Z

    Improved FileHelper.safeClose(...) method by using AutoCloseable
    
    ... which is now super-interface for Closeable, Connection, Statement,
    ResultSet and more.

commit 7b5fb0c09c5dac7af3a2f2ff0e2aef8e0bbf2012
Author: Kasper Sørensen <i....@gmail.com>
Date:   2015-12-10T21:24:39Z

    METAMODEL-219: Added loading of core-site.xml and hdfs-site.xml

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Read hadoop configuration files (core-site...

Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on the pull request:

    https://github.com/apache/metamodel/pull/78#issuecomment-164190432
  
    OK as you can now see from the latest commit (0e0a9fb4c76ce396f15c66094f52abf8bb366a0a) I have added a system property ```metamodel.hadoop.use_hadoop_conf_dir``` which can be set to "true" in order to enable the HADOOP_CONF_DIR and YARN_CONF_DIR checks. I guess that disables it for anyone that's not explicitly enabling it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Read hadoop configuration files (core-site...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on a diff in the pull request:

    https://github.com/apache/metamodel/pull/78#discussion_r47408779
  
    --- Diff: hadoop/src/main/java/org/apache/metamodel/util/HdfsDirectoryInputStream.java ---
    @@ -0,0 +1,74 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package org.apache.metamodel.util;
    +
    +import java.io.IOException;
    +import java.io.InputStream;
    +import java.util.Arrays;
    +
    +import org.apache.hadoop.fs.FileStatus;
    +import org.apache.hadoop.fs.FileSystem;
    +import org.apache.hadoop.fs.Path;
    +import org.apache.hadoop.fs.PathFilter;
    +
    +/**
    + * An {@link InputStream} that represents all the data found in a directory on
    + * HDFS. This {@link InputStream} is used by {@link HdfsResource#read()} when
    + * pointed to a directory.
    + */
    +class HdfsDirectoryInputStream extends AbstractDirectoryInputStream<FileStatus> {
    +
    +    private final Path _hadoopPath;
    +    private final FileSystem _fs;
    +
    +    public HdfsDirectoryInputStream(final Path hadoopPath, final FileSystem fs) {
    +        _hadoopPath = hadoopPath;
    +        _fs = fs;
    +        FileStatus[] fileStatuses;
    +        try {
    +            fileStatuses = _fs.listStatus(_hadoopPath, new PathFilter() {
    +                @Override
    +                public boolean accept(final Path path) {
    +                    try {
    +                        return _fs.isFile(path);
    +                    } catch (IOException e) {
    +                        return false;
    +                    }
    +                }
    +            });
    +            // Natural ordering is the URL
    +            Arrays.sort(fileStatuses);
    +        } catch (IOException e) {
    +            fileStatuses = new FileStatus[0];
    +        }
    +        _files = fileStatuses;
    +    }
    +
    +    @Override
    +    public InputStream openStream(final int index) throws IOException {
    +        final Path nextPath = _files[index].getPath();
    +        return _fs.open(nextPath);
    +    }
    +
    +    @Override
    +    public void close() throws IOException {
    +        super.close();
    +        FileHelper.safeClose(_fs);
    +    }
    +}
    --- End diff --
    
    Missing EOL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Read hadoop configuration files (core-site...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on a diff in the pull request:

    https://github.com/apache/metamodel/pull/78#discussion_r47408874
  
    --- Diff: core/src/main/java/org/apache/metamodel/util/FileHelper.java ---
    @@ -263,40 +259,16 @@ public static void safeClose(Object... objects) {
                         }
                     }
     
    -                if (obj instanceof Closeable) {
    +                if (obj instanceof AutoCloseable) {
                         try {
    -                        ((Closeable) obj).close();
    -                    } catch (IOException e) {
    -                        if (debugEnabled) {
    -                            logger.debug("Closing Closeable failed", e);
    -                        }
    -                    }
    -                } else if (obj instanceof Connection) {
    -                    try {
    -                        ((Connection) obj).close();
    -                    } catch (Exception e) {
    -                        if (debugEnabled) {
    -                            logger.debug("Closing Connection failed", e);
    -                        }
    -                    }
    -                } else if (obj instanceof Statement) {
    -                    try {
    -                        ((Statement) obj).close();
    -                    } catch (Exception e) {
    -                        if (debugEnabled) {
    -                            logger.debug("Closing Statement failed", e);
    -                        }
    -                    }
    -                } else if (obj instanceof ResultSet) {
    -                    try {
    -                        ((ResultSet) obj).close();
    +                        ((AutoCloseable) obj).close();
    --- End diff --
    
    :+1: Nice!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Read hadoop configuration files (core-site...

Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on the pull request:

    https://github.com/apache/metamodel/pull/78#issuecomment-164189119
  
    Fixed the EOLs.
    
    That's a very good question actually... You're right that it would override ```fs.defaultFS```, and initially that was also my intention. I see that HADOOP_CONF_DIR and YARN_CONF_DIR has become standards used by libraries such as Apache Spark and others. I was thinking it would be nice to reuse that standard so that configuration in MetaModel is as easy as possible... BUT of course you're right that it might be that the environment variable is unintentionally set up to configure a _different_ hadoop installation. Hmm then I guess we shouldn't check for environment variables or system properties, but rather only use the constructor argument as a way of configuring it. Then applications that are using MetaModel have to do their own environment variable checking, if it's applicable in their situations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Read hadoop configuration files (core-site...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the pull request:

    https://github.com/apache/metamodel/pull/78#issuecomment-164058639
  
    Isn't it dangerous reuse existing Hadoop/Yarn environment variables? They may be set globally without being intended for use with whatever is using MetaModel, and they will override the configured ```fs.defaultFS```, as far as I read the [JavaDocs](https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/conf/Configuration.html#addResource(java.io.InputStream, java.lang.String)).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Read hadoop configuration files (core-site...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the pull request:

    https://github.com/apache/metamodel/pull/78#issuecomment-164193164
  
    Good solution. :+1:, then :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Read hadoop configuration files (core-site...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on a diff in the pull request:

    https://github.com/apache/metamodel/pull/78#discussion_r47408806
  
    --- Diff: hadoop/src/main/java/org/apache/metamodel/util/HdfsFileInputStream.java ---
    @@ -0,0 +1,88 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package org.apache.metamodel.util;
    +
    +import java.io.IOException;
    +import java.io.InputStream;
    +
    +import org.apache.hadoop.fs.FileSystem;
    +
    +/**
    + * A managed {@link InputStream} for a file on HDFS.
    + * 
    + * The "purpose in life" for this class is to ensure that the {@link FileSystem}
    + * is closed when the stream is closed.
    + */
    +class HdfsFileInputStream extends InputStream {
    +
    +    private final InputStream _in;
    +    private final FileSystem _fs;
    +
    +    public HdfsFileInputStream(final InputStream in, final FileSystem fs) {
    +        _in = in;
    +        _fs = fs;
    +    }
    +
    +    @Override
    +    public int read() throws IOException {
    +        return _in.read();
    +    }
    +
    +    @Override
    +    public int read(byte[] b, int off, int len) throws IOException {
    +        return _in.read(b, off, len);
    +    }
    +
    +    @Override
    +    public int read(byte[] b) throws IOException {
    +        return _in.read(b);
    +    }
    +
    +    @Override
    +    public boolean markSupported() {
    +        return _in.markSupported();
    +    }
    +
    +    @Override
    +    public synchronized void mark(int readLimit) {
    +        _in.mark(readLimit);
    +    }
    +
    +    @Override
    +    public int available() throws IOException {
    +        return _in.available();
    +    }
    +
    +    @Override
    +    public synchronized void reset() throws IOException {
    +        _in.reset();
    +    }
    +
    +    @Override
    +    public long skip(long n) throws IOException {
    +        return _in.skip(n);
    +    }
    +
    +    @Override
    +    public void close() throws IOException {
    +        super.close();
    +        // need to close 'fs' when input stream is closed
    +        FileHelper.safeClose(_fs);
    +    }
    +}
    --- End diff --
    
    Missing EOL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Read hadoop configuration files (core-site...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on a diff in the pull request:

    https://github.com/apache/metamodel/pull/78#discussion_r47408815
  
    --- Diff: hadoop/src/main/java/org/apache/metamodel/util/HdfsFileOutputStream.java ---
    @@ -0,0 +1,68 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + */
    +package org.apache.metamodel.util;
    +
    +import java.io.IOException;
    +import java.io.OutputStream;
    +
    +import org.apache.hadoop.fs.FileSystem;
    +
    +/**
    + * A managed {@link OutputStream} for a file on HDFS.
    + * 
    + * The "purpose in life" for this class is to ensure that the {@link FileSystem}
    + * is closed when the stream is closed.
    + */
    +class HdfsFileOutputStream extends OutputStream {
    +
    +    private final OutputStream _out;
    +    private final FileSystem _fs;
    +
    +    public HdfsFileOutputStream(final OutputStream out, final FileSystem fs) {
    +        _out = out;
    +        _fs = fs;
    +    }
    +
    +    @Override
    +    public void write(int b) throws IOException {
    +        _out.write(b);
    +    }
    +
    +    @Override
    +    public void write(byte[] b, int off, int len) throws IOException {
    +        _out.write(b, off, len);
    +    }
    +
    +    @Override
    +    public void write(byte[] b) throws IOException {
    +        _out.write(b);
    +    }
    +
    +    @Override
    +    public void flush() throws IOException {
    +        _out.flush();
    +    }
    +
    +    @Override
    +    public void close() throws IOException {
    +        super.close();
    +        // need to close 'fs' when output stream is closed
    +        FileHelper.safeClose(_fs);
    +    }
    +}
    --- End diff --
    
    Missing EOL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Read hadoop configuration files (core-site...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/metamodel/pull/78


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Read hadoop configuration files (core-site...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on a diff in the pull request:

    https://github.com/apache/metamodel/pull/78#discussion_r47408930
  
    --- Diff: hadoop/src/main/java/org/apache/metamodel/util/HdfsResource.java ---
    @@ -369,30 +307,19 @@ public Path getHadoopPath() {
     
         @Override
         public int hashCode() {
    -        return Arrays.hashCode(new Object[] { _filepath, _hostname, _port });
    +        return Objects.hash(_filepath, _hostname, _port, _hadoopConfDir);
         }
     
         @Override
         public boolean equals(Object obj) {
    -        if (this == obj)
    +        if (this == obj) {
                 return true;
    -        if (obj == null)
    -            return false;
    -        if (getClass() != obj.getClass())
    -            return false;
    -        HdfsResource other = (HdfsResource) obj;
    -        if (_filepath == null) {
    -            if (other._filepath != null)
    -                return false;
    -        } else if (!_filepath.equals(other._filepath))
    -            return false;
    -        if (_hostname == null) {
    -            if (other._hostname != null)
    -                return false;
    -        } else if (!_hostname.equals(other._hostname))
    -            return false;
    -        if (_port != other._port)
    -            return false;
    -        return true;
    +        }
    +        if (obj instanceof HdfsResource) {
    +            final HdfsResource other = (HdfsResource) obj;
    +            return Objects.equals(_filepath, other._filepath) && Objects.equals(_hostname, other._hostname)
    +                    && Objects.equals(_port, other._port) && Objects.equals(_hadoopConfDir, other._hadoopConfDir);
    +        }
    +        return false;
    --- End diff --
    
    Also a nice simplification. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---