You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metamodel.apache.org by ClaudiaPHI <gi...@git.apache.org> on 2015/07/31 11:50:42 UTC

[GitHub] metamodel pull request: Changed HdfsResource implemention so that ...

GitHub user ClaudiaPHI opened a pull request:

    https://github.com/apache/metamodel/pull/36

    Changed HdfsResource implemention so that Writing to Hadoop is possible.

    This is a part of the story from DataCleaner https://github.com/datacleaner/DataCleaner/issues/494. 
    
    In Hadoop the default replication factor is 3. After a file is created and closed, it become immutable, because of the replicas. Basically, it means that append is not allowed.  
    
    With this change I made it possible to write to Hadoop a CSV file by setting in the configuration the replication factor 1.
    
    However, writing is very slow and it can fail anytime if the Hadoop node is leased or other type of failures occur. 
    
    The "CsvUpdateCallback"  class appends row by row.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ClaudiaPHI/metamodel feature/Hdfs-resourse-impl-for-DC-494

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metamodel/pull/36.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #36
    
----
commit cb82f181b2bf031f6c6c87c58c4800d06913d092
Author: ClaudiaPHI <cl...@humaninference.com>
Date:   2015-07-31T09:33:26Z

    Changed the implemention so that Writing to Hadoop is possible.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Changed HdfsResource implemention so that ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on a diff in the pull request:

    https://github.com/apache/metamodel/pull/36#discussion_r35971009
  
    --- Diff: hadoop/src/main/java/org/apache/metamodel/util/HdfsResource.java ---
    @@ -338,4 +386,26 @@ public boolean equals(Object obj) {
                 return false;
             return true;
         }
    +
    +    public boolean getOverwriteIfExits() {
    +        return _overwriteIfExists;
    +    }
    +
    +    public void setOverwriteIfExits(boolean overwriteIfExits) {
    +        _overwriteIfExists = overwriteIfExits;
    +    }
    +
    +    public void setHadoopConfiguration(Configuration hadoopConfiguration) {
    +        _hadoopConfiguration = hadoopConfiguration;
    +    }
    +
    +    public void replicateFile(int replicationFactor) {
    +        try {
    +            final FileSystem fs = getHadoopFileSystem();
    +            fs.setReplication(getHadoopPath(), (short) replicationFactor);
    +        } catch (IOException e) {
    +            throw new MetaModelException("Could not create replicas for Hdfs file  : " + getHadoopFileSystem() + " "
    +                    + e.getMessage(), e);
    +        }
    +    }
    --- End diff --
    
    Shouldn't this update the Hadoop configuration as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Changed HdfsResource implemention so that ...

Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on the pull request:

    https://github.com/apache/metamodel/pull/36#issuecomment-126834486
  
    I am absolutely with you on that. I did not mean to suggest that the instanceof hack was a beautiful thing. Rather that it hints at an underlying issue that we should maybe rather fix than setting replication levels? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Changed HdfsResource implemention so that ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the pull request:

    https://github.com/apache/metamodel/pull/36#issuecomment-126689077
  
    It's not really ideal, but I guess it is the best we can do without close() in Resource (I'm not entirely sure why resources aren't closeable, or bufferable in some other way... Right now we open and close streams for every single write, which seems like a bad thing for most resource types).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Changed HdfsResource implemention so that ...

Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on the pull request:

    https://github.com/apache/metamodel/pull/36#issuecomment-126826011
  
    A few thoughts/notes:
    
    I think actually the CSV update callback has some FileResource specific hacks to avoid always appending again and again. Might be you find the better root cause here!
    
    To LosD: Resource is not itself closeable just like File is not. But the streams that you might get by invoking methods on the Resource is obviously closeable. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Changed HdfsResource implemention so that ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the pull request:

    https://github.com/apache/metamodel/pull/36#issuecomment-126878797
  
    That seems like a good plan. I'll take a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Changed HdfsResource implemention so that ...

Posted by LosD <gi...@git.apache.org>.
Github user LosD commented on the pull request:

    https://github.com/apache/metamodel/pull/36#issuecomment-126828916
  
    In many (most?) cases, I. believe. it. is. pretty. inefficient. to. have. to. close. the. stream. for. every. single. row. In HDFS it is crippling. 
    
    It's not a problem for normal File operations, as you can easily buffer before writing a block, but unless I misunderstood @cludiaPHI's investigations, we do not have that luxury in Metamodel, as rows come in one at a time, and there's no way to know when it ends. We can of course do the instanceof hack in the CSV writer, but do we want users of the Resource interface to be handicapped by default?
    
    I might have misunderstood something, but as far as I can see, HDFS is just an extreme example of a general problem.
    -- 
    Med venlig hilsen 
    Dennis Du Krøger 
    
    On 1 August 2015 00:19:22 CEST, "Kasper Sørensen" <no...@github.com> wrote:
    >A few thoughts/notes:
    >
    >I think actually the CSV update callback has some FileResource specific
    >hacks to avoid always appending again and again. Might be you find the
    >better root cause here!
    >
    >To LosD: Resource is not itself closeable just like File is not. But
    >the streams that you might get by invoking methods on the Resource is
    >obviously closeable. 
    >
    >---
    >Reply to this email directly or view it on GitHub:
    >https://github.com/apache/metamodel/pull/36#issuecomment-126826011



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Changed HdfsResource implemention so that ...

Posted by kaspersorensen <gi...@git.apache.org>.
Github user kaspersorensen commented on the pull request:

    https://github.com/apache/metamodel/pull/36#issuecomment-126835721
  
    To be more specific: it seems that we for files save a writer in CsvUpdateCallback. I think that's because we don't have a way to generically create that for any resource. Resource has a read() method that returns InputStream but not a write() method that returns OutputStream. Maybe we should simply add that? Then I think we could do that same trick on any resource implementation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] metamodel pull request: Changed HdfsResource implemention so that ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/metamodel/pull/36


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---