You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hawq.apache.org by xunzhang <gi...@git.apache.org> on 2016/06/10 07:48:17 UTC

[GitHub] incubator-hawq pull request #701: HAWQ 795,796,797

GitHub user xunzhang opened a pull request:

    https://github.com/apache/incubator-hawq/pull/701

    HAWQ 795,796,797

    Extend apache/orc to support HDFS read. This pull request is contained first three sub-tasks of [HAWQ-768](https://issues.apache.org/jira/browse/HAWQ-786) which are listed below:
     - [HAWQ-795](https://issues.apache.org/jira/browse/HAWQ-795) Import apache/orc library for further development.
     - [HAWQ-796](https://issues.apache.org/jira/browse/HAWQ-796) Extend orc library to support hdfs located file read.
     - [HAWQ-797](https://issues.apache.org/jira/browse/HAWQ-797) Implement reading tools for testing HDFS located orc file usage.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xunzhang/incubator-hawq HAWQ-795,796,797

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hawq/pull/701.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #701
    
----
commit e2ccc7609bd765f1588668f121347219f664f37f
Author: xunzhang <xu...@gmail.com>
Date:   2016-06-08T17:30:39Z

    HAWQ-795. Import apache/orc into depends folder.

commit c9bbac4988d26200a07181f90a4418c3d3090229
Author: xunzhang <xu...@gmail.com>
Date:   2016-06-09T08:30:55Z

    HAWQ-796. Extend orc library to support hdfs located file read

commit a5ba8747a455c399555d0646469c611ce6d2d247
Author: xunzhang <xu...@gmail.com>
Date:   2016-06-10T07:40:12Z

    HAWQ-797. Implement reading tools for testing HDFS located orc file usage.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by xunzhang <gi...@git.apache.org>.

Github user xunzhang commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
    @ictmalili I add option parse with `--batch` in `HdfsFileScan.cc` and `HdfsFileContents.cc`, remaining code is originally configurable. Plz review again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by xunzhang <gi...@git.apache.org>.

Github user xunzhang commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
    @rvs You are incorrect: this pull request contains three issues/commits. We add HDFS support for orc read in apache/orc code...
    As communicated with @changleicn , now we just maintain the apache/orc codebase ourselves for development efficiency. And afterwards we can contribute our code back to apache/orc.
    For further development of orc support related issues in HAWQ, we could wait PM's comments in apache JIRA before making a decision/pull request.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by xunzhang <gi...@git.apache.org>.

Github user xunzhang commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
    @liming01 I will modify the usage info inside code in the third commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #701: HAWQ 795,796,797

Posted by liming01 <gi...@git.apache.org>.

Github user liming01 commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/701#discussion_r66714877
  
    --- Diff: depends/thirdparty/orc/tools/src/HdfsFileScan.cc ---
    @@ -0,0 +1,65 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +#include <string>
    +#include <memory>
    +#include <iostream>
    +#include <string>
    +
    +#include "orc/ColumnPrinter.hh"
    +
    +#include "Exceptions.hh"
    +
    +#include "wrap/libhdfs3-wrapper.h"
    +
    +int main(int argc, char* argv[]) {
    +  if (argc < 2) {
    +    std::cout << "Usage: orc-scan <filename>\n";
    --- End diff --
    
    In order to make different to local orc file,  we need to make a different command name or an extra param for all these tool commands,  to specify the orc is on hdfs. Or we need to add code to auto detect which type basing on the protocol prefix (hdfs:// or file:// ) of the URL path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #701: HAWQ 795,796,797

Posted by xunzhang <gi...@git.apache.org>.

Github user xunzhang closed the pull request at:

    https://github.com/apache/incubator-hawq/pull/701


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by xunzhang <gi...@git.apache.org>.

Github user xunzhang commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
    @rvs In my previous sight, HAWQ is an apache project which needs to keep our discussions open in JIRA and github. Sorry, I will discussion and comment more on mails in the future. Thanks for your negative points!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #701: HAWQ 795,796,797

Posted by xunzhang <gi...@git.apache.org>.

Github user xunzhang commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/701#discussion_r66717721
  
    --- Diff: depends/thirdparty/orc/tools/src/HdfsFileMemory.cc ---
    @@ -0,0 +1,151 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +#include <string>
    +#include <memory>
    +#include <iostream>
    +#include <map>
    +#include <exception>
    +
    +#include "orc/orc-config.hh"
    +#include "orc/ColumnPrinter.hh"
    +#include "Exceptions.hh"
    +
    +#include "wrap/libhdfs3-wrapper.h"
    +
    +class TestMemoryPool: public orc::MemoryPool {
    +private:
    +  std::map<char*, uint64_t> blocks;
    +  uint64_t totalMemory;
    +  uint64_t maxMemory;
    +
    +public:
    +  char* malloc(uint64_t size) override {
    +    char* p = static_cast<char*>(std::malloc(size));
    +    blocks[p] = size ;
    +    totalMemory += size;
    +    if (maxMemory < totalMemory) {
    +      maxMemory = totalMemory;
    +    }
    +    return p;
    +  }
    +
    +  void free(char* p) override {
    +    std::free(p);
    +    totalMemory -= blocks[p] ;
    +    blocks.erase(p);
    +  }
    +
    +  uint64_t getMaxMemory() {
    +    return maxMemory ;
    +  }
    +
    +  TestMemoryPool(): totalMemory(0), maxMemory(0) {}
    +  ~TestMemoryPool();
    +};
    +
    +TestMemoryPool::~TestMemoryPool() {}
    +
    +void processFile(const char* filename,
    +                 const std::list<uint64_t>& cols,
    +                 uint32_t batchSize) {
    +  orc::ReaderOptions opts;
    +  if (cols.size() > 0) {
    +    opts.include(cols);
    +  }
    +  std::unique_ptr<orc::MemoryPool> pool(new TestMemoryPool());
    +  opts.setMemoryPool(*(pool.get()));
    +    
    +  auto addr = std::getenv("LIBHDFS3_NAMENODE_ADDRESS");
    +  std::string nn = "localhost";
    +  if(addr != NULL) {
    +    nn = std::string(addr);
    +  }
    +  auto portStr = std::getenv("LIBHDFS3_NAMENODE_PORT");
    +  tPort port = 8020;
    +  if (portStr != NULL) {
    +    port = static_cast<tPort>(atoi(portStr));
    +  }
    +  hdfsFS fs = hdfsConnect(nn.c_str(), port);
    +
    +  std::unique_ptr<orc::Reader> reader =
    +    orc::createReader(orc::readHdfsFile(fs, std::string(filename)), opts);
    +
    +  std::unique_ptr<orc::ColumnVectorBatch> batch =
    +      reader->createRowBatch(batchSize);
    +  uint64_t readerMemory = reader->getMemoryUse();
    +  uint64_t batchMemory = batch->getMemoryUsage();
    +  while (reader->next(*batch)) {}
    +  uint64_t actualMemory =
    +      static_cast<TestMemoryPool*>(pool.get())->getMaxMemory();
    +  std::cout << "Reader memory estimate: " << readerMemory
    +            << "\nBatch memory estimate:  " ;
    +  if (batch->hasVariableLength()) {
    +    std::cout << "Cannot estimate because reading ARRAY or MAP columns";
    +  } else {
    +    std::cout << batchMemory
    +              << "\nTotal memory estimate:  " << readerMemory + batchMemory;
    +  }
    +  std::cout << "\nActual max memory used: " << actualMemory << "\n";
    +}
    +
    +int main(int argc, char* argv[]) {
    +  if (argc < 2) {
    +    std::cout << "Usage: hdfs-orc-memory [--columns=column1,column2,...] "
    +        << "[--batch=rows_in_batch] <filename> \n";
    +    return 1;
    +  }
    +
    +  const std::string COLUMNS_PREFIX = "--columns=";
    +  const std::string BATCH_PREFIX = "--batch=";
    +  char* filename = ORC_NULLPTR;
    +
    +  // Default parameters
    +  std::list<uint64_t> cols;
    +  uint32_t batchSize = 1000;
    +
    +  // Read command-line options
    +  char *param, *value;
    +  for (int i = 1; i < argc; i++) {
    +    if ( (param = std::strstr(argv[i], COLUMNS_PREFIX.c_str())) ) {
    +      value = std::strtok(param+COLUMNS_PREFIX.length(), "," );
    +      while (value) {
    +        cols.push_back(static_cast<uint64_t>(std::atoi(value)));
    +        value = std::strtok(ORC_NULLPTR, "," );
    +      }
    +    } else if ( (param=strstr(argv[i], BATCH_PREFIX.c_str())) ) {
    +      batchSize =
    --- End diff --
    
    @ictmalili It is already configurable now, assignment here using `BATCH_PREFIX = '--batch'`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by rvs <gi...@git.apache.org>.

Github user rvs commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
    This is a discussion that belongs on the mailing list. Please do follow up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by xunzhang <gi...@git.apache.org>.

Github user xunzhang commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
    Merged into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by rvs <gi...@git.apache.org>.

Github user rvs commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
    Where was it communicated? I saw no email exchange on the dev@hawq mailing list.
    
    Also, please follow with tour comments on the dev@hawq mailing list thread that I started. It would be much easier to have a more involved discussion there rather than on a JIRA.
    
    I do have a lot of concerns about your approach and I'm still -1 until we clarify some of the issues involved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #701: HAWQ 795,796,797

Posted by ictmalili <gi...@git.apache.org>.

Github user ictmalili commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/701#discussion_r66717132
  
    --- Diff: depends/thirdparty/orc/tools/src/HdfsFileContents.cc ---
    @@ -0,0 +1,101 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +#include <cstdlib>
    +#include <memory>
    +#include <string>
    +#include <iostream>
    +#include <string>
    +#include <stdexcept>
    +
    +#include "orc/orc-config.hh"
    +#include "orc/ColumnPrinter.hh"
    +
    +#include "wrap/libhdfs3-wrapper.h"
    +
    +void printContents(const char* filename, const orc::ReaderOptions opts) {
    +  std::unique_ptr<orc::Reader> reader;
    +
    +  auto addr = std::getenv("LIBHDFS3_NAMENODE_ADDRESS");
    +  std::string nn = "localhost";
    +  if(addr != NULL) {
    +    nn = std::string(addr);
    +  }
    +  auto portStr = std::getenv("LIBHDFS3_NAMENODE_PORT");
    +  tPort port = 8020;
    +  if (portStr != NULL) {
    +    port = static_cast<tPort>(atoi(portStr));
    +  }
    +  hdfsFS fs = hdfsConnect(nn.c_str(), port);
    +  reader = orc::createReader(orc::readHdfsFile(fs, std::string(filename)), opts);
    +
    +  std::unique_ptr<orc::ColumnVectorBatch> batch = reader->createRowBatch(1000);
    --- End diff --
    
    Can we extract "1000" out to a input parameter?  I think it will be very useful if the utility can help print some information depending on input. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #701: HAWQ 795,796,797

Posted by xunzhang <gi...@git.apache.org>.

Github user xunzhang commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/701#discussion_r66714919
  
    --- Diff: depends/thirdparty/orc/tools/src/HdfsFileScan.cc ---
    @@ -0,0 +1,65 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +#include <string>
    +#include <memory>
    +#include <iostream>
    +#include <string>
    +
    +#include "orc/ColumnPrinter.hh"
    +
    +#include "Exceptions.hh"
    +
    +#include "wrap/libhdfs3-wrapper.h"
    +
    +int main(int argc, char* argv[]) {
    +  if (argc < 2) {
    +    std::cout << "Usage: orc-scan <filename>\n";
    --- End diff --
    
    Good advice! We will use prefix to auto-detect the protocol. For default, we regard it as `file://`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by liming01 <gi...@git.apache.org>.

Github user liming01 commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
    @xunzhang,  it looks fine for me, except the third commit. I added a comment in the code. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by xunzhang <gi...@git.apache.org>.

Github user xunzhang commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
    cc @liming01 @ictmalili @changleicn 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #701: HAWQ 795,796,797

Posted by xunzhang <gi...@git.apache.org>.

Github user xunzhang commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/701#discussion_r66717644
  
    --- Diff: depends/thirdparty/orc/tools/src/HdfsFileMemory.cc ---
    @@ -0,0 +1,151 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +#include <string>
    +#include <memory>
    +#include <iostream>
    +#include <map>
    +#include <exception>
    +
    +#include "orc/orc-config.hh"
    +#include "orc/ColumnPrinter.hh"
    +#include "Exceptions.hh"
    +
    +#include "wrap/libhdfs3-wrapper.h"
    +
    +class TestMemoryPool: public orc::MemoryPool {
    +private:
    +  std::map<char*, uint64_t> blocks;
    +  uint64_t totalMemory;
    +  uint64_t maxMemory;
    +
    +public:
    +  char* malloc(uint64_t size) override {
    +    char* p = static_cast<char*>(std::malloc(size));
    +    blocks[p] = size ;
    +    totalMemory += size;
    +    if (maxMemory < totalMemory) {
    +      maxMemory = totalMemory;
    +    }
    +    return p;
    +  }
    +
    +  void free(char* p) override {
    +    std::free(p);
    +    totalMemory -= blocks[p] ;
    +    blocks.erase(p);
    +  }
    +
    +  uint64_t getMaxMemory() {
    +    return maxMemory ;
    +  }
    +
    +  TestMemoryPool(): totalMemory(0), maxMemory(0) {}
    +  ~TestMemoryPool();
    +};
    +
    +TestMemoryPool::~TestMemoryPool() {}
    +
    +void processFile(const char* filename,
    +                 const std::list<uint64_t>& cols,
    +                 uint32_t batchSize) {
    +  orc::ReaderOptions opts;
    +  if (cols.size() > 0) {
    +    opts.include(cols);
    +  }
    +  std::unique_ptr<orc::MemoryPool> pool(new TestMemoryPool());
    +  opts.setMemoryPool(*(pool.get()));
    +    
    +  auto addr = std::getenv("LIBHDFS3_NAMENODE_ADDRESS");
    +  std::string nn = "localhost";
    +  if(addr != NULL) {
    +    nn = std::string(addr);
    +  }
    +  auto portStr = std::getenv("LIBHDFS3_NAMENODE_PORT");
    +  tPort port = 8020;
    +  if (portStr != NULL) {
    +    port = static_cast<tPort>(atoi(portStr));
    +  }
    +  hdfsFS fs = hdfsConnect(nn.c_str(), port);
    +
    +  std::unique_ptr<orc::Reader> reader =
    +    orc::createReader(orc::readHdfsFile(fs, std::string(filename)), opts);
    +
    +  std::unique_ptr<orc::ColumnVectorBatch> batch =
    +      reader->createRowBatch(batchSize);
    +  uint64_t readerMemory = reader->getMemoryUse();
    +  uint64_t batchMemory = batch->getMemoryUsage();
    +  while (reader->next(*batch)) {}
    +  uint64_t actualMemory =
    +      static_cast<TestMemoryPool*>(pool.get())->getMaxMemory();
    +  std::cout << "Reader memory estimate: " << readerMemory
    +            << "\nBatch memory estimate:  " ;
    +  if (batch->hasVariableLength()) {
    +    std::cout << "Cannot estimate because reading ARRAY or MAP columns";
    +  } else {
    +    std::cout << batchMemory
    +              << "\nTotal memory estimate:  " << readerMemory + batchMemory;
    +  }
    +  std::cout << "\nActual max memory used: " << actualMemory << "\n";
    +}
    +
    +int main(int argc, char* argv[]) {
    +  if (argc < 2) {
    +    std::cout << "Usage: hdfs-orc-memory [--columns=column1,column2,...] "
    +        << "[--batch=rows_in_batch] <filename> \n";
    +    return 1;
    +  }
    +
    +  const std::string COLUMNS_PREFIX = "--columns=";
    +  const std::string BATCH_PREFIX = "--batch=";
    +  char* filename = ORC_NULLPTR;
    +
    +  // Default parameters
    +  std::list<uint64_t> cols;
    +  uint32_t batchSize = 1000;
    --- End diff --
    
    Ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by liming01 <gi...@git.apache.org>.

Github user liming01 commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
     LGFM. +1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #701: HAWQ 795,796,797

Posted by ictmalili <gi...@git.apache.org>.

Github user ictmalili commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/701#discussion_r66717137
  
    --- Diff: depends/thirdparty/orc/tools/src/HdfsFileMemory.cc ---
    @@ -0,0 +1,151 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +#include <string>
    +#include <memory>
    +#include <iostream>
    +#include <map>
    +#include <exception>
    +
    +#include "orc/orc-config.hh"
    +#include "orc/ColumnPrinter.hh"
    +#include "Exceptions.hh"
    +
    +#include "wrap/libhdfs3-wrapper.h"
    +
    +class TestMemoryPool: public orc::MemoryPool {
    +private:
    +  std::map<char*, uint64_t> blocks;
    +  uint64_t totalMemory;
    +  uint64_t maxMemory;
    +
    +public:
    +  char* malloc(uint64_t size) override {
    +    char* p = static_cast<char*>(std::malloc(size));
    +    blocks[p] = size ;
    +    totalMemory += size;
    +    if (maxMemory < totalMemory) {
    +      maxMemory = totalMemory;
    +    }
    +    return p;
    +  }
    +
    +  void free(char* p) override {
    +    std::free(p);
    +    totalMemory -= blocks[p] ;
    +    blocks.erase(p);
    +  }
    +
    +  uint64_t getMaxMemory() {
    +    return maxMemory ;
    +  }
    +
    +  TestMemoryPool(): totalMemory(0), maxMemory(0) {}
    +  ~TestMemoryPool();
    +};
    +
    +TestMemoryPool::~TestMemoryPool() {}
    +
    +void processFile(const char* filename,
    +                 const std::list<uint64_t>& cols,
    +                 uint32_t batchSize) {
    +  orc::ReaderOptions opts;
    +  if (cols.size() > 0) {
    +    opts.include(cols);
    +  }
    +  std::unique_ptr<orc::MemoryPool> pool(new TestMemoryPool());
    +  opts.setMemoryPool(*(pool.get()));
    +    
    +  auto addr = std::getenv("LIBHDFS3_NAMENODE_ADDRESS");
    +  std::string nn = "localhost";
    +  if(addr != NULL) {
    +    nn = std::string(addr);
    +  }
    +  auto portStr = std::getenv("LIBHDFS3_NAMENODE_PORT");
    +  tPort port = 8020;
    +  if (portStr != NULL) {
    +    port = static_cast<tPort>(atoi(portStr));
    +  }
    +  hdfsFS fs = hdfsConnect(nn.c_str(), port);
    +
    +  std::unique_ptr<orc::Reader> reader =
    +    orc::createReader(orc::readHdfsFile(fs, std::string(filename)), opts);
    +
    +  std::unique_ptr<orc::ColumnVectorBatch> batch =
    +      reader->createRowBatch(batchSize);
    +  uint64_t readerMemory = reader->getMemoryUse();
    +  uint64_t batchMemory = batch->getMemoryUsage();
    +  while (reader->next(*batch)) {}
    +  uint64_t actualMemory =
    +      static_cast<TestMemoryPool*>(pool.get())->getMaxMemory();
    +  std::cout << "Reader memory estimate: " << readerMemory
    +            << "\nBatch memory estimate:  " ;
    +  if (batch->hasVariableLength()) {
    +    std::cout << "Cannot estimate because reading ARRAY or MAP columns";
    +  } else {
    +    std::cout << batchMemory
    +              << "\nTotal memory estimate:  " << readerMemory + batchMemory;
    +  }
    +  std::cout << "\nActual max memory used: " << actualMemory << "\n";
    +}
    +
    +int main(int argc, char* argv[]) {
    +  if (argc < 2) {
    +    std::cout << "Usage: hdfs-orc-memory [--columns=column1,column2,...] "
    +        << "[--batch=rows_in_batch] <filename> \n";
    +    return 1;
    +  }
    +
    +  const std::string COLUMNS_PREFIX = "--columns=";
    +  const std::string BATCH_PREFIX = "--batch=";
    +  char* filename = ORC_NULLPTR;
    +
    +  // Default parameters
    +  std::list<uint64_t> cols;
    +  uint32_t batchSize = 1000;
    --- End diff --
    
    Is this variable configurable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by rvs <gi...@git.apache.org>.

Github user rvs commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
    As it appears to be a complete fork of the ORC codebase (do correct me if I'm wrong). I'm very much -1 on this. Please see the email thread on dev@hawq because I think we may need to back it out of master and come up with a better approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq issue #701: HAWQ 795,796,797

Posted by ictmalili <gi...@git.apache.org>.

Github user ictmalili commented on the issue:

    https://github.com/apache/incubator-hawq/pull/701
  
    LGTM except for the two comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq pull request #701: HAWQ 795,796,797

Posted by xunzhang <gi...@git.apache.org>.

Github user xunzhang commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/701#discussion_r66717642
  
    --- Diff: depends/thirdparty/orc/tools/src/HdfsFileContents.cc ---
    @@ -0,0 +1,101 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +#include <cstdlib>
    +#include <memory>
    +#include <string>
    +#include <iostream>
    +#include <string>
    +#include <stdexcept>
    +
    +#include "orc/orc-config.hh"
    +#include "orc/ColumnPrinter.hh"
    +
    +#include "wrap/libhdfs3-wrapper.h"
    +
    +void printContents(const char* filename, const orc::ReaderOptions opts) {
    +  std::unique_ptr<orc::Reader> reader;
    +
    +  auto addr = std::getenv("LIBHDFS3_NAMENODE_ADDRESS");
    +  std::string nn = "localhost";
    +  if(addr != NULL) {
    +    nn = std::string(addr);
    +  }
    +  auto portStr = std::getenv("LIBHDFS3_NAMENODE_PORT");
    +  tPort port = 8020;
    +  if (portStr != NULL) {
    +    port = static_cast<tPort>(atoi(portStr));
    +  }
    +  hdfsFS fs = hdfsConnect(nn.c_str(), port);
    +  reader = orc::createReader(orc::readHdfsFile(fs, std::string(filename)), opts);
    +
    +  std::unique_ptr<orc::ColumnVectorBatch> batch = reader->createRowBatch(1000);
    --- End diff --
    
    It's ok, but actually it is not performance test. I will add it in option parser logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---