You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@quickstep.apache.org by jianqiao <gi...@git.apache.org> on 2016/06/09 08:44:29 UTC

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

GitHub user jianqiao opened a pull request:

    https://github.com/apache/incubator-quickstep/pull/19

    Improve text scan operator

    This PR updates the `TextScanOperator` to improve its performance.
    
    There are three main changes:
    (1) Pass `text_offset` and `text_segment_size` as parameters to each `TextScanWorkOrder` instead of really loading the data. Then each `TextScanWorkOrder` reads the corresponding piece of data directly from disk.
    (2) Avoid extra string copying by passing `const char **` buffer pointers into `parseRow()` and `extractFieldString()`.
    (3) Use `ColumnVectorsValueAccessor` as the temporary container to store the parsed tuples. Then call `output_destination_->bulkInsertTuples()` to bulk insert the tuples.
    
    **Note 1:** This updated version follows the semantics of the old `TextScanOperator` except that it does not support the backslash + newline escaping, e.g.
    (a)
    ```
    aaaa\
    bbbb
    ```
    which is semantically equivalent to
    (b)
    ```
    aaaa\nbbbb
    ```
    The updated version supports (b) but not (a). As (a) incurs extra logic that complicates code. Meanwhile, format (a) seems to be specific to PostgreSQL, and the [documentation](http://www.postgresql.org/docs/9.6/static/sql-copy.html) of PostgreSQL 9.6 says:
    _It is strongly recommended that applications generating COPY data convert data newlines and carriage returns to the \n and \r sequences respectively. At present it is possible to represent a data carriage return by a backslash and carriage return, and to represent a data newline by a backslash and newline. However, these representations might not be accepted in future releases. They are also highly vulnerable to corruption if the COPY file is transferred across different machines (for example, from Unix to Windows or vice versa)._
    
    **Note 2:** This PR relies on the fix from #18 to work correctly for loading `TYPE compressed_columnstore` tables.
    
    **Note 3:** Using 40 workers, the expected loading time on cloudlab machines with current SQL-benchmark settings are ~465s for SSB SF100 and ~1050s for TPCH SF100.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-quickstep improve-text-scan-operator-column-vectors

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-quickstep/pull/19.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19
    
----
commit 55b06fab1bd336f2cc7ee4bd557d3328a428e4ab
Author: Jianqiao Zhu <ji...@cs.wisc.edu>
Date:   2016-06-09T08:18:37Z

    Improve text scan operator

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

Posted by jianqiao <gi...@git.apache.org>.

Github user jianqiao commented on a diff in the pull request:

    https://github.com/apache/incubator-quickstep/pull/19#discussion_r66459911
  
    --- Diff: relational_operators/TextScanOperator.cpp ---
    @@ -274,439 +116,293 @@ TextScanWorkOrder::TextScanWorkOrder(const std::size_t query_id,
     
     void TextScanWorkOrder::execute() {
       const CatalogRelationSchema &relation = output_destination_->getRelation();
    +  std::vector<Tuple> tuples;
     
    -  string current_row_string;
    -  if (is_file_) {
    -    FILE *file = std::fopen(filename_.c_str(), "r");
    -    if (file == nullptr) {
    -      throw TextScanReadError(filename_);
    -    }
    +  constexpr std::size_t kSmallBufferSize = 0x4000;
    --- End diff --
    
    From experiments this is the best size for both _slowdisk_ and _fastdisk_ on the cloudlab machines. The observation is that _slowdisk_ loading favors smaller (256K) segment size and _fastdisk_ loading favors larger (1M~4M, but 256K is okay) segment size. I haven't dived into the underlying details though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

Posted by jianqiao <gi...@git.apache.org>.

Github user jianqiao commented on a diff in the pull request:

    https://github.com/apache/incubator-quickstep/pull/19#discussion_r66470954
  
    --- Diff: relational_operators/TextScanOperator.cpp ---
    @@ -155,116 +63,50 @@ bool TextScanOperator::getAllWorkOrders(
       InsertDestination *output_destination =
           query_context->getInsertDestination(output_destination_index_);
     
    -  if (parallelize_load_) {
    -    // Parallel implementation: Split work orders are generated for each file
    -    // being bulk-loaded. (More than one file can be loaded, because we support
    -    // glob() semantics in file name.) These work orders read the input file,
    -    // and split them in the blobs that can be parsed independently.
    -    if (blocking_dependencies_met_) {
    -      if (!work_generated_) {
    -        // First, generate text-split work orders.
    -        for (const auto &file : files) {
    -          container->addNormalWorkOrder(
    -              new TextSplitWorkOrder(query_id_,
    -                                     file,
    -                                     process_escape_sequences_,
    -                                     storage_manager,
    -                                     op_index_,
    -                                     scheduler_client_id,
    -                                     bus),
    -              op_index_);
    -          ++num_split_work_orders_;
    -        }
    -        work_generated_ = true;
    -        return false;
    -      } else {
    -        // Check if there are blobs to parse.
    -        while (!text_blob_queue_.empty()) {
    -          const TextBlob blob_work = text_blob_queue_.popOne();
    -          container->addNormalWorkOrder(
    -              new TextScanWorkOrder(query_id_,
    -                                    blob_work.blob_id,
    -                                    blob_work.size,
    -                                    field_terminator_,
    -                                    process_escape_sequences_,
    -                                    output_destination,
    -                                    storage_manager),
    -              op_index_);
    -        }
    -        // Done if all split work orders are completed, and no blobs are left to
    -        // process.
    -        return num_done_split_work_orders_.load(std::memory_order_acquire) == num_split_work_orders_ &&
    -               text_blob_queue_.empty();
    -      }
    -    }
    -    return false;
    -  } else {
    -    // Serial implementation.
    -    if (blocking_dependencies_met_ && !work_generated_) {
    -      for (const auto &file : files) {
    +  // Text segment size set to 256KB.
    +  constexpr std::size_t kTextSegmentSize = 0x40000u;
    --- End diff --
    
    Yes it would be better to have a gflag on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

Posted by jianqiao <gi...@git.apache.org>.

Github user jianqiao commented on a diff in the pull request:

    https://github.com/apache/incubator-quickstep/pull/19#discussion_r66470811
  
    --- Diff: relational_operators/TextScanOperator.cpp ---
    @@ -155,116 +63,50 @@ bool TextScanOperator::getAllWorkOrders(
       InsertDestination *output_destination =
           query_context->getInsertDestination(output_destination_index_);
     
    -  if (parallelize_load_) {
    -    // Parallel implementation: Split work orders are generated for each file
    -    // being bulk-loaded. (More than one file can be loaded, because we support
    -    // glob() semantics in file name.) These work orders read the input file,
    -    // and split them in the blobs that can be parsed independently.
    -    if (blocking_dependencies_met_) {
    -      if (!work_generated_) {
    -        // First, generate text-split work orders.
    -        for (const auto &file : files) {
    -          container->addNormalWorkOrder(
    -              new TextSplitWorkOrder(query_id_,
    -                                     file,
    -                                     process_escape_sequences_,
    -                                     storage_manager,
    -                                     op_index_,
    -                                     scheduler_client_id,
    -                                     bus),
    -              op_index_);
    -          ++num_split_work_orders_;
    -        }
    -        work_generated_ = true;
    -        return false;
    -      } else {
    -        // Check if there are blobs to parse.
    -        while (!text_blob_queue_.empty()) {
    -          const TextBlob blob_work = text_blob_queue_.popOne();
    -          container->addNormalWorkOrder(
    -              new TextScanWorkOrder(query_id_,
    -                                    blob_work.blob_id,
    -                                    blob_work.size,
    -                                    field_terminator_,
    -                                    process_escape_sequences_,
    -                                    output_destination,
    -                                    storage_manager),
    -              op_index_);
    -        }
    -        // Done if all split work orders are completed, and no blobs are left to
    -        // process.
    -        return num_done_split_work_orders_.load(std::memory_order_acquire) == num_split_work_orders_ &&
    -               text_blob_queue_.empty();
    -      }
    -    }
    -    return false;
    -  } else {
    -    // Serial implementation.
    -    if (blocking_dependencies_met_ && !work_generated_) {
    -      for (const auto &file : files) {
    +  // Text segment size set to 256KB.
    +  constexpr std::size_t kTextSegmentSize = 0x40000u;
    +
    +  if (blocking_dependencies_met_ && !work_generated_) {
    +    for (const std::string &file : files) {
    +      // Use standard C libary to retrieve the file size.
    +      FILE *fp = std::fopen(file.c_str(), "rb");
    +      std::fseek(fp, 0, SEEK_END);
    +      const std::size_t file_size = std::ftell(fp);
    +      std::fclose(fp);
    +
    +      std::size_t text_offset = 0;
    +      while (text_offset < file_size) {
             container->addNormalWorkOrder(
                 new TextScanWorkOrder(query_id_,
                                       file,
    +                                  text_offset,
    +                                  std::min(kTextSegmentSize, file_size - text_offset),
                                       field_terminator_,
                                       process_escape_sequences_,
                                       output_destination,
                                       storage_manager),
                 op_index_);
    +        text_offset += kTextSegmentSize;
    --- End diff --
    
    Yes alternatively we can do
    ```
    std::size_t text_segment_size = std::min(kTextSegmentSize, file_size - text_offset);
    
    ... // addNormalWorkOrder
    
    text_offset += text_segment_size;
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

Posted by zuyu <gi...@git.apache.org>.

Github user zuyu commented on a diff in the pull request:

    https://github.com/apache/incubator-quickstep/pull/19#discussion_r66457753
  
    --- Diff: relational_operators/TextScanOperator.cpp ---
    @@ -274,439 +116,293 @@ TextScanWorkOrder::TextScanWorkOrder(const std::size_t query_id,
     
     void TextScanWorkOrder::execute() {
       const CatalogRelationSchema &relation = output_destination_->getRelation();
    +  std::vector<Tuple> tuples;
     
    -  string current_row_string;
    -  if (is_file_) {
    -    FILE *file = std::fopen(filename_.c_str(), "r");
    -    if (file == nullptr) {
    -      throw TextScanReadError(filename_);
    -    }
    +  constexpr std::size_t kSmallBufferSize = 0x4000;
    --- End diff --
    
    Again, I was curious how we came up with this number. Does a different value would change the loading performance?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

Posted by jianqiao <gi...@git.apache.org>.

Github user jianqiao commented on a diff in the pull request:

    https://github.com/apache/incubator-quickstep/pull/19#discussion_r66460190
  
    --- Diff: relational_operators/TextScanOperator.cpp ---
    @@ -155,116 +63,50 @@ bool TextScanOperator::getAllWorkOrders(
       InsertDestination *output_destination =
           query_context->getInsertDestination(output_destination_index_);
     
    -  if (parallelize_load_) {
    -    // Parallel implementation: Split work orders are generated for each file
    -    // being bulk-loaded. (More than one file can be loaded, because we support
    -    // glob() semantics in file name.) These work orders read the input file,
    -    // and split them in the blobs that can be parsed independently.
    -    if (blocking_dependencies_met_) {
    -      if (!work_generated_) {
    -        // First, generate text-split work orders.
    -        for (const auto &file : files) {
    -          container->addNormalWorkOrder(
    -              new TextSplitWorkOrder(query_id_,
    -                                     file,
    -                                     process_escape_sequences_,
    -                                     storage_manager,
    -                                     op_index_,
    -                                     scheduler_client_id,
    -                                     bus),
    -              op_index_);
    -          ++num_split_work_orders_;
    -        }
    -        work_generated_ = true;
    -        return false;
    -      } else {
    -        // Check if there are blobs to parse.
    -        while (!text_blob_queue_.empty()) {
    -          const TextBlob blob_work = text_blob_queue_.popOne();
    -          container->addNormalWorkOrder(
    -              new TextScanWorkOrder(query_id_,
    -                                    blob_work.blob_id,
    -                                    blob_work.size,
    -                                    field_terminator_,
    -                                    process_escape_sequences_,
    -                                    output_destination,
    -                                    storage_manager),
    -              op_index_);
    -        }
    -        // Done if all split work orders are completed, and no blobs are left to
    -        // process.
    -        return num_done_split_work_orders_.load(std::memory_order_acquire) == num_split_work_orders_ &&
    -               text_blob_queue_.empty();
    -      }
    -    }
    -    return false;
    -  } else {
    -    // Serial implementation.
    -    if (blocking_dependencies_met_ && !work_generated_) {
    -      for (const auto &file : files) {
    +  // Text segment size set to 256KB.
    +  constexpr std::size_t kTextSegmentSize = 0x40000u;
    --- End diff --
    
    From experiments this is the best size for both _slowdisk_ and _fastdisk_ on the cloudlab machines. The observation is that _slowdisk_ loading favors smaller (256K) segment size and _fastdisk_ loading favors larger (1M~4M, but 256K is okay) segment size. I haven't dived into the underlying details though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

Posted by jianqiao <gi...@git.apache.org>.

Github user jianqiao commented on a diff in the pull request:

    https://github.com/apache/incubator-quickstep/pull/19#discussion_r66470302
  
    --- Diff: relational_operators/TextScanOperator.cpp ---
    @@ -274,439 +116,293 @@ TextScanWorkOrder::TextScanWorkOrder(const std::size_t query_id,
     
     void TextScanWorkOrder::execute() {
       const CatalogRelationSchema &relation = output_destination_->getRelation();
    +  std::vector<Tuple> tuples;
     
    -  string current_row_string;
    -  if (is_file_) {
    -    FILE *file = std::fopen(filename_.c_str(), "r");
    -    if (file == nullptr) {
    -      throw TextScanReadError(filename_);
    -    }
    +  constexpr std::size_t kSmallBufferSize = 0x4000;
    --- End diff --
    
    This is the buffer size for processing the last row of the text segment.
    
    For each text segment, we will first: (1) start scanning from the first newline (`\n`) character in the segment, and end scanning with the last newline character in the segment; and then: (2) scanning from the _last_ newline character in _this_ text segment to the _first_ newline character in the _next_ text segment (corner cases will also be handled).
    
    Consider (2), how much data from the _next_ segment do we want to load from disk? Since it is just one row, in most cases we may not want to load too much. So the load buffer starts with 1024 bytes, and we keep appending the buffer's contents to a `std::string` if `\n` is not met. If this "tail row" is really large, the buffer will grow up to 0x4000 bytes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #19: Improve text scan operator

Posted by pateljm <gi...@git.apache.org>.

Github user pateljm commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/19
  
    @zuyu -- I think it is perfectly fine for you go clean it up and open a new PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

Posted by zuyu <gi...@git.apache.org>.

Github user zuyu commented on a diff in the pull request:

    https://github.com/apache/incubator-quickstep/pull/19#discussion_r66461609
  
    --- Diff: relational_operators/TextScanOperator.cpp ---
    @@ -155,116 +63,50 @@ bool TextScanOperator::getAllWorkOrders(
       InsertDestination *output_destination =
           query_context->getInsertDestination(output_destination_index_);
     
    -  if (parallelize_load_) {
    -    // Parallel implementation: Split work orders are generated for each file
    -    // being bulk-loaded. (More than one file can be loaded, because we support
    -    // glob() semantics in file name.) These work orders read the input file,
    -    // and split them in the blobs that can be parsed independently.
    -    if (blocking_dependencies_met_) {
    -      if (!work_generated_) {
    -        // First, generate text-split work orders.
    -        for (const auto &file : files) {
    -          container->addNormalWorkOrder(
    -              new TextSplitWorkOrder(query_id_,
    -                                     file,
    -                                     process_escape_sequences_,
    -                                     storage_manager,
    -                                     op_index_,
    -                                     scheduler_client_id,
    -                                     bus),
    -              op_index_);
    -          ++num_split_work_orders_;
    -        }
    -        work_generated_ = true;
    -        return false;
    -      } else {
    -        // Check if there are blobs to parse.
    -        while (!text_blob_queue_.empty()) {
    -          const TextBlob blob_work = text_blob_queue_.popOne();
    -          container->addNormalWorkOrder(
    -              new TextScanWorkOrder(query_id_,
    -                                    blob_work.blob_id,
    -                                    blob_work.size,
    -                                    field_terminator_,
    -                                    process_escape_sequences_,
    -                                    output_destination,
    -                                    storage_manager),
    -              op_index_);
    -        }
    -        // Done if all split work orders are completed, and no blobs are left to
    -        // process.
    -        return num_done_split_work_orders_.load(std::memory_order_acquire) == num_split_work_orders_ &&
    -               text_blob_queue_.empty();
    -      }
    -    }
    -    return false;
    -  } else {
    -    // Serial implementation.
    -    if (blocking_dependencies_met_ && !work_generated_) {
    -      for (const auto &file : files) {
    +  // Text segment size set to 256KB.
    +  constexpr std::size_t kTextSegmentSize = 0x40000u;
    --- End diff --
    
    Cool. I would suggest to use a gflags to define this value, so that we could set any value w/o recompilation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

Posted by zuyu <gi...@git.apache.org>.

Github user zuyu commented on a diff in the pull request:

    https://github.com/apache/incubator-quickstep/pull/19#discussion_r66455959
  
    --- Diff: relational_operators/TextScanOperator.cpp ---
    @@ -155,116 +63,50 @@ bool TextScanOperator::getAllWorkOrders(
       InsertDestination *output_destination =
           query_context->getInsertDestination(output_destination_index_);
     
    -  if (parallelize_load_) {
    -    // Parallel implementation: Split work orders are generated for each file
    -    // being bulk-loaded. (More than one file can be loaded, because we support
    -    // glob() semantics in file name.) These work orders read the input file,
    -    // and split them in the blobs that can be parsed independently.
    -    if (blocking_dependencies_met_) {
    -      if (!work_generated_) {
    -        // First, generate text-split work orders.
    -        for (const auto &file : files) {
    -          container->addNormalWorkOrder(
    -              new TextSplitWorkOrder(query_id_,
    -                                     file,
    -                                     process_escape_sequences_,
    -                                     storage_manager,
    -                                     op_index_,
    -                                     scheduler_client_id,
    -                                     bus),
    -              op_index_);
    -          ++num_split_work_orders_;
    -        }
    -        work_generated_ = true;
    -        return false;
    -      } else {
    -        // Check if there are blobs to parse.
    -        while (!text_blob_queue_.empty()) {
    -          const TextBlob blob_work = text_blob_queue_.popOne();
    -          container->addNormalWorkOrder(
    -              new TextScanWorkOrder(query_id_,
    -                                    blob_work.blob_id,
    -                                    blob_work.size,
    -                                    field_terminator_,
    -                                    process_escape_sequences_,
    -                                    output_destination,
    -                                    storage_manager),
    -              op_index_);
    -        }
    -        // Done if all split work orders are completed, and no blobs are left to
    -        // process.
    -        return num_done_split_work_orders_.load(std::memory_order_acquire) == num_split_work_orders_ &&
    -               text_blob_queue_.empty();
    -      }
    -    }
    -    return false;
    -  } else {
    -    // Serial implementation.
    -    if (blocking_dependencies_met_ && !work_generated_) {
    -      for (const auto &file : files) {
    +  // Text segment size set to 256KB.
    +  constexpr std::size_t kTextSegmentSize = 0x40000u;
    --- End diff --
    
    @jianqiao I was wondering if there is a reason to choose this size. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-quickstep/pull/19


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

Posted by zuyu <gi...@git.apache.org>.

Github user zuyu commented on a diff in the pull request:

    https://github.com/apache/incubator-quickstep/pull/19#discussion_r66458585
  
    --- Diff: relational_operators/TextScanOperator.cpp ---
    @@ -274,439 +116,293 @@ TextScanWorkOrder::TextScanWorkOrder(const std::size_t query_id,
     
     void TextScanWorkOrder::execute() {
       const CatalogRelationSchema &relation = output_destination_->getRelation();
    +  std::vector<Tuple> tuples;
     
    -  string current_row_string;
    -  if (is_file_) {
    -    FILE *file = std::fopen(filename_.c_str(), "r");
    -    if (file == nullptr) {
    -      throw TextScanReadError(filename_);
    -    }
    +  constexpr std::size_t kSmallBufferSize = 0x4000;
    +  char *buffer = reinterpret_cast<char *>(malloc(std::max(text_segment_size_, kSmallBufferSize)));
     
    -    bool have_row = false;
    -    do {
    -      current_row_string.clear();
    -      have_row = readRowFromFile(file, &current_row_string);
    -      if (have_row) {
    -        Tuple tuple = parseRow(current_row_string, relation);
    -        output_destination_->insertTupleInBatch(tuple);
    -      }
    -    } while (have_row);
    -
    -    std::fclose(file);
    -  } else {
    -    BlobReference blob = storage_manager_->getBlob(text_blob_);
    -    const char *blob_pos = static_cast<const char*>(blob->getMemory());
    -    const char *blob_end = blob_pos + text_size_;
    -    bool have_row = false;
    -    do {
    -      current_row_string.clear();
    -      have_row = readRowFromBlob(&blob_pos, blob_end, &current_row_string);
    -      if (have_row) {
    -        Tuple tuple = parseRow(current_row_string, relation);
    -        output_destination_->insertTupleInBatch(tuple);
    -      }
    -    } while (have_row);
    -
    -    // Drop the consumed blob produced by TextSplitWorkOrder.
    -    blob.release();
    -    storage_manager_->deleteBlockOrBlobFile(text_blob_);
    +  // Read text segment into buffer.
    +  FILE *file = std::fopen(filename_.c_str(), "rb");
    +  std::fseek(file, text_offset_, SEEK_SET);
    +  std::size_t bytes_read = std::fread(buffer, 1, text_segment_size_, file);
    --- End diff --
    
    Please mark `const`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #19: Improve text scan operator

Posted by jianqiao <gi...@git.apache.org>.

Github user jianqiao commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/19
  
    From experiments this is the best size for both _slowdisk_ and _fastdisk_ on the cloudlab machines. The observation is that _slowdisk_ loading favors smaller (256K) segment size and _fastdisk_ loading favors larger (1M~4M, but 256K is okay) segment size. I haven't dived into the underlying details though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #19: Improve text scan operator

Posted by zuyu <gi...@git.apache.org>.

Github user zuyu commented on a diff in the pull request:

    https://github.com/apache/incubator-quickstep/pull/19#discussion_r66457297
  
    --- Diff: relational_operators/TextScanOperator.cpp ---
    @@ -155,116 +63,50 @@ bool TextScanOperator::getAllWorkOrders(
       InsertDestination *output_destination =
           query_context->getInsertDestination(output_destination_index_);
     
    -  if (parallelize_load_) {
    -    // Parallel implementation: Split work orders are generated for each file
    -    // being bulk-loaded. (More than one file can be loaded, because we support
    -    // glob() semantics in file name.) These work orders read the input file,
    -    // and split them in the blobs that can be parsed independently.
    -    if (blocking_dependencies_met_) {
    -      if (!work_generated_) {
    -        // First, generate text-split work orders.
    -        for (const auto &file : files) {
    -          container->addNormalWorkOrder(
    -              new TextSplitWorkOrder(query_id_,
    -                                     file,
    -                                     process_escape_sequences_,
    -                                     storage_manager,
    -                                     op_index_,
    -                                     scheduler_client_id,
    -                                     bus),
    -              op_index_);
    -          ++num_split_work_orders_;
    -        }
    -        work_generated_ = true;
    -        return false;
    -      } else {
    -        // Check if there are blobs to parse.
    -        while (!text_blob_queue_.empty()) {
    -          const TextBlob blob_work = text_blob_queue_.popOne();
    -          container->addNormalWorkOrder(
    -              new TextScanWorkOrder(query_id_,
    -                                    blob_work.blob_id,
    -                                    blob_work.size,
    -                                    field_terminator_,
    -                                    process_escape_sequences_,
    -                                    output_destination,
    -                                    storage_manager),
    -              op_index_);
    -        }
    -        // Done if all split work orders are completed, and no blobs are left to
    -        // process.
    -        return num_done_split_work_orders_.load(std::memory_order_acquire) == num_split_work_orders_ &&
    -               text_blob_queue_.empty();
    -      }
    -    }
    -    return false;
    -  } else {
    -    // Serial implementation.
    -    if (blocking_dependencies_met_ && !work_generated_) {
    -      for (const auto &file : files) {
    +  // Text segment size set to 256KB.
    +  constexpr std::size_t kTextSegmentSize = 0x40000u;
    +
    +  if (blocking_dependencies_met_ && !work_generated_) {
    +    for (const std::string &file : files) {
    +      // Use standard C libary to retrieve the file size.
    +      FILE *fp = std::fopen(file.c_str(), "rb");
    +      std::fseek(fp, 0, SEEK_END);
    +      const std::size_t file_size = std::ftell(fp);
    +      std::fclose(fp);
    +
    +      std::size_t text_offset = 0;
    +      while (text_offset < file_size) {
             container->addNormalWorkOrder(
                 new TextScanWorkOrder(query_id_,
                                       file,
    +                                  text_offset,
    +                                  std::min(kTextSegmentSize, file_size - text_offset),
                                       field_terminator_,
                                       process_escape_sequences_,
                                       output_destination,
                                       storage_manager),
                 op_index_);
    +        text_offset += kTextSegmentSize;
    --- End diff --
    
    This won't become a bug, but I think what we really mean is the following:
    
    ```
      const size_t text_actual_segment_size = std::min(kTextSegmentSize, file_size - text_offset);
      text_offset += text_actual_segment_size;
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #19: Improve text scan operator

Posted by pateljm <gi...@git.apache.org>.

Github user pateljm commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/19
  
    @jianqiao This is fantastic! This is going to making a huge difference when loading large datasets. LGTM. As a perspective loading SSB used to take about 2800 seconds. Huge 6X improvement!
    
    Merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #19: Improve text scan operator

Posted by zuyu <gi...@git.apache.org>.

Github user zuyu commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/19
  
    @jianqiao There are some clean-ups as a result of this PR, including removing `TextScanOperator.proto`, deleting `TextSplitWorkOrder` in `WorkOrder.proto`, and I could do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---