You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@quickstep.apache.org by tarunbansal <gi...@git.apache.org> on 2016/09/29 21:59:48 UTC

[GitHub] incubator-quickstep pull request #103: QUICKSTEP-46 Fault tolerance in bulk ...

GitHub user tarunbansal opened a pull request:

    https://github.com/apache/incubator-quickstep/pull/103

    QUICKSTEP-46 Fault tolerance in bulk loading data

    This PR helps add fault tolerance while bulk loading data from a file. Now quickstep will skip a record(a row) if it is faulty and move to the next one. Possible error cases are listed below and further classified as To-do or Done :
    DONE
    1) Record has too few fields.
    2) Record has too many fields.
    3) A field could not be parsed properly as per the attribute type in a schema.
    4) Null literal specified for a column with a not nullable type.
    TO-DO
    5) Null character mixed in with other data for a column.
    6) Backslash line splicing support.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tarunbansal/incubator-quickstep master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-quickstep/pull/103.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #103
    
----
commit 596c783d161fbbfd5f1f987c3aa40d621e142ab9
Author: Tarun Bansal <ta...@rockhopper-09.cs.wisc.edu>
Date:   2016-09-23T14:30:23Z

    changes to fix QUICKSTEP-46

commit b753d5209e11718b39869fce8659611a7c9f6945
Author: Tarun Bansal <ta...@rockhopper-05.cs.wisc.edu>
Date:   2016-09-24T22:37:36Z

    QUICKSTEP-46 fixed

commit daa4a37f1a6cc18c999ed8c8b39fc37d6cd67819
Author: Tarun Bansal <ta...@rockhopper-09.cs.wisc.edu>
Date:   2016-09-25T01:11:43Z

    QUICKSTEP-46 fixed

commit b1f91650aa9806e018a64043b0e35861d78a8e6f
Author: Tarun Bansal <ta...@rockhopper-06.cs.wisc.edu>
Date:   2016-09-28T00:10:57Z

    QUICKSTEP-46 fixed

commit 9c30199658a462cded24c71bf7788f121294f29d
Author: Tarun Bansal <ta...@rockhopper-09.cs.wisc.edu>
Date:   2016-09-23T14:30:23Z

    changes to fix QUICKSTEP-46

commit dfeb999b2818a95837f84eccfd5215f70bee14a8
Author: Tarun Bansal <ta...@rockhopper-05.cs.wisc.edu>
Date:   2016-09-24T22:37:36Z

    QUICKSTEP-46 fixed

commit 1bbaf4121e23aa76041c58d10cd984d86844315b
Author: Tarun Bansal <ta...@gmail.com>
Date:   2016-09-25T01:11:43Z

    QUICKSTEP-46 fixed

commit cd5a4561d91d0b670630dbccf6ebe345f363a6c8
Author: Tarun Bansal <ta...@gmail.com>
Date:   2016-09-28T00:30:29Z

    QUICKSTEP-46 fixed

commit fb0d65841a6b623c3a8c22e4f6767cd87d838c56
Author: Tarun Bansal <ta...@gmail.com>
Date:   2016-09-28T01:03:08Z

    QUICKSTEP-46 fixed

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by tarunbansal <gi...@git.apache.org>.
Github user tarunbansal commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @pateljm Thanks. I will add the suggested test after a discussion with @hbdeshmukh @hakanmemisoglu about the testing infrastructure. 
    @zuyu Thanks for your suggestion. Moving the original faulty tuples at the end of the data file will need additional write permissions as compared to the current read mode we are using for loading the data. Should that be done in a future PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by pateljm <gi...@git.apache.org>.
Github user pateljm commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @zuyu I'd suggest move this feature along and not expand the scope unnecessarily. There is a tradeoff -- one can dump the line numbers of the problematic tuples, but that slows down loading. For now, I'd suggest to wrap this feature and not load it with more. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by pateljm <gi...@git.apache.org>.
Github user pateljm commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @tarunbansal I can close this. Can you rebase? Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by zuyu <gi...@git.apache.org>.
Github user zuyu commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @tarunbansal This feature is great, and I was wondering if we would consider dumping the original faulty tuples at the end, instead of ignoring them all. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by pateljm <gi...@git.apache.org>.
Github user pateljm commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @tarunbansal This looks good. May I request adding a test to show this error message? Perhaps by augmenting tests/TextScanOperator_unittest.cpp? Then squash the commits, and this is good to go!
    
    @hakanmemisoglu @hbdeshmukh If @tarunbansal could you walk him through the testing infrastructure and show him the ropes? Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by tarunbansal <gi...@git.apache.org>.
Github user tarunbansal commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @zuyu Thanks. I have now included the test in cmake file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by pateljm <gi...@git.apache.org>.
Github user pateljm commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @zuyu IMHO it would be odd to modify the data file. Thanks for looking at this PR. 
    @tarunbansal I think this PR is good as it is. 
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by tarunbansal <gi...@git.apache.org>.
Github user tarunbansal commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    Thanks for demonstrating me the testing infrastructure @hakanmemisoglu and @hbdeshmukh.  As discussed I have added a test case where some faulty records are present. Also logged the error messages to the log file instead of logging them to stdout.
    Created a single commit for all the changes done so far. Please have a look and merge this PR if it is good to go now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by tarunbansal <gi...@git.apache.org>.
Github user tarunbansal commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @pateljm Rebased it. Please merge this PR if it is good to go now. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by tarunbansal <gi...@git.apache.org>.
Github user tarunbansal commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @pateljm Rebase done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by zuyu <gi...@git.apache.org>.
Github user zuyu commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @hbdeshmukh Sg to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by hbdeshmukh <gi...@git.apache.org>.
Github user hbdeshmukh commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    Based on a discussion between @tarunbansal, @hakanmemisoglu and I - we noted two things: 
    1. Logging error messages to std::cout could be unhelpful when there are multiple threads trying to load the data. The error messages will get mixed up with each other. For now, @tarunbansal decided to log the error messages (using GLOG) to the log file. Later, we can summarize the error messages using thread-private buffers and emit them at the end of each work order execution.
     
    2. @tarunbansal will look at how the current testing framework for TextScan operator and add tests with corrupt input tuples. The testing would involve loading an input with some corrupt tuples and verifying if the correct tuples got inserted. 
    
    Once the above comments are addressed and the commits are squashed in one, we will be good to go. Thanks @pateljm @zuyu for your inputs. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by pateljm <gi...@git.apache.org>.
Github user pateljm commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @tarunbansal Merged. Thanks for this contribution!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by zuyu <gi...@git.apache.org>.
Github user zuyu commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @tarunbansal To clarify, I mean to dump the flaky tuples `at the end` of query execution, either to `stdout` or a new file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep pull request #103: QUICKSTEP-46 Fault tolerance in bulk ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-quickstep/pull/103


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by zuyu <gi...@git.apache.org>.
Github user zuyu commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @tarunbansal I believe we also need to include the new unit test in `CMakeLists.txt` (as [here](https://github.com/apache/incubator-quickstep/blob/master/relational_operators/CMakeLists.txt#L771)) to actually run the faulty one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by pateljm <gi...@git.apache.org>.
Github user pateljm commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    @tarunbansal Can you rebase again? Sorry - with the other commits, now this rebase is off (one of the limitations of git on ASF). Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-quickstep issue #103: QUICKSTEP-46 Fault tolerance in bulk loading...

Posted by tarunbansal <gi...@git.apache.org>.
Github user tarunbansal commented on the issue:

    https://github.com/apache/incubator-quickstep/pull/103
  
    Rebased it again to be in synch with the latest commits @hbdeshmukh @pateljm . Please merge this PR if everything looks okay. Thanks !



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---