You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by moresandeep <gi...@git.apache.org> on 2017/10/28 23:46:07 UTC

[GitHub] orc pull request #184: Orc 256 unmask range option

GitHub user moresandeep opened a pull request:

    https://github.com/apache/orc/pull/184

    Orc 256 unmask range option

    This PR contains changes that enables unmasking range option for redact mask (ORC-256).
    
    1. The react mask would accept an additional option (option #3 in this case) that has the configuration for leaving certain ranges of strings unmasked
    2. The options will look like "0:4,-4:-1"
    3. Range unmasking is available for 1. Long 2. Double 3. String 4. Decimal
    
      


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/moresandeep/orc ORC-256-Unmask_Range_Option

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/orc/pull/184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #184
    
----
commit 2144ca0e360e3885d39ec19cad152ec9d1b6e69e
Author: Sandeep More <mo...@apache.org>
Date:   2017-10-28T14:41:41Z

    ORC-256 - Add unmasked ranges option for redact mask
    
    Signed-off-by: Sandeep More <mo...@apache.org>

commit 7f8443c8182228e1ef000f3fc1ecd65fdd85b69f
Author: Sandeep More <mo...@apache.org>
Date:   2017-10-28T16:08:12Z

    ORC-256 - Minor fixes

----


---

[GitHub] orc issue #184: Orc 256 unmask range option

Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on the issue:

    https://github.com/apache/orc/pull/184
  
    @omalley I updated the PR with your suggestions, thanks for the review !


---

[GitHub] orc pull request #184: Orc 256 unmask range option

Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on a diff in the pull request:

    https://github.com/apache/orc/pull/184#discussion_r155629181
  
    --- Diff: java/core/src/test/org/apache/orc/impl/mask/TestUnmaskRange.java ---
    @@ -0,0 +1,165 @@
    +package org.apache.orc.impl.mask;
    +
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with this
    + * work for additional information regarding copyright ownership.  The ASF
    + * licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License.
    + * You may obtain a copy of the License at
    + * <p>
    + * http://www.apache.org/licenses/LICENSE-2.0
    + * <p>
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
    + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
    + * License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable;
    +import org.junit.Test;
    +
    +import java.nio.charset.StandardCharsets;
    +
    +import static org.junit.Assert.assertEquals;
    +
    +/**
    + * Test Unmask option
    + */
    +public class TestUnmaskRange {
    +
    +  public TestUnmaskRange() {
    +    super();
    +  }
    +
    +  /* Test for Long */
    +  @Test
    +  public void testSimpleLongRangeMask() {
    +    RedactMaskFactory mask = new RedactMaskFactory("9", "", "0:2");
    +    long result = mask.maskLong(123456);
    +    assertEquals(123_999, result);
    +
    +    // negative index
    +    mask = new RedactMaskFactory("9", "", "-3:-1");
    +    result = mask.maskLong(123456);
    +    assertEquals(999_456, result);
    +
    +    // out of range mask, return the original mask
    +    mask = new RedactMaskFactory("9", "", "7:10");
    +    result = mask.maskLong(123456);
    +    assertEquals(999999, result);
    +
    +  }
    +
    +  @Test
    +  public void testDefaultRangeMask() {
    +    RedactMaskFactory mask = new RedactMaskFactory("9", "", "");
    +    long result = mask.maskLong(123456);
    +    assertEquals(999999, result);
    +
    +    mask = new RedactMaskFactory("9");
    +    result = mask.maskLong(123456);
    +    assertEquals(999999, result);
    +
    +  }
    +
    +  @Test
    +  public void testCCRangeMask() {
    +    long cc = 4716885592186382L;
    +    long maskedCC = 4716_77777777_6382L;
    +    // Range unmask for first 4 and last 4 of credit card number
    +    final RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:3,-4:-1");
    +    long result = mask.maskLong(cc);
    +
    +    assertEquals(String.valueOf(cc).length(), String.valueOf(result).length());
    +    assertEquals(4716_77777777_6382L, result);
    +  }
    +
    +  /* Tests for Double */
    +  @Test
    +  public void testSimpleDoubleRangeMask() {
    +    RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:2");
    +    assertEquals(1237.77, mask.maskDouble(1234.99), 0.000001);
    +    assertEquals(12377.7, mask.maskDouble(12345.9), 0.000001);
    +
    +    mask = new RedactMaskFactory("Xx7", "", "-3:-1");
    +    assertEquals(7774.9, mask.maskDouble(1234.9), 0.000001);
    +
    +  }
    +
    +  /* test for String */
    +  @Test
    +  public void testStringRangeMask() {
    +
    +    BytesColumnVector source = new BytesColumnVector();
    +    BytesColumnVector target = new BytesColumnVector();
    +    target.reset();
    +
    +    byte[] input = "Mary had 1 little lamb!!".getBytes(StandardCharsets.UTF_8);
    +    source.setRef(0, input, 0, input.length);
    +
    +    // Set a 4 byte chinese character (U+2070E), which is letter other
    +    input = "\uD841\uDF0E".getBytes(StandardCharsets.UTF_8);
    +    source.setRef(1, input, 0, input.length);
    +
    +    RedactMaskFactory mask = new RedactMaskFactory("", "", "0:3, -5:-1");
    +    for(int r=0; r < 2; ++r) {
    +      mask.maskString(source, r, target);
    +    }
    +
    +    assertEquals("Mary xxx 9 xxxxxx xamb!!", new String(target.vector[0],
    +        target.start[0], target.length[0], StandardCharsets.UTF_8));
    +    assertEquals("\uD841\uDF0E", new String(target.vector[1],
    +        target.start[1], target.length[1], StandardCharsets.UTF_8));
    +
    +    // test defaults, no-unmask range
    +    mask = new RedactMaskFactory();
    +    for(int r=0; r < 2; ++r) {
    +      mask.maskString(source, r, target);
    +    }
    +
    +    assertEquals("Xxxx xxx 9 xxxxxx xxxx..", new String(target.vector[0],
    +        target.start[0], target.length[0], StandardCharsets.UTF_8));
    +    assertEquals("ª", new String(target.vector[1],
    +        target.start[1], target.length[1], StandardCharsets.UTF_8));
    +
    +
    +    // test out of range string mask
    +    mask = new RedactMaskFactory("", "", "-1:-5");
    +    for(int r=0; r < 2; ++r) {
    +      mask.maskString(source, r, target);
    +    }
    +
    +    assertEquals("Xxxx xxx 9 xxxxxx xxxx..", new String(target.vector[0],
    +        target.start[0], target.length[0], StandardCharsets.UTF_8));
    +    assertEquals("ª", new String(target.vector[1],
    +        target.start[1], target.length[1], StandardCharsets.UTF_8));
    +
    +  }
    +
    +  /* test for Decimal */
    +  @Test
    +  public void testDecimalRangeMask() {
    +
    +    RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:3");
    +    assertEquals(new HiveDecimalWritable("123477.777"),
    +        mask.maskDecimal(new HiveDecimalWritable("123456.789")));
    +
    +    // try with a reverse index
    +    mask = new RedactMaskFactory("Xx7", "", "-3:-1, 0:3");
    +    assertEquals(new HiveDecimalWritable("123477777.777654"),
    +        mask.maskDecimal(new HiveDecimalWritable("123456789.987654")));
    +
    +    // test removal of leading and  trailing zeros.
    +    /*
    +    assertEquals(new HiveDecimalWritable("777777777777777777.7777"),
    +        mask.maskDecimal(new HiveDecimalWritable("0123456789123456789.01230")));
    +        */
    +
    --- End diff --
    
    ok, will do.


---

[GitHub] orc pull request #184: Orc 256 unmask range option

Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on a diff in the pull request:

    https://github.com/apache/orc/pull/184#discussion_r154761126
  
    --- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
    @@ -114,6 +120,10 @@
       private final boolean maskDate;
       private final boolean maskTimestamp;
     
    +  // index tuples that are not to be masked
    +  private final SortedMap<Integer,Integer> unmaskIndexRanges = Collections.synchronizedSortedMap(new TreeMap());
    --- End diff --
    
    Any particular reason that you need a sychronized map here?


---

[GitHub] orc pull request #184: Orc 256 unmask range option

Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on a diff in the pull request:

    https://github.com/apache/orc/pull/184#discussion_r155629073
  
    --- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
    @@ -245,8 +271,8 @@ public void maskData(ColumnVector original, ColumnVector masked, int start,
             target.isNull[0] = source.isNull[0];
           } else {
             for(int r = start; r < start + length; ++r) {
    -          target.vector[r] = maskLong(source.vector[r]) & mask;
    -          target.isNull[r] = source.isNull[r];
    +            target.vector[r] = maskLong(source.vector[r]) & mask;
    --- End diff --
    
    Sure 


---

[GitHub] orc pull request #184: Orc 256 unmask range option

Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on a diff in the pull request:

    https://github.com/apache/orc/pull/184#discussion_r154762146
  
    --- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
    @@ -619,7 +646,7 @@ public double maskDouble(double value) {
         } else if (posn < 0) {
           posn = -posn -2;
         }
    -    return DIGIT_REPLACEMENT * base * DOUBLE_POWER_10[posn];
    +    return unmaskRangeDoubleValue(value,DIGIT_REPLACEMENT * base * DOUBLE_POWER_10[posn]);
    --- End diff --
    
    Add space after comma.


---

[GitHub] orc pull request #184: Orc 256 unmask range option

Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on a diff in the pull request:

    https://github.com/apache/orc/pull/184#discussion_r155628665
  
    --- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
    @@ -114,6 +120,10 @@
       private final boolean maskDate;
       private final boolean maskTimestamp;
     
    +  // index tuples that are not to be masked
    +  private final SortedMap<Integer,Integer> unmaskIndexRanges = Collections.synchronizedSortedMap(new TreeMap());
    --- End diff --
    
    Hello @xndai,
    Thanks for the review, I was trying to be cautious, but I can get rid of it.


---

[GitHub] orc pull request #184: Orc 256 unmask range option

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/orc/pull/184


---

[GitHub] orc issue #184: Orc 256 unmask range option

Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on the issue:

    https://github.com/apache/orc/pull/184
  
    Updated the PR he changes are as follows:
    1. Fixed the find bugs issue.
    2. Merged the feature into a single commit. 


---

[GitHub] orc issue #184: Orc 256 unmask range option

Posted by omalley <gi...@git.apache.org>.
Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/184
  
    I think we should change the processing for numerics (when there is unmasked ranges) to be:
    
    unmasked number -> string -> mask as string -> masked number


---

[GitHub] orc pull request #184: Orc 256 unmask range option

Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on a diff in the pull request:

    https://github.com/apache/orc/pull/184#discussion_r155623157
  
    --- Diff: java/core/src/test/org/apache/orc/impl/mask/TestUnmaskRange.java ---
    @@ -0,0 +1,165 @@
    +package org.apache.orc.impl.mask;
    +
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with this
    + * work for additional information regarding copyright ownership.  The ASF
    + * licenses this file to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License.
    + * You may obtain a copy of the License at
    + * <p>
    + * http://www.apache.org/licenses/LICENSE-2.0
    + * <p>
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
    + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
    + * License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +
    +import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
    +import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable;
    +import org.junit.Test;
    +
    +import java.nio.charset.StandardCharsets;
    +
    +import static org.junit.Assert.assertEquals;
    +
    +/**
    + * Test Unmask option
    + */
    +public class TestUnmaskRange {
    +
    +  public TestUnmaskRange() {
    +    super();
    +  }
    +
    +  /* Test for Long */
    +  @Test
    +  public void testSimpleLongRangeMask() {
    +    RedactMaskFactory mask = new RedactMaskFactory("9", "", "0:2");
    +    long result = mask.maskLong(123456);
    +    assertEquals(123_999, result);
    +
    +    // negative index
    +    mask = new RedactMaskFactory("9", "", "-3:-1");
    +    result = mask.maskLong(123456);
    +    assertEquals(999_456, result);
    +
    +    // out of range mask, return the original mask
    +    mask = new RedactMaskFactory("9", "", "7:10");
    +    result = mask.maskLong(123456);
    +    assertEquals(999999, result);
    +
    +  }
    +
    +  @Test
    +  public void testDefaultRangeMask() {
    +    RedactMaskFactory mask = new RedactMaskFactory("9", "", "");
    +    long result = mask.maskLong(123456);
    +    assertEquals(999999, result);
    +
    +    mask = new RedactMaskFactory("9");
    +    result = mask.maskLong(123456);
    +    assertEquals(999999, result);
    +
    +  }
    +
    +  @Test
    +  public void testCCRangeMask() {
    +    long cc = 4716885592186382L;
    +    long maskedCC = 4716_77777777_6382L;
    +    // Range unmask for first 4 and last 4 of credit card number
    +    final RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:3,-4:-1");
    +    long result = mask.maskLong(cc);
    +
    +    assertEquals(String.valueOf(cc).length(), String.valueOf(result).length());
    +    assertEquals(4716_77777777_6382L, result);
    +  }
    +
    +  /* Tests for Double */
    +  @Test
    +  public void testSimpleDoubleRangeMask() {
    +    RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:2");
    +    assertEquals(1237.77, mask.maskDouble(1234.99), 0.000001);
    +    assertEquals(12377.7, mask.maskDouble(12345.9), 0.000001);
    +
    +    mask = new RedactMaskFactory("Xx7", "", "-3:-1");
    +    assertEquals(7774.9, mask.maskDouble(1234.9), 0.000001);
    +
    +  }
    +
    +  /* test for String */
    +  @Test
    +  public void testStringRangeMask() {
    +
    +    BytesColumnVector source = new BytesColumnVector();
    +    BytesColumnVector target = new BytesColumnVector();
    +    target.reset();
    +
    +    byte[] input = "Mary had 1 little lamb!!".getBytes(StandardCharsets.UTF_8);
    +    source.setRef(0, input, 0, input.length);
    +
    +    // Set a 4 byte chinese character (U+2070E), which is letter other
    +    input = "\uD841\uDF0E".getBytes(StandardCharsets.UTF_8);
    +    source.setRef(1, input, 0, input.length);
    +
    +    RedactMaskFactory mask = new RedactMaskFactory("", "", "0:3, -5:-1");
    +    for(int r=0; r < 2; ++r) {
    +      mask.maskString(source, r, target);
    +    }
    +
    +    assertEquals("Mary xxx 9 xxxxxx xamb!!", new String(target.vector[0],
    +        target.start[0], target.length[0], StandardCharsets.UTF_8));
    +    assertEquals("\uD841\uDF0E", new String(target.vector[1],
    +        target.start[1], target.length[1], StandardCharsets.UTF_8));
    +
    +    // test defaults, no-unmask range
    +    mask = new RedactMaskFactory();
    +    for(int r=0; r < 2; ++r) {
    +      mask.maskString(source, r, target);
    +    }
    +
    +    assertEquals("Xxxx xxx 9 xxxxxx xxxx..", new String(target.vector[0],
    +        target.start[0], target.length[0], StandardCharsets.UTF_8));
    +    assertEquals("ª", new String(target.vector[1],
    +        target.start[1], target.length[1], StandardCharsets.UTF_8));
    +
    +
    +    // test out of range string mask
    +    mask = new RedactMaskFactory("", "", "-1:-5");
    +    for(int r=0; r < 2; ++r) {
    +      mask.maskString(source, r, target);
    +    }
    +
    +    assertEquals("Xxxx xxx 9 xxxxxx xxxx..", new String(target.vector[0],
    +        target.start[0], target.length[0], StandardCharsets.UTF_8));
    +    assertEquals("ª", new String(target.vector[1],
    +        target.start[1], target.length[1], StandardCharsets.UTF_8));
    +
    +  }
    +
    +  /* test for Decimal */
    +  @Test
    +  public void testDecimalRangeMask() {
    +
    +    RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:3");
    +    assertEquals(new HiveDecimalWritable("123477.777"),
    +        mask.maskDecimal(new HiveDecimalWritable("123456.789")));
    +
    +    // try with a reverse index
    +    mask = new RedactMaskFactory("Xx7", "", "-3:-1, 0:3");
    +    assertEquals(new HiveDecimalWritable("123477777.777654"),
    +        mask.maskDecimal(new HiveDecimalWritable("123456789.987654")));
    +
    +    // test removal of leading and  trailing zeros.
    +    /*
    +    assertEquals(new HiveDecimalWritable("777777777777777777.7777"),
    +        mask.maskDecimal(new HiveDecimalWritable("0123456789123456789.01230")));
    +        */
    +
    --- End diff --
    
    remove empty lines. Same for other places.


---

[GitHub] orc issue #184: Orc 256 unmask range option

Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on the issue:

    https://github.com/apache/orc/pull/184
  
    @xndai  @omalley I updated the PR with the suggested changes, let me know if you have any questions.


---

[GitHub] orc pull request #184: Orc 256 unmask range option

Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on a diff in the pull request:

    https://github.com/apache/orc/pull/184#discussion_r154761512
  
    --- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
    @@ -245,8 +271,8 @@ public void maskData(ColumnVector original, ColumnVector masked, int start,
             target.isNull[0] = source.isNull[0];
           } else {
             for(int r = start; r < start + length; ++r) {
    -          target.vector[r] = maskLong(source.vector[r]) & mask;
    -          target.isNull[r] = source.isNull[r];
    +            target.vector[r] = maskLong(source.vector[r]) & mask;
    --- End diff --
    
    Remove leading space. Same as below.


---

[GitHub] orc pull request #184: Orc 256 unmask range option

Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on a diff in the pull request:

    https://github.com/apache/orc/pull/184#discussion_r155629140
  
    --- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
    @@ -619,7 +646,7 @@ public double maskDouble(double value) {
         } else if (posn < 0) {
           posn = -posn -2;
         }
    -    return DIGIT_REPLACEMENT * base * DOUBLE_POWER_10[posn];
    +    return unmaskRangeDoubleValue(value,DIGIT_REPLACEMENT * base * DOUBLE_POWER_10[posn]);
    --- End diff --
    
    ok, will do.


---