You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by moresandeep <gi...@git.apache.org> on 2017/10/28 23:46:07 UTC
[GitHub] orc pull request #184: Orc 256 unmask range option
GitHub user moresandeep opened a pull request:
https://github.com/apache/orc/pull/184
Orc 256 unmask range option
This PR contains changes that enables unmasking range option for redact mask (ORC-256).
1. The react mask would accept an additional option (option #3 in this case) that has the configuration for leaving certain ranges of strings unmasked
2. The options will look like "0:4,-4:-1"
3. Range unmasking is available for 1. Long 2. Double 3. String 4. Decimal
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/moresandeep/orc ORC-256-Unmask_Range_Option
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/orc/pull/184.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #184
----
commit 2144ca0e360e3885d39ec19cad152ec9d1b6e69e
Author: Sandeep More <mo...@apache.org>
Date: 2017-10-28T14:41:41Z
ORC-256 - Add unmasked ranges option for redact mask
Signed-off-by: Sandeep More <mo...@apache.org>
commit 7f8443c8182228e1ef000f3fc1ecd65fdd85b69f
Author: Sandeep More <mo...@apache.org>
Date: 2017-10-28T16:08:12Z
ORC-256 - Minor fixes
----
---
[GitHub] orc issue #184: Orc 256 unmask range option
Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on the issue:
https://github.com/apache/orc/pull/184
@omalley I updated the PR with your suggestions, thanks for the review !
---
[GitHub] orc pull request #184: Orc 256 unmask range option
Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on a diff in the pull request:
https://github.com/apache/orc/pull/184#discussion_r155629181
--- Diff: java/core/src/test/org/apache/orc/impl/mask/TestUnmaskRange.java ---
@@ -0,0 +1,165 @@
+package org.apache.orc.impl.mask;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership. The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable;
+import org.junit.Test;
+
+import java.nio.charset.StandardCharsets;
+
+import static org.junit.Assert.assertEquals;
+
+/**
+ * Test Unmask option
+ */
+public class TestUnmaskRange {
+
+ public TestUnmaskRange() {
+ super();
+ }
+
+ /* Test for Long */
+ @Test
+ public void testSimpleLongRangeMask() {
+ RedactMaskFactory mask = new RedactMaskFactory("9", "", "0:2");
+ long result = mask.maskLong(123456);
+ assertEquals(123_999, result);
+
+ // negative index
+ mask = new RedactMaskFactory("9", "", "-3:-1");
+ result = mask.maskLong(123456);
+ assertEquals(999_456, result);
+
+ // out of range mask, return the original mask
+ mask = new RedactMaskFactory("9", "", "7:10");
+ result = mask.maskLong(123456);
+ assertEquals(999999, result);
+
+ }
+
+ @Test
+ public void testDefaultRangeMask() {
+ RedactMaskFactory mask = new RedactMaskFactory("9", "", "");
+ long result = mask.maskLong(123456);
+ assertEquals(999999, result);
+
+ mask = new RedactMaskFactory("9");
+ result = mask.maskLong(123456);
+ assertEquals(999999, result);
+
+ }
+
+ @Test
+ public void testCCRangeMask() {
+ long cc = 4716885592186382L;
+ long maskedCC = 4716_77777777_6382L;
+ // Range unmask for first 4 and last 4 of credit card number
+ final RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:3,-4:-1");
+ long result = mask.maskLong(cc);
+
+ assertEquals(String.valueOf(cc).length(), String.valueOf(result).length());
+ assertEquals(4716_77777777_6382L, result);
+ }
+
+ /* Tests for Double */
+ @Test
+ public void testSimpleDoubleRangeMask() {
+ RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:2");
+ assertEquals(1237.77, mask.maskDouble(1234.99), 0.000001);
+ assertEquals(12377.7, mask.maskDouble(12345.9), 0.000001);
+
+ mask = new RedactMaskFactory("Xx7", "", "-3:-1");
+ assertEquals(7774.9, mask.maskDouble(1234.9), 0.000001);
+
+ }
+
+ /* test for String */
+ @Test
+ public void testStringRangeMask() {
+
+ BytesColumnVector source = new BytesColumnVector();
+ BytesColumnVector target = new BytesColumnVector();
+ target.reset();
+
+ byte[] input = "Mary had 1 little lamb!!".getBytes(StandardCharsets.UTF_8);
+ source.setRef(0, input, 0, input.length);
+
+ // Set a 4 byte chinese character (U+2070E), which is letter other
+ input = "\uD841\uDF0E".getBytes(StandardCharsets.UTF_8);
+ source.setRef(1, input, 0, input.length);
+
+ RedactMaskFactory mask = new RedactMaskFactory("", "", "0:3, -5:-1");
+ for(int r=0; r < 2; ++r) {
+ mask.maskString(source, r, target);
+ }
+
+ assertEquals("Mary xxx 9 xxxxxx xamb!!", new String(target.vector[0],
+ target.start[0], target.length[0], StandardCharsets.UTF_8));
+ assertEquals("\uD841\uDF0E", new String(target.vector[1],
+ target.start[1], target.length[1], StandardCharsets.UTF_8));
+
+ // test defaults, no-unmask range
+ mask = new RedactMaskFactory();
+ for(int r=0; r < 2; ++r) {
+ mask.maskString(source, r, target);
+ }
+
+ assertEquals("Xxxx xxx 9 xxxxxx xxxx..", new String(target.vector[0],
+ target.start[0], target.length[0], StandardCharsets.UTF_8));
+ assertEquals("ª", new String(target.vector[1],
+ target.start[1], target.length[1], StandardCharsets.UTF_8));
+
+
+ // test out of range string mask
+ mask = new RedactMaskFactory("", "", "-1:-5");
+ for(int r=0; r < 2; ++r) {
+ mask.maskString(source, r, target);
+ }
+
+ assertEquals("Xxxx xxx 9 xxxxxx xxxx..", new String(target.vector[0],
+ target.start[0], target.length[0], StandardCharsets.UTF_8));
+ assertEquals("ª", new String(target.vector[1],
+ target.start[1], target.length[1], StandardCharsets.UTF_8));
+
+ }
+
+ /* test for Decimal */
+ @Test
+ public void testDecimalRangeMask() {
+
+ RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:3");
+ assertEquals(new HiveDecimalWritable("123477.777"),
+ mask.maskDecimal(new HiveDecimalWritable("123456.789")));
+
+ // try with a reverse index
+ mask = new RedactMaskFactory("Xx7", "", "-3:-1, 0:3");
+ assertEquals(new HiveDecimalWritable("123477777.777654"),
+ mask.maskDecimal(new HiveDecimalWritable("123456789.987654")));
+
+ // test removal of leading and trailing zeros.
+ /*
+ assertEquals(new HiveDecimalWritable("777777777777777777.7777"),
+ mask.maskDecimal(new HiveDecimalWritable("0123456789123456789.01230")));
+ */
+
--- End diff --
ok, will do.
---
[GitHub] orc pull request #184: Orc 256 unmask range option
Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on a diff in the pull request:
https://github.com/apache/orc/pull/184#discussion_r154761126
--- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
@@ -114,6 +120,10 @@
private final boolean maskDate;
private final boolean maskTimestamp;
+ // index tuples that are not to be masked
+ private final SortedMap<Integer,Integer> unmaskIndexRanges = Collections.synchronizedSortedMap(new TreeMap());
--- End diff --
Any particular reason that you need a sychronized map here?
---
[GitHub] orc pull request #184: Orc 256 unmask range option
Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on a diff in the pull request:
https://github.com/apache/orc/pull/184#discussion_r155629073
--- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
@@ -245,8 +271,8 @@ public void maskData(ColumnVector original, ColumnVector masked, int start,
target.isNull[0] = source.isNull[0];
} else {
for(int r = start; r < start + length; ++r) {
- target.vector[r] = maskLong(source.vector[r]) & mask;
- target.isNull[r] = source.isNull[r];
+ target.vector[r] = maskLong(source.vector[r]) & mask;
--- End diff --
Sure
---
[GitHub] orc pull request #184: Orc 256 unmask range option
Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on a diff in the pull request:
https://github.com/apache/orc/pull/184#discussion_r154762146
--- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
@@ -619,7 +646,7 @@ public double maskDouble(double value) {
} else if (posn < 0) {
posn = -posn -2;
}
- return DIGIT_REPLACEMENT * base * DOUBLE_POWER_10[posn];
+ return unmaskRangeDoubleValue(value,DIGIT_REPLACEMENT * base * DOUBLE_POWER_10[posn]);
--- End diff --
Add space after comma.
---
[GitHub] orc pull request #184: Orc 256 unmask range option
Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on a diff in the pull request:
https://github.com/apache/orc/pull/184#discussion_r155628665
--- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
@@ -114,6 +120,10 @@
private final boolean maskDate;
private final boolean maskTimestamp;
+ // index tuples that are not to be masked
+ private final SortedMap<Integer,Integer> unmaskIndexRanges = Collections.synchronizedSortedMap(new TreeMap());
--- End diff --
Hello @xndai,
Thanks for the review, I was trying to be cautious, but I can get rid of it.
---
[GitHub] orc pull request #184: Orc 256 unmask range option
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/orc/pull/184
---
[GitHub] orc issue #184: Orc 256 unmask range option
Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on the issue:
https://github.com/apache/orc/pull/184
Updated the PR he changes are as follows:
1. Fixed the find bugs issue.
2. Merged the feature into a single commit.
---
[GitHub] orc issue #184: Orc 256 unmask range option
Posted by omalley <gi...@git.apache.org>.
Github user omalley commented on the issue:
https://github.com/apache/orc/pull/184
I think we should change the processing for numerics (when there is unmasked ranges) to be:
unmasked number -> string -> mask as string -> masked number
---
[GitHub] orc pull request #184: Orc 256 unmask range option
Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on a diff in the pull request:
https://github.com/apache/orc/pull/184#discussion_r155623157
--- Diff: java/core/src/test/org/apache/orc/impl/mask/TestUnmaskRange.java ---
@@ -0,0 +1,165 @@
+package org.apache.orc.impl.mask;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership. The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable;
+import org.junit.Test;
+
+import java.nio.charset.StandardCharsets;
+
+import static org.junit.Assert.assertEquals;
+
+/**
+ * Test Unmask option
+ */
+public class TestUnmaskRange {
+
+ public TestUnmaskRange() {
+ super();
+ }
+
+ /* Test for Long */
+ @Test
+ public void testSimpleLongRangeMask() {
+ RedactMaskFactory mask = new RedactMaskFactory("9", "", "0:2");
+ long result = mask.maskLong(123456);
+ assertEquals(123_999, result);
+
+ // negative index
+ mask = new RedactMaskFactory("9", "", "-3:-1");
+ result = mask.maskLong(123456);
+ assertEquals(999_456, result);
+
+ // out of range mask, return the original mask
+ mask = new RedactMaskFactory("9", "", "7:10");
+ result = mask.maskLong(123456);
+ assertEquals(999999, result);
+
+ }
+
+ @Test
+ public void testDefaultRangeMask() {
+ RedactMaskFactory mask = new RedactMaskFactory("9", "", "");
+ long result = mask.maskLong(123456);
+ assertEquals(999999, result);
+
+ mask = new RedactMaskFactory("9");
+ result = mask.maskLong(123456);
+ assertEquals(999999, result);
+
+ }
+
+ @Test
+ public void testCCRangeMask() {
+ long cc = 4716885592186382L;
+ long maskedCC = 4716_77777777_6382L;
+ // Range unmask for first 4 and last 4 of credit card number
+ final RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:3,-4:-1");
+ long result = mask.maskLong(cc);
+
+ assertEquals(String.valueOf(cc).length(), String.valueOf(result).length());
+ assertEquals(4716_77777777_6382L, result);
+ }
+
+ /* Tests for Double */
+ @Test
+ public void testSimpleDoubleRangeMask() {
+ RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:2");
+ assertEquals(1237.77, mask.maskDouble(1234.99), 0.000001);
+ assertEquals(12377.7, mask.maskDouble(12345.9), 0.000001);
+
+ mask = new RedactMaskFactory("Xx7", "", "-3:-1");
+ assertEquals(7774.9, mask.maskDouble(1234.9), 0.000001);
+
+ }
+
+ /* test for String */
+ @Test
+ public void testStringRangeMask() {
+
+ BytesColumnVector source = new BytesColumnVector();
+ BytesColumnVector target = new BytesColumnVector();
+ target.reset();
+
+ byte[] input = "Mary had 1 little lamb!!".getBytes(StandardCharsets.UTF_8);
+ source.setRef(0, input, 0, input.length);
+
+ // Set a 4 byte chinese character (U+2070E), which is letter other
+ input = "\uD841\uDF0E".getBytes(StandardCharsets.UTF_8);
+ source.setRef(1, input, 0, input.length);
+
+ RedactMaskFactory mask = new RedactMaskFactory("", "", "0:3, -5:-1");
+ for(int r=0; r < 2; ++r) {
+ mask.maskString(source, r, target);
+ }
+
+ assertEquals("Mary xxx 9 xxxxxx xamb!!", new String(target.vector[0],
+ target.start[0], target.length[0], StandardCharsets.UTF_8));
+ assertEquals("\uD841\uDF0E", new String(target.vector[1],
+ target.start[1], target.length[1], StandardCharsets.UTF_8));
+
+ // test defaults, no-unmask range
+ mask = new RedactMaskFactory();
+ for(int r=0; r < 2; ++r) {
+ mask.maskString(source, r, target);
+ }
+
+ assertEquals("Xxxx xxx 9 xxxxxx xxxx..", new String(target.vector[0],
+ target.start[0], target.length[0], StandardCharsets.UTF_8));
+ assertEquals("ª", new String(target.vector[1],
+ target.start[1], target.length[1], StandardCharsets.UTF_8));
+
+
+ // test out of range string mask
+ mask = new RedactMaskFactory("", "", "-1:-5");
+ for(int r=0; r < 2; ++r) {
+ mask.maskString(source, r, target);
+ }
+
+ assertEquals("Xxxx xxx 9 xxxxxx xxxx..", new String(target.vector[0],
+ target.start[0], target.length[0], StandardCharsets.UTF_8));
+ assertEquals("ª", new String(target.vector[1],
+ target.start[1], target.length[1], StandardCharsets.UTF_8));
+
+ }
+
+ /* test for Decimal */
+ @Test
+ public void testDecimalRangeMask() {
+
+ RedactMaskFactory mask = new RedactMaskFactory("Xx7", "", "0:3");
+ assertEquals(new HiveDecimalWritable("123477.777"),
+ mask.maskDecimal(new HiveDecimalWritable("123456.789")));
+
+ // try with a reverse index
+ mask = new RedactMaskFactory("Xx7", "", "-3:-1, 0:3");
+ assertEquals(new HiveDecimalWritable("123477777.777654"),
+ mask.maskDecimal(new HiveDecimalWritable("123456789.987654")));
+
+ // test removal of leading and trailing zeros.
+ /*
+ assertEquals(new HiveDecimalWritable("777777777777777777.7777"),
+ mask.maskDecimal(new HiveDecimalWritable("0123456789123456789.01230")));
+ */
+
--- End diff --
remove empty lines. Same for other places.
---
[GitHub] orc issue #184: Orc 256 unmask range option
Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on the issue:
https://github.com/apache/orc/pull/184
@xndai @omalley I updated the PR with the suggested changes, let me know if you have any questions.
---
[GitHub] orc pull request #184: Orc 256 unmask range option
Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on a diff in the pull request:
https://github.com/apache/orc/pull/184#discussion_r154761512
--- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
@@ -245,8 +271,8 @@ public void maskData(ColumnVector original, ColumnVector masked, int start,
target.isNull[0] = source.isNull[0];
} else {
for(int r = start; r < start + length; ++r) {
- target.vector[r] = maskLong(source.vector[r]) & mask;
- target.isNull[r] = source.isNull[r];
+ target.vector[r] = maskLong(source.vector[r]) & mask;
--- End diff --
Remove leading space. Same as below.
---
[GitHub] orc pull request #184: Orc 256 unmask range option
Posted by moresandeep <gi...@git.apache.org>.
Github user moresandeep commented on a diff in the pull request:
https://github.com/apache/orc/pull/184#discussion_r155629140
--- Diff: java/core/src/java/org/apache/orc/impl/mask/RedactMaskFactory.java ---
@@ -619,7 +646,7 @@ public double maskDouble(double value) {
} else if (posn < 0) {
posn = -posn -2;
}
- return DIGIT_REPLACEMENT * base * DOUBLE_POWER_10[posn];
+ return unmaskRangeDoubleValue(value,DIGIT_REPLACEMENT * base * DOUBLE_POWER_10[posn]);
--- End diff --
ok, will do.
---