You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/09/03 21:10:00 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

RussellSpitzer opened a new pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420


   This is one of two WIP PR's to demonstrate some approaches to parallelizing the job planning phase of Spark Reads.
   
   To add distribtued manifest reading to table scans, we allow for a ManifestProcessor
   to be used in a TableScan. This class is utilzied when ManifestGroup processes
   Manifest files. The default implementation mimics the current code by just wrapping
   the reading code in a CloseableIterable. A distributed Spark implementation is also
   provided which reads all of the manifests remotely before returning valid entries
   for further processing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420#issuecomment-702338717


   We decided to go with the other approach, Goodnight sweet prince


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420#discussion_r496776208



##########
File path: api/src/main/java/org/apache/iceberg/TableScan.java
##########
@@ -94,6 +94,11 @@
    */
   TableScan includeColumnStats();
 
+  /**
+   * Doc doc doc
+   */
+  TableScan withManifestProcessor(ManifestProcessor processor);

Review comment:
       Of course :) 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420#discussion_r485983344



##########
File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
##########
@@ -131,6 +133,9 @@ public TableScan useSnapshot(long scanSnapshotId) {
         ops, table, schema, context.useSnapshotId(scanSnapshotId));
   }
 
+
+
+

Review comment:
       Nit: unnecessary whitespace change




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420#discussion_r487581440



##########
File path: api/src/main/java/org/apache/iceberg/ManifestEntry.java
##########
@@ -25,7 +20,7 @@
 import static org.apache.iceberg.types.Types.NestedField.optional;
 import static org.apache.iceberg.types.Types.NestedField.required;
 
-interface ManifestEntry<F extends ContentFile<F>> {
+public interface ManifestEntry<F extends ContentFile<F>> {

Review comment:
       Can you expand on why making `ManifestEntry` public makes the API harder to use? Is it just because we then have to account for the fact that this is a public API?
   
   I personally am more fond of this proof of concept as opposed to the other one, so I'd love to get insight into your comment here.
   
   If it's just because we would then have to treat this API public, is it possible something could be done to mark it as private to iceberg, equivalent to scala's `private[iceberg]`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420#discussion_r483959167



##########
File path: api/src/main/java/org/apache/iceberg/ManifestProcessor.java
##########
@@ -0,0 +1,33 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.function.BiFunction;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileIO;
+
+public abstract class ManifestProcessor implements Serializable {
+    public abstract <T extends ContentFile<T>> Iterable<CloseableIterable<ManifestEntry<T>>> readManifests(final Iterable<ManifestFile> fromIterable,
+        BiFunction<ManifestFile, FileIO, CloseableIterable<ManifestEntry<T>>> reader);
+
+    /**
+     * A Helper interface for making lambdas transform into the correct type for the ManfiestProcessor
+     * @param <T> The ManifestEntry Type being read from Manifest files
+     */
+    public interface Func<T extends ContentFile<T>> extends BiFunction<ManifestFile, FileIO,
+        CloseableIterable<ManifestEntry<T>>>, Serializable {}

Review comment:
       Check out the other PR too :) And if you have another approach I'd be glad to look into that as well.  This is in my opinion definitely a more kludgy approach in Java. The only reason we need this interface is because otherwise you have to do 
   (All of that type info) lambda, this is basically a type alias java will use to covert the lambda into a real function.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420#discussion_r496775847



##########
File path: api/src/main/java/org/apache/iceberg/ManifestEntry.java
##########
@@ -1,20 +1,15 @@
 /*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
  *
- *   http://www.apache.org/licenses/LICENSE-2.0
+ *     http://www.apache.org/licenses/LICENSE-2.0
  *
- * Unless required by applicable law or agreed to in writing,
- * software distributed under the License is distributed on an
- * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- * KIND, either express or implied.  See the License for the
- * specific language governing permissions and limitations
- * under the License.
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.

Review comment:
       yep sorry, I was just trying to get to a prototype real fast here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420#discussion_r485983175



##########
File path: api/src/main/java/org/apache/iceberg/ManifestEntry.java
##########
@@ -25,7 +20,7 @@
 import static org.apache.iceberg.types.Types.NestedField.optional;
 import static org.apache.iceberg.types.Types.NestedField.required;
 
-interface ManifestEntry<F extends ContentFile<F>> {
+public interface ManifestEntry<F extends ContentFile<F>> {

Review comment:
       We've really tried to avoid making `ManifestEntry` public because exposing it makes the API much harder to use.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420#discussion_r484609843



##########
File path: api/src/main/java/org/apache/iceberg/TableScan.java
##########
@@ -94,6 +94,11 @@
    */
   TableScan includeColumnStats();
 
+  /**
+   * Doc doc doc
+   */
+  TableScan withManifestProcessor(ManifestProcessor processor);

Review comment:
       I see what you did here to get past the linter ;-)
   
   If we decide to use this approach, let's be sure to not merge in `Doc doc doc` but instead a more descriptive comment or even a `TODO` comment. However since this is a WIP it's not worth your time to explain a pretty self documenting method name for the purpose of appeasing the linter. Leaving this comment so we don't forget to update if we do merge this in. :)

##########
File path: api/src/main/java/org/apache/iceberg/ManifestProcessor.java
##########
@@ -0,0 +1,33 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.function.BiFunction;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileIO;
+
+public abstract class ManifestProcessor implements Serializable {
+    public abstract <T extends ContentFile<T>> Iterable<CloseableIterable<ManifestEntry<T>>> readManifests(final Iterable<ManifestFile> fromIterable,
+        BiFunction<ManifestFile, FileIO, CloseableIterable<ManifestEntry<T>>> reader);
+
+    /**
+     * A Helper interface for making lambdas transform into the correct type for the ManfiestProcessor
+     * @param <T> The ManifestEntry Type being read from Manifest files
+     */
+    public interface Func<T extends ContentFile<T>> extends BiFunction<ManifestFile, FileIO,
+        CloseableIterable<ManifestEntry<T>>>, Serializable {}

Review comment:
       Ah. Ok. Thank you so much! As I've mentioned, in my day job I'm more of a scala developer so I'm much more used to working with all of the well enriched type stuff that's available there. Thankfully I've finally stood all of this up on my own K8s cluster somewhere so I can continue to get more experienced with practical usage of Iceberg (especially between query systems).
   
   When you explain it that way though, it makes perfect sense. :)
   
   As for the approach being kludgy, if it gets the job done / speeds up manifest reading and makes things like exception stack traces etc more readable than a very large number of lambda functions, I'm not necessarily opposed at all. As a person that spends a lot of time in their current position as a sort of developer advocate helping people with the more esoteric parts of Spark and Flink etc, I can absolutely get behind more simplified type info and less concern with the serializability of functions, especially if the lambdas get raised to real functions so that they have more readable stack traces. :)
   
   I'll be sure to check out the other approach. And if I can come up with any approaches or modifications to this approach that might be cleaner, I'll definitely let you know. I think that java doc comment on the interface is likely sufficient to explain `Func`. It's possible that maybe the relatively generic name `Func` is what tripped me up, but I can't think of a better name that doesn't conflict with other current Iceberg concepts (e.g. Transformers, etc).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420#discussion_r487581213



##########
File path: api/src/main/java/org/apache/iceberg/ManifestEntry.java
##########
@@ -1,20 +1,15 @@
 /*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
  *
- *   http://www.apache.org/licenses/LICENSE-2.0
+ *     http://www.apache.org/licenses/LICENSE-2.0
  *
- * Unless required by applicable law or agreed to in writing,
- * software distributed under the License is distributed on an
- * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- * KIND, either express or implied.  See the License for the
- * specific language governing permissions and limitations
- * under the License.
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.

Review comment:
       My understanding is that this is just a proof of concept PR, to compare with the other idea demonstrated here: https://github.com/apache/iceberg/pull/1421
   
   I think once one is decided upon / if either of the versions are decided upon, then @RussellSpitzer will be updating it to be properly formatted etc. Please feel free to correct me if I'm wrong though Russell 🙂 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer closed pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
RussellSpitzer closed pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420#discussion_r485983050



##########
File path: api/src/main/java/org/apache/iceberg/ManifestEntry.java
##########
@@ -1,20 +1,15 @@
 /*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
  *
- *   http://www.apache.org/licenses/LICENSE-2.0
+ *     http://www.apache.org/licenses/LICENSE-2.0
  *
- * Unless required by applicable law or agreed to in writing,
- * software distributed under the License is distributed on an
- * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- * KIND, either express or implied.  See the License for the
- * specific language governing permissions and limitations
- * under the License.
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.

Review comment:
       Did this not match the rest of the files in the project? I'd prefer not to reformat headers in a different PR if we can avoid it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #1420: WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #1420:
URL: https://github.com/apache/iceberg/pull/1420#discussion_r483934150



##########
File path: api/src/main/java/org/apache/iceberg/ManifestProcessor.java
##########
@@ -0,0 +1,33 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.function.BiFunction;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileIO;
+
+public abstract class ManifestProcessor implements Serializable {
+    public abstract <T extends ContentFile<T>> Iterable<CloseableIterable<ManifestEntry<T>>> readManifests(final Iterable<ManifestFile> fromIterable,
+        BiFunction<ManifestFile, FileIO, CloseableIterable<ManifestEntry<T>>> reader);
+
+    /**
+     * A Helper interface for making lambdas transform into the correct type for the ManfiestProcessor
+     * @param <T> The ManifestEntry Type being read from Manifest files
+     */
+    public interface Func<T extends ContentFile<T>> extends BiFunction<ManifestFile, FileIO,
+        CloseableIterable<ManifestEntry<T>>>, Serializable {}

Review comment:
       I know this is a WIP, so I'm partially commenting to follow along.
   
   But I'm not a huge fan of this interface being named `Func`. Its usage also feels somewhat clunky to me to be honest, but I'm more of a scala developer than a java developer by day so that could possibly just be my own bias showing.
   
   Overall this is shaping up to be a great addition though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org