You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/01/03 22:46:34 UTC

[GitHub] [incubator-hudi] nsivabalan opened a new pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

nsivabalan opened a new pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   - This PR adds a new FileSystem called InlineFileSystem to support embedding any file format as an InlineFile within a regular file.  InlineFS will be used only in read path. 
   - Have added InMemoryFileSystem as part of the PR for write path.
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
     - *Added tests for InLineFileSystem and InLineFileSystem*
     - *Added tests for testing InlineFS for Parquet and HFile format*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-602287036
 
 
   @nsivabalan Only one issue I see
   
   I would advise against use of special string in parsing.. if someone for e,g has `inline_file` in the `<path_to_outer_file>`, then we are prone to errors? Can you fix it with along these lines?
   
   then once CI parses, please go ahead merge

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1176: [WIP] [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1176: [WIP] [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-587272207
 
 
   @vinothchandar : Have fixed the build failures. Patch is good to review. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393285575
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
 
 Review comment:
   may be a better example on the second line? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395401242
 
 

 ##########
 File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##########
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
+
+  private final Configuration inMemoryConf;
+  private final Configuration inlineConf;
+  private final int minBlockSize = 1024;
+  private static final String LOCAL_FORMATTER = "%010d";
+  private int maxRows = 100 + RANDOM.nextInt(1000);
+  private Path generatedPath;
+
+  public TestHFileReadWriteFlow() {
+    inMemoryConf = new Configuration();
+    inMemoryConf.set("fs." + InMemoryFileSystem.SCHEME + ".impl", InMemoryFileSystem.class.getName());
+    inlineConf = new Configuration();
+    inlineConf.set("fs." + InlineFileSystem.SCHEME + ".impl", InlineFileSystem.class.getName());
+  }
+
+  @After
+  public void teardown() throws IOException {
+    if (generatedPath != null) {
+      File filePath = new File(generatedPath.toString().substring(generatedPath.toString().indexOf(':') + 1));
+      if (filePath.exists()) {
+        FileSystemTestUtils.deleteFile(filePath);
+      }
+    }
+  }
+
+  @Test
+  public void testSimpleInlineFileSystem() throws IOException {
+    Path outerInMemFSPath = getRandomOuterInMemPath();
+    Path outerPath = new Path(FILE_SCHEME + outerInMemFSPath.toString().substring(outerInMemFSPath.toString().indexOf(':')));
+    generatedPath = outerPath;
+    CacheConfig cacheConf = new CacheConfig(inMemoryConf);
+    FSDataOutputStream fout = createFSOutput(outerInMemFSPath, inMemoryConf);
+    HFileContext meta = new HFileContextBuilder()
+        .withBlockSize(minBlockSize)
+        .build();
+    HFile.Writer writer = HFile.getWriterFactory(inMemoryConf, cacheConf)
+        .withOutputStream(fout)
+        .withFileContext(meta)
+        .withComparator(new KeyValue.KVComparator())
+        .create();
+
+    writeRecords(writer);
+    fout.close();
+
+    byte[] inlineBytes = getBytesToInline(outerInMemFSPath);
+    long startOffset = generateOuterFile(outerPath, inlineBytes);
+
+    long inlineLength = inlineBytes.length;
+
+    // Generate phantom inline file
+    Path inlinePath = getPhantomFile(outerPath, startOffset, inlineLength);
+
+    InlineFileSystem inlineFileSystem = (InlineFileSystem) inlinePath.getFileSystem(inlineConf);
+    FSDataInputStream fin = inlineFileSystem.open(inlinePath);
+
+    HFile.Reader reader = HFile.createReader(inlineFileSystem, inlinePath, cacheConf, inlineConf);
+    // Load up the index.
+    reader.loadFileInfo();
+    // Get a scanner that caches and that does not use pread.
+    HFileScanner scanner = reader.getScanner(true, false);
+    // Align scanner at start of the file.
+    scanner.seekTo();
+    readAllRecords(scanner);
+
+    Set<Integer> rowIdsToSearch = getRandomValidRowIds(10);
+    for (int rowId : rowIdsToSearch) {
+      Assert.assertTrue("location lookup failed",
+          scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId))) == 0);
+      // read the key and see if it matches
+      ByteBuffer readKey = scanner.getKey();
+      Assert.assertTrue("seeked key does not match", Arrays.equals(getSomeKey(rowId),
+          Bytes.toBytes(readKey)));
+      scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId)));
+      ByteBuffer val1 = scanner.getValue();
+      scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId)));
+      ByteBuffer val2 = scanner.getValue();
+      Assert.assertTrue(Arrays.equals(Bytes.toBytes(val1), Bytes.toBytes(val2)));
+    }
+
+    int[] invalidRowIds = {-4, maxRows, maxRows + 1, maxRows + 120, maxRows + 160, maxRows + 1000};
+    for (int rowId : invalidRowIds) {
+      Assert.assertFalse("location lookup should have failed",
+          scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId))) == 0);
+    }
+    reader.close();
+    fin.close();
+    outerPath.getFileSystem(inMemoryConf).delete(outerPath, true);
+  }
+
+  private Set<Integer> getRandomValidRowIds(int count) {
+    Set<Integer> rowIds = new HashSet<>();
+    while (rowIds.size() < count) {
+      int index = RANDOM.nextInt(maxRows);
+      if (!rowIds.contains(index)) {
+        rowIds.add(index);
+      }
+    }
+    return rowIds;
+  }
+
+  private byte[] getSomeKey(int rowId) {
+    KeyValue kv = new KeyValue(String.format(LOCAL_FORMATTER, Integer.valueOf(rowId)).getBytes(),
+        Bytes.toBytes("family"), Bytes.toBytes("qual"), HConstants.LATEST_TIMESTAMP, KeyValue.Type.Put);
+    return kv.getKey();
+  }
+
+  private FSDataOutputStream createFSOutput(Path name, Configuration conf) throws IOException {
+    //if (fs.exists(name)) fs.delete(name, true);
+    FSDataOutputStream fout = name.getFileSystem(conf).create(name);
+    return fout;
 
 Review comment:
   Sorry had some debug statements before and hence. will fix it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393337986
 
 

 ##########
 File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##########
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
 
 Review comment:
   this file has a fair amount of test code.. any way to simply/reuse? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-578426325
 
 
   @nsivabalan is this ready to go?  do we want to close #1068 ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-599593502
 
 
   @nsivabalan its on my queue.. while you wait, can you please move the classes to the final locations.. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393289484
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20&length=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String origScheme, long inLineStartOffset, long inLineLength) {
+    String subPath = outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+    return new Path(
+        InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + INLINE_FILE_STR + "/"
+            + "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + LENGTH_STR + EQUALS_STR + inLineLength
+    );
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20&length=40".
+   * Output : "file:/file1"
+   *
+   * @param inlinePath
+   * @param outerScheme
+   * @return
+   */
+  public static Path getOuterfilePathFromInlinePath(Path inlinePath, String outerScheme) {
+    String scheme = inlinePath.getParent().getParent().getName();
+    Path basePath = inlinePath.getParent().getParent().getParent();
+    return new Path(basePath.toString().replaceFirst(outerScheme, scheme));
 
 Review comment:
   Similarly above, could we have just replaced `file:` with `inlinefs`, instead of indexOf(). 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393338934
 
 

 ##########
 File path: pom.xml
 ##########
 @@ -684,6 +684,11 @@
         <artifactId>hbase-client</artifactId>
         <version>${hbase.version}</version>
       </dependency>
+      <dependency>
+        <groupId>org.apache.hbase</groupId>
+        <artifactId>hbase-server</artifactId>
+        <version>${hbase.version}</version>
 
 Review comment:
   scope to be test? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393291524
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##########
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link FileSystem} is used only in read path and does not support
+ * any write apis.
+ * <p>
+ * - Reading an inlined file at a given offset, length, read it out as if it were an independent file of that length
+ * - Inlined path is of the form "inlinefs:///path/to/outer/file/<outer_file_scheme>/inline_file/?start_offset=<start_offset>&length=<length>
+ * <p>
+ * TODO: The reader/writer may try to use relative paths based on the inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this shows promise, also support directly writing
 
 Review comment:
   In any case, please file JIRA for these gaps after we land this 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393338790
 
 

 ##########
 File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestParquetReadWriteFlow.java
 ##########
 @@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieRecord;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.avro.AvroParquetReader;
+import org.apache.parquet.avro.AvroParquetWriter;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.apache.parquet.hadoop.ParquetWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.UUID;
+
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} with Parquet writer and reader.
+ */
+public class TestParquetReadWriteFlow {
 
 Review comment:
   rename `TestParquetInlining`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393292323
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##########
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link FileSystem} is used only in read path and does not support
+ * any write apis.
+ * <p>
+ * - Reading an inlined file at a given offset, length, read it out as if it were an independent file of that length
+ * - Inlined path is of the form "inlinefs:///path/to/outer/file/<outer_file_scheme>/inline_file/?start_offset=<start_offset>&length=<length>
+ * <p>
+ * TODO: The reader/writer may try to use relative paths based on the inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
 
 Review comment:
   I think the InlineFSUtils can assume this scheme and can become lot simpler to read, with fewer args in methods. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393288676
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20&length=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String origScheme, long inLineStartOffset, long inLineLength) {
+    String subPath = outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+    return new Path(
+        InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + INLINE_FILE_STR + "/"
+            + "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + LENGTH_STR + EQUALS_STR + inLineLength
 
 Review comment:
   break to a new line at `"&"` for readability? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395401013
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##########
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link FileSystem} is used only in read path and does not support
+ * any write apis.
+ * <p>
+ * - Reading an inlined file at a given offset, length, read it out as if it were an independent file of that length
+ * - Inlined path is of the form "inlinefs:///path/to/outer/file/<outer_file_scheme>/inline_file/?start_offset=<start_offset>&length=<length>
+ * <p>
+ * TODO: The reader/writer may try to use relative paths based on the inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
+  private Configuration conf = null;
+
+  @Override
+  public void initialize(URI name, Configuration conf) throws IOException {
+    super.initialize(name, conf);
+    this.conf = conf;
+  }
+
+  @Override
+  public URI getUri() {
+    return URI.create(getScheme());
+  }
+
+  public String getScheme() {
+    return SCHEME;
+  }
+
+  @Override
+  public FSDataInputStream open(Path inlinePath, int bufferSize) throws IOException {
+    Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, getScheme());
+    FileSystem outerFs = outerPath.getFileSystem(conf);
+    FSDataInputStream outerStream = outerFs.open(outerPath, bufferSize);
+    return new InlineFsDataInputStream(InLineFSUtils.startOffset(inlinePath), outerStream, InLineFSUtils.length(inlinePath));
+  }
+
+  @Override
+  public boolean exists(Path f) {
+    try {
+      return getFileStatus(f) != null;
+    } catch (Exception e) {
+      return false;
+    }
+  }
+
+  @Override
+  public FileStatus getFileStatus(Path inlinePath) throws IOException {
+    Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, getScheme());
+    FileSystem outerFs = outerPath.getFileSystem(conf);
+    FileStatus status = outerFs.getFileStatus(outerPath);
+    FileStatus toReturn = new FileStatus(InLineFSUtils.length(inlinePath), status.isDirectory(), status.getReplication(), status.getBlockSize(),
 
 Review comment:
   have commented above on this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395395831
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20&length=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String origScheme, long inLineStartOffset, long inLineLength) {
+    String subPath = outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+    return new Path(
+        InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + INLINE_FILE_STR + "/"
+            + "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + LENGTH_STR + EQUALS_STR + inLineLength
+    );
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20&length=40".
+   * Output : "file:/file1"
+   *
+   * @param inlinePath
+   * @param outerScheme
+   * @return
+   */
+  public static Path getOuterfilePathFromInlinePath(Path inlinePath, String outerScheme) {
+    String scheme = inlinePath.getParent().getParent().getName();
+    Path basePath = inlinePath.getParent().getParent().getParent();
+    return new Path(basePath.toString().replaceFirst(outerScheme, scheme));
 
 Review comment:
   sorry I don't get you. I haven't used indexOf() in this method at all. not sure I get your feedback.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395400942
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##########
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link FileSystem} is used only in read path and does not support
+ * any write apis.
+ * <p>
+ * - Reading an inlined file at a given offset, length, read it out as if it were an independent file of that length
+ * - Inlined path is of the form "inlinefs:///path/to/outer/file/<outer_file_scheme>/inline_file/?start_offset=<start_offset>&length=<length>
+ * <p>
+ * TODO: The reader/writer may try to use relative paths based on the inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
+  private Configuration conf = null;
+
+  @Override
+  public void initialize(URI name, Configuration conf) throws IOException {
+    super.initialize(name, conf);
+    this.conf = conf;
+  }
+
+  @Override
+  public URI getUri() {
+    return URI.create(getScheme());
+  }
+
+  public String getScheme() {
+    return SCHEME;
+  }
+
+  @Override
+  public FSDataInputStream open(Path inlinePath, int bufferSize) throws IOException {
+    Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, getScheme());
 
 Review comment:
   Within InlineFileSystem, we need to use Path in most places. So even if we introduce a POJO, I feel it may not be of much use. 1: all arguments to this class is Path. So, we have to convert to P
   OJO before we do anything. We can't store it as instance variable as well 2: Introducing a POJO might help in fetching start offset and length which is done only once in this class.  Not sure how much of value it adds to define a POJO. But open to making change if you feel so.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393338152
 
 

 ##########
 File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##########
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
 
 Review comment:
   rename to `TestHFileInlining`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393289965
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20&length=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String origScheme, long inLineStartOffset, long inLineLength) {
+    String subPath = outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+    return new Path(
+        InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + INLINE_FILE_STR + "/"
+            + "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + LENGTH_STR + EQUALS_STR + inLineLength
+    );
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20&length=40".
+   * Output : "file:/file1"
+   *
+   * @param inlinePath
+   * @param outerScheme
+   * @return
+   */
+  public static Path getOuterfilePathFromInlinePath(Path inlinePath, String outerScheme) {
 
 Review comment:
   above example does not make it easy to understand what `outerScheme` is? is n't `outerScheme` always `inlinefs` if this is an inlinePath? Should we need to pass it in? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393286265
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20&length=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String origScheme, long inLineStartOffset, long inLineLength) {
 
 Review comment:
   rename: getInlineFilePath()? (Prefer Inline to InLine everywhere for camelcasing it) 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan merged pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan merged pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-587730864
 
 
   @nsivabalan awesome. will do.. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393337020
 
 

 ##########
 File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##########
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
+
+  private final Configuration inMemoryConf;
+  private final Configuration inlineConf;
+  private final int minBlockSize = 1024;
+  private static final String LOCAL_FORMATTER = "%010d";
+  private int maxRows = 100 + RANDOM.nextInt(1000);
+  private Path generatedPath;
+
+  public TestHFileReadWriteFlow() {
+    inMemoryConf = new Configuration();
+    inMemoryConf.set("fs." + InMemoryFileSystem.SCHEME + ".impl", InMemoryFileSystem.class.getName());
+    inlineConf = new Configuration();
+    inlineConf.set("fs." + InlineFileSystem.SCHEME + ".impl", InlineFileSystem.class.getName());
+  }
+
+  @After
+  public void teardown() throws IOException {
+    if (generatedPath != null) {
+      File filePath = new File(generatedPath.toString().substring(generatedPath.toString().indexOf(':') + 1));
+      if (filePath.exists()) {
+        FileSystemTestUtils.deleteFile(filePath);
+      }
+    }
+  }
+
+  @Test
+  public void testSimpleInlineFileSystem() throws IOException {
+    Path outerInMemFSPath = getRandomOuterInMemPath();
+    Path outerPath = new Path(FILE_SCHEME + outerInMemFSPath.toString().substring(outerInMemFSPath.toString().indexOf(':')));
+    generatedPath = outerPath;
+    CacheConfig cacheConf = new CacheConfig(inMemoryConf);
+    FSDataOutputStream fout = createFSOutput(outerInMemFSPath, inMemoryConf);
+    HFileContext meta = new HFileContextBuilder()
+        .withBlockSize(minBlockSize)
+        .build();
+    HFile.Writer writer = HFile.getWriterFactory(inMemoryConf, cacheConf)
+        .withOutputStream(fout)
+        .withFileContext(meta)
+        .withComparator(new KeyValue.KVComparator())
+        .create();
+
+    writeRecords(writer);
+    fout.close();
+
+    byte[] inlineBytes = getBytesToInline(outerInMemFSPath);
+    long startOffset = generateOuterFile(outerPath, inlineBytes);
+
+    long inlineLength = inlineBytes.length;
+
+    // Generate phantom inline file
+    Path inlinePath = getPhantomFile(outerPath, startOffset, inlineLength);
+
+    InlineFileSystem inlineFileSystem = (InlineFileSystem) inlinePath.getFileSystem(inlineConf);
+    FSDataInputStream fin = inlineFileSystem.open(inlinePath);
+
+    HFile.Reader reader = HFile.createReader(inlineFileSystem, inlinePath, cacheConf, inlineConf);
+    // Load up the index.
+    reader.loadFileInfo();
+    // Get a scanner that caches and that does not use pread.
+    HFileScanner scanner = reader.getScanner(true, false);
+    // Align scanner at start of the file.
+    scanner.seekTo();
+    readAllRecords(scanner);
+
+    Set<Integer> rowIdsToSearch = getRandomValidRowIds(10);
+    for (int rowId : rowIdsToSearch) {
+      Assert.assertTrue("location lookup failed",
+          scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId))) == 0);
+      // read the key and see if it matches
+      ByteBuffer readKey = scanner.getKey();
+      Assert.assertTrue("seeked key does not match", Arrays.equals(getSomeKey(rowId),
+          Bytes.toBytes(readKey)));
+      scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId)));
+      ByteBuffer val1 = scanner.getValue();
+      scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId)));
+      ByteBuffer val2 = scanner.getValue();
+      Assert.assertTrue(Arrays.equals(Bytes.toBytes(val1), Bytes.toBytes(val2)));
+    }
+
+    int[] invalidRowIds = {-4, maxRows, maxRows + 1, maxRows + 120, maxRows + 160, maxRows + 1000};
+    for (int rowId : invalidRowIds) {
+      Assert.assertFalse("location lookup should have failed",
+          scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId))) == 0);
+    }
+    reader.close();
+    fin.close();
+    outerPath.getFileSystem(inMemoryConf).delete(outerPath, true);
+  }
+
+  private Set<Integer> getRandomValidRowIds(int count) {
+    Set<Integer> rowIds = new HashSet<>();
+    while (rowIds.size() < count) {
+      int index = RANDOM.nextInt(maxRows);
+      if (!rowIds.contains(index)) {
+        rowIds.add(index);
+      }
+    }
+    return rowIds;
+  }
+
+  private byte[] getSomeKey(int rowId) {
+    KeyValue kv = new KeyValue(String.format(LOCAL_FORMATTER, Integer.valueOf(rowId)).getBytes(),
+        Bytes.toBytes("family"), Bytes.toBytes("qual"), HConstants.LATEST_TIMESTAMP, KeyValue.Type.Put);
+    return kv.getKey();
+  }
+
+  private FSDataOutputStream createFSOutput(Path name, Configuration conf) throws IOException {
+    //if (fs.exists(name)) fs.delete(name, true);
+    FSDataOutputStream fout = name.getFileSystem(conf).create(name);
+    return fout;
 
 Review comment:
   single line return? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-599255753
 
 
   @vinothchandar : a reminder to review this patch. Or should I ask someone to pitch in?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-601649921
 
 
   @vinothchandar : have addressed your comments for the most part. I will wait for 2 days for you to have a final look. If not, will merge it myself as per your suggestion. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393286954
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20&length=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String origScheme, long inLineStartOffset, long inLineLength) {
+    String subPath = outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
 
 Review comment:
   ensure that `toString()` will always yield something with the scheme? may be there is a more direct method for this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-570809268
 
 
   Some tests seem to be failing? 
   
   ```
   Failed tests: 
   10649  TestInMemoryFileSystem.testCreateWriteGetFileAsBytes:55 array lengths differed, expected.length=80 actual.length=17967
   10650Tests in error: 
   10651  TestParquetReadWriteFlow.testSimpleInlineFileSystem:96->readParquetGenericRecords:125 ยป IO
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395395084
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
 
 Review comment:
   Nothing significant. I am just using it to delimit regular path info from inline related attributes.
   Eg: "inlinefs://<path_to_outer_file>/s3a/inline_file/?start_offset=20&length=40". If you feel, its not adding much value, I can remove it.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393290627
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20&length=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String origScheme, long inLineStartOffset, long inLineLength) {
+    String subPath = outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+    return new Path(
+        InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + INLINE_FILE_STR + "/"
+            + "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + LENGTH_STR + EQUALS_STR + inLineLength
+    );
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20&length=40".
+   * Output : "file:/file1"
+   *
+   * @param inlinePath
+   * @param outerScheme
+   * @return
+   */
+  public static Path getOuterfilePathFromInlinePath(Path inlinePath, String outerScheme) {
+    String scheme = inlinePath.getParent().getParent().getName();
+    Path basePath = inlinePath.getParent().getParent().getParent();
+    return new Path(basePath.toString().replaceFirst(outerScheme, scheme));
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20&length=40".
+   * output: 20
+   *
+   * @param inlinePath
+   * @return
+   */
+  public static int startOffset(Path inlinePath) {
+    String pathName = inlinePath.getName();
+    return Integer.parseInt(pathName.substring(pathName.indexOf('=') + 1, pathName.indexOf('&')));
 
 Review comment:
   can we implement this using `split()` and then picking the ith index.. instead of relying on first and last? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393336968
 
 

 ##########
 File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##########
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
+
+  private final Configuration inMemoryConf;
+  private final Configuration inlineConf;
+  private final int minBlockSize = 1024;
+  private static final String LOCAL_FORMATTER = "%010d";
+  private int maxRows = 100 + RANDOM.nextInt(1000);
+  private Path generatedPath;
+
+  public TestHFileReadWriteFlow() {
+    inMemoryConf = new Configuration();
+    inMemoryConf.set("fs." + InMemoryFileSystem.SCHEME + ".impl", InMemoryFileSystem.class.getName());
+    inlineConf = new Configuration();
+    inlineConf.set("fs." + InlineFileSystem.SCHEME + ".impl", InlineFileSystem.class.getName());
+  }
+
+  @After
+  public void teardown() throws IOException {
+    if (generatedPath != null) {
+      File filePath = new File(generatedPath.toString().substring(generatedPath.toString().indexOf(':') + 1));
+      if (filePath.exists()) {
+        FileSystemTestUtils.deleteFile(filePath);
+      }
+    }
+  }
+
+  @Test
+  public void testSimpleInlineFileSystem() throws IOException {
+    Path outerInMemFSPath = getRandomOuterInMemPath();
+    Path outerPath = new Path(FILE_SCHEME + outerInMemFSPath.toString().substring(outerInMemFSPath.toString().indexOf(':')));
+    generatedPath = outerPath;
+    CacheConfig cacheConf = new CacheConfig(inMemoryConf);
+    FSDataOutputStream fout = createFSOutput(outerInMemFSPath, inMemoryConf);
+    HFileContext meta = new HFileContextBuilder()
+        .withBlockSize(minBlockSize)
+        .build();
+    HFile.Writer writer = HFile.getWriterFactory(inMemoryConf, cacheConf)
+        .withOutputStream(fout)
+        .withFileContext(meta)
+        .withComparator(new KeyValue.KVComparator())
+        .create();
+
+    writeRecords(writer);
+    fout.close();
+
+    byte[] inlineBytes = getBytesToInline(outerInMemFSPath);
+    long startOffset = generateOuterFile(outerPath, inlineBytes);
+
+    long inlineLength = inlineBytes.length;
+
+    // Generate phantom inline file
+    Path inlinePath = getPhantomFile(outerPath, startOffset, inlineLength);
+
+    InlineFileSystem inlineFileSystem = (InlineFileSystem) inlinePath.getFileSystem(inlineConf);
+    FSDataInputStream fin = inlineFileSystem.open(inlinePath);
+
+    HFile.Reader reader = HFile.createReader(inlineFileSystem, inlinePath, cacheConf, inlineConf);
+    // Load up the index.
+    reader.loadFileInfo();
+    // Get a scanner that caches and that does not use pread.
+    HFileScanner scanner = reader.getScanner(true, false);
+    // Align scanner at start of the file.
+    scanner.seekTo();
+    readAllRecords(scanner);
+
+    Set<Integer> rowIdsToSearch = getRandomValidRowIds(10);
+    for (int rowId : rowIdsToSearch) {
+      Assert.assertTrue("location lookup failed",
+          scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId))) == 0);
+      // read the key and see if it matches
+      ByteBuffer readKey = scanner.getKey();
+      Assert.assertTrue("seeked key does not match", Arrays.equals(getSomeKey(rowId),
+          Bytes.toBytes(readKey)));
+      scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId)));
+      ByteBuffer val1 = scanner.getValue();
+      scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId)));
+      ByteBuffer val2 = scanner.getValue();
+      Assert.assertTrue(Arrays.equals(Bytes.toBytes(val1), Bytes.toBytes(val2)));
+    }
+
+    int[] invalidRowIds = {-4, maxRows, maxRows + 1, maxRows + 120, maxRows + 160, maxRows + 1000};
+    for (int rowId : invalidRowIds) {
+      Assert.assertFalse("location lookup should have failed",
+          scanner.seekTo(KeyValue.createKeyValueFromKey(getSomeKey(rowId))) == 0);
+    }
+    reader.close();
+    fin.close();
+    outerPath.getFileSystem(inMemoryConf).delete(outerPath, true);
+  }
+
+  private Set<Integer> getRandomValidRowIds(int count) {
+    Set<Integer> rowIds = new HashSet<>();
+    while (rowIds.size() < count) {
+      int index = RANDOM.nextInt(maxRows);
+      if (!rowIds.contains(index)) {
+        rowIds.add(index);
+      }
+    }
+    return rowIds;
+  }
+
+  private byte[] getSomeKey(int rowId) {
+    KeyValue kv = new KeyValue(String.format(LOCAL_FORMATTER, Integer.valueOf(rowId)).getBytes(),
+        Bytes.toBytes("family"), Bytes.toBytes("qual"), HConstants.LATEST_TIMESTAMP, KeyValue.Type.Put);
+    return kv.getKey();
+  }
+
+  private FSDataOutputStream createFSOutput(Path name, Configuration conf) throws IOException {
+    //if (fs.exists(name)) fs.delete(name, true);
 
 Review comment:
   remove?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395402222
 
 

 ##########
 File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##########
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
 
 Review comment:
   Most of it is supporting cast for HFile read and write. So, not sure how much we re-use we could do here. Not too much value in modularizing it because the TestParquetInLining may have lot of supporting cast for Parquet. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393291311
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##########
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link FileSystem} is used only in read path and does not support
+ * any write apis.
+ * <p>
+ * - Reading an inlined file at a given offset, length, read it out as if it were an independent file of that length
+ * - Inlined path is of the form "inlinefs:///path/to/outer/file/<outer_file_scheme>/inline_file/?start_offset=<start_offset>&length=<length>
+ * <p>
+ * TODO: The reader/writer may try to use relative paths based on the inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this shows promise, also support directly writing
 
 Review comment:
   can you expand on this? supporting parquet summary metadata is a critical thing for the PR IMO

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395399661
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##########
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link FileSystem} is used only in read path and does not support
+ * any write apis.
+ * <p>
+ * - Reading an inlined file at a given offset, length, read it out as if it were an independent file of that length
+ * - Inlined path is of the form "inlinefs:///path/to/outer/file/<outer_file_scheme>/inline_file/?start_offset=<start_offset>&length=<length>
+ * <p>
+ * TODO: The reader/writer may try to use relative paths based on the inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
 
 Review comment:
   yes, I have fixed the scheme issue. Can you check once I have addressed all comments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395120821
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20&length=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String origScheme, long inLineStartOffset, long inLineLength) {
+    String subPath = outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
 
 Review comment:
   I checked Path, but couldn't find anything. toString() does have scheme if present. 
   
   @Override
     public String toString() {
       // we can't use uri.toString(), which escapes everything, because we want
       // illegal characters unescaped in the string, for glob processing, etc.
       StringBuilder buffer = new StringBuilder();
       if (uri.getScheme() != null) {
         buffer.append(uri.getScheme());
         buffer.append(":");
       }
       if (uri.getAuthority() != null) {
         buffer.append("//");
         buffer.append(uri.getAuthority());
       }
       if (uri.getPath() != null) {
         String path = uri.getPath();
         if (path.indexOf('/')==0 &&
             hasWindowsDrive(path) &&                // has windows drive
             uri.getScheme() == null &&              // but no scheme
             uri.getAuthority() == null)             // or authority
           path = path.substring(1);                 // remove slash before drive
         buffer.append(path);
       }
       if (uri.getFragment() != null) {
         buffer.append("#");
         buffer.append(uri.getFragment());
       }
       return buffer.toString();
     }

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393288343
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
 
 Review comment:
   whats the purpose of this string? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r387111823
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InMemoryFileSystem.java
 ##########
 @@ -0,0 +1,120 @@
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.ByteArrayOutputStream;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.net.URI;
+import java.net.URISyntaxException;
+
+/**
+ * A FileSystem which stores all content in memory and returns a byte[] when {@link #getFileAsBytes()} is called
+ * This FileSystem is used only in write path. Does not support any read apis except {@link #getFileAsBytes()}.
+ */
+public class InMemoryFileSystem extends FileSystem {
 
 Review comment:
   Sure. will address along with other feedback. Guess you are not blocked from your review on this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393287633
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
 
 Review comment:
   use a different scheme like hdfs: or s3a for illustration? :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395399488
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##########
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link FileSystem} is used only in read path and does not support
+ * any write apis.
+ * <p>
+ * - Reading an inlined file at a given offset, length, read it out as if it were an independent file of that length
+ * - Inlined path is of the form "inlinefs:///path/to/outer/file/<outer_file_scheme>/inline_file/?start_offset=<start_offset>&length=<length>
+ * <p>
+ * TODO: The reader/writer may try to use relative paths based on the inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this shows promise, also support directly writing
 
 Review comment:
   actually I wanted to sync up with you on this. Will sycn up offline.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1176: [WIP] [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1176: [WIP] [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-586061205
 
 
   @nsivabalan are you still blocked by the test failures for this one? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r386132436
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InMemoryFileSystem.java
 ##########
 @@ -0,0 +1,120 @@
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.ByteArrayOutputStream;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.net.URI;
+import java.net.URISyntaxException;
+
+/**
+ * A FileSystem which stores all content in memory and returns a byte[] when {@link #getFileAsBytes()} is called
+ * This FileSystem is used only in write path. Does not support any read apis except {@link #getFileAsBytes()}.
+ */
+public class InMemoryFileSystem extends FileSystem {
 
 Review comment:
   can we move all of this code to `hudi-common` 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395398271
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##########
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/<start_offset>/<length>"
+ * "inlinefs:/<path_to_outer_file>/<outer_file_scheme>/inline_file/?start_offset=start_offset>&length=<length>"
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20&length=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String origScheme, long inLineStartOffset, long inLineLength) {
+    String subPath = outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+    return new Path(
+        InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + INLINE_FILE_STR + "/"
+            + "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + LENGTH_STR + EQUALS_STR + inLineLength
+    );
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20&length=40".
+   * Output : "file:/file1"
+   *
+   * @param inlinePath
+   * @param outerScheme
+   * @return
+   */
+  public static Path getOuterfilePathFromInlinePath(Path inlinePath, String outerScheme) {
 
 Review comment:
   actually you are right. will fix it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393336202
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##########
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link FileSystem} is used only in read path and does not support
+ * any write apis.
+ * <p>
+ * - Reading an inlined file at a given offset, length, read it out as if it were an independent file of that length
+ * - Inlined path is of the form "inlinefs:///path/to/outer/file/<outer_file_scheme>/inline_file/?start_offset=<start_offset>&length=<length>
+ * <p>
+ * TODO: The reader/writer may try to use relative paths based on the inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
+  private Configuration conf = null;
+
+  @Override
+  public void initialize(URI name, Configuration conf) throws IOException {
+    super.initialize(name, conf);
+    this.conf = conf;
+  }
+
+  @Override
+  public URI getUri() {
+    return URI.create(getScheme());
+  }
+
+  public String getScheme() {
+    return SCHEME;
+  }
+
+  @Override
+  public FSDataInputStream open(Path inlinePath, int bufferSize) throws IOException {
+    Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, getScheme());
+    FileSystem outerFs = outerPath.getFileSystem(conf);
+    FSDataInputStream outerStream = outerFs.open(outerPath, bufferSize);
+    return new InlineFsDataInputStream(InLineFSUtils.startOffset(inlinePath), outerStream, InLineFSUtils.length(inlinePath));
+  }
+
+  @Override
+  public boolean exists(Path f) {
+    try {
+      return getFileStatus(f) != null;
+    } catch (Exception e) {
+      return false;
+    }
+  }
+
+  @Override
+  public FileStatus getFileStatus(Path inlinePath) throws IOException {
+    Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, getScheme());
+    FileSystem outerFs = outerPath.getFileSystem(conf);
+    FileStatus status = outerFs.getFileStatus(outerPath);
+    FileStatus toReturn = new FileStatus(InLineFSUtils.length(inlinePath), status.isDirectory(), status.getReplication(), status.getBlockSize(),
 
 Review comment:
   and all of this code will read nicely.. as just getters on that pojo 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1176: [WIP] [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1176: [WIP] [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-581781882
 
 
   @nsivabalan quick update on this? this is ready for review? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r393336002
 
 

 ##########
 File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##########
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link FileSystem} is used only in read path and does not support
+ * any write apis.
+ * <p>
+ * - Reading an inlined file at a given offset, length, read it out as if it were an independent file of that length
+ * - Inlined path is of the form "inlinefs:///path/to/outer/file/<outer_file_scheme>/inline_file/?start_offset=<start_offset>&length=<length>
+ * <p>
+ * TODO: The reader/writer may try to use relative paths based on the inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
+  private Configuration conf = null;
+
+  @Override
+  public void initialize(URI name, Configuration conf) throws IOException {
+    super.initialize(name, conf);
+    this.conf = conf;
+  }
+
+  @Override
+  public URI getUri() {
+    return URI.create(getScheme());
+  }
+
+  public String getScheme() {
+    return SCHEME;
+  }
+
+  @Override
+  public FSDataInputStream open(Path inlinePath, int bufferSize) throws IOException {
+    Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, getScheme());
 
 Review comment:
   can we introduce a pojo for this? Something that can keep an outerPath, scheme, startOffset and length together.. we can have this utils method return that.. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services