You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/29 09:17:32 UTC

[GitHub] [arrow] zhangyue19921010 opened a new issue, #13030: [JAVA] Is any way reading partial parquet file into arrow

zhangyue19921010 opened a new issue, #13030:
URL: https://github.com/apache/arrow/issues/13030

   Hi Team,
   
   I am developing HoodieArrowParquetFileFormat, aiming to trigger spark sql and query hudi data throw arrow.
   
   Based on spark sql abstraction, I need to access partial parquet file [offset0, offset1], Is it possible to  use newScan api?
   For example, can we set `start_offset, length, file_format` directly?
   ```
   JNIEXPORT jlong JNICALL
   Java_org_apache_arrow_dataset_file_JniWrapper_makeFileSystemDatasetFactory(
       JNIEnv* env, jobject, jstring uri, jlong file_format_id,
       jlong start_offset, jlong length) {
     JNI_METHOD_START
     std::shared_ptr<arrow::dataset::FileFormat> file_format =
         JniGetOrThrow(GetFileFormat(file_format_id));
     arrow::dataset::FileSystemFactoryOptions options;
     std::shared_ptr<arrow::dataset::DatasetFactory> d =
         JniGetOrThrow(arrow::dataset::FileSystemDatasetFactory::Make(
             JStringToCString(env, uri), start_offset, length, file_format, options));
     return CreateNativeRef(d);
     JNI_METHOD_END(-1L)
   }
   ```
   
   ```
       (file: PartitionedFile) => {
         val allocator = HoodieArrowUtils.getAllocator()
         val factory = HoodieArrowUtils.getDatasetFactory(allocator, file.filePath)
         val dataset = factory.finish(HoodieArrowUtils.toArrowSchema(requiredSchema, HoodieArrowUtils.getLocalTimezoneID()))
   
         val scanOptions = new ScanOptions(batchSize)
         val scanner = dataset.newScan(scanOptions)
   
         Option(TaskContext.get()).foreach(_.addTaskCompletionListener[Unit](_ => {
           scanner.close()
           dataset.close()
           factory.close()
         }))
   
         val itr = scanner.scan().iterator()
           .asScala.toList
           .flatMap(task => task.execute().asScala.toList)
           .map(batch => HoodieArrowUtils.loadBatch(batch, file.partitionValues, partitionSchema, requiredSchema, allocator))
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vinothchandar commented on issue #13030: [JAVA] Is any way reading partial parquet file into arrow

Posted by "vinothchandar (via GitHub)" <gi...@apache.org>.

vinothchandar commented on issue #13030:
URL: https://github.com/apache/arrow/issues/13030#issuecomment-1633517322

   @zhangyue19921010 were you looking for something like what's asked in #35638 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zhangyue19921010 commented on issue #13030: [Java] Is any way reading partial parquet file into arrow

Posted by "zhangyue19921010 (via GitHub)" <gi...@apache.org>.

zhangyue19921010 commented on issue #13030:
URL: https://github.com/apache/arrow/issues/13030#issuecomment-1637968534

   Also sorry for missing this message ... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #13030: [JAVA] Is any way reading partial parquet file into arrow

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #13030:
URL: https://github.com/apache/arrow/issues/13030#issuecomment-1113692280

   A parquet file is made up of row groups, columns, and pages.  A page is indivisible as it represents a compressed buffer.  There is no way to read a part of a page and so it cannot be sliced.
   
   However, it is still a popular idea to partition file access based on file size.  One way to handle this is to return every row group whose first byte is in the asked-for range.
   
   For example, if a parquet file has 10 row groups and each row group is 900,000 bytes and you ask for the range [2000000,3000000] you would get the 3rd row group (that starts at byte 2,700,000).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zhangyue19921010 commented on issue #13030: [JAVA] Is any way reading partial parquet file into arrow

Posted by GitBox <gi...@apache.org>.

zhangyue19921010 commented on issue #13030:
URL: https://github.com/apache/arrow/issues/13030#issuecomment-1113126932

   Or Could we supply this kind of option for users? Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zhangyue19921010 commented on issue #13030: [Java] Is any way reading partial parquet file into arrow

Posted by "zhangyue19921010 (via GitHub)" <gi...@apache.org>.

zhangyue19921010 commented on issue #13030:
URL: https://github.com/apache/arrow/issues/13030#issuecomment-1637966062

   > @zhangyue19921010 were you probably asking if there is a way to avoid loading a parquet file all at once? row group by row group?
   
   Exactly VC. Unfortunately, there doesn't seem to be a particularly good way to do it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org