You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "wxmimperio (Jira)" <ji...@apache.org> on 2020/08/06 08:42:00 UTC
[jira] [Comment Edited] (PARQUET-1559) Add way to manually commit already written data to disk

    [ https://issues.apache.org/jira/browse/PARQUET-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172137#comment-17172137 ] 

wxmimperio edited comment on PARQUET-1559 at 8/6/20, 8:41 AM:
--------------------------------------------------------------

[~gszadovszky]

Thank you for your answer.

I want to know, if I set up relatively small row groups, refresh the colnum store to the page store frequently, and refresh to the outputStream, will the data be flushed to disk?(I know the data is unreadable at this time, but the column store and page store memory can be released by gc)
 pageStore.flushToFileWriter(parquetFileWriter);
 This method just refreshes page stroe to outputStream, so the data should still be in memory at this time until run outputStream.close().

When I reduced rowGroupSize = 8Mb and I find debug log: {color:#c1c7d0}LOG.debug("Flushing mem columnStore to file. allocated memory: {}", columnStore.getAllocatedSize()),{color} but the file on hdfs has no content and size. I guess it is outPutStream did not flush the data out.


was (Author: wxmimperio):
[~gszadovszky]

Thank you for your answer.

I want to know, if I set up relatively small row groups, refresh the colnum store to the page store frequently, and refresh to the outputStream, will the data be flushed to disk?(I know the data is unreadable at this time, but the column store and page store memory can be released by gc)
pageStore.flushToFileWriter(parquetFileWriter);
This method just refreshes page stroe to outputStream, so the data should still be in memory at this time until outputStream.close() the data flush to disk.

When I have reduced rowGroupSize = 8Mb and I find debug log: LOG.debug("Flushing mem columnStore to file. allocated memory: {}", columnStore.getAllocatedSize()), but the file on hdfs has no content and size. I guess it is outPutStream did not flush the data out.

> Add way to manually commit already written data to disk
> -------------------------------------------------------
>
>                 Key: PARQUET-1559
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1559
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: Victor
>            Priority: Major
>
> I'm not exactly sure this is compliant with the way parquet works, but I have the following need:
>  * I'm using parquet-avro to write to a parquet file during a long running process
>  * I would like to be able from time to time to access the already written data
> So I was expecting to be able to flush manually the file to ensure the data is on disk and then copy the file for preliminary analysis.
> If it's contradictory to the way parquet works (for example there is something about metadata being at the footer of the file), what would then be the alternative?
> Closing the file and opening a new one to continue writing?
> Could this be supported directly by parquet-mr maybe? It would then write multiple files in that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)