You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Deepak Gangwar <dg...@vmware.com> on 2022/02/17 11:29:35 UTC

Get uncompressed size of parquet file via parquet-cli

Hi folks,

I was using parquet-tools to see the data or metadata of parquet files. I noticed that parquet-tools has been deprecated and removed from the latest branch and it is replaced by parquet-cli. Most of my use-cases are fulfilled by parquet-cli but there is 1 thing missing in parquet-cli. I am not able to find any way to get the uncompressed size of the data present. “parquet-tools size -u” gave the uncompressed size but there is no equivalent parquet-cli command and “parquet-cli meta” only prints the compressed size.

I looked around in the codebase and noticed that uncompressedSize is assigned to a variable in meta command but it is not used or printed anywhere [1]. I think usage of the variable is missed but I am not able to find any open issue in jira so I might be completely wrong here. Please confirm whether this is actually an issue and is there any other way to get uncompressed size that I am missing?


[1] https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ParquetMetadataCommand.java#L123
--
Thanks & Regards
Deepak Gangwar


Re: Get uncompressed size of parquet file via parquet-cli

Posted by Deepak Gangwar <dg...@vmware.com>.
Thanks Xinli for confirming. Looks like Vinoo already have a fix so I will lookout for that PR.

--
Thanks & Regards
Deepak Gangwar


From: Vinoo Ganesh <vi...@gmail.com>
Date: Monday, 21 February 2022 at 2:42 AM
To: dev@parquet.apache.org <de...@parquet.apache.org>
Subject: Re: Get uncompressed size of parquet file via parquet-cli
Ironically, I've needed this and added it recently on my fork of my
parquet. Happy to contribute it back:

https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-2129&amp;data=04%7C01%7Cdgangwar%40vmware.com%7C49ca4906740f4c18a3f008d9f4b5b3dd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637809883530644471%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=4ewpHGR1A3STunNpNqxK%2F%2BUAYHUw9RSOHRs7U%2F95X08%3D&amp;reserved=0
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fpull%2F949&amp;data=04%7C01%7Cdgangwar%40vmware.com%7C49ca4906740f4c18a3f008d9f4b5b3dd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637809883530644471%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=8SaMmV4ptS7Czy5gPQril7MjqZhEG0Pi5ys321ANgec%3D&amp;reserved=0

Thanks,
Vinoo Ganesh | vinoo.ganesh@gmail.com

<Vi...@gmail.com>


On Sun, Feb 20, 2022 at 1:18 PM Xinli shang <sh...@uber.com.invalid> wrote:

> You seem right. The 'uncompressedSize' is having the value but not printed
> out anywhere. Do you want to make a fix?
>
> On Thu, Feb 17, 2022 at 3:29 AM Deepak Gangwar <dg...@vmware.com>
> wrote:
>
> > Hi folks,
> >
> > I was using parquet-tools to see the data or metadata of parquet files. I
> > noticed that parquet-tools has been deprecated and removed from the
> latest
> > branch and it is replaced by parquet-cli. Most of my use-cases are
> > fulfilled by parquet-cli but there is 1 thing missing in parquet-cli. I
> am
> > not able to find any way to get the uncompressed size of the data
> present.
> > “parquet-tools size -u” gave the uncompressed size but there is no
> > equivalent parquet-cli command and “parquet-cli meta” only prints the
> > compressed size.
> >
> > I looked around in the codebase and noticed that uncompressedSize is
> > assigned to a variable in meta command but it is not used or printed
> > anywhere [1]. I think usage of the variable is missed but I am not able
> to
> > find any open issue in jira so I might be completely wrong here. Please
> > confirm whether this is actually an issue and is there any other way to
> get
> > uncompressed size that I am missing?
> >
> >
> > [1]
> >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fmaster%2Fparquet-cli%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fparquet%2Fcli%2Fcommands%2FParquetMetadataCommand.java%23L123&amp;data=04%7C01%7Cdgangwar%40vmware.com%7C49ca4906740f4c18a3f008d9f4b5b3dd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637809883530644471%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=qFfUYd0CY5L5%2FlZ0qVJ%2BRxG0gqO%2BU8KJeqLt%2B5cgwwg%3D&amp;reserved=0
> > --
> > Thanks & Regards
> > Deepak Gangwar
> >
> >
>
> --
> Xinli Shang
>

Re: Get uncompressed size of parquet file via parquet-cli

Posted by Vinoo Ganesh <vi...@gmail.com>.
Ironically, I've needed this and added it recently on my fork of my
parquet. Happy to contribute it back:

https://issues.apache.org/jira/browse/PARQUET-2129
https://github.com/apache/parquet-mr/pull/949

Thanks,
Vinoo Ganesh | vinoo.ganesh@gmail.com

<Vi...@gmail.com>


On Sun, Feb 20, 2022 at 1:18 PM Xinli shang <sh...@uber.com.invalid> wrote:

> You seem right. The 'uncompressedSize' is having the value but not printed
> out anywhere. Do you want to make a fix?
>
> On Thu, Feb 17, 2022 at 3:29 AM Deepak Gangwar <dg...@vmware.com>
> wrote:
>
> > Hi folks,
> >
> > I was using parquet-tools to see the data or metadata of parquet files. I
> > noticed that parquet-tools has been deprecated and removed from the
> latest
> > branch and it is replaced by parquet-cli. Most of my use-cases are
> > fulfilled by parquet-cli but there is 1 thing missing in parquet-cli. I
> am
> > not able to find any way to get the uncompressed size of the data
> present.
> > “parquet-tools size -u” gave the uncompressed size but there is no
> > equivalent parquet-cli command and “parquet-cli meta” only prints the
> > compressed size.
> >
> > I looked around in the codebase and noticed that uncompressedSize is
> > assigned to a variable in meta command but it is not used or printed
> > anywhere [1]. I think usage of the variable is missed but I am not able
> to
> > find any open issue in jira so I might be completely wrong here. Please
> > confirm whether this is actually an issue and is there any other way to
> get
> > uncompressed size that I am missing?
> >
> >
> > [1]
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ParquetMetadataCommand.java#L123
> > --
> > Thanks & Regards
> > Deepak Gangwar
> >
> >
>
> --
> Xinli Shang
>

Re: Get uncompressed size of parquet file via parquet-cli

Posted by Xinli shang <sh...@uber.com.INVALID>.
You seem right. The 'uncompressedSize' is having the value but not printed
out anywhere. Do you want to make a fix?

On Thu, Feb 17, 2022 at 3:29 AM Deepak Gangwar <dg...@vmware.com> wrote:

> Hi folks,
>
> I was using parquet-tools to see the data or metadata of parquet files. I
> noticed that parquet-tools has been deprecated and removed from the latest
> branch and it is replaced by parquet-cli. Most of my use-cases are
> fulfilled by parquet-cli but there is 1 thing missing in parquet-cli. I am
> not able to find any way to get the uncompressed size of the data present.
> “parquet-tools size -u” gave the uncompressed size but there is no
> equivalent parquet-cli command and “parquet-cli meta” only prints the
> compressed size.
>
> I looked around in the codebase and noticed that uncompressedSize is
> assigned to a variable in meta command but it is not used or printed
> anywhere [1]. I think usage of the variable is missed but I am not able to
> find any open issue in jira so I might be completely wrong here. Please
> confirm whether this is actually an issue and is there any other way to get
> uncompressed size that I am missing?
>
>
> [1]
> https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ParquetMetadataCommand.java#L123
> --
> Thanks & Regards
> Deepak Gangwar
>
>

-- 
Xinli Shang