You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Yash Ganthe <ya...@gmail.com> on 2020/07/18 15:19:57 UTC

How to incrementally store timeseries in Parquet files for efficient retrieval?

I would like to store the stock price of a large number of companies in a
parquet file in the form of a timeseries.
If I gather the data at the end of 1 Jul, I would be writing a file such as:
1 Jul 2020, Company1,35
1 Jul 2020, Company2,46
....

On 2 Jul, I would receive the new prices and would write it in "append"
mode as:
2 Jul 2020, Company1,37
2 Jul 2020, Company2,43
...

This will result in 2 partition files being created for the same parquet
file:
stocks.parquet/
part0_stocks.parquet written on 1 Jul
part1_stocks.parquet written on 2 Jul

If this continues for years, I will have a large number of partition files
created, one per day.
If a client application wants to fetch the timeseries for 6 months, it will
be reading several files to gather the data and may be inefficient.

Is there a better way to store timeseries data in parquet?

Re: How to incrementally store timeseries in Parquet files for efficient retrieval?

Posted by Tim Armstrong <ta...@cloudera.com.INVALID>.
The usual solution is to partition the data based on the criteria you want
to filter by. E.g. for Hive tables, you would partition by date and have a
separate directory per date.

If you have a relatively modern version of Parquet, stats and page indices
will allow the reader to filter out files based on ranges of values in the
file after reading the file footers. Reading the footer takes longer than
not reading the file at all, but is much faster than reading the whole file.

On Sat, Jul 18, 2020 at 8:21 AM Yash Ganthe <ya...@gmail.com> wrote:

> I would like to store the stock price of a large number of companies in a
> parquet file in the form of a timeseries.
> If I gather the data at the end of 1 Jul, I would be writing a file such
> as:
> 1 Jul 2020, Company1,35
> 1 Jul 2020, Company2,46
> ....
>
> On 2 Jul, I would receive the new prices and would write it in "append"
> mode as:
> 2 Jul 2020, Company1,37
> 2 Jul 2020, Company2,43
> ...
>
> This will result in 2 partition files being created for the same parquet
> file:
> stocks.parquet/
> part0_stocks.parquet written on 1 Jul
> part1_stocks.parquet written on 2 Jul
>
> If this continues for years, I will have a large number of partition files
> created, one per day.
> If a client application wants to fetch the timeseries for 6 months, it will
> be reading several files to gather the data and may be inefficient.
>
> Is there a better way to store timeseries data in parquet?
>