You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by fo...@apache.org on 2019/06/25 20:20:08 UTC
[parquet-format] branch master updated: PARQUET-1610: Minor
grammatical fixes (#132)
This is an automated email from the ASF dual-hosted git repository.
fokko pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 96a8f31 PARQUET-1610: Minor grammatical fixes (#132)
96a8f31 is described below
commit 96a8f3172a3b895408d2d1b939200dd02ab8300d
Author: Umayah Abdennabi <ab...@gmail.com>
AuthorDate: Tue Jun 25 13:20:03 2019 -0700
PARQUET-1610: Minor grammatical fixes (#132)
---
PageIndex.md | 25 +++++++++++--------------
1 file changed, 11 insertions(+), 14 deletions(-)
diff --git a/PageIndex.md b/PageIndex.md
index 7ac6e42..551ef0c 100644
--- a/PageIndex.md
+++ b/PageIndex.md
@@ -19,14 +19,14 @@
# ColumnIndex Layout to Support Page Skipping
-This documents describes the format for column index pages in the Parquet
+This document describes the format for column index pages in the Parquet
footer. These pages contain statistics for DataPages and can be used to skip
pages when scanning data in ordered and unordered columns.
## Problem Statement
In previous versions of the format, Statistics are stored for ColumnChunks in
ColumnMetaData and for individual pages inside DataPageHeader structs. When
-reading pages, a reader had to process the page header in order to determine
+reading pages, a reader had to process the page header to determine
whether the page could be skipped based on the statistics. This means the reader
had to access all pages in a column, thus likely reading most of the column
data from disk.
@@ -34,21 +34,21 @@ data from disk.
## Goals
1. Make both range scans and point lookups I/O efficient by allowing direct
access to pages based on their min and max values. In particular:
-2. A single-row lookup in a rowgroup based on the sort column of that rowgroup
- will only read one data page per retrieved column.
- * Range scans on the sort column will only need to read the exact data
+2. A single-row lookup in a row group based on the sort column of that row group
+ will only read one data page per the retrieved column.
+ * Range scans on the sort column will only need to read the exact data
pages that contain relevant data.
* Make other selective scans I/O efficient: if we have a very selective
predicate on a non-sorting column, for the other retrieved columns we
should only need to access data pages that contain matching rows.
3. No additional decoding effort for scans without selective predicates, e.g.,
- full-row group scans. If a reader determines that it does not need to read
+ full-row group scans. If a reader determines that it does not need to read
the index data, it does not incur any overhead.
4. Index pages for sorted columns use minimal storage by storing only the
boundary elements between pages.
## Non-Goals
-* Support for the equivalent of secondary indices, ie, an index structure
+* Support for the equivalent of secondary indices, i.e., an index structure
sorted on the key values over non-sorted data.
@@ -64,9 +64,9 @@ We add two new per-column structures to the row group metadata:
skipped. Hence the OffsetIndexes for each column in a RowGroup are stored
together.
-The new index structures are stored separately from RowGroup, near the footer,
-so that a reader does not have to pay the I/O and deserialization cost for
-reading the them if it is not doing selective scans. The index structures'
+The new index structures are stored separately from RowGroup, near the footer.
+This is done so that a reader does not have to pay the I/O and deserialization
+cost for reading them if it is not doing selective scans. The index structures'
location and length are stored in ColumnChunk.
![Page Index Layout](doc/images/PageIndexLayout.png)
@@ -92,10 +92,7 @@ a binary search in `min_values` and `max_values`. For unordered columns, a
reader can find matching pages by sequentially reading `min_values` and
`max_values`.
-For range scans this approach can be extended to return ranges of rows, page
+For range scans, this approach can be extended to return ranges of rows, page
indices, and page offsets to scan in each column. The reader can then
initialize a scanner for each column and fast forward them to the start row of
the scan.
-
-
-