You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by sh...@apache.org on 2022/03/04 17:15:52 UTC

[parquet-mr] branch master updated: PARQUET-2121: Remove descriptions for the removed modules (#947)

This is an automated email from the ASF dual-hosted git repository.

shangxinli pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-mr.git


The following commit(s) were added to refs/heads/master by this push:
     new 4d062dc  PARQUET-2121: Remove descriptions for the removed modules (#947)
4d062dc is described below

commit 4d062dc37577e719dcecc666f8e837843e44a9be
Author: Kengo Seki <se...@apache.org>
AuthorDate: Sat Mar 5 02:15:14 2022 +0900

    PARQUET-2121: Remove descriptions for the removed modules (#947)
    
    * PARQUET-2121: Remove descriptions for the removed modules
    
    * Add '(deprecated)' to removed modules in README.md instead of removing their line
---
 .gitignore                                         |   1 -
 README.md                                          |   6 +-
 .../parquet/hadoop/thrift/ThriftReadSupport.java   |   9 +-
 parquet_cascading.md                               | 163 ---------------------
 pom.xml                                            |   3 -
 5 files changed, 6 insertions(+), 176 deletions(-)

diff --git a/.gitignore b/.gitignore
index aa67d3d..2ef152e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -13,7 +13,6 @@ target
 *.orig
 *.rej
 dependency-reduced-pom.xml
-parquet-scrooge/.cache
 .idea/*
 target/
 .cache
diff --git a/README.md b/README.md
index d897125..91501dd 100644
--- a/README.md
+++ b/README.md
@@ -66,10 +66,10 @@ Parquet is a very active project, and new features are being added quickly. Here
 * Type-specific encoding
 * Hive integration (deprecated)
 * Pig integration
-* Cascading integration
+* Cascading integration (deprecated)
 * Crunch integration
 * Apache Arrow integration
-* Apache Scrooge integration
+* Scrooge integration (deprecated)
 * Impala integration (non-nested)
 * Java Map/Reduce API
 * Native Avro support
@@ -92,7 +92,7 @@ Note that to use an Input or Output format, you need to implement a WriteSupport
 We've implemented this for 2 popular data formats to provide a clean migration path as well:
 
 ### Thrift
-Thrift integration is provided by the [parquet-thrift](https://github.com/apache/parquet-mr/tree/master/parquet-thrift) sub-project. If you are using Thrift through Scala, you may be using Twitter's [Scrooge](https://github.com/twitter/scrooge). If that's the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in the [parquet-scrooge](https://github.com/apache/parquet-mr/tree/master/parquet-scrooge) sub-project.
+Thrift integration is provided by the [parquet-thrift](https://github.com/apache/parquet-mr/tree/master/parquet-thrift) sub-project.
 
 ### Avro
 Avro conversion is implemented via the [parquet-avro](https://github.com/apache/parquet-mr/tree/master/parquet-avro) sub-project.
diff --git a/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java b/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java
index 6bad970..2375a6d 100644
--- a/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java
+++ b/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java
@@ -67,8 +67,7 @@ public class ThriftReadSupport<T> extends ReadSupport<T> {
   /**
    * A {@link ThriftRecordConverter} builds an object by working with {@link TProtocol}. The default
    * implementation creates standard Apache Thrift {@link TBase} objects; to support alternatives, such
-   * as <a href="http://github.com/twitter/scrooge">Twiter's Scrooge</a>, a custom converter can be specified using this key
-   * (for example, ScroogeRecordConverter from parquet-scrooge).
+   * as <a href="http://github.com/twitter/scrooge">Twiter's Scrooge</a>, a custom converter can be specified using this key.
    */
   private static final String RECORD_CONVERTER_CLASS_KEY = "parquet.thrift.converter.class";
 
@@ -77,8 +76,7 @@ public class ThriftReadSupport<T> extends ReadSupport<T> {
   /**
    * A {@link ThriftRecordConverter} builds an object by working with {@link TProtocol}. The default
    * implementation creates standard Apache Thrift {@link TBase} objects; to support alternatives, such
-   * as <a href="http://github.com/twitter/scrooge">Twiter's Scrooge</a>, a custom converter can be specified
-   * (for example, ScroogeRecordConverter from parquet-scrooge).
+   * as <a href="http://github.com/twitter/scrooge">Twiter's Scrooge</a>, a custom converter can be specified.
    *
    * @param conf a mapred jobconf
    * @param klass a thrift class
@@ -93,8 +91,7 @@ public class ThriftReadSupport<T> extends ReadSupport<T> {
   /**
    * A {@link ThriftRecordConverter} builds an object by working with {@link TProtocol}. The default
    * implementation creates standard Apache Thrift {@link TBase} objects; to support alternatives, such
-   * as <a href="http://github.com/twitter/scrooge">Twiter's Scrooge</a>, a custom converter can be specified
-   * (for example, ScroogeRecordConverter from parquet-scrooge).
+   * as <a href="http://github.com/twitter/scrooge">Twiter's Scrooge</a>, a custom converter can be specified.
    *
    * @param conf a configuration
    * @param klass a thrift class
diff --git a/parquet_cascading.md b/parquet_cascading.md
deleted file mode 100644
index 0eeaceb..0000000
--- a/parquet_cascading.md
+++ /dev/null
@@ -1,163 +0,0 @@
-<!--
-  ~ Licensed to the Apache Software Foundation (ASF) under one
-  ~ or more contributor license agreements.  See the NOTICE file
-  ~ distributed with this work for additional information
-  ~ regarding copyright ownership.  The ASF licenses this file
-  ~ to you under the Apache License, Version 2.0 (the
-  ~ "License"); you may not use this file except in compliance
-  ~ with the License.  You may obtain a copy of the License at
-  ~
-  ~   http://www.apache.org/licenses/LICENSE-2.0
-  ~
-  ~ Unless required by applicable law or agreed to in writing,
-  ~ software distributed under the License is distributed on an
-  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  ~ KIND, either express or implied.  See the License for the
-  ~ specific language governing permissions and limitations
-  ~ under the License.
-  -->
-
-Parquet Cascading Integration
-=============================
-
-This document details the support of reading and writing parquet format from cascading.
-
-1. Read and Write
-==============
-
-In [parquet-cascading](https://github.com/apache/parquet-mr/tree/master/parquet-cascading) sub-module, we provide support for reading/writing records of various data structures including Thrift(TBase), Scrooge and Tuples. Please refer to following sections for each data structures.
-
-1.1 Thrift/TBase
-------------
-### Read Thrift Records from Parquet
-[ParquetTbaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetTBaseScheme.java) is the interface for reading thrift records in Parquet format. Providing a ParquetTbaseScheme as a parameter to the constructor of a source enables the program to read Thrift object(TBase), eg.
-
-`
-Scheme sourceScheme = new ParquetTBaseScheme(Name.class)
-Tap source = new Hfs(sourceScheme, parquetInputPath);
-`
-
-In the above example Name is a thrift class that extends TBase. Under the hood parquet will generate a schema from the thrift class to decode the data. 
-
-The thrift class is actually *optional* to initialize a ParquetTBaseScheme when the data is written as Thrift records in Parquet. When writing thrift records to parquet format, the Thrift class of the records is stored as meta-data in the footer of the parquet file. Therefore when reading the file, if a thrift class is not explicitly provided, Parquet will use the class name stored in the footer as the thrift class. 
-
-### Write Thrift Records to Parquet
-[ParquetTbaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetTBaseScheme.java) can also be used by a sink. When used as a sink, the Thrift class of the records being written must be *explicitly* provided.
-
-`
-Scheme sinkScheme = new ParquetTBaseScheme(Name.class);
-Tap sink = new Hfs(sinkScheme, parquetOutputPath);
-`
-
-For more concrete examples please refer to [TestParquetTBaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/test/java/org/apache/parquet/cascading/TestParquetTBaseScheme.java)
-
-1.2 Scrooge
------------
-### Read Scrooge records from Parquet
-Scrooge support is defined in a separate module called [parquet-scrooge](https://github.com/apache/parquet-mr/tree/master/parquet-scrooge). With [ParquetScroogeScheme](https://github.com/apache/parquet-mr/blob/master/parquet-scrooge/src/main/java/org/apache/parquet/scrooge/ParquetScroogeScheme.java), data can be read in the form of Scrooge objects which are more scala friendly.
-
-`
-Scheme sinkScheme = new ParquetScroogeScheme(Name.class);
-Tap sink = new Hfs(sinkScheme, parquetOutputPath);
-`
-
-### Write Scrooge Records to Parquet(Not supported yet)
-
-1.3 Tuples
-----------
-### Read Cascading Tuples from Parquet
-Currently, the support for reading tuples is mainly(but not limited) for data written from pig scripts as pig tuples. More comprehensive support will be added, but in the mean time, there are some limitations to notice: Nested structures are not supported. If the data is written as thrift objects which have nested structure, it can not be read at current time. *Data to read must be in flat structure*. To read data as tuples, simply use [ParquetTupleScheme](https://github.com/apache/parqu [...]
-
-`
-Scheme sourceScheme = new ParquetTupleScheme(new Fields("last_name"));
-Tap source = new Hfs(sourceScheme, parquetInputPath);
-`
-
-### Write Cascading Tuples to Parquet(coming soon)
-
-For more examples please refer to [TestParquetTupleScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/test/java/org/apache/parquet/cascading/TestParquetTupleScheme.java)
-
-2. Projection Pushdown
-======================
-One of the big benefit of using columnar format is to be able to read only a subset of columns when the full schema is huge. It saves times by not reading unused columns. 
-
-Parquet support projection pushdown for Thrift records and tuples.
-
-### 2.1 Projection Pushdown with Thrift/Scrooge Records
-To read only a subset of columns in a Thrift/Scrooge class, the columns of interest should be specified using a glob syntax.
-
-For example, imagine a Person struct defined as:
-
-    struct Person {
-      1: required string name
-      2: optional int16 age
-      3: optional Address primaryAddress
-      4: required map<string, Address> otherAddresses
-    }
-
-    struct Address {
-      1: required string street
-      2: required string zip
-      3: required PhoneNumber primaryPhone
-      4: required PhoneNumber secondaryPhone
-      4: required list<PhoneNumber> otherPhones
-    }
-
-    struct PhoneNumber {
-      1: required i32 areaCode
-      2: required i32 number
-      3: required bool doNotCall
-    }
-
-A column is specified as the path from the root of the schema down to the field of interest, separated by `.`, just as you would access the field
-in java or scala code. For example: `primaryAddress.primaryPhone.doNotCall`.
-This applies for repeated fields as well, for example `primaryAddress.otherPhones.number` selects all the `number`s from all the elements of `otherPhones`.
-Maps are a special case -- the map is split into two columns, the key and the value. All the columns in the key are required, but you can select a subset of the
-columns in the value (or skip the value entirely), for example: `otherAddresses.{key,value.street}` will select only the streets from the
-values of the map, but the entire key will be kept. To select an entire map, you can do: `otherAddresses.{key,value}`, 
-and to select only the keys: `otherAddresses.key`. Similar to map keys, the values in a set cannot be partially projected,
-you must select all the columns of the items in the set, or none of them. This is because materializing the set wouldn't make much sense if the item's
-hashcode is dependent on the dropped columns (as with the key of a map).
-
-When selecting a field that is a struct, for example `primaryAddress.primaryPhone`, 
-it will select the entire struct. So `primaryAddress.primaryPhone.*` is redundant.
-
-Columns can be specified concretely (like `primaryAddress.primaryPhone.doNotCall`), or a restricted glob syntax can be used.
-The glob syntax supports only wildcards (`*`) and glob expressions (`{}`).
-
-For example:
-
-  * `name` will select just the `name` from the Person
-  * `{name,age}` will select both the `name` and `age` from the Person
-  * `primaryAddress` will select the entire `primaryAddress` struct, including all of its children (recursively)
-  * `primaryAddress.*Phone` will select all of `primaryAddress.primaryPhone` and `primaryAddress.secondaryPhone`
-  * `primaryAddress.*Phone*` will select all of `primaryAddress.primaryPhone` and `primaryAddress.secondaryPhone` and `primaryAddress.otherPhones`
-  * `{name,age,primaryAddress.{*Phone,street}}` will select `name`, `age`, `primaryAddress.primaryPhone`, `primaryAddress.secondaryPhone`, and `primaryAddress.street`
-
-Multiple Patterns:
-Multiple glob expression can be joined together separated by ";". eg. `name;primaryAddress.street` will match only name and street in Address.
-This is useful if you want to combine a list of patterns without making a giant `{}` group.
-
-Note: all possible glob patterns must match at least one column. For example, if you provide the glob: `a.b.{c,d,e}` but only columns `a.b.c` and `a.b.d` exist, an
-exception will be thrown.
-
-You can provide your projection globs to parquet by setting `parquet.thrift.column.projection.globs` in the hadoop config, or using the methods in the
-scheme builder classes.
-
-### 2.2 Projection Pushdown with Tuples
-When using ParquetTupleScheme, specifying projection pushdown is as simple as specifying fields as the parameter of the constructor of ParquetTupleScheme:
-
-
-3. Cascading 2.0 & Cascading 3.0
-================================
-Cascading 3.0 introduced a breaking interface change in the Scheme abstract class, which causes a breaking change in all scheme implementations.
-The parquet-cascading3 directory contains a separate library for use with Cascading 3.0
-
-A significant part of the code remains identical; this shared part is in the parquet-cascading-common23 directory, which is not a Maven module.
-
-You cannot use both parquet-cascading and parquet-cascading3 in the same Classloader, which should be fine as you cannot use both cascading-core 2.x and cascading-core 3.x in the same Classloader either.
-
-
-
-
-`Scheme sourceScheme = new ParquetTupleScheme(new Fields("age"));`
diff --git a/pom.xml b/pom.xml
index d2be611..ad36b71 100644
--- a/pom.xml
+++ b/pom.xml
@@ -77,8 +77,6 @@
     <japicmp.version>0.14.2</japicmp.version>
     <shade.prefix>shaded.parquet</shade.prefix>
     <hadoop.version>2.10.1</hadoop.version>
-    <cascading.version>2.7.1</cascading.version>
-    <cascading3.version>3.1.2</cascading3.version>
     <parquet.format.version>2.9.0</parquet.format.version>
     <previous.version>1.12.0</previous.version>
     <thrift.executable>thrift</thrift.executable>
@@ -461,7 +459,6 @@
             <exclude>**/*.parquet</exclude>
             <exclude>**/*.avro</exclude>
             <exclude>**/*.json</exclude>
-            <exclude>**/names.txt</exclude> <!-- parquet-cascading test data -->
             <exclude>**/*.avsc</exclude>
             <exclude>**/*.iml</exclude>
             <exclude>**/*.log</exclude>