You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Alexey Romanenko <ar...@gmail.com> on 2020/12/01 11:44:45 UTC

Re: Quick question regarding production readiness of ParquetIO

ParquetIO exists in Beam since 2.5.0 release, so it can be considered quite stable and mature. I’m not aware about any open major issues and you can check the performance here [1][2]

On the other hand, you are right  - it’s annotated with @Experimental as many other Beam Java IOs and components that make people confusing. There is a long story on this in Beam and we had several related discussions (the latest one [3]) on how to reduce the number of these "experimental”s. 

[1] http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?panelId=16&fullscreen&orgId=1
[2] http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?panelId=17&fullscreen&orgId=1
[3] https://lists.apache.org/thread.html/0f769736be1cf2fc5227f7a25dd3fdbb9296afe8a071761cb91f588a%40%3Cdev.beam.apache.org%3E

> On 30 Nov 2020, at 22:13, Tao Li <ta...@zillow.com> wrote:
> 
> Hi Beam community,
>  
> According to this link the  ParquetIO is still considered experimental:https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/parquet/ParquetIO.html <https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/parquet/ParquetIO.html>
>  
> Does it mean it’s not yet ready for prod usage? If that’s the case, when will it be ready?
>  
> Also, is there any known performance/scalability/reliability issue with ParquetIO?
>  
> Thanks a lot!

Re: Quick question regarding production readiness of ParquetIO

Posted by Kobe Feng <fl...@gmail.com>.

Tao, my experience of using ParquetIO is good (version: 2.11, 2.18, 2.21)
We mainly leverage it for hadoop sink by converting avro record to parquet,
and we checked data loss, quality, etc. are good, and no performance issue.

Here is one code snippet: (why we have own parquetIO is to remove partition
field from the record base on user requirement as hive/spark partition
table already include the value in HDFS path and use it for scan filtering)

def toHadoop(basePath: String, recordPartition: RecordPartition,
fileNaming: FileNaming, shardNum: Int, includePartitionFields: Boolean
= false): Unit = {
  val baseDir = HadoopClient.resolve(basePath, env)
  pCollection.apply("darwin.write.hadoop.parquet." + postfix,
FileIO.writeDynamic[String, GenericRecord]()
    .by(recordPartition.partitionFunc)
    .withDestinationCoder(StringUtf8Coder.of())
    .via(*DarwinParquetIO*.sink(recordPartition.getOutputSchema(avroSchema,
includePartitionFields), recordPartition.getPartitionFields(),
includePartitionFields))
    .to(baseDir)
    .withCompression(Compression.LZO)
    .withNaming((partitionFolder: String) =>
relativeFileNaming(StaticValueProvider.of[String](baseDir +
Path.SEPARATOR + partitionFolder), fileNaming))
    .withNumShards(shardNum))
}


On Tue, Dec 1, 2020 at 3:44 AM Alexey Romanenko <ar...@gmail.com>
wrote:

> ParquetIO exists in Beam since 2.5.0 release, so it can be considered
> quite stable and mature. I’m not aware about any open major issues and you
> can check the performance here [1][2]
>
> On the other hand, you are right  - it’s annotated with @Experimental as
> many other Beam Java IOs and components that make people confusing. There
> is a long story on this in Beam and we had several related discussions (the
> latest one [3]) on how to reduce the number of these "experimental”s.
>
> [1]
> http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?panelId=16&fullscreen&orgId=1
> [2]
> http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?panelId=17&fullscreen&orgId=1
> [3]
> https://lists.apache.org/thread.html/0f769736be1cf2fc5227f7a25dd3fdbb9296afe8a071761cb91f588a%40%3Cdev.beam.apache.org%3E
>
> On 30 Nov 2020, at 22:13, Tao Li <ta...@zillow.com> wrote:
>
> Hi Beam community,
>
> According to this link the  ParquetIO is still considered experimental:
> https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/parquet/ParquetIO.html
>
> Does it mean it’s not yet ready for prod usage? If that’s the case, when
> will it be ready?
>
> Also, is there any known performance/scalability/reliability issue with
> ParquetIO?
>
> Thanks a lot!
>
>
>

-- 
Yours Sincerely
Kobe Feng