You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by ALeX Wang <ee...@gmail.com> on 2018/03/14 00:52:04 UTC

Question about my use case.

hi,

i know it is may not be the best place to ask but would like to try
anyways, as it is quite hard for me to find good example of this online.

My usecase:

i'd like to generate from streaming data (using Scala) into arrow format in
memory mapped file and then have my parquet-cpp program writing it as
parquet file to disk.

my understanding is that java parquet only implements HDFS writer, which is
not my use case (not using hadoop) and parquet-cpp is much more succinct.

My question:

does my usecase make sense? or if there is better way?

Thanks,
-- 
Alex Wang,
Open vSwitch developer

Re: Question about my use case.

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

You might consider using Avro with Java classes. That would reduce the
amount of code you need because it can use reflection to work with your
classes. We don’t recommend building to the object model APIs unless you
need tighter integration with an existing processing engine. Here’s an
example of how easy it is to write from Parquet’s tests:

    Schema schema = ReflectData.get().getSchema(Pojo.class);
    ParquetWriter<Pojo> writer = AvroParquetWriter.<Pojo>builder(path)
        .withSchema(schema)
        .withDataModel(ReflectData.get())
        .build();
    for (int i = 0; i < num; i++) {
      writer.write(records.get(i));
    }

There’s also an example of the read side
<https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestReflectReadWrite.java#L47-L59>
in the tests. That’s probably easier to use and maintain.

rb

On Thu, Apr 19, 2018 at 5:39 PM, ALeX Wang <ee...@gmail.com> wrote:

> Sorry for this long delayed reply,
>
> Finally have time to work on this again, and yes, after taking a closer
> study at parquet-hadoop source code, I'm able to simple write a customer
> ParquetWriter using java.io.FileOutputStream for my use case.
>
> We do not use Avro,  All the data is in flat java classes, and we want to
> directly write into parquet file at local filesystem,
>
> Thanks,
> Alex Wang,
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Question about my use case.

Posted by ALeX Wang <ee...@gmail.com>.

Sorry for this long delayed reply,

Finally have time to work on this again, and yes, after taking a closer
study at parquet-hadoop source code, I'm able to simple write a customer
ParquetWriter using java.io.FileOutputStream for my use case.

We do not use Avro,  All the data is in flat java classes, and we want to
directly write into parquet file at local filesystem,

Thanks,
Alex Wang,

Re: Question about my use case.

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Yeah, sounds like something went wrong. What is your data model? Parquet
can handle Avro records pretty seamlessly if you already have them.

On Wed, Mar 14, 2018 at 9:20 AM, ALeX Wang <ee...@gmail.com> wrote:

> Hi Ryan,
>
> Thanks for the reply,
>
> We are using samza for streaming,
>
> Regarding parquet java, then i must have not used the APIs right,,, since
> last time we tried, we have 7 hadoop processes spawned for writing to a
> single file and it was much slower than our parquet c++ alternative,
>
> Thanks,
>
>
> On 14 March 2018 at 09:06, Ryan Blue <rb...@netflix.com.invalid> wrote:
>
>> Hi Alex,
>>
>> I don't think what you're trying to do makes sense. If you're using Scala,
>> then your data is already in the JVM and it is probably much easier to
>> write it to Parquet using the Java library. While that library depends on
>> Hadoop, you don't have to use it with HDFS. The Hadoop FileSystem
>> interface
>> can be used to write directly to local disk or a number of other stores,
>> like S3. Using the Java library would allow you to write the data
>> directly,
>> instead of translating to Arrow first.
>>
>> Since you want to use Scala, then the easiest way to get this support is
>> probably to write using Spark, which has most of what you need ready to
>> go.
>> If you're using a different streaming system you might not want both. What
>> are you using?
>>
>> rb
>>
>> On Tue, Mar 13, 2018 at 6:11 PM, ALeX Wang <ee...@gmail.com> wrote:
>>
>> > Also could i get a pointer to example that write parquet file from arrow
>> > memory buffer directly?
>> >
>> > The part i'm currently missing is how to derive the repetition level and
>> > definition level@@
>> >
>> > Thanks,
>> >
>> > On 13 March 2018 at 17:52, ALeX Wang <ee...@gmail.com> wrote:
>> >
>> > > hi,
>> > >
>> > > i know it is may not be the best place to ask but would like to try
>> > > anyways, as it is quite hard for me to find good example of this
>> online.
>> > >
>> > > My usecase:
>> > >
>> > > i'd like to generate from streaming data (using Scala) into arrow
>> format
>> > > in memory mapped file and then have my parquet-cpp program writing it
>> as
>> > > parquet file to disk.
>> > >
>> > > my understanding is that java parquet only implements HDFS writer,
>> which
>> > > is not my use case (not using hadoop) and parquet-cpp is much more
>> > > succinct.
>> > >
>> > > My question:
>> > >
>> > > does my usecase make sense? or if there is better way?
>> > >
>> > > Thanks,
>> > > --
>> > > Alex Wang,
>> > > Open vSwitch developer
>> > >
>> >
>> >
>> >
>> > --
>> > Alex Wang,
>> > Open vSwitch developer
>> >
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
>
> --
> Alex Wang,
> Open vSwitch developer
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Question about my use case.

Posted by ALeX Wang <ee...@gmail.com>.

Hi Ryan,

Thanks for the reply,

We are using samza for streaming,

Regarding parquet java, then i must have not used the APIs right,,, since
last time we tried, we have 7 hadoop processes spawned for writing to a
single file and it was much slower than our parquet c++ alternative,

Thanks,


On 14 March 2018 at 09:06, Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi Alex,
>
> I don't think what you're trying to do makes sense. If you're using Scala,
> then your data is already in the JVM and it is probably much easier to
> write it to Parquet using the Java library. While that library depends on
> Hadoop, you don't have to use it with HDFS. The Hadoop FileSystem interface
> can be used to write directly to local disk or a number of other stores,
> like S3. Using the Java library would allow you to write the data directly,
> instead of translating to Arrow first.
>
> Since you want to use Scala, then the easiest way to get this support is
> probably to write using Spark, which has most of what you need ready to go.
> If you're using a different streaming system you might not want both. What
> are you using?
>
> rb
>
> On Tue, Mar 13, 2018 at 6:11 PM, ALeX Wang <ee...@gmail.com> wrote:
>
> > Also could i get a pointer to example that write parquet file from arrow
> > memory buffer directly?
> >
> > The part i'm currently missing is how to derive the repetition level and
> > definition level@@
> >
> > Thanks,
> >
> > On 13 March 2018 at 17:52, ALeX Wang <ee...@gmail.com> wrote:
> >
> > > hi,
> > >
> > > i know it is may not be the best place to ask but would like to try
> > > anyways, as it is quite hard for me to find good example of this
> online.
> > >
> > > My usecase:
> > >
> > > i'd like to generate from streaming data (using Scala) into arrow
> format
> > > in memory mapped file and then have my parquet-cpp program writing it
> as
> > > parquet file to disk.
> > >
> > > my understanding is that java parquet only implements HDFS writer,
> which
> > > is not my use case (not using hadoop) and parquet-cpp is much more
> > > succinct.
> > >
> > > My question:
> > >
> > > does my usecase make sense? or if there is better way?
> > >
> > > Thanks,
> > > --
> > > Alex Wang,
> > > Open vSwitch developer
> > >
> >
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Alex Wang,
Open vSwitch developer

Re: Question about my use case.

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Hi Alex,

I don't think what you're trying to do makes sense. If you're using Scala,
then your data is already in the JVM and it is probably much easier to
write it to Parquet using the Java library. While that library depends on
Hadoop, you don't have to use it with HDFS. The Hadoop FileSystem interface
can be used to write directly to local disk or a number of other stores,
like S3. Using the Java library would allow you to write the data directly,
instead of translating to Arrow first.

Since you want to use Scala, then the easiest way to get this support is
probably to write using Spark, which has most of what you need ready to go.
If you're using a different streaming system you might not want both. What
are you using?

rb

On Tue, Mar 13, 2018 at 6:11 PM, ALeX Wang <ee...@gmail.com> wrote:

> Also could i get a pointer to example that write parquet file from arrow
> memory buffer directly?
>
> The part i'm currently missing is how to derive the repetition level and
> definition level@@
>
> Thanks,
>
> On 13 March 2018 at 17:52, ALeX Wang <ee...@gmail.com> wrote:
>
> > hi,
> >
> > i know it is may not be the best place to ask but would like to try
> > anyways, as it is quite hard for me to find good example of this online.
> >
> > My usecase:
> >
> > i'd like to generate from streaming data (using Scala) into arrow format
> > in memory mapped file and then have my parquet-cpp program writing it as
> > parquet file to disk.
> >
> > my understanding is that java parquet only implements HDFS writer, which
> > is not my use case (not using hadoop) and parquet-cpp is much more
> > succinct.
> >
> > My question:
> >
> > does my usecase make sense? or if there is better way?
> >
> > Thanks,
> > --
> > Alex Wang,
> > Open vSwitch developer
> >
>
>
>
> --
> Alex Wang,
> Open vSwitch developer
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Question about my use case.

Posted by ALeX Wang <ee...@gmail.com>.

Also could i get a pointer to example that write parquet file from arrow
memory buffer directly?

The part i'm currently missing is how to derive the repetition level and
definition level@@

Thanks,

On 13 March 2018 at 17:52, ALeX Wang <ee...@gmail.com> wrote:

> hi,
>
> i know it is may not be the best place to ask but would like to try
> anyways, as it is quite hard for me to find good example of this online.
>
> My usecase:
>
> i'd like to generate from streaming data (using Scala) into arrow format
> in memory mapped file and then have my parquet-cpp program writing it as
> parquet file to disk.
>
> my understanding is that java parquet only implements HDFS writer, which
> is not my use case (not using hadoop) and parquet-cpp is much more
> succinct.
>
> My question:
>
> does my usecase make sense? or if there is better way?
>
> Thanks,
> --
> Alex Wang,
> Open vSwitch developer
>



-- 
Alex Wang,
Open vSwitch developer