You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@crail.apache.org by Julian Hyde <jh...@apache.org> on 2018/09/04 19:59:07 UTC

Crail, Albis and Arrow

I just read the blog post [1] about Crail and file formats. (I have to declare my interests up front: I have been a huge supporter of Apache Arrow, and I am a PMC member. I’m speaking here as an Arrow contributor and enthusiast, not as a mentor of Crail.)

I am a bit troubled about the endorsement of Albis in a Crail blog post. For example, "we have developed a new file format called Albis”. Since the blog post is not signed, I take it that “We” means the authors of the paper [2] mentioned in the blog post. But I hope that “we” does not mean “we as Crail committers and PMC members".

I know that there are different forces at play if you work for a corporation, or are a researcher, or are an idealistic open source. As a researcher, you need to invent new stuff and prove that it is better than everything that has been done before.

But I’ve been through the file format wars — ORC vs Parquet — driven in large part by two competing vendors. It was sickening, and a huge waste of effort. Please, please don’t let this happen again. If you want to make Crail successful, you should make it absolutely clear to the Arrow, ORC and Parquet communities that you will help to make Crail work as well as it possibly can

Also, on paper Albis looks very similar to Arrow, and the performance gap is fairly narrow. If you have found insights that would improve Arrow, I encourage you to share them and make Arrow better. It may be good research practice to accentuate the differences between the two, but it’s good open source practice to find consensus between technologies, and merge communities. There is a lot of work to be done, and too few people to do it.

Lastly, I know I seem to be giving mixed messages here. I do believe that content about Crail will help drive engagement and build community (controversial content even more so). I am delighted that the Crail team is writing blog posts and posting them to Twitter. But be careful not to alienate communities that could help Crail gain widespread adoption.

Julian

[1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>

[2] https://www.usenix.org/conference/atc18/presentation/trivedi <https://www.usenix.org/conference/atc18/presentation/trivedi>

Re: Crail, Albis and Arrow

Posted by Animesh Trivedi <an...@gmail.com>.

Hi Wes - absolutely ! I will start a discussion on the Arrow mailing list.

Cheers,
--
Animesh

On Fri, Sep 7, 2018 at 4:57 PM Wes McKinney <we...@gmail.com> wrote:

> On Fri, Sep 7, 2018 at 8:03 AM Animesh Trivedi
> <an...@gmail.com> wrote:
> >
> > To all,
> >
> > the blog is updated to:
> > 1) point out that this blog is from a user of the Crail project, not an
> > endorsement from the Crail project.
> > 2) clarify that Arrow is not a storage format but an IPC format. And we
> > have evaluated the performance of the Java libraries on HDFS, which has
> > headroom for further performance optimizations.
> >
> > Please have a look and let me know if some further clarification is
> needed.
> >
> > Coming back to Wes's comments (hi again!) : how should we proceed if I
> want
> > to benchmark Arrow's performance on Crail/HDFS in Java? I would be happy
> to
> > have your inputs in this process, collaborating on this investigation and
> > write our findings as a follow-up crail blog about "Arrow on Crail
> > delivering 100 Gbps"? I suppose in the process I/we will look closely at
> > I/O paths of Arrow/Java libraries. Does this sound like an interesting
> line
> > of work to you?
>
> Yes -- since I'm not a Java developer the best place to discuss this
> further would be on the dev@arrow mailing list.
>
> Thanks!
> Wes
>
> >
> > Cheers,
> > --
> > Animesh
> >
> > On Thu, Sep 6, 2018 at 5:30 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi Animesh,
> > >
> > > On Thu, Sep 6, 2018 at 12:23 AM Animesh Trivedi
> > > <an...@gmail.com> wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > Nice to connect to you too. We are happy to have your input on Albis
> and
> > > > Arrow. Specifically:
> > > >
> > > > - We understand that Arrow is not a file format, but we chose to
> evaluate
> > > > it in a mix with storage formats as Arrow is designed for in-memory
> > > > columnar storage. The "in-memory" aspect of it is closer to
> flash/NVMe
> > > than
> > > > disks in terms of performance. And personally I was curious to try
> out
> > > > Arrow :) We coded a simple benchmark (how fast one can materialize
> > > values)
> > > > because anything more complicated like relational queries would bring
> > > > complexity from the underlying SQL engine.
> > >
> > > Right, but what you did in your benchmarks was neither in-memory or
> > > memory-mapping IIUC -- you are accessing the memory through
> > > synchronous Hadoop protobuf RPCs which deeply conflates the results
> > > (even if the HDFS nodes are running atop NVMe). Additionally, the
> > > Arrow Java library does not even yet support memory mapping (we do in
> > > C++), so the only way to fairly evaluate that code right now is to run
> > > on RAM-resident data.
> > >
> > > - Wes
> > >
> > > >
> > > > - Yes, I will make it clear that the performance of Arrow that is
> > > evaluated
> > > > in the blog is for the less beaten on-heap Java path.
> > > >
> > > > Now coming to the interesting bit. Arrow storage performance tuning
> (HDFS
> > > > or Crail) that I can help to investigate. This is a good starting
> point.
> > > I
> > > > will update you all on the Crail and Arrow mailing lists. Beyond
> > > > performance, the multi-file storage model is where I am most
> interested.
> > > It
> > > > will help us to explore how different file types (column groups,
> > > metadata)
> > > > can be mapped to different storage (NVMe, DRAM, 3DXP) types that
> Crail
> > > > supports. I think this is an interesting avenue to explore.
> > > >
> > > > Wes and Julian - thanks for the discussion.
> > > >
> > > > Cheers,
> > > > --
> > > > Animesh
> > > >
> > > > On Wed, Sep 5, 2018 at 8:57 PM Julian Hyde <jh...@apache.org> wrote:
> > > >
> > > > > Animesh,
> > > > >
> > > > > Thanks for your thoughtful response.
> > > > >
> > > > > I think we’re now on the same page about the opportunities for
> > > > > collaboration. And I saw that Wes posted to this thread too. I
> hope you
> > > > > find ways to make Arrow and Crail work well together.
> > > > >
> > > > > Julian
> > > > >
> > > > >
> > > > > > On Sep 5, 2018, at 3:49 AM, Animesh Trivedi <
> > > animesh.trivedi@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Hi Julian,
> > > > > >
> > > > > > Thanks for posting your thoughts.
> > > > > >
> > > > > > [As a Crail committer]: We agree that the notion of "we" creates
> > > > > confusion.
> > > > > > The Crail blog follows the trend in community projects, where a
> > > blogpost
> > > > > > falls in one of the two categories. The first type where a
> developer
> > > > > talks
> > > > > > about recent improvements, features, performance evaluation,
> etc. The
> > > > > > second type is where "a user" presents how they used the system
> for
> > > their
> > > > > > use-case. The Albis blog post falls into the second category. We
> can
> > > (and
> > > > > > should for future references) definitely categorize and mark it
> clear
> > > > > that
> > > > > > way. And we would encourage the community, whoever tries Crail
> please
> > > > > reach
> > > > > > out to us to present your story on the Crail blog. Crail is
> > > committed to
> > > > > > provide the best possible performance to all its users, be it
> Albis,
> > > > > Arrow,
> > > > > > ORC, or Parquet.
> > > > > >
> > > > > > [As a developer of Albis and user of Crail]: I understand your
> > > sentiment
> > > > > > regarding the format wars, and it is not the aim of Albis to
> > > establish
> > > > > yet
> > > > > > another file format. Albis started as a prototype to quickly
> > > "explore"
> > > > > > various design choices for storing relational data for a variety
> of
> > > > > > scenarios with high-performance storage/networking devices - the
> > > kind of
> > > > > > devices Crail targets. This is something that I cannot easily do
> with
> > > > > > Arrow, ORC, or Parquet with HDFS (or something similar) within a
> > > > > reasonable
> > > > > > effort and time-frame as they all have already chosen certain
> design
> > > > > points
> > > > > > and trade-offs. Crail and Albis are not tied (or are preferred
> over
> > > other
> > > > > > choices) to each other, though since it is coming from a same
> set of
> > > > > > developers, I can see why the confusion arises. Having said
> this, I
> > > will
> > > > > be
> > > > > > happy to contribute back to the Arrow community about the
> findings
> > > from
> > > > > > Albis, and would appreciate any help with that. I had a brief
> > > discussion
> > > > > > with Julien Le Dem at last DataWorks summit in San Jose about
> Albis
> > > as
> > > > > > well. I have not done a through investigation of Arrow over
> Crail,
> > > but
> > > > > > perhaps something that can be picked-up now as a starting point.
> > > > > >
> > > > > > I hope this clarifies the confusion. We will fix the blog post.
> > > > > >
> > > > > > Thanks,
> > > > > > --
> > > > > > Animesh
> > > > > >
> > > > > > On Tue, Sep 4, 2018 at 9:59 PM Julian Hyde <jhyde@apache.org
> > > <mailto:
> > > > > jhyde@apache.org>> wrote:
> > > > > >
> > > > > >> I just read the blog post [1] about Crail and file formats. (I
> have
> > > to
> > > > > >> declare my interests up front: I have been a huge supporter of
> > > Apache
> > > > > >> Arrow, and I am a PMC member. I’m speaking here as an Arrow
> > > contributor
> > > > > and
> > > > > >> enthusiast, not as a mentor of Crail.)
> > > > > >>
> > > > > >> I am a bit troubled about the endorsement of Albis in a Crail
> blog
> > > post.
> > > > > >> For example, "we have developed a new file format called Albis”.
> > > Since
> > > > > the
> > > > > >> blog post is not signed, I take it that “We” means the authors
> of
> > > the
> > > > > paper
> > > > > >> [2] mentioned in the blog post. But I hope that “we” does not
> mean
> > > “we
> > > > > as
> > > > > >> Crail committers and PMC members".
> > > > > >>
> > > > > >> I know that there are different forces at play if you work for a
> > > > > >> corporation, or are a researcher, or are an idealistic open
> source.
> > > As a
> > > > > >> researcher, you need to invent new stuff and prove that it is
> better
> > > > > than
> > > > > >> everything that has been done before.
> > > > > >>
> > > > > >> But I’ve been through the file format wars — ORC vs Parquet —
> > > driven in
> > > > > >> large part by two competing vendors. It was sickening, and a
> huge
> > > waste
> > > > > of
> > > > > >> effort. Please, please don’t let this happen again. If you want
> to
> > > make
> > > > > >> Crail successful, you should make it absolutely clear to the
> Arrow,
> > > ORC
> > > > > and
> > > > > >> Parquet communities that you will help to make Crail work as
> well
> > > as it
> > > > > >> possibly can
> > > > > >>
> > > > > >> Also, on paper Albis looks very similar to Arrow, and the
> > > performance
> > > > > gap
> > > > > >> is fairly narrow. If you have found insights that would improve
> > > Arrow, I
> > > > > >> encourage you to share them and make Arrow better. It may be
> good
> > > > > research
> > > > > >> practice to accentuate the differences between the two, but it’s
> > > good
> > > > > open
> > > > > >> source practice to find consensus between technologies, and
> merge
> > > > > >> communities. There is a lot of work to be done, and too few
> people
> > > to
> > > > > do it.
> > > > > >>
> > > > > >> Lastly, I know I seem to be giving mixed messages here. I do
> believe
> > > > > that
> > > > > >> content about Crail will help drive engagement and build
> community
> > > > > >> (controversial content even more so). I am delighted that the
> Crail
> > > > > team is
> > > > > >> writing blog posts and posting them to Twitter. But be careful
> not
> > > to
> > > > > >> alienate communities that could help Crail gain widespread
> adoption.
> > > > > >>
> > > > > >> Julian
> > > > > >>
> > > > > >> [1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html
> <
> > > > > >> http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> > > > > http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>>
> > > > > >>
> > > > > >> [2]
> https://www.usenix.org/conference/atc18/presentation/trivedi <
> > > > > https://www.usenix.org/conference/atc18/presentation/trivedi> <
> > > > > >> https://www.usenix.org/conference/atc18/presentation/trivedi <
> > > > > https://www.usenix.org/conference/atc18/presentation/trivedi>>
> > > > >
> > > > >
> > >
>

Re: Crail, Albis and Arrow

Posted by Wes McKinney <we...@gmail.com>.

On Fri, Sep 7, 2018 at 8:03 AM Animesh Trivedi
<an...@gmail.com> wrote:
>
> To all,
>
> the blog is updated to:
> 1) point out that this blog is from a user of the Crail project, not an
> endorsement from the Crail project.
> 2) clarify that Arrow is not a storage format but an IPC format. And we
> have evaluated the performance of the Java libraries on HDFS, which has
> headroom for further performance optimizations.
>
> Please have a look and let me know if some further clarification is needed.
>
> Coming back to Wes's comments (hi again!) : how should we proceed if I want
> to benchmark Arrow's performance on Crail/HDFS in Java? I would be happy to
> have your inputs in this process, collaborating on this investigation and
> write our findings as a follow-up crail blog about "Arrow on Crail
> delivering 100 Gbps"? I suppose in the process I/we will look closely at
> I/O paths of Arrow/Java libraries. Does this sound like an interesting line
> of work to you?

Yes -- since I'm not a Java developer the best place to discuss this
further would be on the dev@arrow mailing list.

Thanks!
Wes

>
> Cheers,
> --
> Animesh
>
> On Thu, Sep 6, 2018 at 5:30 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Animesh,
> >
> > On Thu, Sep 6, 2018 at 12:23 AM Animesh Trivedi
> > <an...@gmail.com> wrote:
> > >
> > > Hi Wes,
> > >
> > > Nice to connect to you too. We are happy to have your input on Albis and
> > > Arrow. Specifically:
> > >
> > > - We understand that Arrow is not a file format, but we chose to evaluate
> > > it in a mix with storage formats as Arrow is designed for in-memory
> > > columnar storage. The "in-memory" aspect of it is closer to flash/NVMe
> > than
> > > disks in terms of performance. And personally I was curious to try out
> > > Arrow :) We coded a simple benchmark (how fast one can materialize
> > values)
> > > because anything more complicated like relational queries would bring
> > > complexity from the underlying SQL engine.
> >
> > Right, but what you did in your benchmarks was neither in-memory or
> > memory-mapping IIUC -- you are accessing the memory through
> > synchronous Hadoop protobuf RPCs which deeply conflates the results
> > (even if the HDFS nodes are running atop NVMe). Additionally, the
> > Arrow Java library does not even yet support memory mapping (we do in
> > C++), so the only way to fairly evaluate that code right now is to run
> > on RAM-resident data.
> >
> > - Wes
> >
> > >
> > > - Yes, I will make it clear that the performance of Arrow that is
> > evaluated
> > > in the blog is for the less beaten on-heap Java path.
> > >
> > > Now coming to the interesting bit. Arrow storage performance tuning (HDFS
> > > or Crail) that I can help to investigate. This is a good starting point.
> > I
> > > will update you all on the Crail and Arrow mailing lists. Beyond
> > > performance, the multi-file storage model is where I am most interested.
> > It
> > > will help us to explore how different file types (column groups,
> > metadata)
> > > can be mapped to different storage (NVMe, DRAM, 3DXP) types that Crail
> > > supports. I think this is an interesting avenue to explore.
> > >
> > > Wes and Julian - thanks for the discussion.
> > >
> > > Cheers,
> > > --
> > > Animesh
> > >
> > > On Wed, Sep 5, 2018 at 8:57 PM Julian Hyde <jh...@apache.org> wrote:
> > >
> > > > Animesh,
> > > >
> > > > Thanks for your thoughtful response.
> > > >
> > > > I think we’re now on the same page about the opportunities for
> > > > collaboration. And I saw that Wes posted to this thread too. I hope you
> > > > find ways to make Arrow and Crail work well together.
> > > >
> > > > Julian
> > > >
> > > >
> > > > > On Sep 5, 2018, at 3:49 AM, Animesh Trivedi <
> > animesh.trivedi@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi Julian,
> > > > >
> > > > > Thanks for posting your thoughts.
> > > > >
> > > > > [As a Crail committer]: We agree that the notion of "we" creates
> > > > confusion.
> > > > > The Crail blog follows the trend in community projects, where a
> > blogpost
> > > > > falls in one of the two categories. The first type where a developer
> > > > talks
> > > > > about recent improvements, features, performance evaluation, etc. The
> > > > > second type is where "a user" presents how they used the system for
> > their
> > > > > use-case. The Albis blog post falls into the second category. We can
> > (and
> > > > > should for future references) definitely categorize and mark it clear
> > > > that
> > > > > way. And we would encourage the community, whoever tries Crail please
> > > > reach
> > > > > out to us to present your story on the Crail blog. Crail is
> > committed to
> > > > > provide the best possible performance to all its users, be it Albis,
> > > > Arrow,
> > > > > ORC, or Parquet.
> > > > >
> > > > > [As a developer of Albis and user of Crail]: I understand your
> > sentiment
> > > > > regarding the format wars, and it is not the aim of Albis to
> > establish
> > > > yet
> > > > > another file format. Albis started as a prototype to quickly
> > "explore"
> > > > > various design choices for storing relational data for a variety of
> > > > > scenarios with high-performance storage/networking devices - the
> > kind of
> > > > > devices Crail targets. This is something that I cannot easily do with
> > > > > Arrow, ORC, or Parquet with HDFS (or something similar) within a
> > > > reasonable
> > > > > effort and time-frame as they all have already chosen certain design
> > > > points
> > > > > and trade-offs. Crail and Albis are not tied (or are preferred over
> > other
> > > > > choices) to each other, though since it is coming from a same set of
> > > > > developers, I can see why the confusion arises. Having said this, I
> > will
> > > > be
> > > > > happy to contribute back to the Arrow community about the findings
> > from
> > > > > Albis, and would appreciate any help with that. I had a brief
> > discussion
> > > > > with Julien Le Dem at last DataWorks summit in San Jose about Albis
> > as
> > > > > well. I have not done a through investigation of Arrow over Crail,
> > but
> > > > > perhaps something that can be picked-up now as a starting point.
> > > > >
> > > > > I hope this clarifies the confusion. We will fix the blog post.
> > > > >
> > > > > Thanks,
> > > > > --
> > > > > Animesh
> > > > >
> > > > > On Tue, Sep 4, 2018 at 9:59 PM Julian Hyde <jhyde@apache.org
> > <mailto:
> > > > jhyde@apache.org>> wrote:
> > > > >
> > > > >> I just read the blog post [1] about Crail and file formats. (I have
> > to
> > > > >> declare my interests up front: I have been a huge supporter of
> > Apache
> > > > >> Arrow, and I am a PMC member. I’m speaking here as an Arrow
> > contributor
> > > > and
> > > > >> enthusiast, not as a mentor of Crail.)
> > > > >>
> > > > >> I am a bit troubled about the endorsement of Albis in a Crail blog
> > post.
> > > > >> For example, "we have developed a new file format called Albis”.
> > Since
> > > > the
> > > > >> blog post is not signed, I take it that “We” means the authors of
> > the
> > > > paper
> > > > >> [2] mentioned in the blog post. But I hope that “we” does not mean
> > “we
> > > > as
> > > > >> Crail committers and PMC members".
> > > > >>
> > > > >> I know that there are different forces at play if you work for a
> > > > >> corporation, or are a researcher, or are an idealistic open source.
> > As a
> > > > >> researcher, you need to invent new stuff and prove that it is better
> > > > than
> > > > >> everything that has been done before.
> > > > >>
> > > > >> But I’ve been through the file format wars — ORC vs Parquet —
> > driven in
> > > > >> large part by two competing vendors. It was sickening, and a huge
> > waste
> > > > of
> > > > >> effort. Please, please don’t let this happen again. If you want to
> > make
> > > > >> Crail successful, you should make it absolutely clear to the Arrow,
> > ORC
> > > > and
> > > > >> Parquet communities that you will help to make Crail work as well
> > as it
> > > > >> possibly can
> > > > >>
> > > > >> Also, on paper Albis looks very similar to Arrow, and the
> > performance
> > > > gap
> > > > >> is fairly narrow. If you have found insights that would improve
> > Arrow, I
> > > > >> encourage you to share them and make Arrow better. It may be good
> > > > research
> > > > >> practice to accentuate the differences between the two, but it’s
> > good
> > > > open
> > > > >> source practice to find consensus between technologies, and merge
> > > > >> communities. There is a lot of work to be done, and too few people
> > to
> > > > do it.
> > > > >>
> > > > >> Lastly, I know I seem to be giving mixed messages here. I do believe
> > > > that
> > > > >> content about Crail will help drive engagement and build community
> > > > >> (controversial content even more so). I am delighted that the Crail
> > > > team is
> > > > >> writing blog posts and posting them to Twitter. But be careful not
> > to
> > > > >> alienate communities that could help Crail gain widespread adoption.
> > > > >>
> > > > >> Julian
> > > > >>
> > > > >> [1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> > > > >> http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> > > > http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>>
> > > > >>
> > > > >> [2] https://www.usenix.org/conference/atc18/presentation/trivedi <
> > > > https://www.usenix.org/conference/atc18/presentation/trivedi> <
> > > > >> https://www.usenix.org/conference/atc18/presentation/trivedi <
> > > > https://www.usenix.org/conference/atc18/presentation/trivedi>>
> > > >
> > > >
> >

Re: Crail, Albis and Arrow

Posted by Animesh Trivedi <an...@gmail.com>.

To all,

the blog is updated to:
1) point out that this blog is from a user of the Crail project, not an
endorsement from the Crail project.
2) clarify that Arrow is not a storage format but an IPC format. And we
have evaluated the performance of the Java libraries on HDFS, which has
headroom for further performance optimizations.

Please have a look and let me know if some further clarification is needed.

Coming back to Wes's comments (hi again!) : how should we proceed if I want
to benchmark Arrow's performance on Crail/HDFS in Java? I would be happy to
have your inputs in this process, collaborating on this investigation and
write our findings as a follow-up crail blog about "Arrow on Crail
delivering 100 Gbps"? I suppose in the process I/we will look closely at
I/O paths of Arrow/Java libraries. Does this sound like an interesting line
of work to you?

Cheers,
--
Animesh

On Thu, Sep 6, 2018 at 5:30 PM Wes McKinney <we...@gmail.com> wrote:

> hi Animesh,
>
> On Thu, Sep 6, 2018 at 12:23 AM Animesh Trivedi
> <an...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > Nice to connect to you too. We are happy to have your input on Albis and
> > Arrow. Specifically:
> >
> > - We understand that Arrow is not a file format, but we chose to evaluate
> > it in a mix with storage formats as Arrow is designed for in-memory
> > columnar storage. The "in-memory" aspect of it is closer to flash/NVMe
> than
> > disks in terms of performance. And personally I was curious to try out
> > Arrow :) We coded a simple benchmark (how fast one can materialize
> values)
> > because anything more complicated like relational queries would bring
> > complexity from the underlying SQL engine.
>
> Right, but what you did in your benchmarks was neither in-memory or
> memory-mapping IIUC -- you are accessing the memory through
> synchronous Hadoop protobuf RPCs which deeply conflates the results
> (even if the HDFS nodes are running atop NVMe). Additionally, the
> Arrow Java library does not even yet support memory mapping (we do in
> C++), so the only way to fairly evaluate that code right now is to run
> on RAM-resident data.
>
> - Wes
>
> >
> > - Yes, I will make it clear that the performance of Arrow that is
> evaluated
> > in the blog is for the less beaten on-heap Java path.
> >
> > Now coming to the interesting bit. Arrow storage performance tuning (HDFS
> > or Crail) that I can help to investigate. This is a good starting point.
> I
> > will update you all on the Crail and Arrow mailing lists. Beyond
> > performance, the multi-file storage model is where I am most interested.
> It
> > will help us to explore how different file types (column groups,
> metadata)
> > can be mapped to different storage (NVMe, DRAM, 3DXP) types that Crail
> > supports. I think this is an interesting avenue to explore.
> >
> > Wes and Julian - thanks for the discussion.
> >
> > Cheers,
> > --
> > Animesh
> >
> > On Wed, Sep 5, 2018 at 8:57 PM Julian Hyde <jh...@apache.org> wrote:
> >
> > > Animesh,
> > >
> > > Thanks for your thoughtful response.
> > >
> > > I think we’re now on the same page about the opportunities for
> > > collaboration. And I saw that Wes posted to this thread too. I hope you
> > > find ways to make Arrow and Crail work well together.
> > >
> > > Julian
> > >
> > >
> > > > On Sep 5, 2018, at 3:49 AM, Animesh Trivedi <
> animesh.trivedi@gmail.com>
> > > wrote:
> > > >
> > > > Hi Julian,
> > > >
> > > > Thanks for posting your thoughts.
> > > >
> > > > [As a Crail committer]: We agree that the notion of "we" creates
> > > confusion.
> > > > The Crail blog follows the trend in community projects, where a
> blogpost
> > > > falls in one of the two categories. The first type where a developer
> > > talks
> > > > about recent improvements, features, performance evaluation, etc. The
> > > > second type is where "a user" presents how they used the system for
> their
> > > > use-case. The Albis blog post falls into the second category. We can
> (and
> > > > should for future references) definitely categorize and mark it clear
> > > that
> > > > way. And we would encourage the community, whoever tries Crail please
> > > reach
> > > > out to us to present your story on the Crail blog. Crail is
> committed to
> > > > provide the best possible performance to all its users, be it Albis,
> > > Arrow,
> > > > ORC, or Parquet.
> > > >
> > > > [As a developer of Albis and user of Crail]: I understand your
> sentiment
> > > > regarding the format wars, and it is not the aim of Albis to
> establish
> > > yet
> > > > another file format. Albis started as a prototype to quickly
> "explore"
> > > > various design choices for storing relational data for a variety of
> > > > scenarios with high-performance storage/networking devices - the
> kind of
> > > > devices Crail targets. This is something that I cannot easily do with
> > > > Arrow, ORC, or Parquet with HDFS (or something similar) within a
> > > reasonable
> > > > effort and time-frame as they all have already chosen certain design
> > > points
> > > > and trade-offs. Crail and Albis are not tied (or are preferred over
> other
> > > > choices) to each other, though since it is coming from a same set of
> > > > developers, I can see why the confusion arises. Having said this, I
> will
> > > be
> > > > happy to contribute back to the Arrow community about the findings
> from
> > > > Albis, and would appreciate any help with that. I had a brief
> discussion
> > > > with Julien Le Dem at last DataWorks summit in San Jose about Albis
> as
> > > > well. I have not done a through investigation of Arrow over Crail,
> but
> > > > perhaps something that can be picked-up now as a starting point.
> > > >
> > > > I hope this clarifies the confusion. We will fix the blog post.
> > > >
> > > > Thanks,
> > > > --
> > > > Animesh
> > > >
> > > > On Tue, Sep 4, 2018 at 9:59 PM Julian Hyde <jhyde@apache.org
> <mailto:
> > > jhyde@apache.org>> wrote:
> > > >
> > > >> I just read the blog post [1] about Crail and file formats. (I have
> to
> > > >> declare my interests up front: I have been a huge supporter of
> Apache
> > > >> Arrow, and I am a PMC member. I’m speaking here as an Arrow
> contributor
> > > and
> > > >> enthusiast, not as a mentor of Crail.)
> > > >>
> > > >> I am a bit troubled about the endorsement of Albis in a Crail blog
> post.
> > > >> For example, "we have developed a new file format called Albis”.
> Since
> > > the
> > > >> blog post is not signed, I take it that “We” means the authors of
> the
> > > paper
> > > >> [2] mentioned in the blog post. But I hope that “we” does not mean
> “we
> > > as
> > > >> Crail committers and PMC members".
> > > >>
> > > >> I know that there are different forces at play if you work for a
> > > >> corporation, or are a researcher, or are an idealistic open source.
> As a
> > > >> researcher, you need to invent new stuff and prove that it is better
> > > than
> > > >> everything that has been done before.
> > > >>
> > > >> But I’ve been through the file format wars — ORC vs Parquet —
> driven in
> > > >> large part by two competing vendors. It was sickening, and a huge
> waste
> > > of
> > > >> effort. Please, please don’t let this happen again. If you want to
> make
> > > >> Crail successful, you should make it absolutely clear to the Arrow,
> ORC
> > > and
> > > >> Parquet communities that you will help to make Crail work as well
> as it
> > > >> possibly can
> > > >>
> > > >> Also, on paper Albis looks very similar to Arrow, and the
> performance
> > > gap
> > > >> is fairly narrow. If you have found insights that would improve
> Arrow, I
> > > >> encourage you to share them and make Arrow better. It may be good
> > > research
> > > >> practice to accentuate the differences between the two, but it’s
> good
> > > open
> > > >> source practice to find consensus between technologies, and merge
> > > >> communities. There is a lot of work to be done, and too few people
> to
> > > do it.
> > > >>
> > > >> Lastly, I know I seem to be giving mixed messages here. I do believe
> > > that
> > > >> content about Crail will help drive engagement and build community
> > > >> (controversial content even more so). I am delighted that the Crail
> > > team is
> > > >> writing blog posts and posting them to Twitter. But be careful not
> to
> > > >> alienate communities that could help Crail gain widespread adoption.
> > > >>
> > > >> Julian
> > > >>
> > > >> [1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> > > >> http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> > > http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>>
> > > >>
> > > >> [2] https://www.usenix.org/conference/atc18/presentation/trivedi <
> > > https://www.usenix.org/conference/atc18/presentation/trivedi> <
> > > >> https://www.usenix.org/conference/atc18/presentation/trivedi <
> > > https://www.usenix.org/conference/atc18/presentation/trivedi>>
> > >
> > >
>

Re: Crail, Albis and Arrow

Posted by Wes McKinney <we...@gmail.com>.

hi Animesh,

On Thu, Sep 6, 2018 at 12:23 AM Animesh Trivedi
<an...@gmail.com> wrote:
>
> Hi Wes,
>
> Nice to connect to you too. We are happy to have your input on Albis and
> Arrow. Specifically:
>
> - We understand that Arrow is not a file format, but we chose to evaluate
> it in a mix with storage formats as Arrow is designed for in-memory
> columnar storage. The "in-memory" aspect of it is closer to flash/NVMe than
> disks in terms of performance. And personally I was curious to try out
> Arrow :) We coded a simple benchmark (how fast one can materialize values)
> because anything more complicated like relational queries would bring
> complexity from the underlying SQL engine.

Right, but what you did in your benchmarks was neither in-memory or
memory-mapping IIUC -- you are accessing the memory through
synchronous Hadoop protobuf RPCs which deeply conflates the results
(even if the HDFS nodes are running atop NVMe). Additionally, the
Arrow Java library does not even yet support memory mapping (we do in
C++), so the only way to fairly evaluate that code right now is to run
on RAM-resident data.

- Wes

>
> - Yes, I will make it clear that the performance of Arrow that is evaluated
> in the blog is for the less beaten on-heap Java path.
>
> Now coming to the interesting bit. Arrow storage performance tuning (HDFS
> or Crail) that I can help to investigate. This is a good starting point. I
> will update you all on the Crail and Arrow mailing lists. Beyond
> performance, the multi-file storage model is where I am most interested. It
> will help us to explore how different file types (column groups, metadata)
> can be mapped to different storage (NVMe, DRAM, 3DXP) types that Crail
> supports. I think this is an interesting avenue to explore.
>
> Wes and Julian - thanks for the discussion.
>
> Cheers,
> --
> Animesh
>
> On Wed, Sep 5, 2018 at 8:57 PM Julian Hyde <jh...@apache.org> wrote:
>
> > Animesh,
> >
> > Thanks for your thoughtful response.
> >
> > I think we’re now on the same page about the opportunities for
> > collaboration. And I saw that Wes posted to this thread too. I hope you
> > find ways to make Arrow and Crail work well together.
> >
> > Julian
> >
> >
> > > On Sep 5, 2018, at 3:49 AM, Animesh Trivedi <an...@gmail.com>
> > wrote:
> > >
> > > Hi Julian,
> > >
> > > Thanks for posting your thoughts.
> > >
> > > [As a Crail committer]: We agree that the notion of "we" creates
> > confusion.
> > > The Crail blog follows the trend in community projects, where a blogpost
> > > falls in one of the two categories. The first type where a developer
> > talks
> > > about recent improvements, features, performance evaluation, etc. The
> > > second type is where "a user" presents how they used the system for their
> > > use-case. The Albis blog post falls into the second category. We can (and
> > > should for future references) definitely categorize and mark it clear
> > that
> > > way. And we would encourage the community, whoever tries Crail please
> > reach
> > > out to us to present your story on the Crail blog. Crail is committed to
> > > provide the best possible performance to all its users, be it Albis,
> > Arrow,
> > > ORC, or Parquet.
> > >
> > > [As a developer of Albis and user of Crail]: I understand your sentiment
> > > regarding the format wars, and it is not the aim of Albis to establish
> > yet
> > > another file format. Albis started as a prototype to quickly "explore"
> > > various design choices for storing relational data for a variety of
> > > scenarios with high-performance storage/networking devices - the kind of
> > > devices Crail targets. This is something that I cannot easily do with
> > > Arrow, ORC, or Parquet with HDFS (or something similar) within a
> > reasonable
> > > effort and time-frame as they all have already chosen certain design
> > points
> > > and trade-offs. Crail and Albis are not tied (or are preferred over other
> > > choices) to each other, though since it is coming from a same set of
> > > developers, I can see why the confusion arises. Having said this, I will
> > be
> > > happy to contribute back to the Arrow community about the findings from
> > > Albis, and would appreciate any help with that. I had a brief discussion
> > > with Julien Le Dem at last DataWorks summit in San Jose about Albis as
> > > well. I have not done a through investigation of Arrow over Crail, but
> > > perhaps something that can be picked-up now as a starting point.
> > >
> > > I hope this clarifies the confusion. We will fix the blog post.
> > >
> > > Thanks,
> > > --
> > > Animesh
> > >
> > > On Tue, Sep 4, 2018 at 9:59 PM Julian Hyde <jhyde@apache.org <mailto:
> > jhyde@apache.org>> wrote:
> > >
> > >> I just read the blog post [1] about Crail and file formats. (I have to
> > >> declare my interests up front: I have been a huge supporter of Apache
> > >> Arrow, and I am a PMC member. I’m speaking here as an Arrow contributor
> > and
> > >> enthusiast, not as a mentor of Crail.)
> > >>
> > >> I am a bit troubled about the endorsement of Albis in a Crail blog post.
> > >> For example, "we have developed a new file format called Albis”. Since
> > the
> > >> blog post is not signed, I take it that “We” means the authors of the
> > paper
> > >> [2] mentioned in the blog post. But I hope that “we” does not mean “we
> > as
> > >> Crail committers and PMC members".
> > >>
> > >> I know that there are different forces at play if you work for a
> > >> corporation, or are a researcher, or are an idealistic open source. As a
> > >> researcher, you need to invent new stuff and prove that it is better
> > than
> > >> everything that has been done before.
> > >>
> > >> But I’ve been through the file format wars — ORC vs Parquet — driven in
> > >> large part by two competing vendors. It was sickening, and a huge waste
> > of
> > >> effort. Please, please don’t let this happen again. If you want to make
> > >> Crail successful, you should make it absolutely clear to the Arrow, ORC
> > and
> > >> Parquet communities that you will help to make Crail work as well as it
> > >> possibly can
> > >>
> > >> Also, on paper Albis looks very similar to Arrow, and the performance
> > gap
> > >> is fairly narrow. If you have found insights that would improve Arrow, I
> > >> encourage you to share them and make Arrow better. It may be good
> > research
> > >> practice to accentuate the differences between the two, but it’s good
> > open
> > >> source practice to find consensus between technologies, and merge
> > >> communities. There is a lot of work to be done, and too few people to
> > do it.
> > >>
> > >> Lastly, I know I seem to be giving mixed messages here. I do believe
> > that
> > >> content about Crail will help drive engagement and build community
> > >> (controversial content even more so). I am delighted that the Crail
> > team is
> > >> writing blog posts and posting them to Twitter. But be careful not to
> > >> alienate communities that could help Crail gain widespread adoption.
> > >>
> > >> Julian
> > >>
> > >> [1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> > >> http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> > http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>>
> > >>
> > >> [2] https://www.usenix.org/conference/atc18/presentation/trivedi <
> > https://www.usenix.org/conference/atc18/presentation/trivedi> <
> > >> https://www.usenix.org/conference/atc18/presentation/trivedi <
> > https://www.usenix.org/conference/atc18/presentation/trivedi>>
> >
> >

Re: Crail, Albis and Arrow

Posted by Animesh Trivedi <an...@gmail.com>.

Hi Wes,

Nice to connect to you too. We are happy to have your input on Albis and
Arrow. Specifically:

- We understand that Arrow is not a file format, but we chose to evaluate
it in a mix with storage formats as Arrow is designed for in-memory
columnar storage. The "in-memory" aspect of it is closer to flash/NVMe than
disks in terms of performance. And personally I was curious to try out
Arrow :) We coded a simple benchmark (how fast one can materialize values)
because anything more complicated like relational queries would bring
complexity from the underlying SQL engine.

- Yes, I will make it clear that the performance of Arrow that is evaluated
in the blog is for the less beaten on-heap Java path.

Now coming to the interesting bit. Arrow storage performance tuning (HDFS
or Crail) that I can help to investigate. This is a good starting point. I
will update you all on the Crail and Arrow mailing lists. Beyond
performance, the multi-file storage model is where I am most interested. It
will help us to explore how different file types (column groups, metadata)
can be mapped to different storage (NVMe, DRAM, 3DXP) types that Crail
supports. I think this is an interesting avenue to explore.

Wes and Julian - thanks for the discussion.

Cheers,
--
Animesh

On Wed, Sep 5, 2018 at 8:57 PM Julian Hyde <jh...@apache.org> wrote:

> Animesh,
>
> Thanks for your thoughtful response.
>
> I think we’re now on the same page about the opportunities for
> collaboration. And I saw that Wes posted to this thread too. I hope you
> find ways to make Arrow and Crail work well together.
>
> Julian
>
>
> > On Sep 5, 2018, at 3:49 AM, Animesh Trivedi <an...@gmail.com>
> wrote:
> >
> > Hi Julian,
> >
> > Thanks for posting your thoughts.
> >
> > [As a Crail committer]: We agree that the notion of "we" creates
> confusion.
> > The Crail blog follows the trend in community projects, where a blogpost
> > falls in one of the two categories. The first type where a developer
> talks
> > about recent improvements, features, performance evaluation, etc. The
> > second type is where "a user" presents how they used the system for their
> > use-case. The Albis blog post falls into the second category. We can (and
> > should for future references) definitely categorize and mark it clear
> that
> > way. And we would encourage the community, whoever tries Crail please
> reach
> > out to us to present your story on the Crail blog. Crail is committed to
> > provide the best possible performance to all its users, be it Albis,
> Arrow,
> > ORC, or Parquet.
> >
> > [As a developer of Albis and user of Crail]: I understand your sentiment
> > regarding the format wars, and it is not the aim of Albis to establish
> yet
> > another file format. Albis started as a prototype to quickly "explore"
> > various design choices for storing relational data for a variety of
> > scenarios with high-performance storage/networking devices - the kind of
> > devices Crail targets. This is something that I cannot easily do with
> > Arrow, ORC, or Parquet with HDFS (or something similar) within a
> reasonable
> > effort and time-frame as they all have already chosen certain design
> points
> > and trade-offs. Crail and Albis are not tied (or are preferred over other
> > choices) to each other, though since it is coming from a same set of
> > developers, I can see why the confusion arises. Having said this, I will
> be
> > happy to contribute back to the Arrow community about the findings from
> > Albis, and would appreciate any help with that. I had a brief discussion
> > with Julien Le Dem at last DataWorks summit in San Jose about Albis as
> > well. I have not done a through investigation of Arrow over Crail, but
> > perhaps something that can be picked-up now as a starting point.
> >
> > I hope this clarifies the confusion. We will fix the blog post.
> >
> > Thanks,
> > --
> > Animesh
> >
> > On Tue, Sep 4, 2018 at 9:59 PM Julian Hyde <jhyde@apache.org <mailto:
> jhyde@apache.org>> wrote:
> >
> >> I just read the blog post [1] about Crail and file formats. (I have to
> >> declare my interests up front: I have been a huge supporter of Apache
> >> Arrow, and I am a PMC member. I’m speaking here as an Arrow contributor
> and
> >> enthusiast, not as a mentor of Crail.)
> >>
> >> I am a bit troubled about the endorsement of Albis in a Crail blog post.
> >> For example, "we have developed a new file format called Albis”. Since
> the
> >> blog post is not signed, I take it that “We” means the authors of the
> paper
> >> [2] mentioned in the blog post. But I hope that “we” does not mean “we
> as
> >> Crail committers and PMC members".
> >>
> >> I know that there are different forces at play if you work for a
> >> corporation, or are a researcher, or are an idealistic open source. As a
> >> researcher, you need to invent new stuff and prove that it is better
> than
> >> everything that has been done before.
> >>
> >> But I’ve been through the file format wars — ORC vs Parquet — driven in
> >> large part by two competing vendors. It was sickening, and a huge waste
> of
> >> effort. Please, please don’t let this happen again. If you want to make
> >> Crail successful, you should make it absolutely clear to the Arrow, ORC
> and
> >> Parquet communities that you will help to make Crail work as well as it
> >> possibly can
> >>
> >> Also, on paper Albis looks very similar to Arrow, and the performance
> gap
> >> is fairly narrow. If you have found insights that would improve Arrow, I
> >> encourage you to share them and make Arrow better. It may be good
> research
> >> practice to accentuate the differences between the two, but it’s good
> open
> >> source practice to find consensus between technologies, and merge
> >> communities. There is a lot of work to be done, and too few people to
> do it.
> >>
> >> Lastly, I know I seem to be giving mixed messages here. I do believe
> that
> >> content about Crail will help drive engagement and build community
> >> (controversial content even more so). I am delighted that the Crail
> team is
> >> writing blog posts and posting them to Twitter. But be careful not to
> >> alienate communities that could help Crail gain widespread adoption.
> >>
> >> Julian
> >>
> >> [1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> >> http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>>
> >>
> >> [2] https://www.usenix.org/conference/atc18/presentation/trivedi <
> https://www.usenix.org/conference/atc18/presentation/trivedi> <
> >> https://www.usenix.org/conference/atc18/presentation/trivedi <
> https://www.usenix.org/conference/atc18/presentation/trivedi>>
>
>

Re: Crail, Albis and Arrow

Posted by Julian Hyde <jh...@apache.org>.

Animesh,

Thanks for your thoughtful response.

I think we’re now on the same page about the opportunities for collaboration. And I saw that Wes posted to this thread too. I hope you find ways to make Arrow and Crail work well together.

Julian


> On Sep 5, 2018, at 3:49 AM, Animesh Trivedi <an...@gmail.com> wrote:
> 
> Hi Julian,
> 
> Thanks for posting your thoughts.
> 
> [As a Crail committer]: We agree that the notion of "we" creates confusion.
> The Crail blog follows the trend in community projects, where a blogpost
> falls in one of the two categories. The first type where a developer talks
> about recent improvements, features, performance evaluation, etc. The
> second type is where "a user" presents how they used the system for their
> use-case. The Albis blog post falls into the second category. We can (and
> should for future references) definitely categorize and mark it clear that
> way. And we would encourage the community, whoever tries Crail please reach
> out to us to present your story on the Crail blog. Crail is committed to
> provide the best possible performance to all its users, be it Albis, Arrow,
> ORC, or Parquet.
> 
> [As a developer of Albis and user of Crail]: I understand your sentiment
> regarding the format wars, and it is not the aim of Albis to establish yet
> another file format. Albis started as a prototype to quickly "explore"
> various design choices for storing relational data for a variety of
> scenarios with high-performance storage/networking devices - the kind of
> devices Crail targets. This is something that I cannot easily do with
> Arrow, ORC, or Parquet with HDFS (or something similar) within a reasonable
> effort and time-frame as they all have already chosen certain design points
> and trade-offs. Crail and Albis are not tied (or are preferred over other
> choices) to each other, though since it is coming from a same set of
> developers, I can see why the confusion arises. Having said this, I will be
> happy to contribute back to the Arrow community about the findings from
> Albis, and would appreciate any help with that. I had a brief discussion
> with Julien Le Dem at last DataWorks summit in San Jose about Albis as
> well. I have not done a through investigation of Arrow over Crail, but
> perhaps something that can be picked-up now as a starting point.
> 
> I hope this clarifies the confusion. We will fix the blog post.
> 
> Thanks,
> --
> Animesh
> 
> On Tue, Sep 4, 2018 at 9:59 PM Julian Hyde <jhyde@apache.org <ma...@apache.org>> wrote:
> 
>> I just read the blog post [1] about Crail and file formats. (I have to
>> declare my interests up front: I have been a huge supporter of Apache
>> Arrow, and I am a PMC member. I’m speaking here as an Arrow contributor and
>> enthusiast, not as a mentor of Crail.)
>> 
>> I am a bit troubled about the endorsement of Albis in a Crail blog post.
>> For example, "we have developed a new file format called Albis”. Since the
>> blog post is not signed, I take it that “We” means the authors of the paper
>> [2] mentioned in the blog post. But I hope that “we” does not mean “we as
>> Crail committers and PMC members".
>> 
>> I know that there are different forces at play if you work for a
>> corporation, or are a researcher, or are an idealistic open source. As a
>> researcher, you need to invent new stuff and prove that it is better than
>> everything that has been done before.
>> 
>> But I’ve been through the file format wars — ORC vs Parquet — driven in
>> large part by two competing vendors. It was sickening, and a huge waste of
>> effort. Please, please don’t let this happen again. If you want to make
>> Crail successful, you should make it absolutely clear to the Arrow, ORC and
>> Parquet communities that you will help to make Crail work as well as it
>> possibly can
>> 
>> Also, on paper Albis looks very similar to Arrow, and the performance gap
>> is fairly narrow. If you have found insights that would improve Arrow, I
>> encourage you to share them and make Arrow better. It may be good research
>> practice to accentuate the differences between the two, but it’s good open
>> source practice to find consensus between technologies, and merge
>> communities. There is a lot of work to be done, and too few people to do it.
>> 
>> Lastly, I know I seem to be giving mixed messages here. I do believe that
>> content about Crail will help drive engagement and build community
>> (controversial content even more so). I am delighted that the Crail team is
>> writing blog posts and posting them to Twitter. But be careful not to
>> alienate communities that could help Crail gain widespread adoption.
>> 
>> Julian
>> 
>> [1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
>> http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>>
>> 
>> [2] https://www.usenix.org/conference/atc18/presentation/trivedi <https://www.usenix.org/conference/atc18/presentation/trivedi> <
>> https://www.usenix.org/conference/atc18/presentation/trivedi <https://www.usenix.org/conference/atc18/presentation/trivedi>>

Re: Crail, Albis and Arrow

Posted by Wes McKinney <we...@gmail.com>.

hi Animesh,

(Also wearing my Arrow PMC hat)

I learned about the Crail incubator project yesterday from the blog
post on Twitter -- it's a pleasure to be connected. I look forward to
being able to collaborate with you all on these topics.

Some particular comments re: Arrow that we can discuss more:

* It is a bit misleading to describe Arrow as a file format. You can
use the Arrow IPC protocol to create a file format, but if you do, I
think a lot more detail has to be provided with regards to your
methodology or people may be confused

* The blog post should make it more clear that you are comparing
_implementations of Java libraries_. Even two implementations of
Parquet in Java can have radically different performance
characteristics depending on how they are used. I am not sure, for
example, that the Arrow Java library is especially optimized for your
use case in the way you are using it ([1]). The Java library has been
focused on interactions with memory that is already in Netty off-heap
memory, so if you venture outside of that in Java, my guess is you're
seeing code paths that have been optimized little.

* Getting the best performance out of a library on top of HDFS can
require some tweaking. As far as Arrow is concerned it would be useful
to research together the best way to store and access data stored as
Arrow IPC messages on HDFS. While we have people actively using
Parquet via Arrow on HDFS in C++, the HDFS aspect has seen little
performance tuning so far.

In any case, it would be great to dig in and see where we have common
ground in our use cases and where we could collaborate. At a high
level, at the lowest binary representation level Arrow and Albis do
not seem very different (we don't define a multi-file storage model
yet, because Arrow is not a file format, though we could create a
storage scheme that uses the Arrow memory format), but I would like to
understand the details better.

cheers,
Wes

[1]: https://github.com/zrlio/fileformat-benchmarks/blob/master/src/main/scala/com/github/animeshtrivedi/FileBench/rtests/ArrowReadTest.scala#L28
On Wed, Sep 5, 2018 at 6:49 AM Animesh Trivedi
<an...@gmail.com> wrote:
>
> Hi Julian,
>
> Thanks for posting your thoughts.
>
> [As a Crail committer]: We agree that the notion of "we" creates confusion.
> The Crail blog follows the trend in community projects, where a blogpost
> falls in one of the two categories. The first type where a developer talks
> about recent improvements, features, performance evaluation, etc. The
> second type is where "a user" presents how they used the system for their
> use-case. The Albis blog post falls into the second category. We can (and
> should for future references) definitely categorize and mark it clear that
> way. And we would encourage the community, whoever tries Crail please reach
> out to us to present your story on the Crail blog. Crail is committed to
> provide the best possible performance to all its users, be it Albis, Arrow,
> ORC, or Parquet.
>
> [As a developer of Albis and user of Crail]: I understand your sentiment
> regarding the format wars, and it is not the aim of Albis to establish yet
> another file format. Albis started as a prototype to quickly "explore"
> various design choices for storing relational data for a variety of
> scenarios with high-performance storage/networking devices - the kind of
> devices Crail targets. This is something that I cannot easily do with
> Arrow, ORC, or Parquet with HDFS (or something similar) within a reasonable
> effort and time-frame as they all have already chosen certain design points
> and trade-offs. Crail and Albis are not tied (or are preferred over other
> choices) to each other, though since it is coming from a same set of
> developers, I can see why the confusion arises. Having said this, I will be
> happy to contribute back to the Arrow community about the findings from
> Albis, and would appreciate any help with that. I had a brief discussion
> with Julien Le Dem at last DataWorks summit in San Jose about Albis as
> well. I have not done a through investigation of Arrow over Crail, but
> perhaps something that can be picked-up now as a starting point.
>
> I hope this clarifies the confusion. We will fix the blog post.
>
> Thanks,
> --
> Animesh
>
> On Tue, Sep 4, 2018 at 9:59 PM Julian Hyde <jh...@apache.org> wrote:
>
> > I just read the blog post [1] about Crail and file formats. (I have to
> > declare my interests up front: I have been a huge supporter of Apache
> > Arrow, and I am a PMC member. I’m speaking here as an Arrow contributor and
> > enthusiast, not as a mentor of Crail.)
> >
> > I am a bit troubled about the endorsement of Albis in a Crail blog post.
> > For example, "we have developed a new file format called Albis”. Since the
> > blog post is not signed, I take it that “We” means the authors of the paper
> > [2] mentioned in the blog post. But I hope that “we” does not mean “we as
> > Crail committers and PMC members".
> >
> > I know that there are different forces at play if you work for a
> > corporation, or are a researcher, or are an idealistic open source. As a
> > researcher, you need to invent new stuff and prove that it is better than
> > everything that has been done before.
> >
> > But I’ve been through the file format wars — ORC vs Parquet — driven in
> > large part by two competing vendors. It was sickening, and a huge waste of
> > effort. Please, please don’t let this happen again. If you want to make
> > Crail successful, you should make it absolutely clear to the Arrow, ORC and
> > Parquet communities that you will help to make Crail work as well as it
> > possibly can
> >
> > Also, on paper Albis looks very similar to Arrow, and the performance gap
> > is fairly narrow. If you have found insights that would improve Arrow, I
> > encourage you to share them and make Arrow better. It may be good research
> > practice to accentuate the differences between the two, but it’s good open
> > source practice to find consensus between technologies, and merge
> > communities. There is a lot of work to be done, and too few people to do it.
> >
> > Lastly, I know I seem to be giving mixed messages here. I do believe that
> > content about Crail will help drive engagement and build community
> > (controversial content even more so). I am delighted that the Crail team is
> > writing blog posts and posting them to Twitter. But be careful not to
> > alienate communities that could help Crail gain widespread adoption.
> >
> > Julian
> >
> > [1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> > http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>
> >
> > [2] https://www.usenix.org/conference/atc18/presentation/trivedi <
> > https://www.usenix.org/conference/atc18/presentation/trivedi>

Re: Crail, Albis and Arrow

Posted by Animesh Trivedi <an...@gmail.com>.

Hi Julian,

Thanks for posting your thoughts.

[As a Crail committer]: We agree that the notion of "we" creates confusion.
The Crail blog follows the trend in community projects, where a blogpost
falls in one of the two categories. The first type where a developer talks
about recent improvements, features, performance evaluation, etc. The
second type is where "a user" presents how they used the system for their
use-case. The Albis blog post falls into the second category. We can (and
should for future references) definitely categorize and mark it clear that
way. And we would encourage the community, whoever tries Crail please reach
out to us to present your story on the Crail blog. Crail is committed to
provide the best possible performance to all its users, be it Albis, Arrow,
ORC, or Parquet.

[As a developer of Albis and user of Crail]: I understand your sentiment
regarding the format wars, and it is not the aim of Albis to establish yet
another file format. Albis started as a prototype to quickly "explore"
various design choices for storing relational data for a variety of
scenarios with high-performance storage/networking devices - the kind of
devices Crail targets. This is something that I cannot easily do with
Arrow, ORC, or Parquet with HDFS (or something similar) within a reasonable
effort and time-frame as they all have already chosen certain design points
and trade-offs. Crail and Albis are not tied (or are preferred over other
choices) to each other, though since it is coming from a same set of
developers, I can see why the confusion arises. Having said this, I will be
happy to contribute back to the Arrow community about the findings from
Albis, and would appreciate any help with that. I had a brief discussion
with Julien Le Dem at last DataWorks summit in San Jose about Albis as
well. I have not done a through investigation of Arrow over Crail, but
perhaps something that can be picked-up now as a starting point.

I hope this clarifies the confusion. We will fix the blog post.

Thanks,
--
Animesh

On Tue, Sep 4, 2018 at 9:59 PM Julian Hyde <jh...@apache.org> wrote:

> I just read the blog post [1] about Crail and file formats. (I have to
> declare my interests up front: I have been a huge supporter of Apache
> Arrow, and I am a PMC member. I’m speaking here as an Arrow contributor and
> enthusiast, not as a mentor of Crail.)
>
> I am a bit troubled about the endorsement of Albis in a Crail blog post.
> For example, "we have developed a new file format called Albis”. Since the
> blog post is not signed, I take it that “We” means the authors of the paper
> [2] mentioned in the blog post. But I hope that “we” does not mean “we as
> Crail committers and PMC members".
>
> I know that there are different forces at play if you work for a
> corporation, or are a researcher, or are an idealistic open source. As a
> researcher, you need to invent new stuff and prove that it is better than
> everything that has been done before.
>
> But I’ve been through the file format wars — ORC vs Parquet — driven in
> large part by two competing vendors. It was sickening, and a huge waste of
> effort. Please, please don’t let this happen again. If you want to make
> Crail successful, you should make it absolutely clear to the Arrow, ORC and
> Parquet communities that you will help to make Crail work as well as it
> possibly can
>
> Also, on paper Albis looks very similar to Arrow, and the performance gap
> is fairly narrow. If you have found insights that would improve Arrow, I
> encourage you to share them and make Arrow better. It may be good research
> practice to accentuate the differences between the two, but it’s good open
> source practice to find consensus between technologies, and merge
> communities. There is a lot of work to be done, and too few people to do it.
>
> Lastly, I know I seem to be giving mixed messages here. I do believe that
> content about Crail will help drive engagement and build community
> (controversial content even more so). I am delighted that the Crail team is
> writing blog posts and posting them to Twitter. But be careful not to
> alienate communities that could help Crail gain widespread adoption.
>
> Julian
>
> [1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>
>
> [2] https://www.usenix.org/conference/atc18/presentation/trivedi <
> https://www.usenix.org/conference/atc18/presentation/trivedi>