You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2017/10/21 17:58:26 UTC

[DISCUSS] Updating Arrow's "elevator pitch" on web properties

I believe we would benefit from modified language to describe the
nature and scope of the Arrow project.

Currently, our GitHub project description (and what we use in release
announcements) states:

"Apache Arrow is a columnar in-memory analytics layer designed to
accelerate big data. It houses a set of canonical in-memory
representations of flat and hierarchical data along with multiple
language-bindings for structure manipulation. It also provides IPC and
common algorithm implementations."

I think this could be perhaps restated in the following way:

"Apache Arrow is a cross-language development platform for in-memory
structured data access and analytics. It specifies a standardized
language-independent columnar memory format for flat and hierarchical
data, with support for zero-copy streaming messaging and interprocess
communication. It also provides computational libraries for efficient
in-memory analytics on modern hardware."

It is true that we have been mostly focused on hardening the details
of the Arrow format and related issues around messaging and IPC, which
are necessary for everything else we may contemplate building in the
future. Since I plan to be building a library of computational tools
in C++ for the native code community (Python, Ruby, R, etc.), I think
it would be a good idea to clearly state that building general purpose
analytics implementations (i.e. the sorts of things you find in "data
frame libraries" like pandas) is part of the mission of the project.

Feedback on the above would be appreciated how we could do a better
job representing our past, present, and future community goals.

Thanks
Wes

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Posted by Julian Hyde <jh...@apache.org>.

Looks great.

The FAQ is the perfect place to expand on these points (and add nuance) without diluting the message.

Column-oriented is good for on-disk data, but bus-efficient structures also need to be organized to fit within cache lines, need to have constant offset multiples to allow using indexed addressing instructions, and probably other features I don’t know about in order to allow SIMD instructions. In other words, we are selling ourselves short if we just say we are column-oriented. Plus, "column-oriented” is rather old hat these days (I first heard the term in ’95). 

By the way, it would be interesting to consider what other constraints are imposed by NVRAM and GPU instructions. Maybe Arrow can address these needs, or be extended to address them, or maybe another format is required.

Julian

> On Oct 27, 2017, at 11:47 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> Here's a tweaked version of Julian's edits in 4 bullet points
> 
> 1. Apache Arrow is a cross-language development platform for in-memory
>   data.
> 2. It specifies a standardized language-independent columnar memory format for
>   flat and hierarchical data, organized for efficient analytic operations on
>   modern hardware.
> 3. It also provides computational libraries and zero-copy streaming messaging
>   and interprocess communication.
> 4. Languages currently supported include C, C++, Java, JavaScript,
> Python, and Ruby.
> 
> My comments to these points:
> 
> 1. Arrow's scope as a "hub" for in-memory data is larger than the
> columnar format. I think to lead with "columnar in-memory analytics"
> would weaken the project's position for users who do not exclusively
> work with columnar data, and also may limit the number of people who
> jump to the immediate conclusion that Arrow "is the same as Parquet".
> We obviously need to have a FAQ on the website where we address such
> confusions more directly
> 
> 2. The columnar format specification is one of the keystones of the project
> 
> 3. We are building computation and messaging libraries to be
> companions to the columnar format and memory management
> 
> 4. We support many languages (I added "currently" to imply that we are
> not closed to new languages)
> 
> - Wes
> 
> On Sun, Oct 22, 2017 at 11:04 PM, Julian Hyde <jh...@apache.org> wrote:
>> It's best if a project's (or company's) marketing has several tiers.
>> An "elevator pitch" of 2-3 sentences, a "high concept pitch" which is
>> a phrase, e.g. "book rooms with locals, rather than hotels", and
>> expanded description.
>> 
>> I think the question of whether this replaces Avro is best handled in an FAQ.
>> 
>> On Sun, Oct 22, 2017 at 5:35 AM, Wes McKinney <we...@gmail.com> wrote:
>>>> But my concern is that I saw some time ago some people questioning "Is Arrow a replacement for Avro?" (also Flatbuffers seems to be something we get
>>> often compared to). For at least these two cases, I see that we want
>>> to achieve different goals. We want to work with them together to
>>> build a better data analytics ecosystem but at least from my
>>> perspective, we don't want to replace all existing serialization
>>> formats.
>>> 
>>> Indeed, the most common problem I have experienced is that people who
>>> do not build data processing engines professionally sometimes get
>>> confused about the distinction between in-memory formats and
>>> serialization formats (Parquet, Avro, Protocol Buffers, etc.). The
>>> vast majority of developers rarely get this "close to the metal" and
>>> mainly think about storage formats and data access layers in terms of
>>> their high level semantics like "tables" and "records".
>>> 
>>> The distinction between Arrow and zero-copy serialization formats like
>>> Flatbuffers and Cap'n Proto is another thing that I often find myself
>>> explaining. I don't think there's any way we can resolve these
>>> confusions in ~100 words.
>>> 
>>> I would like for us to write some blog posts helping people mentally
>>> classify the technologies since it would help people understand both
>>> how Arrow is different as well as how it is a complementary / not
>>> mutually exclusive technology. I find that programmers are sometimes
>>> prone to dichotomous / binary thinking (which leads to the inclination
>>> to cast one technology as "the same as" another) and it's rare that a
>>> new, category-defining technology like this comes along. People even
>>> hear the "columnar" buzzword and then ask "wait, so is this replacing
>>> Parquet?".
>>> 
>>> The audience for the Arrow project are the developers of data
>>> processing engines. We need to precisely message that developers who
>>> work with complex in-memory data sets (especially using shared memory
>>> and memory-mappable devices like GPUs and NVM), even if they are not
>>> always columnar / structured, are welcome and indeed desired members
>>> of our community. As an example, our collaboration with the Ray
>>> project has been a success (and bodes well for use in more machine
>>> learning applications) because we can compose our zero-copy structured
>>> data representation with general buffer memory management to create
>>> richer, memory-efficient data access interfaces.
>>> 
>>> I'll spend a little time tweaking the blurb a bit based on Julian's
>>> edits and post for more feedback.
>>> 
>>> - Wes
>>> 
>>> On Sun, Oct 22, 2017 at 8:01 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>>> I clearly understand that all four layers are important to Arrow (and we
>>>> should mention them, maybe graphically) on the Arrow landing page. But
>>>> my concern is that I saw some time ago some people questioning "Is Arrow
>>>> a replacement for Avro?" (also Flatbuffers seems to be something we get
>>>> often compared to). For at least these two cases, I see that we want to
>>>> achieve different goals. We want to work with them together to build a
>>>> better data analytics ecosystem but at least from my perspective, we
>>>> don't want to replace all existing serialization formats. One of the
>>>> main points that people should show that there is a boundary in Arrow's
>>>> scope is the "in-memory" objective but I still would like to keep the
>>>> "columnar" somewhere in the description. It might be slightly
>>>> de-emphasized but it is still there as one of the focal point. From my
>>>> perspective, 3 of the four layers are still very much focused on
>>>> columnar memory.
>>>> 
>>>> Uwe
>>>> 
>>>> On Sun, Oct 22, 2017, at 01:46 PM, Wes McKinney wrote:
>>>>>> Still, I would like to see "columnar" used in the first sentence as this is the main focus of the project.
>>>>> 
>>>>> It's interesting, slightly de-emphasizing the role of the columnar
>>>>> format is actually one of my objectives of the revisions. It does not
>>>>> mean that the columnar specification is not a critical component of
>>>>> the project: it absolutely is and one of centerpieces of the project.
>>>>> 
>>>>> But the scope of Arrow has already become larger than that -- as time
>>>>> goes on the project's center of gravity concerns general management of
>>>>> in-memory analytical datasets. These may not be structured (and
>>>>> columnar) 100% of the time -- for example, you could use Arrow to
>>>>> write a collection of simple buffers (without any additional type
>>>>> metadata) to shared memory, then read them back with zero copy. This
>>>>> requires maintaining a general "memory management system" that is
>>>>> necessary for everything else, and the columnar format is built on top
>>>>> of this. It's pretty complex to be able to manage zero-copy memory
>>>>> references for arbitrarily complex
>>>>> 
>>>>> I see the C++ library in 4 distinct layers, for example:
>>>>> 
>>>>> * General zero-copy memory management: Plasma, arrow::Buffer,
>>>>> arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy
>>>>> io::BufferReader, etc.)
>>>>> * Columnar memory format / data structures / in-memory metadata :
>>>>> arrow::DataType / Array
>>>>> * Structured data IPC: arrays, record batches, and any other new
>>>>> message types (e.g. tensors)
>>>>> * Columnar in-memory analytics: what we are just beginning to
>>>>> implement in arrow/compute
>>>>> 
>>>>> I think to express to the open source community that in-memory data
>>>>> problems that are not columnar are of no interest to the Arrow
>>>>> community would be needlessly closing off collaboration opportunities.
>>>>> It's important that a larger audience is able to consume Arrow's
>>>>> memory management layer and IPC tools (e.g. they can easily be used
>>>>> for deep learning / ML applications) and use them to create more kinds
>>>>> of applications architected around the mantra of zero-copy. With new
>>>>> architectures designed to leverage non-volatile memory on the horizon,
>>>>> this grows more important with each passing day.
>>>>> 
>>>>> - Wes
>>>>> 
>>>>> On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>>>>> Thank you Wes and Julian for taking the approach to improve the elevator
>>>>>> pitch. I really like the improvements. Still, I would like to see
>>>>>> "columnar" used in the first sentence as this is the main focus of the
>>>>>> project.
>>>>>> 
>>>>>> Uwe
>>>>>> 
>>>>>> On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote:
>>>>>>> Thanks Julian, I like the changes.
>>>>>>> 
>>>>>>> For the last part I agree listing languages is good; we would do well
>>>>>>> to include JavaScript and Ruby in that list. Hopefully the list will
>>>>>>> keep growing longer!
>>>>>>> 
>>>>>>> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote:
>>>>>>>> Your proposed version is definitely an improvement.
>>>>>>>> 
>>>>>>>>> "Apache Arrow is a cross-language development platform for in-memory
>>>>>>>>> structured data access and analytics. It specifies a standardized
>>>>>>>>> language-independent columnar memory format for flat and hierarchical
>>>>>>>>> data, with support for zero-copy streaming messaging and interprocess
>>>>>>>>> communication. It also provides computational libraries for efficient
>>>>>>>>> in-memory analytics on modern hardware.”
>>>>>>>> 
>>>>>>>> I propose a few tweaks:
>>>>>>>> 
>>>>>>>> Simplify sentence 1 to
>>>>>>>> 
>>>>>>>>  Apache Arrow is a cross-language development platform for in-memory
>>>>>>>>  data.
>>>>>>>> 
>>>>>>>> This is easier to parse, captures the gist, and the other parts are covered
>>>>>>>> in later sentences.
>>>>>>>> 
>>>>>>>> To me, the cache-efficient format is more fundamental important than
>>>>>>>> streaming and IPC (you can build the latter). Therefore I’d change
>>>>>>>> sentence 2 to
>>>>>>>> 
>>>>>>>>  It specifies a standardized language-independent columnar memory
>>>>>>>>  format for flat and hierarchical data, organized for efficient analytic
>>>>>>>>  operations on modern hardware.
>>>>>>>> 
>>>>>>>> Which leaves sentence 3 as
>>>>>>>> 
>>>>>>>>  It also provides computational libraries for zero-copy streaming
>>>>>>>>  messaging and interprocess communication.
>>>>>>>> 
>>>>>>>> And add sentence 4,
>>>>>>>> 
>>>>>>>>  Languages supported include C and C++, Java, and Python.
>>>>>>>> 
>>>>>>>> Julian
>>>>>>>> 
>>>>>>>>> On Oct 21, 2017, at 10:58 AM, Wes McKinney <we...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> I believe we would benefit from modified language to describe the
>>>>>>>>> nature and scope of the Arrow project.
>>>>>>>>> 
>>>>>>>>> Currently, our GitHub project description (and what we use in release
>>>>>>>>> announcements) states:
>>>>>>>>> 
>>>>>>>>> "Apache Arrow is a columnar in-memory analytics layer designed to
>>>>>>>>> accelerate big data. It houses a set of canonical in-memory
>>>>>>>>> representations of flat and hierarchical data along with multiple
>>>>>>>>> language-bindings for structure manipulation. It also provides IPC and
>>>>>>>>> common algorithm implementations."
>>>>>>>>> 
>>>>>>>>> I think this could be perhaps restated in the following way:
>>>>>>>>> 
>>>>>>>>> "Apache Arrow is a cross-language development platform for in-memory
>>>>>>>>> structured data access and analytics. It specifies a standardized
>>>>>>>>> language-independent columnar memory format for flat and hierarchical
>>>>>>>>> data, with support for zero-copy streaming messaging and interprocess
>>>>>>>>> communication. It also provides computational libraries for efficient
>>>>>>>>> in-memory analytics on modern hardware."
>>>>>>>>> 
>>>>>>>>> It is true that we have been mostly focused on hardening the details
>>>>>>>>> of the Arrow format and related issues around messaging and IPC, which
>>>>>>>>> are necessary for everything else we may contemplate building in the
>>>>>>>>> future. Since I plan to be building a library of computational tools
>>>>>>>>> in C++ for the native code community (Python, Ruby, R, etc.), I think
>>>>>>>>> it would be a good idea to clearly state that building general purpose
>>>>>>>>> analytics implementations (i.e. the sorts of things you find in "data
>>>>>>>>> frame libraries" like pandas) is part of the mission of the project.
>>>>>>>>> 
>>>>>>>>> Feedback on the above would be appreciated how we could do a better
>>>>>>>>> job representing our past, present, and future community goals.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Wes
>>>>>>>>

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Posted by Wes McKinney <we...@gmail.com>.

Here's a tweaked version of Julian's edits in 4 bullet points

1. Apache Arrow is a cross-language development platform for in-memory
   data.
2. It specifies a standardized language-independent columnar memory format for
   flat and hierarchical data, organized for efficient analytic operations on
   modern hardware.
3. It also provides computational libraries and zero-copy streaming messaging
   and interprocess communication.
4. Languages currently supported include C, C++, Java, JavaScript,
Python, and Ruby.

My comments to these points:

1. Arrow's scope as a "hub" for in-memory data is larger than the
columnar format. I think to lead with "columnar in-memory analytics"
would weaken the project's position for users who do not exclusively
work with columnar data, and also may limit the number of people who
jump to the immediate conclusion that Arrow "is the same as Parquet".
We obviously need to have a FAQ on the website where we address such
confusions more directly

2. The columnar format specification is one of the keystones of the project

3. We are building computation and messaging libraries to be
companions to the columnar format and memory management

4. We support many languages (I added "currently" to imply that we are
not closed to new languages)

- Wes

On Sun, Oct 22, 2017 at 11:04 PM, Julian Hyde <jh...@apache.org> wrote:
> It's best if a project's (or company's) marketing has several tiers.
> An "elevator pitch" of 2-3 sentences, a "high concept pitch" which is
> a phrase, e.g. "book rooms with locals, rather than hotels", and
> expanded description.
>
> I think the question of whether this replaces Avro is best handled in an FAQ.
>
> On Sun, Oct 22, 2017 at 5:35 AM, Wes McKinney <we...@gmail.com> wrote:
>>> But my concern is that I saw some time ago some people questioning "Is Arrow a replacement for Avro?" (also Flatbuffers seems to be something we get
>> often compared to). For at least these two cases, I see that we want
>> to achieve different goals. We want to work with them together to
>> build a better data analytics ecosystem but at least from my
>> perspective, we don't want to replace all existing serialization
>> formats.
>>
>> Indeed, the most common problem I have experienced is that people who
>> do not build data processing engines professionally sometimes get
>> confused about the distinction between in-memory formats and
>> serialization formats (Parquet, Avro, Protocol Buffers, etc.). The
>> vast majority of developers rarely get this "close to the metal" and
>> mainly think about storage formats and data access layers in terms of
>> their high level semantics like "tables" and "records".
>>
>> The distinction between Arrow and zero-copy serialization formats like
>> Flatbuffers and Cap'n Proto is another thing that I often find myself
>> explaining. I don't think there's any way we can resolve these
>> confusions in ~100 words.
>>
>> I would like for us to write some blog posts helping people mentally
>> classify the technologies since it would help people understand both
>> how Arrow is different as well as how it is a complementary / not
>> mutually exclusive technology. I find that programmers are sometimes
>> prone to dichotomous / binary thinking (which leads to the inclination
>> to cast one technology as "the same as" another) and it's rare that a
>> new, category-defining technology like this comes along. People even
>> hear the "columnar" buzzword and then ask "wait, so is this replacing
>> Parquet?".
>>
>> The audience for the Arrow project are the developers of data
>> processing engines. We need to precisely message that developers who
>> work with complex in-memory data sets (especially using shared memory
>> and memory-mappable devices like GPUs and NVM), even if they are not
>> always columnar / structured, are welcome and indeed desired members
>> of our community. As an example, our collaboration with the Ray
>> project has been a success (and bodes well for use in more machine
>> learning applications) because we can compose our zero-copy structured
>> data representation with general buffer memory management to create
>> richer, memory-efficient data access interfaces.
>>
>> I'll spend a little time tweaking the blurb a bit based on Julian's
>> edits and post for more feedback.
>>
>> - Wes
>>
>> On Sun, Oct 22, 2017 at 8:01 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>> I clearly understand that all four layers are important to Arrow (and we
>>> should mention them, maybe graphically) on the Arrow landing page. But
>>> my concern is that I saw some time ago some people questioning "Is Arrow
>>> a replacement for Avro?" (also Flatbuffers seems to be something we get
>>> often compared to). For at least these two cases, I see that we want to
>>> achieve different goals. We want to work with them together to build a
>>> better data analytics ecosystem but at least from my perspective, we
>>> don't want to replace all existing serialization formats. One of the
>>> main points that people should show that there is a boundary in Arrow's
>>> scope is the "in-memory" objective but I still would like to keep the
>>> "columnar" somewhere in the description. It might be slightly
>>> de-emphasized but it is still there as one of the focal point. From my
>>> perspective, 3 of the four layers are still very much focused on
>>> columnar memory.
>>>
>>> Uwe
>>>
>>> On Sun, Oct 22, 2017, at 01:46 PM, Wes McKinney wrote:
>>>> > Still, I would like to see "columnar" used in the first sentence as this is the main focus of the project.
>>>>
>>>> It's interesting, slightly de-emphasizing the role of the columnar
>>>> format is actually one of my objectives of the revisions. It does not
>>>> mean that the columnar specification is not a critical component of
>>>> the project: it absolutely is and one of centerpieces of the project.
>>>>
>>>> But the scope of Arrow has already become larger than that -- as time
>>>> goes on the project's center of gravity concerns general management of
>>>> in-memory analytical datasets. These may not be structured (and
>>>> columnar) 100% of the time -- for example, you could use Arrow to
>>>> write a collection of simple buffers (without any additional type
>>>> metadata) to shared memory, then read them back with zero copy. This
>>>> requires maintaining a general "memory management system" that is
>>>> necessary for everything else, and the columnar format is built on top
>>>> of this. It's pretty complex to be able to manage zero-copy memory
>>>> references for arbitrarily complex
>>>>
>>>> I see the C++ library in 4 distinct layers, for example:
>>>>
>>>> * General zero-copy memory management: Plasma, arrow::Buffer,
>>>> arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy
>>>> io::BufferReader, etc.)
>>>> * Columnar memory format / data structures / in-memory metadata :
>>>> arrow::DataType / Array
>>>> * Structured data IPC: arrays, record batches, and any other new
>>>> message types (e.g. tensors)
>>>> * Columnar in-memory analytics: what we are just beginning to
>>>> implement in arrow/compute
>>>>
>>>> I think to express to the open source community that in-memory data
>>>> problems that are not columnar are of no interest to the Arrow
>>>> community would be needlessly closing off collaboration opportunities.
>>>> It's important that a larger audience is able to consume Arrow's
>>>> memory management layer and IPC tools (e.g. they can easily be used
>>>> for deep learning / ML applications) and use them to create more kinds
>>>> of applications architected around the mantra of zero-copy. With new
>>>> architectures designed to leverage non-volatile memory on the horizon,
>>>> this grows more important with each passing day.
>>>>
>>>> - Wes
>>>>
>>>> On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>>> > Thank you Wes and Julian for taking the approach to improve the elevator
>>>> > pitch. I really like the improvements. Still, I would like to see
>>>> > "columnar" used in the first sentence as this is the main focus of the
>>>> > project.
>>>> >
>>>> > Uwe
>>>> >
>>>> > On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote:
>>>> >> Thanks Julian, I like the changes.
>>>> >>
>>>> >> For the last part I agree listing languages is good; we would do well
>>>> >> to include JavaScript and Ruby in that list. Hopefully the list will
>>>> >> keep growing longer!
>>>> >>
>>>> >> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote:
>>>> >> > Your proposed version is definitely an improvement.
>>>> >> >
>>>> >> >> "Apache Arrow is a cross-language development platform for in-memory
>>>> >> >> structured data access and analytics. It specifies a standardized
>>>> >> >> language-independent columnar memory format for flat and hierarchical
>>>> >> >> data, with support for zero-copy streaming messaging and interprocess
>>>> >> >> communication. It also provides computational libraries for efficient
>>>> >> >> in-memory analytics on modern hardware.”
>>>> >> >
>>>> >> > I propose a few tweaks:
>>>> >> >
>>>> >> > Simplify sentence 1 to
>>>> >> >
>>>> >> >   Apache Arrow is a cross-language development platform for in-memory
>>>> >> >   data.
>>>> >> >
>>>> >> > This is easier to parse, captures the gist, and the other parts are covered
>>>> >> > in later sentences.
>>>> >> >
>>>> >> > To me, the cache-efficient format is more fundamental important than
>>>> >> > streaming and IPC (you can build the latter). Therefore I’d change
>>>> >> > sentence 2 to
>>>> >> >
>>>> >> >   It specifies a standardized language-independent columnar memory
>>>> >> >   format for flat and hierarchical data, organized for efficient analytic
>>>> >> >   operations on modern hardware.
>>>> >> >
>>>> >> > Which leaves sentence 3 as
>>>> >> >
>>>> >> >   It also provides computational libraries for zero-copy streaming
>>>> >> >   messaging and interprocess communication.
>>>> >> >
>>>> >> > And add sentence 4,
>>>> >> >
>>>> >> >   Languages supported include C and C++, Java, and Python.
>>>> >> >
>>>> >> > Julian
>>>> >> >
>>>> >> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <we...@gmail.com> wrote:
>>>> >> >>
>>>> >> >> I believe we would benefit from modified language to describe the
>>>> >> >> nature and scope of the Arrow project.
>>>> >> >>
>>>> >> >> Currently, our GitHub project description (and what we use in release
>>>> >> >> announcements) states:
>>>> >> >>
>>>> >> >> "Apache Arrow is a columnar in-memory analytics layer designed to
>>>> >> >> accelerate big data. It houses a set of canonical in-memory
>>>> >> >> representations of flat and hierarchical data along with multiple
>>>> >> >> language-bindings for structure manipulation. It also provides IPC and
>>>> >> >> common algorithm implementations."
>>>> >> >>
>>>> >> >> I think this could be perhaps restated in the following way:
>>>> >> >>
>>>> >> >> "Apache Arrow is a cross-language development platform for in-memory
>>>> >> >> structured data access and analytics. It specifies a standardized
>>>> >> >> language-independent columnar memory format for flat and hierarchical
>>>> >> >> data, with support for zero-copy streaming messaging and interprocess
>>>> >> >> communication. It also provides computational libraries for efficient
>>>> >> >> in-memory analytics on modern hardware."
>>>> >> >>
>>>> >> >> It is true that we have been mostly focused on hardening the details
>>>> >> >> of the Arrow format and related issues around messaging and IPC, which
>>>> >> >> are necessary for everything else we may contemplate building in the
>>>> >> >> future. Since I plan to be building a library of computational tools
>>>> >> >> in C++ for the native code community (Python, Ruby, R, etc.), I think
>>>> >> >> it would be a good idea to clearly state that building general purpose
>>>> >> >> analytics implementations (i.e. the sorts of things you find in "data
>>>> >> >> frame libraries" like pandas) is part of the mission of the project.
>>>> >> >>
>>>> >> >> Feedback on the above would be appreciated how we could do a better
>>>> >> >> job representing our past, present, and future community goals.
>>>> >> >>
>>>> >> >> Thanks
>>>> >> >> Wes
>>>> >> >

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Posted by Julian Hyde <jh...@apache.org>.

It's best if a project's (or company's) marketing has several tiers.
An "elevator pitch" of 2-3 sentences, a "high concept pitch" which is
a phrase, e.g. "book rooms with locals, rather than hotels", and
expanded description.

I think the question of whether this replaces Avro is best handled in an FAQ.

On Sun, Oct 22, 2017 at 5:35 AM, Wes McKinney <we...@gmail.com> wrote:
>> But my concern is that I saw some time ago some people questioning "Is Arrow a replacement for Avro?" (also Flatbuffers seems to be something we get
> often compared to). For at least these two cases, I see that we want
> to achieve different goals. We want to work with them together to
> build a better data analytics ecosystem but at least from my
> perspective, we don't want to replace all existing serialization
> formats.
>
> Indeed, the most common problem I have experienced is that people who
> do not build data processing engines professionally sometimes get
> confused about the distinction between in-memory formats and
> serialization formats (Parquet, Avro, Protocol Buffers, etc.). The
> vast majority of developers rarely get this "close to the metal" and
> mainly think about storage formats and data access layers in terms of
> their high level semantics like "tables" and "records".
>
> The distinction between Arrow and zero-copy serialization formats like
> Flatbuffers and Cap'n Proto is another thing that I often find myself
> explaining. I don't think there's any way we can resolve these
> confusions in ~100 words.
>
> I would like for us to write some blog posts helping people mentally
> classify the technologies since it would help people understand both
> how Arrow is different as well as how it is a complementary / not
> mutually exclusive technology. I find that programmers are sometimes
> prone to dichotomous / binary thinking (which leads to the inclination
> to cast one technology as "the same as" another) and it's rare that a
> new, category-defining technology like this comes along. People even
> hear the "columnar" buzzword and then ask "wait, so is this replacing
> Parquet?".
>
> The audience for the Arrow project are the developers of data
> processing engines. We need to precisely message that developers who
> work with complex in-memory data sets (especially using shared memory
> and memory-mappable devices like GPUs and NVM), even if they are not
> always columnar / structured, are welcome and indeed desired members
> of our community. As an example, our collaboration with the Ray
> project has been a success (and bodes well for use in more machine
> learning applications) because we can compose our zero-copy structured
> data representation with general buffer memory management to create
> richer, memory-efficient data access interfaces.
>
> I'll spend a little time tweaking the blurb a bit based on Julian's
> edits and post for more feedback.
>
> - Wes
>
> On Sun, Oct 22, 2017 at 8:01 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>> I clearly understand that all four layers are important to Arrow (and we
>> should mention them, maybe graphically) on the Arrow landing page. But
>> my concern is that I saw some time ago some people questioning "Is Arrow
>> a replacement for Avro?" (also Flatbuffers seems to be something we get
>> often compared to). For at least these two cases, I see that we want to
>> achieve different goals. We want to work with them together to build a
>> better data analytics ecosystem but at least from my perspective, we
>> don't want to replace all existing serialization formats. One of the
>> main points that people should show that there is a boundary in Arrow's
>> scope is the "in-memory" objective but I still would like to keep the
>> "columnar" somewhere in the description. It might be slightly
>> de-emphasized but it is still there as one of the focal point. From my
>> perspective, 3 of the four layers are still very much focused on
>> columnar memory.
>>
>> Uwe
>>
>> On Sun, Oct 22, 2017, at 01:46 PM, Wes McKinney wrote:
>>> > Still, I would like to see "columnar" used in the first sentence as this is the main focus of the project.
>>>
>>> It's interesting, slightly de-emphasizing the role of the columnar
>>> format is actually one of my objectives of the revisions. It does not
>>> mean that the columnar specification is not a critical component of
>>> the project: it absolutely is and one of centerpieces of the project.
>>>
>>> But the scope of Arrow has already become larger than that -- as time
>>> goes on the project's center of gravity concerns general management of
>>> in-memory analytical datasets. These may not be structured (and
>>> columnar) 100% of the time -- for example, you could use Arrow to
>>> write a collection of simple buffers (without any additional type
>>> metadata) to shared memory, then read them back with zero copy. This
>>> requires maintaining a general "memory management system" that is
>>> necessary for everything else, and the columnar format is built on top
>>> of this. It's pretty complex to be able to manage zero-copy memory
>>> references for arbitrarily complex
>>>
>>> I see the C++ library in 4 distinct layers, for example:
>>>
>>> * General zero-copy memory management: Plasma, arrow::Buffer,
>>> arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy
>>> io::BufferReader, etc.)
>>> * Columnar memory format / data structures / in-memory metadata :
>>> arrow::DataType / Array
>>> * Structured data IPC: arrays, record batches, and any other new
>>> message types (e.g. tensors)
>>> * Columnar in-memory analytics: what we are just beginning to
>>> implement in arrow/compute
>>>
>>> I think to express to the open source community that in-memory data
>>> problems that are not columnar are of no interest to the Arrow
>>> community would be needlessly closing off collaboration opportunities.
>>> It's important that a larger audience is able to consume Arrow's
>>> memory management layer and IPC tools (e.g. they can easily be used
>>> for deep learning / ML applications) and use them to create more kinds
>>> of applications architected around the mantra of zero-copy. With new
>>> architectures designed to leverage non-volatile memory on the horizon,
>>> this grows more important with each passing day.
>>>
>>> - Wes
>>>
>>> On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>> > Thank you Wes and Julian for taking the approach to improve the elevator
>>> > pitch. I really like the improvements. Still, I would like to see
>>> > "columnar" used in the first sentence as this is the main focus of the
>>> > project.
>>> >
>>> > Uwe
>>> >
>>> > On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote:
>>> >> Thanks Julian, I like the changes.
>>> >>
>>> >> For the last part I agree listing languages is good; we would do well
>>> >> to include JavaScript and Ruby in that list. Hopefully the list will
>>> >> keep growing longer!
>>> >>
>>> >> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote:
>>> >> > Your proposed version is definitely an improvement.
>>> >> >
>>> >> >> "Apache Arrow is a cross-language development platform for in-memory
>>> >> >> structured data access and analytics. It specifies a standardized
>>> >> >> language-independent columnar memory format for flat and hierarchical
>>> >> >> data, with support for zero-copy streaming messaging and interprocess
>>> >> >> communication. It also provides computational libraries for efficient
>>> >> >> in-memory analytics on modern hardware.”
>>> >> >
>>> >> > I propose a few tweaks:
>>> >> >
>>> >> > Simplify sentence 1 to
>>> >> >
>>> >> >   Apache Arrow is a cross-language development platform for in-memory
>>> >> >   data.
>>> >> >
>>> >> > This is easier to parse, captures the gist, and the other parts are covered
>>> >> > in later sentences.
>>> >> >
>>> >> > To me, the cache-efficient format is more fundamental important than
>>> >> > streaming and IPC (you can build the latter). Therefore I’d change
>>> >> > sentence 2 to
>>> >> >
>>> >> >   It specifies a standardized language-independent columnar memory
>>> >> >   format for flat and hierarchical data, organized for efficient analytic
>>> >> >   operations on modern hardware.
>>> >> >
>>> >> > Which leaves sentence 3 as
>>> >> >
>>> >> >   It also provides computational libraries for zero-copy streaming
>>> >> >   messaging and interprocess communication.
>>> >> >
>>> >> > And add sentence 4,
>>> >> >
>>> >> >   Languages supported include C and C++, Java, and Python.
>>> >> >
>>> >> > Julian
>>> >> >
>>> >> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <we...@gmail.com> wrote:
>>> >> >>
>>> >> >> I believe we would benefit from modified language to describe the
>>> >> >> nature and scope of the Arrow project.
>>> >> >>
>>> >> >> Currently, our GitHub project description (and what we use in release
>>> >> >> announcements) states:
>>> >> >>
>>> >> >> "Apache Arrow is a columnar in-memory analytics layer designed to
>>> >> >> accelerate big data. It houses a set of canonical in-memory
>>> >> >> representations of flat and hierarchical data along with multiple
>>> >> >> language-bindings for structure manipulation. It also provides IPC and
>>> >> >> common algorithm implementations."
>>> >> >>
>>> >> >> I think this could be perhaps restated in the following way:
>>> >> >>
>>> >> >> "Apache Arrow is a cross-language development platform for in-memory
>>> >> >> structured data access and analytics. It specifies a standardized
>>> >> >> language-independent columnar memory format for flat and hierarchical
>>> >> >> data, with support for zero-copy streaming messaging and interprocess
>>> >> >> communication. It also provides computational libraries for efficient
>>> >> >> in-memory analytics on modern hardware."
>>> >> >>
>>> >> >> It is true that we have been mostly focused on hardening the details
>>> >> >> of the Arrow format and related issues around messaging and IPC, which
>>> >> >> are necessary for everything else we may contemplate building in the
>>> >> >> future. Since I plan to be building a library of computational tools
>>> >> >> in C++ for the native code community (Python, Ruby, R, etc.), I think
>>> >> >> it would be a good idea to clearly state that building general purpose
>>> >> >> analytics implementations (i.e. the sorts of things you find in "data
>>> >> >> frame libraries" like pandas) is part of the mission of the project.
>>> >> >>
>>> >> >> Feedback on the above would be appreciated how we could do a better
>>> >> >> job representing our past, present, and future community goals.
>>> >> >>
>>> >> >> Thanks
>>> >> >> Wes
>>> >> >

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Posted by Wes McKinney <we...@gmail.com>.

> But my concern is that I saw some time ago some people questioning "Is Arrow a replacement for Avro?" (also Flatbuffers seems to be something we get
often compared to). For at least these two cases, I see that we want
to achieve different goals. We want to work with them together to
build a better data analytics ecosystem but at least from my
perspective, we don't want to replace all existing serialization
formats.

Indeed, the most common problem I have experienced is that people who
do not build data processing engines professionally sometimes get
confused about the distinction between in-memory formats and
serialization formats (Parquet, Avro, Protocol Buffers, etc.). The
vast majority of developers rarely get this "close to the metal" and
mainly think about storage formats and data access layers in terms of
their high level semantics like "tables" and "records".

The distinction between Arrow and zero-copy serialization formats like
Flatbuffers and Cap'n Proto is another thing that I often find myself
explaining. I don't think there's any way we can resolve these
confusions in ~100 words.

I would like for us to write some blog posts helping people mentally
classify the technologies since it would help people understand both
how Arrow is different as well as how it is a complementary / not
mutually exclusive technology. I find that programmers are sometimes
prone to dichotomous / binary thinking (which leads to the inclination
to cast one technology as "the same as" another) and it's rare that a
new, category-defining technology like this comes along. People even
hear the "columnar" buzzword and then ask "wait, so is this replacing
Parquet?".

The audience for the Arrow project are the developers of data
processing engines. We need to precisely message that developers who
work with complex in-memory data sets (especially using shared memory
and memory-mappable devices like GPUs and NVM), even if they are not
always columnar / structured, are welcome and indeed desired members
of our community. As an example, our collaboration with the Ray
project has been a success (and bodes well for use in more machine
learning applications) because we can compose our zero-copy structured
data representation with general buffer memory management to create
richer, memory-efficient data access interfaces.

I'll spend a little time tweaking the blurb a bit based on Julian's
edits and post for more feedback.

- Wes

On Sun, Oct 22, 2017 at 8:01 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> I clearly understand that all four layers are important to Arrow (and we
> should mention them, maybe graphically) on the Arrow landing page. But
> my concern is that I saw some time ago some people questioning "Is Arrow
> a replacement for Avro?" (also Flatbuffers seems to be something we get
> often compared to). For at least these two cases, I see that we want to
> achieve different goals. We want to work with them together to build a
> better data analytics ecosystem but at least from my perspective, we
> don't want to replace all existing serialization formats. One of the
> main points that people should show that there is a boundary in Arrow's
> scope is the "in-memory" objective but I still would like to keep the
> "columnar" somewhere in the description. It might be slightly
> de-emphasized but it is still there as one of the focal point. From my
> perspective, 3 of the four layers are still very much focused on
> columnar memory.
>
> Uwe
>
> On Sun, Oct 22, 2017, at 01:46 PM, Wes McKinney wrote:
>> > Still, I would like to see "columnar" used in the first sentence as this is the main focus of the project.
>>
>> It's interesting, slightly de-emphasizing the role of the columnar
>> format is actually one of my objectives of the revisions. It does not
>> mean that the columnar specification is not a critical component of
>> the project: it absolutely is and one of centerpieces of the project.
>>
>> But the scope of Arrow has already become larger than that -- as time
>> goes on the project's center of gravity concerns general management of
>> in-memory analytical datasets. These may not be structured (and
>> columnar) 100% of the time -- for example, you could use Arrow to
>> write a collection of simple buffers (without any additional type
>> metadata) to shared memory, then read them back with zero copy. This
>> requires maintaining a general "memory management system" that is
>> necessary for everything else, and the columnar format is built on top
>> of this. It's pretty complex to be able to manage zero-copy memory
>> references for arbitrarily complex
>>
>> I see the C++ library in 4 distinct layers, for example:
>>
>> * General zero-copy memory management: Plasma, arrow::Buffer,
>> arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy
>> io::BufferReader, etc.)
>> * Columnar memory format / data structures / in-memory metadata :
>> arrow::DataType / Array
>> * Structured data IPC: arrays, record batches, and any other new
>> message types (e.g. tensors)
>> * Columnar in-memory analytics: what we are just beginning to
>> implement in arrow/compute
>>
>> I think to express to the open source community that in-memory data
>> problems that are not columnar are of no interest to the Arrow
>> community would be needlessly closing off collaboration opportunities.
>> It's important that a larger audience is able to consume Arrow's
>> memory management layer and IPC tools (e.g. they can easily be used
>> for deep learning / ML applications) and use them to create more kinds
>> of applications architected around the mantra of zero-copy. With new
>> architectures designed to leverage non-volatile memory on the horizon,
>> this grows more important with each passing day.
>>
>> - Wes
>>
>> On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>> > Thank you Wes and Julian for taking the approach to improve the elevator
>> > pitch. I really like the improvements. Still, I would like to see
>> > "columnar" used in the first sentence as this is the main focus of the
>> > project.
>> >
>> > Uwe
>> >
>> > On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote:
>> >> Thanks Julian, I like the changes.
>> >>
>> >> For the last part I agree listing languages is good; we would do well
>> >> to include JavaScript and Ruby in that list. Hopefully the list will
>> >> keep growing longer!
>> >>
>> >> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote:
>> >> > Your proposed version is definitely an improvement.
>> >> >
>> >> >> "Apache Arrow is a cross-language development platform for in-memory
>> >> >> structured data access and analytics. It specifies a standardized
>> >> >> language-independent columnar memory format for flat and hierarchical
>> >> >> data, with support for zero-copy streaming messaging and interprocess
>> >> >> communication. It also provides computational libraries for efficient
>> >> >> in-memory analytics on modern hardware.”
>> >> >
>> >> > I propose a few tweaks:
>> >> >
>> >> > Simplify sentence 1 to
>> >> >
>> >> >   Apache Arrow is a cross-language development platform for in-memory
>> >> >   data.
>> >> >
>> >> > This is easier to parse, captures the gist, and the other parts are covered
>> >> > in later sentences.
>> >> >
>> >> > To me, the cache-efficient format is more fundamental important than
>> >> > streaming and IPC (you can build the latter). Therefore I’d change
>> >> > sentence 2 to
>> >> >
>> >> >   It specifies a standardized language-independent columnar memory
>> >> >   format for flat and hierarchical data, organized for efficient analytic
>> >> >   operations on modern hardware.
>> >> >
>> >> > Which leaves sentence 3 as
>> >> >
>> >> >   It also provides computational libraries for zero-copy streaming
>> >> >   messaging and interprocess communication.
>> >> >
>> >> > And add sentence 4,
>> >> >
>> >> >   Languages supported include C and C++, Java, and Python.
>> >> >
>> >> > Julian
>> >> >
>> >> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <we...@gmail.com> wrote:
>> >> >>
>> >> >> I believe we would benefit from modified language to describe the
>> >> >> nature and scope of the Arrow project.
>> >> >>
>> >> >> Currently, our GitHub project description (and what we use in release
>> >> >> announcements) states:
>> >> >>
>> >> >> "Apache Arrow is a columnar in-memory analytics layer designed to
>> >> >> accelerate big data. It houses a set of canonical in-memory
>> >> >> representations of flat and hierarchical data along with multiple
>> >> >> language-bindings for structure manipulation. It also provides IPC and
>> >> >> common algorithm implementations."
>> >> >>
>> >> >> I think this could be perhaps restated in the following way:
>> >> >>
>> >> >> "Apache Arrow is a cross-language development platform for in-memory
>> >> >> structured data access and analytics. It specifies a standardized
>> >> >> language-independent columnar memory format for flat and hierarchical
>> >> >> data, with support for zero-copy streaming messaging and interprocess
>> >> >> communication. It also provides computational libraries for efficient
>> >> >> in-memory analytics on modern hardware."
>> >> >>
>> >> >> It is true that we have been mostly focused on hardening the details
>> >> >> of the Arrow format and related issues around messaging and IPC, which
>> >> >> are necessary for everything else we may contemplate building in the
>> >> >> future. Since I plan to be building a library of computational tools
>> >> >> in C++ for the native code community (Python, Ruby, R, etc.), I think
>> >> >> it would be a good idea to clearly state that building general purpose
>> >> >> analytics implementations (i.e. the sorts of things you find in "data
>> >> >> frame libraries" like pandas) is part of the mission of the project.
>> >> >>
>> >> >> Feedback on the above would be appreciated how we could do a better
>> >> >> job representing our past, present, and future community goals.
>> >> >>
>> >> >> Thanks
>> >> >> Wes
>> >> >

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

I clearly understand that all four layers are important to Arrow (and we
should mention them, maybe graphically) on the Arrow landing page. But
my concern is that I saw some time ago some people questioning "Is Arrow
a replacement for Avro?" (also Flatbuffers seems to be something we get
often compared to). For at least these two cases, I see that we want to
achieve different goals. We want to work with them together to build a
better data analytics ecosystem but at least from my perspective, we
don't want to replace all existing serialization formats. One of the
main points that people should show that there is a boundary in Arrow's
scope is the "in-memory" objective but I still would like to keep the
"columnar" somewhere in the description. It might be slightly
de-emphasized but it is still there as one of the focal point. From my
perspective, 3 of the four layers are still very much focused on
columnar memory.

Uwe

On Sun, Oct 22, 2017, at 01:46 PM, Wes McKinney wrote:
> > Still, I would like to see "columnar" used in the first sentence as this is the main focus of the project.
> 
> It's interesting, slightly de-emphasizing the role of the columnar
> format is actually one of my objectives of the revisions. It does not
> mean that the columnar specification is not a critical component of
> the project: it absolutely is and one of centerpieces of the project.
> 
> But the scope of Arrow has already become larger than that -- as time
> goes on the project's center of gravity concerns general management of
> in-memory analytical datasets. These may not be structured (and
> columnar) 100% of the time -- for example, you could use Arrow to
> write a collection of simple buffers (without any additional type
> metadata) to shared memory, then read them back with zero copy. This
> requires maintaining a general "memory management system" that is
> necessary for everything else, and the columnar format is built on top
> of this. It's pretty complex to be able to manage zero-copy memory
> references for arbitrarily complex
> 
> I see the C++ library in 4 distinct layers, for example:
> 
> * General zero-copy memory management: Plasma, arrow::Buffer,
> arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy
> io::BufferReader, etc.)
> * Columnar memory format / data structures / in-memory metadata :
> arrow::DataType / Array
> * Structured data IPC: arrays, record batches, and any other new
> message types (e.g. tensors)
> * Columnar in-memory analytics: what we are just beginning to
> implement in arrow/compute
> 
> I think to express to the open source community that in-memory data
> problems that are not columnar are of no interest to the Arrow
> community would be needlessly closing off collaboration opportunities.
> It's important that a larger audience is able to consume Arrow's
> memory management layer and IPC tools (e.g. they can easily be used
> for deep learning / ML applications) and use them to create more kinds
> of applications architected around the mantra of zero-copy. With new
> architectures designed to leverage non-volatile memory on the horizon,
> this grows more important with each passing day.
> 
> - Wes
> 
> On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> > Thank you Wes and Julian for taking the approach to improve the elevator
> > pitch. I really like the improvements. Still, I would like to see
> > "columnar" used in the first sentence as this is the main focus of the
> > project.
> >
> > Uwe
> >
> > On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote:
> >> Thanks Julian, I like the changes.
> >>
> >> For the last part I agree listing languages is good; we would do well
> >> to include JavaScript and Ruby in that list. Hopefully the list will
> >> keep growing longer!
> >>
> >> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote:
> >> > Your proposed version is definitely an improvement.
> >> >
> >> >> "Apache Arrow is a cross-language development platform for in-memory
> >> >> structured data access and analytics. It specifies a standardized
> >> >> language-independent columnar memory format for flat and hierarchical
> >> >> data, with support for zero-copy streaming messaging and interprocess
> >> >> communication. It also provides computational libraries for efficient
> >> >> in-memory analytics on modern hardware.”
> >> >
> >> > I propose a few tweaks:
> >> >
> >> > Simplify sentence 1 to
> >> >
> >> >   Apache Arrow is a cross-language development platform for in-memory
> >> >   data.
> >> >
> >> > This is easier to parse, captures the gist, and the other parts are covered
> >> > in later sentences.
> >> >
> >> > To me, the cache-efficient format is more fundamental important than
> >> > streaming and IPC (you can build the latter). Therefore I’d change
> >> > sentence 2 to
> >> >
> >> >   It specifies a standardized language-independent columnar memory
> >> >   format for flat and hierarchical data, organized for efficient analytic
> >> >   operations on modern hardware.
> >> >
> >> > Which leaves sentence 3 as
> >> >
> >> >   It also provides computational libraries for zero-copy streaming
> >> >   messaging and interprocess communication.
> >> >
> >> > And add sentence 4,
> >> >
> >> >   Languages supported include C and C++, Java, and Python.
> >> >
> >> > Julian
> >> >
> >> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <we...@gmail.com> wrote:
> >> >>
> >> >> I believe we would benefit from modified language to describe the
> >> >> nature and scope of the Arrow project.
> >> >>
> >> >> Currently, our GitHub project description (and what we use in release
> >> >> announcements) states:
> >> >>
> >> >> "Apache Arrow is a columnar in-memory analytics layer designed to
> >> >> accelerate big data. It houses a set of canonical in-memory
> >> >> representations of flat and hierarchical data along with multiple
> >> >> language-bindings for structure manipulation. It also provides IPC and
> >> >> common algorithm implementations."
> >> >>
> >> >> I think this could be perhaps restated in the following way:
> >> >>
> >> >> "Apache Arrow is a cross-language development platform for in-memory
> >> >> structured data access and analytics. It specifies a standardized
> >> >> language-independent columnar memory format for flat and hierarchical
> >> >> data, with support for zero-copy streaming messaging and interprocess
> >> >> communication. It also provides computational libraries for efficient
> >> >> in-memory analytics on modern hardware."
> >> >>
> >> >> It is true that we have been mostly focused on hardening the details
> >> >> of the Arrow format and related issues around messaging and IPC, which
> >> >> are necessary for everything else we may contemplate building in the
> >> >> future. Since I plan to be building a library of computational tools
> >> >> in C++ for the native code community (Python, Ruby, R, etc.), I think
> >> >> it would be a good idea to clearly state that building general purpose
> >> >> analytics implementations (i.e. the sorts of things you find in "data
> >> >> frame libraries" like pandas) is part of the mission of the project.
> >> >>
> >> >> Feedback on the above would be appreciated how we could do a better
> >> >> job representing our past, present, and future community goals.
> >> >>
> >> >> Thanks
> >> >> Wes
> >> >

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Posted by Wes McKinney <we...@gmail.com>.

> Still, I would like to see "columnar" used in the first sentence as this is the main focus of the project.

It's interesting, slightly de-emphasizing the role of the columnar
format is actually one of my objectives of the revisions. It does not
mean that the columnar specification is not a critical component of
the project: it absolutely is and one of centerpieces of the project.

But the scope of Arrow has already become larger than that -- as time
goes on the project's center of gravity concerns general management of
in-memory analytical datasets. These may not be structured (and
columnar) 100% of the time -- for example, you could use Arrow to
write a collection of simple buffers (without any additional type
metadata) to shared memory, then read them back with zero copy. This
requires maintaining a general "memory management system" that is
necessary for everything else, and the columnar format is built on top
of this. It's pretty complex to be able to manage zero-copy memory
references for arbitrarily complex

I see the C++ library in 4 distinct layers, for example:

* General zero-copy memory management: Plasma, arrow::Buffer,
arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy
io::BufferReader, etc.)
* Columnar memory format / data structures / in-memory metadata :
arrow::DataType / Array
* Structured data IPC: arrays, record batches, and any other new
message types (e.g. tensors)
* Columnar in-memory analytics: what we are just beginning to
implement in arrow/compute

I think to express to the open source community that in-memory data
problems that are not columnar are of no interest to the Arrow
community would be needlessly closing off collaboration opportunities.
It's important that a larger audience is able to consume Arrow's
memory management layer and IPC tools (e.g. they can easily be used
for deep learning / ML applications) and use them to create more kinds
of applications architected around the mantra of zero-copy. With new
architectures designed to leverage non-volatile memory on the horizon,
this grows more important with each passing day.

- Wes

On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Thank you Wes and Julian for taking the approach to improve the elevator
> pitch. I really like the improvements. Still, I would like to see
> "columnar" used in the first sentence as this is the main focus of the
> project.
>
> Uwe
>
> On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote:
>> Thanks Julian, I like the changes.
>>
>> For the last part I agree listing languages is good; we would do well
>> to include JavaScript and Ruby in that list. Hopefully the list will
>> keep growing longer!
>>
>> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote:
>> > Your proposed version is definitely an improvement.
>> >
>> >> "Apache Arrow is a cross-language development platform for in-memory
>> >> structured data access and analytics. It specifies a standardized
>> >> language-independent columnar memory format for flat and hierarchical
>> >> data, with support for zero-copy streaming messaging and interprocess
>> >> communication. It also provides computational libraries for efficient
>> >> in-memory analytics on modern hardware.”
>> >
>> > I propose a few tweaks:
>> >
>> > Simplify sentence 1 to
>> >
>> >   Apache Arrow is a cross-language development platform for in-memory
>> >   data.
>> >
>> > This is easier to parse, captures the gist, and the other parts are covered
>> > in later sentences.
>> >
>> > To me, the cache-efficient format is more fundamental important than
>> > streaming and IPC (you can build the latter). Therefore I’d change
>> > sentence 2 to
>> >
>> >   It specifies a standardized language-independent columnar memory
>> >   format for flat and hierarchical data, organized for efficient analytic
>> >   operations on modern hardware.
>> >
>> > Which leaves sentence 3 as
>> >
>> >   It also provides computational libraries for zero-copy streaming
>> >   messaging and interprocess communication.
>> >
>> > And add sentence 4,
>> >
>> >   Languages supported include C and C++, Java, and Python.
>> >
>> > Julian
>> >
>> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <we...@gmail.com> wrote:
>> >>
>> >> I believe we would benefit from modified language to describe the
>> >> nature and scope of the Arrow project.
>> >>
>> >> Currently, our GitHub project description (and what we use in release
>> >> announcements) states:
>> >>
>> >> "Apache Arrow is a columnar in-memory analytics layer designed to
>> >> accelerate big data. It houses a set of canonical in-memory
>> >> representations of flat and hierarchical data along with multiple
>> >> language-bindings for structure manipulation. It also provides IPC and
>> >> common algorithm implementations."
>> >>
>> >> I think this could be perhaps restated in the following way:
>> >>
>> >> "Apache Arrow is a cross-language development platform for in-memory
>> >> structured data access and analytics. It specifies a standardized
>> >> language-independent columnar memory format for flat and hierarchical
>> >> data, with support for zero-copy streaming messaging and interprocess
>> >> communication. It also provides computational libraries for efficient
>> >> in-memory analytics on modern hardware."
>> >>
>> >> It is true that we have been mostly focused on hardening the details
>> >> of the Arrow format and related issues around messaging and IPC, which
>> >> are necessary for everything else we may contemplate building in the
>> >> future. Since I plan to be building a library of computational tools
>> >> in C++ for the native code community (Python, Ruby, R, etc.), I think
>> >> it would be a good idea to clearly state that building general purpose
>> >> analytics implementations (i.e. the sorts of things you find in "data
>> >> frame libraries" like pandas) is part of the mission of the project.
>> >>
>> >> Feedback on the above would be appreciated how we could do a better
>> >> job representing our past, present, and future community goals.
>> >>
>> >> Thanks
>> >> Wes
>> >

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Thank you Wes and Julian for taking the approach to improve the elevator
pitch. I really like the improvements. Still, I would like to see
"columnar" used in the first sentence as this is the main focus of the
project.

Uwe

On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote:
> Thanks Julian, I like the changes.
> 
> For the last part I agree listing languages is good; we would do well
> to include JavaScript and Ruby in that list. Hopefully the list will
> keep growing longer!
> 
> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote:
> > Your proposed version is definitely an improvement.
> >
> >> "Apache Arrow is a cross-language development platform for in-memory
> >> structured data access and analytics. It specifies a standardized
> >> language-independent columnar memory format for flat and hierarchical
> >> data, with support for zero-copy streaming messaging and interprocess
> >> communication. It also provides computational libraries for efficient
> >> in-memory analytics on modern hardware.”
> >
> > I propose a few tweaks:
> >
> > Simplify sentence 1 to
> >
> >   Apache Arrow is a cross-language development platform for in-memory
> >   data.
> >
> > This is easier to parse, captures the gist, and the other parts are covered
> > in later sentences.
> >
> > To me, the cache-efficient format is more fundamental important than
> > streaming and IPC (you can build the latter). Therefore I’d change
> > sentence 2 to
> >
> >   It specifies a standardized language-independent columnar memory
> >   format for flat and hierarchical data, organized for efficient analytic
> >   operations on modern hardware.
> >
> > Which leaves sentence 3 as
> >
> >   It also provides computational libraries for zero-copy streaming
> >   messaging and interprocess communication.
> >
> > And add sentence 4,
> >
> >   Languages supported include C and C++, Java, and Python.
> >
> > Julian
> >
> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <we...@gmail.com> wrote:
> >>
> >> I believe we would benefit from modified language to describe the
> >> nature and scope of the Arrow project.
> >>
> >> Currently, our GitHub project description (and what we use in release
> >> announcements) states:
> >>
> >> "Apache Arrow is a columnar in-memory analytics layer designed to
> >> accelerate big data. It houses a set of canonical in-memory
> >> representations of flat and hierarchical data along with multiple
> >> language-bindings for structure manipulation. It also provides IPC and
> >> common algorithm implementations."
> >>
> >> I think this could be perhaps restated in the following way:
> >>
> >> "Apache Arrow is a cross-language development platform for in-memory
> >> structured data access and analytics. It specifies a standardized
> >> language-independent columnar memory format for flat and hierarchical
> >> data, with support for zero-copy streaming messaging and interprocess
> >> communication. It also provides computational libraries for efficient
> >> in-memory analytics on modern hardware."
> >>
> >> It is true that we have been mostly focused on hardening the details
> >> of the Arrow format and related issues around messaging and IPC, which
> >> are necessary for everything else we may contemplate building in the
> >> future. Since I plan to be building a library of computational tools
> >> in C++ for the native code community (Python, Ruby, R, etc.), I think
> >> it would be a good idea to clearly state that building general purpose
> >> analytics implementations (i.e. the sorts of things you find in "data
> >> frame libraries" like pandas) is part of the mission of the project.
> >>
> >> Feedback on the above would be appreciated how we could do a better
> >> job representing our past, present, and future community goals.
> >>
> >> Thanks
> >> Wes
> >

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Posted by Wes McKinney <we...@gmail.com>.

Thanks Julian, I like the changes.

For the last part I agree listing languages is good; we would do well
to include JavaScript and Ruby in that list. Hopefully the list will
keep growing longer!

On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote:
> Your proposed version is definitely an improvement.
>
>> "Apache Arrow is a cross-language development platform for in-memory
>> structured data access and analytics. It specifies a standardized
>> language-independent columnar memory format for flat and hierarchical
>> data, with support for zero-copy streaming messaging and interprocess
>> communication. It also provides computational libraries for efficient
>> in-memory analytics on modern hardware.”
>
> I propose a few tweaks:
>
> Simplify sentence 1 to
>
>   Apache Arrow is a cross-language development platform for in-memory
>   data.
>
> This is easier to parse, captures the gist, and the other parts are covered
> in later sentences.
>
> To me, the cache-efficient format is more fundamental important than
> streaming and IPC (you can build the latter). Therefore I’d change
> sentence 2 to
>
>   It specifies a standardized language-independent columnar memory
>   format for flat and hierarchical data, organized for efficient analytic
>   operations on modern hardware.
>
> Which leaves sentence 3 as
>
>   It also provides computational libraries for zero-copy streaming
>   messaging and interprocess communication.
>
> And add sentence 4,
>
>   Languages supported include C and C++, Java, and Python.
>
> Julian
>
>> On Oct 21, 2017, at 10:58 AM, Wes McKinney <we...@gmail.com> wrote:
>>
>> I believe we would benefit from modified language to describe the
>> nature and scope of the Arrow project.
>>
>> Currently, our GitHub project description (and what we use in release
>> announcements) states:
>>
>> "Apache Arrow is a columnar in-memory analytics layer designed to
>> accelerate big data. It houses a set of canonical in-memory
>> representations of flat and hierarchical data along with multiple
>> language-bindings for structure manipulation. It also provides IPC and
>> common algorithm implementations."
>>
>> I think this could be perhaps restated in the following way:
>>
>> "Apache Arrow is a cross-language development platform for in-memory
>> structured data access and analytics. It specifies a standardized
>> language-independent columnar memory format for flat and hierarchical
>> data, with support for zero-copy streaming messaging and interprocess
>> communication. It also provides computational libraries for efficient
>> in-memory analytics on modern hardware."
>>
>> It is true that we have been mostly focused on hardening the details
>> of the Arrow format and related issues around messaging and IPC, which
>> are necessary for everything else we may contemplate building in the
>> future. Since I plan to be building a library of computational tools
>> in C++ for the native code community (Python, Ruby, R, etc.), I think
>> it would be a good idea to clearly state that building general purpose
>> analytics implementations (i.e. the sorts of things you find in "data
>> frame libraries" like pandas) is part of the mission of the project.
>>
>> Feedback on the above would be appreciated how we could do a better
>> job representing our past, present, and future community goals.
>>
>> Thanks
>> Wes
>

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Posted by Julian Hyde <jh...@apache.org>.

Your proposed version is definitely an improvement.

> "Apache Arrow is a cross-language development platform for in-memory
> structured data access and analytics. It specifies a standardized
> language-independent columnar memory format for flat and hierarchical
> data, with support for zero-copy streaming messaging and interprocess
> communication. It also provides computational libraries for efficient
> in-memory analytics on modern hardware.”

I propose a few tweaks:

Simplify sentence 1 to

  Apache Arrow is a cross-language development platform for in-memory
  data.

This is easier to parse, captures the gist, and the other parts are covered
in later sentences.

To me, the cache-efficient format is more fundamental important than
streaming and IPC (you can build the latter). Therefore I’d change
sentence 2 to

  It specifies a standardized language-independent columnar memory
  format for flat and hierarchical data, organized for efficient analytic
  operations on modern hardware.

Which leaves sentence 3 as

  It also provides computational libraries for zero-copy streaming
  messaging and interprocess communication.

And add sentence 4,

  Languages supported include C and C++, Java, and Python.

Julian

> On Oct 21, 2017, at 10:58 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> I believe we would benefit from modified language to describe the
> nature and scope of the Arrow project.
> 
> Currently, our GitHub project description (and what we use in release
> announcements) states:
> 
> "Apache Arrow is a columnar in-memory analytics layer designed to
> accelerate big data. It houses a set of canonical in-memory
> representations of flat and hierarchical data along with multiple
> language-bindings for structure manipulation. It also provides IPC and
> common algorithm implementations."
> 
> I think this could be perhaps restated in the following way:
> 
> "Apache Arrow is a cross-language development platform for in-memory
> structured data access and analytics. It specifies a standardized
> language-independent columnar memory format for flat and hierarchical
> data, with support for zero-copy streaming messaging and interprocess
> communication. It also provides computational libraries for efficient
> in-memory analytics on modern hardware."
> 
> It is true that we have been mostly focused on hardening the details
> of the Arrow format and related issues around messaging and IPC, which
> are necessary for everything else we may contemplate building in the
> future. Since I plan to be building a library of computational tools
> in C++ for the native code community (Python, Ruby, R, etc.), I think
> it would be a good idea to clearly state that building general purpose
> analytics implementations (i.e. the sorts of things you find in "data
> frame libraries" like pandas) is part of the mission of the project.
> 
> Feedback on the above would be appreciated how we could do a better
> job representing our past, present, and future community goals.
> 
> Thanks
> Wes