You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Danielle Navarro (Jira)" <ji...@apache.org> on 2022/09/29 07:27:00 UTC

[jira] [Created] (ARROW-17887) [R] [Doc] Improve readability of the "get started" page

Danielle Navarro created ARROW-17887:
----------------------------------------

Summary: [R] [Doc] Improve readability of the "get started" page
Key: ARROW-17887
URL: https://issues.apache.org/jira/browse/ARROW-17887
Project: Apache Arrow
Issue Type: Improvement
Components: R
Reporter: Danielle Navarro
Assignee: Danielle Navarro

In its current form the pkgdown Get Started and Read Me pages are a little hard for new users to follow. I would argue that both pages are written in a way that makes sense to someone who is already familiar with core Arrow concepts, but is potentially intimidating to an R user who is curious about Arrow but has never used it. The issue is perhaps most severe on the main [README page](https://arrow.apache.org/docs/r/index.html) and the [Get Started](https://arrow.apache.org/docs/r/articles/arrow.html) page. A few examples:

- The README page opens with the sentence **"Apache Arrow is a cross-language development platform for in-memory data".** This is a problem for multiple reasons. Firstly it's not really true anymore, because we encourage users to rely on `Dataset` for on-disk datasets. Secondly, the sentence simply *assumes* the user has a clear mental model of the difference between in-memory and on-disk data. I don't think that's true for data scientists in general. A data engineer likely has a more precise mental model here, but R users are typically focused on analytics. Unless they have extensive experience working with large data sets this isn't something we can assume. Thirdly, and maybe most importantly, it doesn't explain to the user why they should care about arrow: it doesn't say what the arrow package *does*. It's too vague.

- There are (IMO) too many boldfaced sections in the README page, and it's very cluttered. It gives the page an intensity and feeling of "denseness" that I think we should avoid at all costs. Arrow already has a reputation for being a complicated project (because it is!) but we don't want our documentation to have that feeling. I think we ought to be aiming for something gentler and welcoming. If that means pushing more details into vignettes, that's totally okay. Readers don't need to be told all the things on the very first page: it's probably better to give a simpler description and then push the details onto additional vignettes.

- The "get started" page has some of the same problems as the main README. The "object hierarchy" and "data object" tables only make sense once you already understand core Arrow concepts. What needs to happen in both cases is the tables need to be wrapped with some explanatory text that provide the missing context for users, and then additional details are pushed out to vignettes that explain it in more detail.

- The data types mapping section on the get started page has the same issue. A novice user doesn't necessarily even have a clear understanding of how fundamental types are represented in R, much less how they are represented in Arrow. A section that simply assumes that these types are meaningful concepts and gives a lookup table with various footnotes isn't at all helpful to that kind of user. I think it makes more sense to again split the work: on the "get started" page we should have something simple, and a longer discussion of these mappings should be pushed to a vignette

The concrete proposal here is to restructure the content of these two pages to be more novice-friendly: specifically, to add more "Arrow 101" explanatory notes to these pages, and to move more of the technical information to new vignettes (e.g., there should be a new "data types" vignette)

--
This message was sent by Atlassian Jira
(v8.20.10#820010)