You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Aldrin <ak...@ucsc.edu> on 2021/08/17 21:22:11 UTC

R - how to create a schema with many columns?

Hello!

I am pretty confused by the schema factory function in R, because I think
what I'm doing should work, but it doesn't seem to. I have inlined the code
below, but if there's an alternate way to setting the data types of a
schema in R, then I would welcome recommendations for those as well.

Anyways, the brief overview is that I want to create tables from matrices
that will have anywhere from hundreds of columns to thousands, and
specifying the schema inline is not going to be useful. I figure I should
be able to create a named list and then pass it to the schema factory
function, but I always get an error when trying to do so ("Error:
!is.null(nms <- names(.list)) is not TRUE").

I could update to arrow 5.0.0, but I assume that my problem shouldn't be a
problem in arrow 4.0.1.

Thanks for any help!

Working code:

Create an example data frame:
sample_df <- data.frame(
     SRR12=c(0)
    ,SRR20=c(0)
    ,SRR24=c(4)
    ,SRR27=c(223)
    ,row.names=c('ENSG3')
)

sample_df

>       SRR12 SRR20 SRR24   SRR27
> ENSG3     0     0     4     223


Create an arrow table, specify the schema inline:
sample_table <- Table$create(
     sample_df
    ,schema=schema(
          SRR12=uint16()
         ,SRR20=uint16()
         ,SRR24=uint16()
         ,SRR27=uint16()
     )
)

sample_table

> Table
> 1 rows x 4 columns
> $SRR12 <uint16>
> $SRR20 <uint16>
> $SRR24 <uint16>
> $SRR27 <uint16>
>

Create a schema from a list, because we want > 1000 columns sometimes:
schema_fields <- list(SRR12=uint16(), SRR20=uint16(), SRR24=uint16(),
SRR27=uint16())
sample_schema <- schema(schema_fields)

> Error: !is.null(nms <- names(.list)) is not TRUE
>

schema_fields

> $SRR12
> UInt16
> uint16
>
> $SRR20
> UInt16
> uint16
>
> $SRR24
> UInt16
> uint16
>
> $SRR27
> UInt16
> uint16



Package information (system is macbook M1):
> brew info apache-arrow

apache-arrow: stable 5.0.0 (bottled), HEAD
Columnar in-memory analytics layer designed to accelerate big data
https://arrow.apache.org/
/opt/homebrew/Cellar/apache-arrow/4.0.1_2 (534 files, 92.9MB) *
  Poured from bottle on 2021-07-07 at 16:10:51
From:
https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/apache-arrow.rb
License: Apache-2.0
==> Dependencies
Build: boost ✔, cmake ✘, llvm ✘
Required: brotli ✔, glog ✔, grpc ✘, lz4 ✔, numpy ✘, openssl@1.1 ✔, protobuf
✔, python@3.9 ✔, rapidjson ✔, re2 ✘, snappy ✔, thrift ✔, utf8proc ✔, zstd ✔
==> Options
--HEAD
        Install HEAD version
==> Analytics
install: 1,715 (30 days), 5,687 (90 days), 18,191 (365 days)
install-on-request: 994 (30 days), 3,232 (90 days), 10,314 (365 days)
build-error: 0 (30 days)


> arrow::arrow_info()

Arrow package version: 4.0.1

Capabilities:

dataset    TRUE
parquet    TRUE
s3        FALSE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc   TRUE
mimalloc  FALSE

Memory:

Allocator  jemalloc
Current   256 bytes
Max         2.31 Kb

Runtime:

SIMD Level          none
Detected SIMD Level none



Aldrin Montana
Computer Science PhD Student
UC Santa Cruz

Re: R - how to create a schema with many columns?

Posted by Aldrin <ak...@ucsc.edu>.
Wow, that works! I really appreciate the help!

🎉🎉🎉

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Tue, Aug 17, 2021 at 3:17 PM Ian Cook <ia...@ursacomputing.com> wrote:

> Hi Aldrin,
>
> Please try this:
>
> sample_schema <- schema(!!!schema_fields)
>
> The schema() function now uses rlang functions to evaluate its arguments,
> so variable names need to be unquoted and spliced with !!!
>
> Ian
>
>
> On Tue, Aug 17, 2021 at 5:22 PM Aldrin <ak...@ucsc.edu> wrote:
>
>> Hello!
>>
>> I am pretty confused by the schema factory function in R, because I think
>> what I'm doing should work, but it doesn't seem to. I have inlined the code
>> below, but if there's an alternate way to setting the data types of a
>> schema in R, then I would welcome recommendations for those as well.
>>
>> Anyways, the brief overview is that I want to create tables from matrices
>> that will have anywhere from hundreds of columns to thousands, and
>> specifying the schema inline is not going to be useful. I figure I should
>> be able to create a named list and then pass it to the schema factory
>> function, but I always get an error when trying to do so ("Error:
>> !is.null(nms <- names(.list)) is not TRUE").
>>
>> I could update to arrow 5.0.0, but I assume that my problem shouldn't be
>> a problem in arrow 4.0.1.
>>
>> Thanks for any help!
>>
>> Working code:
>>
>> Create an example data frame:
>> sample_df <- data.frame(
>>      SRR12=c(0)
>>     ,SRR20=c(0)
>>     ,SRR24=c(4)
>>     ,SRR27=c(223)
>>     ,row.names=c('ENSG3')
>> )
>>
>> sample_df
>>
>>>       SRR12 SRR20 SRR24   SRR27
>>> ENSG3     0     0     4     223
>>
>>
>> Create an arrow table, specify the schema inline:
>> sample_table <- Table$create(
>>      sample_df
>>     ,schema=schema(
>>           SRR12=uint16()
>>          ,SRR20=uint16()
>>          ,SRR24=uint16()
>>          ,SRR27=uint16()
>>      )
>> )
>>
>> sample_table
>>
>>> Table
>>> 1 rows x 4 columns
>>> $SRR12 <uint16>
>>> $SRR20 <uint16>
>>> $SRR24 <uint16>
>>> $SRR27 <uint16>
>>>
>>
>> Create a schema from a list, because we want > 1000 columns sometimes:
>> schema_fields <- list(SRR12=uint16(), SRR20=uint16(), SRR24=uint16(),
>> SRR27=uint16())
>> sample_schema <- schema(schema_fields)
>>
>>> Error: !is.null(nms <- names(.list)) is not TRUE
>>>
>>
>> schema_fields
>>
>>> $SRR12
>>> UInt16
>>> uint16
>>>
>>> $SRR20
>>> UInt16
>>> uint16
>>>
>>> $SRR24
>>> UInt16
>>> uint16
>>>
>>> $SRR27
>>> UInt16
>>> uint16
>>
>>
>>
>> Package information (system is macbook M1):
>> > brew info apache-arrow
>>
>> apache-arrow: stable 5.0.0 (bottled), HEAD
>> Columnar in-memory analytics layer designed to accelerate big data
>> https://arrow.apache.org/
>> /opt/homebrew/Cellar/apache-arrow/4.0.1_2 (534 files, 92.9MB) *
>>   Poured from bottle on 2021-07-07 at 16:10:51
>> From:
>> https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/apache-arrow.rb
>> License: Apache-2.0
>> ==> Dependencies
>> Build: boost ✔, cmake ✘, llvm ✘
>> Required: brotli ✔, glog ✔, grpc ✘, lz4 ✔, numpy ✘, openssl@1.1 ✔,
>> protobuf ✔, python@3.9 ✔, rapidjson ✔, re2 ✘, snappy ✔, thrift ✔,
>> utf8proc ✔, zstd ✔
>> ==> Options
>> --HEAD
>>         Install HEAD version
>> ==> Analytics
>> install: 1,715 (30 days), 5,687 (90 days), 18,191 (365 days)
>> install-on-request: 994 (30 days), 3,232 (90 days), 10,314 (365 days)
>> build-error: 0 (30 days)
>>
>>
>> > arrow::arrow_info()
>>
>> Arrow package version: 4.0.1
>>
>> Capabilities:
>>
>> dataset    TRUE
>> parquet    TRUE
>> s3        FALSE
>> utf8proc   TRUE
>> re2        TRUE
>> snappy     TRUE
>> gzip       TRUE
>> brotli     TRUE
>> zstd       TRUE
>> lz4        TRUE
>> lz4_frame  TRUE
>> lzo       FALSE
>> bz2        TRUE
>> jemalloc   TRUE
>> mimalloc  FALSE
>>
>> Memory:
>>
>> Allocator  jemalloc
>> Current   256 bytes
>> Max         2.31 Kb
>>
>> Runtime:
>>
>> SIMD Level          none
>> Detected SIMD Level none
>>
>>
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>

Re: R - how to create a schema with many columns?

Posted by Ian Cook <ia...@ursacomputing.com>.
Hi Aldrin,

Please try this:

sample_schema <- schema(!!!schema_fields)

The schema() function now uses rlang functions to evaluate its arguments,
so variable names need to be unquoted and spliced with !!!

Ian


On Tue, Aug 17, 2021 at 5:22 PM Aldrin <ak...@ucsc.edu> wrote:

> Hello!
>
> I am pretty confused by the schema factory function in R, because I think
> what I'm doing should work, but it doesn't seem to. I have inlined the code
> below, but if there's an alternate way to setting the data types of a
> schema in R, then I would welcome recommendations for those as well.
>
> Anyways, the brief overview is that I want to create tables from matrices
> that will have anywhere from hundreds of columns to thousands, and
> specifying the schema inline is not going to be useful. I figure I should
> be able to create a named list and then pass it to the schema factory
> function, but I always get an error when trying to do so ("Error:
> !is.null(nms <- names(.list)) is not TRUE").
>
> I could update to arrow 5.0.0, but I assume that my problem shouldn't be a
> problem in arrow 4.0.1.
>
> Thanks for any help!
>
> Working code:
>
> Create an example data frame:
> sample_df <- data.frame(
>      SRR12=c(0)
>     ,SRR20=c(0)
>     ,SRR24=c(4)
>     ,SRR27=c(223)
>     ,row.names=c('ENSG3')
> )
>
> sample_df
>
>>       SRR12 SRR20 SRR24   SRR27
>> ENSG3     0     0     4     223
>
>
> Create an arrow table, specify the schema inline:
> sample_table <- Table$create(
>      sample_df
>     ,schema=schema(
>           SRR12=uint16()
>          ,SRR20=uint16()
>          ,SRR24=uint16()
>          ,SRR27=uint16()
>      )
> )
>
> sample_table
>
>> Table
>> 1 rows x 4 columns
>> $SRR12 <uint16>
>> $SRR20 <uint16>
>> $SRR24 <uint16>
>> $SRR27 <uint16>
>>
>
> Create a schema from a list, because we want > 1000 columns sometimes:
> schema_fields <- list(SRR12=uint16(), SRR20=uint16(), SRR24=uint16(),
> SRR27=uint16())
> sample_schema <- schema(schema_fields)
>
>> Error: !is.null(nms <- names(.list)) is not TRUE
>>
>
> schema_fields
>
>> $SRR12
>> UInt16
>> uint16
>>
>> $SRR20
>> UInt16
>> uint16
>>
>> $SRR24
>> UInt16
>> uint16
>>
>> $SRR27
>> UInt16
>> uint16
>
>
>
> Package information (system is macbook M1):
> > brew info apache-arrow
>
> apache-arrow: stable 5.0.0 (bottled), HEAD
> Columnar in-memory analytics layer designed to accelerate big data
> https://arrow.apache.org/
> /opt/homebrew/Cellar/apache-arrow/4.0.1_2 (534 files, 92.9MB) *
>   Poured from bottle on 2021-07-07 at 16:10:51
> From:
> https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/apache-arrow.rb
> License: Apache-2.0
> ==> Dependencies
> Build: boost ✔, cmake ✘, llvm ✘
> Required: brotli ✔, glog ✔, grpc ✘, lz4 ✔, numpy ✘, openssl@1.1 ✔,
> protobuf ✔, python@3.9 ✔, rapidjson ✔, re2 ✘, snappy ✔, thrift ✔,
> utf8proc ✔, zstd ✔
> ==> Options
> --HEAD
>         Install HEAD version
> ==> Analytics
> install: 1,715 (30 days), 5,687 (90 days), 18,191 (365 days)
> install-on-request: 994 (30 days), 3,232 (90 days), 10,314 (365 days)
> build-error: 0 (30 days)
>
>
> > arrow::arrow_info()
>
> Arrow package version: 4.0.1
>
> Capabilities:
>
> dataset    TRUE
> parquet    TRUE
> s3        FALSE
> utf8proc   TRUE
> re2        TRUE
> snappy     TRUE
> gzip       TRUE
> brotli     TRUE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2        TRUE
> jemalloc   TRUE
> mimalloc  FALSE
>
> Memory:
>
> Allocator  jemalloc
> Current   256 bytes
> Max         2.31 Kb
>
> Runtime:
>
> SIMD Level          none
> Detected SIMD Level none
>
>
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>