You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Aldrin <ak...@ucsc.edu> on 2021/08/17 21:22:11 UTC
R - how to create a schema with many columns?
Hello!
I am pretty confused by the schema factory function in R, because I think
what I'm doing should work, but it doesn't seem to. I have inlined the code
below, but if there's an alternate way to setting the data types of a
schema in R, then I would welcome recommendations for those as well.
Anyways, the brief overview is that I want to create tables from matrices
that will have anywhere from hundreds of columns to thousands, and
specifying the schema inline is not going to be useful. I figure I should
be able to create a named list and then pass it to the schema factory
function, but I always get an error when trying to do so ("Error:
!is.null(nms <- names(.list)) is not TRUE").
I could update to arrow 5.0.0, but I assume that my problem shouldn't be a
problem in arrow 4.0.1.
Thanks for any help!
Working code:
Create an example data frame:
sample_df <- data.frame(
SRR12=c(0)
,SRR20=c(0)
,SRR24=c(4)
,SRR27=c(223)
,row.names=c('ENSG3')
)
sample_df
> SRR12 SRR20 SRR24 SRR27
> ENSG3 0 0 4 223
Create an arrow table, specify the schema inline:
sample_table <- Table$create(
sample_df
,schema=schema(
SRR12=uint16()
,SRR20=uint16()
,SRR24=uint16()
,SRR27=uint16()
)
)
sample_table
> Table
> 1 rows x 4 columns
> $SRR12 <uint16>
> $SRR20 <uint16>
> $SRR24 <uint16>
> $SRR27 <uint16>
>
Create a schema from a list, because we want > 1000 columns sometimes:
schema_fields <- list(SRR12=uint16(), SRR20=uint16(), SRR24=uint16(),
SRR27=uint16())
sample_schema <- schema(schema_fields)
> Error: !is.null(nms <- names(.list)) is not TRUE
>
schema_fields
> $SRR12
> UInt16
> uint16
>
> $SRR20
> UInt16
> uint16
>
> $SRR24
> UInt16
> uint16
>
> $SRR27
> UInt16
> uint16
Package information (system is macbook M1):
> brew info apache-arrow
apache-arrow: stable 5.0.0 (bottled), HEAD
Columnar in-memory analytics layer designed to accelerate big data
https://arrow.apache.org/
/opt/homebrew/Cellar/apache-arrow/4.0.1_2 (534 files, 92.9MB) *
Poured from bottle on 2021-07-07 at 16:10:51
From:
https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/apache-arrow.rb
License: Apache-2.0
==> Dependencies
Build: boost ✔, cmake ✘, llvm ✘
Required: brotli ✔, glog ✔, grpc ✘, lz4 ✔, numpy ✘, openssl@1.1 ✔, protobuf
✔, python@3.9 ✔, rapidjson ✔, re2 ✘, snappy ✔, thrift ✔, utf8proc ✔, zstd ✔
==> Options
--HEAD
Install HEAD version
==> Analytics
install: 1,715 (30 days), 5,687 (90 days), 18,191 (365 days)
install-on-request: 994 (30 days), 3,232 (90 days), 10,314 (365 days)
build-error: 0 (30 days)
> arrow::arrow_info()
Arrow package version: 4.0.1
Capabilities:
dataset TRUE
parquet TRUE
s3 FALSE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc FALSE
Memory:
Allocator jemalloc
Current 256 bytes
Max 2.31 Kb
Runtime:
SIMD Level none
Detected SIMD Level none
Aldrin Montana
Computer Science PhD Student
UC Santa Cruz
Re: R - how to create a schema with many columns?
Posted by Aldrin <ak...@ucsc.edu>.
Wow, that works! I really appreciate the help!
🎉🎉🎉
Aldrin Montana
Computer Science PhD Student
UC Santa Cruz
On Tue, Aug 17, 2021 at 3:17 PM Ian Cook <ia...@ursacomputing.com> wrote:
> Hi Aldrin,
>
> Please try this:
>
> sample_schema <- schema(!!!schema_fields)
>
> The schema() function now uses rlang functions to evaluate its arguments,
> so variable names need to be unquoted and spliced with !!!
>
> Ian
>
>
> On Tue, Aug 17, 2021 at 5:22 PM Aldrin <ak...@ucsc.edu> wrote:
>
>> Hello!
>>
>> I am pretty confused by the schema factory function in R, because I think
>> what I'm doing should work, but it doesn't seem to. I have inlined the code
>> below, but if there's an alternate way to setting the data types of a
>> schema in R, then I would welcome recommendations for those as well.
>>
>> Anyways, the brief overview is that I want to create tables from matrices
>> that will have anywhere from hundreds of columns to thousands, and
>> specifying the schema inline is not going to be useful. I figure I should
>> be able to create a named list and then pass it to the schema factory
>> function, but I always get an error when trying to do so ("Error:
>> !is.null(nms <- names(.list)) is not TRUE").
>>
>> I could update to arrow 5.0.0, but I assume that my problem shouldn't be
>> a problem in arrow 4.0.1.
>>
>> Thanks for any help!
>>
>> Working code:
>>
>> Create an example data frame:
>> sample_df <- data.frame(
>> SRR12=c(0)
>> ,SRR20=c(0)
>> ,SRR24=c(4)
>> ,SRR27=c(223)
>> ,row.names=c('ENSG3')
>> )
>>
>> sample_df
>>
>>> SRR12 SRR20 SRR24 SRR27
>>> ENSG3 0 0 4 223
>>
>>
>> Create an arrow table, specify the schema inline:
>> sample_table <- Table$create(
>> sample_df
>> ,schema=schema(
>> SRR12=uint16()
>> ,SRR20=uint16()
>> ,SRR24=uint16()
>> ,SRR27=uint16()
>> )
>> )
>>
>> sample_table
>>
>>> Table
>>> 1 rows x 4 columns
>>> $SRR12 <uint16>
>>> $SRR20 <uint16>
>>> $SRR24 <uint16>
>>> $SRR27 <uint16>
>>>
>>
>> Create a schema from a list, because we want > 1000 columns sometimes:
>> schema_fields <- list(SRR12=uint16(), SRR20=uint16(), SRR24=uint16(),
>> SRR27=uint16())
>> sample_schema <- schema(schema_fields)
>>
>>> Error: !is.null(nms <- names(.list)) is not TRUE
>>>
>>
>> schema_fields
>>
>>> $SRR12
>>> UInt16
>>> uint16
>>>
>>> $SRR20
>>> UInt16
>>> uint16
>>>
>>> $SRR24
>>> UInt16
>>> uint16
>>>
>>> $SRR27
>>> UInt16
>>> uint16
>>
>>
>>
>> Package information (system is macbook M1):
>> > brew info apache-arrow
>>
>> apache-arrow: stable 5.0.0 (bottled), HEAD
>> Columnar in-memory analytics layer designed to accelerate big data
>> https://arrow.apache.org/
>> /opt/homebrew/Cellar/apache-arrow/4.0.1_2 (534 files, 92.9MB) *
>> Poured from bottle on 2021-07-07 at 16:10:51
>> From:
>> https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/apache-arrow.rb
>> License: Apache-2.0
>> ==> Dependencies
>> Build: boost ✔, cmake ✘, llvm ✘
>> Required: brotli ✔, glog ✔, grpc ✘, lz4 ✔, numpy ✘, openssl@1.1 ✔,
>> protobuf ✔, python@3.9 ✔, rapidjson ✔, re2 ✘, snappy ✔, thrift ✔,
>> utf8proc ✔, zstd ✔
>> ==> Options
>> --HEAD
>> Install HEAD version
>> ==> Analytics
>> install: 1,715 (30 days), 5,687 (90 days), 18,191 (365 days)
>> install-on-request: 994 (30 days), 3,232 (90 days), 10,314 (365 days)
>> build-error: 0 (30 days)
>>
>>
>> > arrow::arrow_info()
>>
>> Arrow package version: 4.0.1
>>
>> Capabilities:
>>
>> dataset TRUE
>> parquet TRUE
>> s3 FALSE
>> utf8proc TRUE
>> re2 TRUE
>> snappy TRUE
>> gzip TRUE
>> brotli TRUE
>> zstd TRUE
>> lz4 TRUE
>> lz4_frame TRUE
>> lzo FALSE
>> bz2 TRUE
>> jemalloc TRUE
>> mimalloc FALSE
>>
>> Memory:
>>
>> Allocator jemalloc
>> Current 256 bytes
>> Max 2.31 Kb
>>
>> Runtime:
>>
>> SIMD Level none
>> Detected SIMD Level none
>>
>>
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>
Re: R - how to create a schema with many columns?
Posted by Ian Cook <ia...@ursacomputing.com>.
Hi Aldrin,
Please try this:
sample_schema <- schema(!!!schema_fields)
The schema() function now uses rlang functions to evaluate its arguments,
so variable names need to be unquoted and spliced with !!!
Ian
On Tue, Aug 17, 2021 at 5:22 PM Aldrin <ak...@ucsc.edu> wrote:
> Hello!
>
> I am pretty confused by the schema factory function in R, because I think
> what I'm doing should work, but it doesn't seem to. I have inlined the code
> below, but if there's an alternate way to setting the data types of a
> schema in R, then I would welcome recommendations for those as well.
>
> Anyways, the brief overview is that I want to create tables from matrices
> that will have anywhere from hundreds of columns to thousands, and
> specifying the schema inline is not going to be useful. I figure I should
> be able to create a named list and then pass it to the schema factory
> function, but I always get an error when trying to do so ("Error:
> !is.null(nms <- names(.list)) is not TRUE").
>
> I could update to arrow 5.0.0, but I assume that my problem shouldn't be a
> problem in arrow 4.0.1.
>
> Thanks for any help!
>
> Working code:
>
> Create an example data frame:
> sample_df <- data.frame(
> SRR12=c(0)
> ,SRR20=c(0)
> ,SRR24=c(4)
> ,SRR27=c(223)
> ,row.names=c('ENSG3')
> )
>
> sample_df
>
>> SRR12 SRR20 SRR24 SRR27
>> ENSG3 0 0 4 223
>
>
> Create an arrow table, specify the schema inline:
> sample_table <- Table$create(
> sample_df
> ,schema=schema(
> SRR12=uint16()
> ,SRR20=uint16()
> ,SRR24=uint16()
> ,SRR27=uint16()
> )
> )
>
> sample_table
>
>> Table
>> 1 rows x 4 columns
>> $SRR12 <uint16>
>> $SRR20 <uint16>
>> $SRR24 <uint16>
>> $SRR27 <uint16>
>>
>
> Create a schema from a list, because we want > 1000 columns sometimes:
> schema_fields <- list(SRR12=uint16(), SRR20=uint16(), SRR24=uint16(),
> SRR27=uint16())
> sample_schema <- schema(schema_fields)
>
>> Error: !is.null(nms <- names(.list)) is not TRUE
>>
>
> schema_fields
>
>> $SRR12
>> UInt16
>> uint16
>>
>> $SRR20
>> UInt16
>> uint16
>>
>> $SRR24
>> UInt16
>> uint16
>>
>> $SRR27
>> UInt16
>> uint16
>
>
>
> Package information (system is macbook M1):
> > brew info apache-arrow
>
> apache-arrow: stable 5.0.0 (bottled), HEAD
> Columnar in-memory analytics layer designed to accelerate big data
> https://arrow.apache.org/
> /opt/homebrew/Cellar/apache-arrow/4.0.1_2 (534 files, 92.9MB) *
> Poured from bottle on 2021-07-07 at 16:10:51
> From:
> https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/apache-arrow.rb
> License: Apache-2.0
> ==> Dependencies
> Build: boost ✔, cmake ✘, llvm ✘
> Required: brotli ✔, glog ✔, grpc ✘, lz4 ✔, numpy ✘, openssl@1.1 ✔,
> protobuf ✔, python@3.9 ✔, rapidjson ✔, re2 ✘, snappy ✔, thrift ✔,
> utf8proc ✔, zstd ✔
> ==> Options
> --HEAD
> Install HEAD version
> ==> Analytics
> install: 1,715 (30 days), 5,687 (90 days), 18,191 (365 days)
> install-on-request: 994 (30 days), 3,232 (90 days), 10,314 (365 days)
> build-error: 0 (30 days)
>
>
> > arrow::arrow_info()
>
> Arrow package version: 4.0.1
>
> Capabilities:
>
> dataset TRUE
> parquet TRUE
> s3 FALSE
> utf8proc TRUE
> re2 TRUE
> snappy TRUE
> gzip TRUE
> brotli TRUE
> zstd TRUE
> lz4 TRUE
> lz4_frame TRUE
> lzo FALSE
> bz2 TRUE
> jemalloc TRUE
> mimalloc FALSE
>
> Memory:
>
> Allocator jemalloc
> Current 256 bytes
> Max 2.31 Kb
>
> Runtime:
>
> SIMD Level none
> Detected SIMD Level none
>
>
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>