You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2017/10/03 00:59:06 UTC

Re: Arrow for Redshift Spectrum

hey Colin,

This is cool, thanks for sharing! I am not able to look too closely at
the moment, but Tom Augspurger or Phillip Cloud may be able to take a
closer look since they've dealt with similar issues around this kind
of ETL problem. We definitely would like the pyarrow libraries to work
well for this use case, so please report any JIRAs if you run into
rough edges or things that seem more difficult than they should be.

Thanks
Wes

On Thu, Sep 28, 2017 at 6:35 PM, Colin Nichols <co...@narrativ.com> wrote:
> Hi all,
>
> Would love to get some feedback on a little project I put together. I paired my company's parquet conversion routines (wrapper around pyarrow) with SqlAlchemy's table reflection capabilities to make an "easy mode" redshift --> Redshift spectrum converter.
>
> You can find it here: https://github.com/hellonarrativ/spectrify
>
> I would be curious to hear impressions about the project (is it obvious what it does? Would you find it useful?) and also the parquet conversion more specifically.
>
> I ended up not using numpy/pandas to avoid issues with null values. Performance wise it's obviously not the best choice, but for this application (occasional conversion of data to parquet) performance is not critical.
>
> I thought this project might be useful for people evaluating Redshift Spectrum, or for those without an existing setup for converting to parquet.
>
> Thanks for reading!
>
> Best,
> Colin
>
> -- sent from my phone --