You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Ken Hampson <ha...@gmail.com> on 2016/01/08 02:28:38 UTC

Importing Postgres-generated CSVs into Phoenix

Hi, all:

I am in the process of importing some data from Postgres into Phoenix as
part of a larger project. I noticed after doing so and cross-checking the
results of a fairly complex JOIN query (of 6 tables of so) that the number
of rows returned by Postgres and Phoenix differed -- Phoenix had a
significant number more rows. I investigated this further, and ultimately
determined that the reason was that all *boolean *values were imported as
being false, regardless of their actual value.

When I investigated that aspect further, I discovered that Phoenix appears
to require the full strings "true" and "false" in the CSV in order to
properly register their actual values. As a workaround, I am currently
using a full SELECT in the "\copy" statement in psql to select each column
individually, and for each boolean column, I am using a CASE statement to
emit the full strings. I have since also learned that casting the boolean
to the "text" datatype will also cause Postgres to emit the full string.

In either case, though, the pain point is that this discrepancy requires
that I examine each table closely for boolean columns, and if there are
any, I then need to enumerate all the columns out explicitly and use either
the CASE statement or the cast method on the boolean columns in order to
get the values Phoenix expects. In my case, I didn't need to transform any
of the data otherwise, so this forced a much more complex procedure.

While psql provides many options to change various values -- delimiters and
what not -- there is no option to change the boolean values. Likewise, as
near as I can tell, there are no options to change what Phoenix expects,
either.

I am wondering:

- Is there any way from within Phoenix to change the boolean values that it
expects in the CSV?
- Are there any other methods from a Phoenix perspective that you can think
of that would obviate the need for the column-by-column enumeration
procedure I outlined above?
- Since Postgres is a major open source RDBMS which I suspect many people
-- myself included -- will want to export data from into Phoenix, could
there be an enhancement to Phoenix to allow an option to specify a default
boolean value, or perhaps an option to specify the CSV source, from which
the proper value could be defined and utilized automatically?

Thanks,
- Ken

Re: Importing Postgres-generated CSVs into Phoenix

Posted by Ken Hampson <ha...@gmail.com>.

Sure, no problem. I've added a comment indicating my interest to SQOOP-2649.

Thanks,
- Ken

On Fri, Jan 8, 2016 at 8:33 PM James Taylor <ja...@apache.org> wrote:

> Sounds good, Ken. If you could comment on SQOOP-2649 that you'd be
> interested in using that functionality if it were available, that'd be much
> appreciated.
>
> Thanks,
> James
>
> On Fri, Jan 8, 2016 at 5:04 PM, Ken Hampson <ha...@gmail.com> wrote:
>
>> Hi, James:
>>
>> Thanks for the reply. I took a look at the patch attached to
>> PHOENIX-2434, and it does look like it has a good chance to solve the
>> problem, as Postgres exports boolean values as 't' and 'f', which the patch
>> now checks for. Once we're able to get to Phoenix 4.7.0, I'll definitely
>> try it out to confirm.
>>
>> We don't and won't be able to use Flyway due to various architectural
>> considerations, but we do use Sqoop to get data from Postgres into Hive. I
>> had skimmed  SQOOP-2649 when originally looking at how to bulk load into
>> Phoenix, and figured CSV was the way to go for now since the  SQOOP-2649
>> enhancement was still in patch form. It's definitely something I will keep
>> an eye on going forward.
>>
>> Thanks again,
>> - Ken
>>
>>
>> On Fri, Jan 8, 2016 at 1:13 PM James Taylor <ja...@apache.org>
>> wrote:
>>
>>> Hi Ken,
>>> PHOENIX-2434 improved our CSV handling of booleans and will appear in
>>> our upcoming 4.7.0 release. It'd be good if you can confirm whether or not
>>> this is what you need. We definitely want to support ingest of CSVs from
>>> other RDBMSs.
>>>
>>> There are a couple of other avenues of ingest into Phoenix:
>>> - using Flyway DB (http://flywaydb.org/) where Phoenix support has been
>>> added (not sure if this has appeared in a release yet or not, though)
>>> - through a pull request to add support for Phoenix in Scoop (
>>> http://sqoop.apache.org/). Unfortunately, this isn't getting much
>>> attention in that community (if this is useful for you, you may want to
>>> chime in on SQOOP-2649 or on their mailing list).
>>>
>>> Thanks,
>>> James
>>>
>>> On Thu, Jan 7, 2016 at 5:28 PM, Ken Hampson <ha...@gmail.com> wrote:
>>>
>>>> Hi, all:
>>>>
>>>> I am in the process of importing some data from Postgres into Phoenix
>>>> as part of a larger project. I noticed after doing so and cross-checking
>>>> the results of a fairly complex JOIN query (of 6 tables of so) that the
>>>> number of rows returned by Postgres and Phoenix differed -- Phoenix had a
>>>> significant number more rows. I investigated this further, and ultimately
>>>> determined that the reason was that all *boolean *values were imported
>>>> as being false, regardless of their actual value.
>>>>
>>>> When I investigated that aspect further, I discovered that Phoenix
>>>> appears to require the full strings "true" and "false" in the CSV in order
>>>> to properly register their actual values. As a workaround, I am currently
>>>> using a full SELECT in the "\copy" statement in psql to select each column
>>>> individually, and for each boolean column, I am using a CASE statement to
>>>> emit the full strings. I have since also learned that casting the boolean
>>>> to the "text" datatype will also cause Postgres to emit the full string.
>>>>
>>>> In either case, though, the pain point is that this discrepancy
>>>> requires that I examine each table closely for boolean columns, and if
>>>> there are any, I then need to enumerate all the columns out explicitly and
>>>> use either the CASE statement or the cast method on the boolean columns in
>>>> order to get the values Phoenix expects. In my case, I didn't need to
>>>> transform any of the data otherwise, so this forced a much more complex
>>>> procedure.
>>>>
>>>> While psql provides many options to change various values -- delimiters
>>>> and what not -- there is no option to change the boolean values. Likewise,
>>>> as near as I can tell, there are no options to change what Phoenix expects,
>>>> either.
>>>>
>>>> I am wondering:
>>>>
>>>> - Is there any way from within Phoenix to change the boolean values
>>>> that it expects in the CSV?
>>>> - Are there any other methods from a Phoenix perspective that you can
>>>> think of that would obviate the need for the column-by-column enumeration
>>>> procedure I outlined above?
>>>> - Since Postgres is a major open source RDBMS which I suspect many
>>>> people -- myself included -- will want to export data from into Phoenix,
>>>> could there be an enhancement to Phoenix to allow an option to specify a
>>>> default boolean value, or perhaps an option to specify the CSV source, from
>>>> which the proper value could be defined and utilized automatically?
>>>>
>>>> Thanks,
>>>> - Ken
>>>>
>>>
>>>
>

Re: Importing Postgres-generated CSVs into Phoenix

Posted by James Taylor <ja...@apache.org>.

Sounds good, Ken. If you could comment on SQOOP-2649 that you'd be
interested in using that functionality if it were available, that'd be much
appreciated.

Thanks,
James

On Fri, Jan 8, 2016 at 5:04 PM, Ken Hampson <ha...@gmail.com> wrote:

> Hi, James:
>
> Thanks for the reply. I took a look at the patch attached to PHOENIX-2434,
> and it does look like it has a good chance to solve the problem, as
> Postgres exports boolean values as 't' and 'f', which the patch now checks
> for. Once we're able to get to Phoenix 4.7.0, I'll definitely try it out to
> confirm.
>
> We don't and won't be able to use Flyway due to various architectural
> considerations, but we do use Sqoop to get data from Postgres into Hive. I
> had skimmed  SQOOP-2649 when originally looking at how to bulk load into
> Phoenix, and figured CSV was the way to go for now since the  SQOOP-2649
> enhancement was still in patch form. It's definitely something I will keep
> an eye on going forward.
>
> Thanks again,
> - Ken
>
>
> On Fri, Jan 8, 2016 at 1:13 PM James Taylor <ja...@apache.org>
> wrote:
>
>> Hi Ken,
>> PHOENIX-2434 improved our CSV handling of booleans and will appear in our
>> upcoming 4.7.0 release. It'd be good if you can confirm whether or not this
>> is what you need. We definitely want to support ingest of CSVs from other
>> RDBMSs.
>>
>> There are a couple of other avenues of ingest into Phoenix:
>> - using Flyway DB (http://flywaydb.org/) where Phoenix support has been
>> added (not sure if this has appeared in a release yet or not, though)
>> - through a pull request to add support for Phoenix in Scoop (
>> http://sqoop.apache.org/). Unfortunately, this isn't getting much
>> attention in that community (if this is useful for you, you may want to
>> chime in on SQOOP-2649 or on their mailing list).
>>
>> Thanks,
>> James
>>
>> On Thu, Jan 7, 2016 at 5:28 PM, Ken Hampson <ha...@gmail.com> wrote:
>>
>>> Hi, all:
>>>
>>> I am in the process of importing some data from Postgres into Phoenix as
>>> part of a larger project. I noticed after doing so and cross-checking the
>>> results of a fairly complex JOIN query (of 6 tables of so) that the number
>>> of rows returned by Postgres and Phoenix differed -- Phoenix had a
>>> significant number more rows. I investigated this further, and ultimately
>>> determined that the reason was that all *boolean *values were imported
>>> as being false, regardless of their actual value.
>>>
>>> When I investigated that aspect further, I discovered that Phoenix
>>> appears to require the full strings "true" and "false" in the CSV in order
>>> to properly register their actual values. As a workaround, I am currently
>>> using a full SELECT in the "\copy" statement in psql to select each column
>>> individually, and for each boolean column, I am using a CASE statement to
>>> emit the full strings. I have since also learned that casting the boolean
>>> to the "text" datatype will also cause Postgres to emit the full string.
>>>
>>> In either case, though, the pain point is that this discrepancy requires
>>> that I examine each table closely for boolean columns, and if there are
>>> any, I then need to enumerate all the columns out explicitly and use either
>>> the CASE statement or the cast method on the boolean columns in order to
>>> get the values Phoenix expects. In my case, I didn't need to transform any
>>> of the data otherwise, so this forced a much more complex procedure.
>>>
>>> While psql provides many options to change various values -- delimiters
>>> and what not -- there is no option to change the boolean values. Likewise,
>>> as near as I can tell, there are no options to change what Phoenix expects,
>>> either.
>>>
>>> I am wondering:
>>>
>>> - Is there any way from within Phoenix to change the boolean values that
>>> it expects in the CSV?
>>> - Are there any other methods from a Phoenix perspective that you can
>>> think of that would obviate the need for the column-by-column enumeration
>>> procedure I outlined above?
>>> - Since Postgres is a major open source RDBMS which I suspect many
>>> people -- myself included -- will want to export data from into Phoenix,
>>> could there be an enhancement to Phoenix to allow an option to specify a
>>> default boolean value, or perhaps an option to specify the CSV source, from
>>> which the proper value could be defined and utilized automatically?
>>>
>>> Thanks,
>>> - Ken
>>>
>>
>>

Re: Importing Postgres-generated CSVs into Phoenix

Posted by Ken Hampson <ha...@gmail.com>.

Hi, James:

Thanks for the reply. I took a look at the patch attached to PHOENIX-2434,
and it does look like it has a good chance to solve the problem, as
Postgres exports boolean values as 't' and 'f', which the patch now checks
for. Once we're able to get to Phoenix 4.7.0, I'll definitely try it out to
confirm.

We don't and won't be able to use Flyway due to various architectural
considerations, but we do use Sqoop to get data from Postgres into Hive. I
had skimmed  SQOOP-2649 when originally looking at how to bulk load into
Phoenix, and figured CSV was the way to go for now since the  SQOOP-2649
enhancement was still in patch form. It's definitely something I will keep
an eye on going forward.

Thanks again,
- Ken

On Fri, Jan 8, 2016 at 1:13 PM James Taylor <ja...@apache.org> wrote:

> Hi Ken,
> PHOENIX-2434 improved our CSV handling of booleans and will appear in our
> upcoming 4.7.0 release. It'd be good if you can confirm whether or not this
> is what you need. We definitely want to support ingest of CSVs from other
> RDBMSs.
>
> There are a couple of other avenues of ingest into Phoenix:
> - using Flyway DB (http://flywaydb.org/) where Phoenix support has been
> added (not sure if this has appeared in a release yet or not, though)
> - through a pull request to add support for Phoenix in Scoop (
> http://sqoop.apache.org/). Unfortunately, this isn't getting much
> attention in that community (if this is useful for you, you may want to
> chime in on SQOOP-2649 or on their mailing list).
>
> Thanks,
> James
>
> On Thu, Jan 7, 2016 at 5:28 PM, Ken Hampson <ha...@gmail.com> wrote:
>
>> Hi, all:
>>
>> I am in the process of importing some data from Postgres into Phoenix as
>> part of a larger project. I noticed after doing so and cross-checking the
>> results of a fairly complex JOIN query (of 6 tables of so) that the number
>> of rows returned by Postgres and Phoenix differed -- Phoenix had a
>> significant number more rows. I investigated this further, and ultimately
>> determined that the reason was that all *boolean *values were imported
>> as being false, regardless of their actual value.
>>
>> When I investigated that aspect further, I discovered that Phoenix
>> appears to require the full strings "true" and "false" in the CSV in order
>> to properly register their actual values. As a workaround, I am currently
>> using a full SELECT in the "\copy" statement in psql to select each column
>> individually, and for each boolean column, I am using a CASE statement to
>> emit the full strings. I have since also learned that casting the boolean
>> to the "text" datatype will also cause Postgres to emit the full string.
>>
>> In either case, though, the pain point is that this discrepancy requires
>> that I examine each table closely for boolean columns, and if there are
>> any, I then need to enumerate all the columns out explicitly and use either
>> the CASE statement or the cast method on the boolean columns in order to
>> get the values Phoenix expects. In my case, I didn't need to transform any
>> of the data otherwise, so this forced a much more complex procedure.
>>
>> While psql provides many options to change various values -- delimiters
>> and what not -- there is no option to change the boolean values. Likewise,
>> as near as I can tell, there are no options to change what Phoenix expects,
>> either.
>>
>> I am wondering:
>>
>> - Is there any way from within Phoenix to change the boolean values that
>> it expects in the CSV?
>> - Are there any other methods from a Phoenix perspective that you can
>> think of that would obviate the need for the column-by-column enumeration
>> procedure I outlined above?
>> - Since Postgres is a major open source RDBMS which I suspect many people
>> -- myself included -- will want to export data from into Phoenix, could
>> there be an enhancement to Phoenix to allow an option to specify a default
>> boolean value, or perhaps an option to specify the CSV source, from which
>> the proper value could be defined and utilized automatically?
>>
>> Thanks,
>> - Ken
>>
>
>

Re: Importing Postgres-generated CSVs into Phoenix

Posted by James Taylor <ja...@apache.org>.

Hi Ken,
PHOENIX-2434 improved our CSV handling of booleans and will appear in our
upcoming 4.7.0 release. It'd be good if you can confirm whether or not this
is what you need. We definitely want to support ingest of CSVs from other
RDBMSs.

There are a couple of other avenues of ingest into Phoenix:
- using Flyway DB (http://flywaydb.org/) where Phoenix support has been
added (not sure if this has appeared in a release yet or not, though)
- through a pull request to add support for Phoenix in Scoop (
http://sqoop.apache.org/). Unfortunately, this isn't getting much attention
in that community (if this is useful for you, you may want to chime in
on SQOOP-2649 or on their mailing list).

Thanks,
James

On Thu, Jan 7, 2016 at 5:28 PM, Ken Hampson <ha...@gmail.com> wrote:

> Hi, all:
>
> I am in the process of importing some data from Postgres into Phoenix as
> part of a larger project. I noticed after doing so and cross-checking the
> results of a fairly complex JOIN query (of 6 tables of so) that the number
> of rows returned by Postgres and Phoenix differed -- Phoenix had a
> significant number more rows. I investigated this further, and ultimately
> determined that the reason was that all *boolean *values were imported as
> being false, regardless of their actual value.
>
> When I investigated that aspect further, I discovered that Phoenix appears
> to require the full strings "true" and "false" in the CSV in order to
> properly register their actual values. As a workaround, I am currently
> using a full SELECT in the "\copy" statement in psql to select each column
> individually, and for each boolean column, I am using a CASE statement to
> emit the full strings. I have since also learned that casting the boolean
> to the "text" datatype will also cause Postgres to emit the full string.
>
> In either case, though, the pain point is that this discrepancy requires
> that I examine each table closely for boolean columns, and if there are
> any, I then need to enumerate all the columns out explicitly and use either
> the CASE statement or the cast method on the boolean columns in order to
> get the values Phoenix expects. In my case, I didn't need to transform any
> of the data otherwise, so this forced a much more complex procedure.
>
> While psql provides many options to change various values -- delimiters
> and what not -- there is no option to change the boolean values. Likewise,
> as near as I can tell, there are no options to change what Phoenix expects,
> either.
>
> I am wondering:
>
> - Is there any way from within Phoenix to change the boolean values that
> it expects in the CSV?
> - Are there any other methods from a Phoenix perspective that you can
> think of that would obviate the need for the column-by-column enumeration
> procedure I outlined above?
> - Since Postgres is a major open source RDBMS which I suspect many people
> -- myself included -- will want to export data from into Phoenix, could
> there be an enhancement to Phoenix to allow an option to specify a default
> boolean value, or perhaps an option to specify the CSV source, from which
> the proper value could be defined and utilized automatically?
>
> Thanks,
> - Ken
>