You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by James Heather <ja...@mendeley.com> on 2015/09/08 20:26:04 UTC

missing rows after using performance.py

I've had another go running the performance.py script to upsert 
100,000,000 rows into a Phoenix table, and again I've ended up with 
around 500 rows missing.

Can anyone explain this, or reproduce it?

It is rather concerning: I'm reluctant to use Phoenix if I'm not sure 
whether rows will be silently dropped.

James

Re: missing rows after using performance.py

Posted by Mujtaba Chohan <mu...@apache.org>.

Thanks James. Filed https://issues.apache.org/jira/browse/PHOENIX-2240.

On Tue, Sep 8, 2015 at 12:38 PM, James Heather <ja...@mendeley.com>
wrote:

> Thanks.
>
> I've discovered that the cause is even simpler. With 100M rows, you get
> collisions in the primary key in the CSV file. An experiment (capturing the
> CSV file, and counting the rows with a unique primary key) reveals that the
> number of unique primary keys is about 500 short of the full 100M. So the
> upserting is working as it should!
>
> I don't know if there's a way round this, because it does produce rather
> suspicious-looking results. It might be worth having the program emit a
> warning to this effect if the parameter size is large, or finding a way to
> increase the entropy in the primary keys that are generated, to ensure that
> there won't be collisions.
>
> It's a bit surprising no one has run into this before! Hopefully this
> script has been run on that many rows before... it seems a reasonable
> number for testing performance of a scalable database... (in fact I was
> planning to increase the row count somewhat).
>
> James
>
>
> On 08/09/15 20:16, James Taylor wrote:
>
> Hi James,
> Looks like currently you'll get a error log message generated if a row is
> attempted to be imported but cannot be (usually due to the data not being
> compatible with the schema). For psql.py, this would be the client side log
> and messages would look like this:
>             LOG.error("Error upserting record {}: {}", csvRecord,
> errorMessage);
>
> FWIW, we have a "strict" option for CSV loading (using the -s or --strict
> option) which is meant to cause the load to abort if bad data is found, but
> it doesn't look like this is currently checked (when bad data is
> encountered). I've filed PHOENIX-2239 for this.
>
> Thanks,
> James
>
> On Tue, Sep 8, 2015 at 11:26 AM, James Heather <james.heather@mendeley.com
> > wrote:
>
>> I've had another go running the performance.py script to upsert
>> 100,000,000 rows into a Phoenix table, and again I've ended up with around
>> 500 rows missing.
>>
>> Can anyone explain this, or reproduce it?
>>
>> It is rather concerning: I'm reluctant to use Phoenix if I'm not sure
>> whether rows will be silently dropped.
>>
>> James
>>
>
>
>

Re: missing rows after using performance.py

Posted by James Heather <ja...@mendeley.com>.

Thanks.

I've discovered that the cause is even simpler. With 100M rows, you get 
collisions in the primary key in the CSV file. An experiment (capturing 
the CSV file, and counting the rows with a unique primary key) reveals 
that the number of unique primary keys is about 500 short of the full 
100M. So the upserting is working as it should!

I don't know if there's a way round this, because it does produce rather 
suspicious-looking results. It might be worth having the program emit a 
warning to this effect if the parameter size is large, or finding a way 
to increase the entropy in the primary keys that are generated, to 
ensure that there won't be collisions.

It's a bit surprising no one has run into this before! Hopefully this 
script has been run on that many rows before... it seems a reasonable 
number for testing performance of a scalable database... (in fact I was 
planning to increase the row count somewhat).

James

On 08/09/15 20:16, James Taylor wrote:
> Hi James,
> Looks like currently you'll get a error log message generated if a row 
> is attempted to be imported but cannot be (usually due to the data not 
> being compatible with the schema). For psql.py, this would be the 
> client side log and messages would look like this:
>             LOG.error("Error upserting record {}: {}", csvRecord, 
> errorMessage);
>
> FWIW, we have a "strict" option for CSV loading (using the -s or 
> --strict option) which is meant to cause the load to abort if bad data 
> is found, but it doesn't look like this is currently checked (when bad 
> data is encountered). I've filed PHOENIX-2239 for this.
>
> Thanks,
> James
>
> On Tue, Sep 8, 2015 at 11:26 AM, James Heather 
> <james.heather@mendeley.com <ma...@mendeley.com>> wrote:
>
>     I've had another go running the performance.py script to upsert
>     100,000,000 rows into a Phoenix table, and again I've ended up
>     with around 500 rows missing.
>
>     Can anyone explain this, or reproduce it?
>
>     It is rather concerning: I'm reluctant to use Phoenix if I'm not
>     sure whether rows will be silently dropped.
>
>     James
>
>

Re: missing rows after using performance.py

Posted by James Taylor <ja...@apache.org>.

Hi James,
Looks like currently you'll get a error log message generated if a row is
attempted to be imported but cannot be (usually due to the data not being
compatible with the schema). For psql.py, this would be the client side log
and messages would look like this:
            LOG.error("Error upserting record {}: {}", csvRecord,
errorMessage);

FWIW, we have a "strict" option for CSV loading (using the -s or --strict
option) which is meant to cause the load to abort if bad data is found, but
it doesn't look like this is currently checked (when bad data is
encountered). I've filed PHOENIX-2239 for this.

Thanks,
James

On Tue, Sep 8, 2015 at 11:26 AM, James Heather <ja...@mendeley.com>
wrote:

> I've had another go running the performance.py script to upsert
> 100,000,000 rows into a Phoenix table, and again I've ended up with around
> 500 rows missing.
>
> Can anyone explain this, or reproduce it?
>
> It is rather concerning: I'm reluctant to use Phoenix if I'm not sure
> whether rows will be silently dropped.
>
> James
>