You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "James Taylor (JIRA)" <ji...@apache.org> on 2016/01/13 02:05:39 UTC
[jira] [Comment Edited] (PHOENIX-2446) Immutable index - Index vs base table row count does not match when index is created during data load

    [ https://issues.apache.org/jira/browse/PHOENIX-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095360#comment-15095360 ] 

James Taylor edited comment on PHOENIX-2446 at 1/13/16 1:05 AM:
----------------------------------------------------------------

I'm not sure of the exact processing that occurs during a flush, but logically it writes the data that's in the memstore to disk. How long does it take and if you sleep for that amount of time (instead of doing the flush), do your tests pass?

Have you tried running the UPSERT/SELECT at a lower priority?

I suppose we need to understand the timing of everything to understand why it's failing. Is the data inflight when the index population occurs runs? Or does the data arrive at the server while the UPSERT/SELECT is running?

Would be good to make a timeline like this:
* Compile CREATE INDEX IDX ON T statement
** Resolve table T getting server timestamp back of t0.
* Execute CREATE INDEX statement
** Create HBase metadata for new IDX table (index on view or local index may not create any new metadata)
** Insert Phoenix metadata for new index
** Execute UPSERT SELECT to populate index using t0 for scan and put timestamp
*** Execute n scans in parallel chunk-by-chunk for each guidepost and submit batch mutation for initial index population.
** Mark new index as active (i.e. updating Phoenix metadata through coprocessor call)

The above is time ordered, so the UPSERT SELECT is running as of an earlier timestamp. When does other client's batch of mutations have to hit the server for the problem to occur?



was (Author: jamestaylor):
I'm not sure of the exact processing that occurs during a flush, but logically it writes the data that's in the memstore to disk. How long does it take and if you sleep for that amount of time (instead of doing the flush), do your tests pass?

Have you tried running the UPSERT/SELECT at a lower priority?

I suppose we need to understand the timing of everything to understand why it's failing. Is the data inflight when the index population occurs runs? Or does the data arrive at the server while the UPSERT/SELECT is running?

Would be good to make a timeline like this:
* CREATE INDEX IDX ON T(x) statement compiled
** Table T is resolved at t0
* CREATE INDEX executed
** HBase metadata created for new IDX table (index on view or local index may not create any new metadata)
** Phoenix metadata inserted for new index
** UPSERT SELECT run to populate index using t0 for scan and put timestamp
*** Execute n scans in parallel chunk-by-chunk for each guidepost and submit batch mutation for initial index population.
** Mark new index as active (i.e. updating Phoenix metadata through coprocessor call)

The above is time ordered, so the UPSERT SELECT is running as of an earlier timestamp. When does other client's batch of mutations have to hit the server for the problem to occur?


> Immutable index - Index vs base table row count does not match when index is created during data load
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-2446
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2446
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.6.0
>            Reporter: Mujtaba Chohan
>            Assignee: Thomas D'Silva
>             Fix For: 4.7.0
>
>         Attachments: PHOENIX-2446-wip.patch, PHOENIX-2446.patch
>
>
> I'll add more details later but here's the scenario that consistently produces wrong row count for index table vs base table for immutable async index.
> 1. Start data upsert
> 2. Create async index
> 3. Trigger M/R index build
> 4. Keep data upsert going in background during step 2,3 and a while after M/R index finishes.
> 5. End data upsert. 
> Now count with index enabled vs count with hint to not use index is off by a large factor. Will get a cleaner repro for this issue soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)