You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Tongzhou Wang <to...@gmail.com> on 2016/06/24 00:45:07 UTC

Bulk loading and index

Hi all,

I am writing to ask if there is a way to disable an index, then update it
through the MapReduce job (IndexTool). I want to bulk load a huge amount of
data, but index maintaining makes it very slow. It would be great if I can
disable an index, load data, then use a MapReduce job to update it to
usable state.

Also, does Phoenix's secondary index maintaining take TTL into account?

Thanks,
Tongzhou

Re: Bulk loading and index

Posted by Simon Wang <si...@airbnb.com>.

Thanks, James. 

I created JIRA created at PHOENIX-3032 <https://issues.apache.org/jira/browse/PHOENIX-3032>. I am currently looking into the code and see if I can make this change. How would you suggest the logic should be? Having spent a few hours reading the code, I am considering a workflow like this:

1. Record the timestamps at `ALTER INDEX .. REBUILD ASYNC`.
2. In `PhoenixIndexImportMapper.map`, process iff the record’s timestamp is newer than the recorded timestamp.

Could you share some thoughts if this is the correct approach? If so, what are the other classes I should look into (to add the `REBUILD ASYNC` token in parser, etc.)

Best,
Tongzhou


> On Jun 27, 2016, at 12:51 PM, James Taylor <ja...@apache.org> wrote:
> 
> Tongzhou,
> Please file a JIRA for supporting ALTER INDEX .... REBUILD ASYNC. This would be a good addition and not very difficult to implement. Contributions are, of course, always welcome.
> Regards,
> James
> 
> On Sun, Jun 26, 2016 at 2:45 AM, Ankit Singhal <ankitsinghal59@gmail.com <ma...@gmail.com>> wrote:
> HI Tongzhou,
> 
> May be you can trying dropping the current index and after your upload is completed, you can create a async index. Then you can use IndexTool to rebuild your index from start.
> 
> source:- https://phoenix.apache.org/secondary_indexing.html <https://phoenix.apache.org/secondary_indexing.html>
> 
> CREATE INDEX async_index ON my_schema.my_table (v) ASYNC
> 
> But if you are only using CSVBulkLoadTool for bulk load, then it will automatically prepare and bulk load index data also. So Index maintaining would not be required.
> 
> Regards,
> Ankit Singhal
> 
> On Sat, Jun 25, 2016 at 4:13 PM, Tongzhou Wang (Simon) <tongzhou.wang.1994@gmail.com <ma...@gmail.com>> wrote:
> Hi Josh,
> 
> First, thanks for the response.
> 
> As far as I can tell, a disabled index cannot be directly changed to USABLE. It must be rebuilt first. I am aware that I can do ALTER INDEX .... REBUILD. But, if I understand correctly, this is single thread and slow. I'm wondering if I can use the IndexTool map reduce job in this case.
> 
> About TTL, I did some experiments. Turns out that Phoenix do not automatically remove index entry when the table entry dies from TTL setting. However, it is possible to set index table with same TTL so that index can be in sync.
> 
> Best,
> Tongzhou
> 
> > On Jun 25, 2016, at 15:31, Josh Elser <josh.elser@gmail.com <ma...@gmail.com>> wrote:
> >
> > Hi Tongzhou,
> >
> > Maybe you can try `ALTER INDEX index ON table DISABLE`. And then the same command with USABLE after you update the index. Are you attempting to do this incrementally? Like, a bulk load of data then a bulk load of index data, repeat?
> >
> > Regarding the TTL, I assume so, but I'm not certain.
> >
> > Tongzhou Wang wrote:
> >> Hi all,
> >>
> >> I am writing to ask if there is a way to disable an index, then update
> >> it through the MapReduce job (IndexTool). I want to bulk load a huge
> >> amount of data, but index maintaining makes it very slow. It would be
> >> great if I can disable an index, load data, then use a MapReduce job to
> >> update it to usable state.
> >>
> >> Also, does Phoenix's secondary index maintaining take TTL into account?
> >>
> >> Thanks,
> >> Tongzhou
> 
>

Re: Bulk loading and index

Posted by James Taylor <ja...@apache.org>.

Tongzhou,
Please file a JIRA for supporting ALTER INDEX .... REBUILD ASYNC. This
would be a good addition and not very difficult to implement. Contributions
are, of course, always welcome.
Regards,
James

On Sun, Jun 26, 2016 at 2:45 AM, Ankit Singhal <an...@gmail.com>
wrote:

> HI Tongzhou,
>
> May be you can trying dropping the current index and after your upload is
> completed, you can create a async index. Then you can use IndexTool to
> rebuild your index from start.
>
> source:- https://phoenix.apache.org/secondary_indexing.html
>
> CREATE INDEX async_index ON my_schema.my_table (v) ASYNC
>
>
> But if you are only using CSVBulkLoadTool for bulk load, then it will
> automatically prepare and bulk load index data also. So Index maintaining
> would not be required.
>
> Regards,
> Ankit Singhal
>
> On Sat, Jun 25, 2016 at 4:13 PM, Tongzhou Wang (Simon) <
> tongzhou.wang.1994@gmail.com> wrote:
>
>> Hi Josh,
>>
>> First, thanks for the response.
>>
>> As far as I can tell, a disabled index cannot be directly changed to
>> USABLE. It must be rebuilt first. I am aware that I can do ALTER INDEX ....
>> REBUILD. But, if I understand correctly, this is single thread and slow.
>> I'm wondering if I can use the IndexTool map reduce job in this case.
>>
>> About TTL, I did some experiments. Turns out that Phoenix do not
>> automatically remove index entry when the table entry dies from TTL
>> setting. However, it is possible to set index table with same TTL so that
>> index can be in sync.
>>
>> Best,
>> Tongzhou
>>
>> > On Jun 25, 2016, at 15:31, Josh Elser <jo...@gmail.com> wrote:
>> >
>> > Hi Tongzhou,
>> >
>> > Maybe you can try `ALTER INDEX index ON table DISABLE`. And then the
>> same command with USABLE after you update the index. Are you attempting to
>> do this incrementally? Like, a bulk load of data then a bulk load of index
>> data, repeat?
>> >
>> > Regarding the TTL, I assume so, but I'm not certain.
>> >
>> > Tongzhou Wang wrote:
>> >> Hi all,
>> >>
>> >> I am writing to ask if there is a way to disable an index, then update
>> >> it through the MapReduce job (IndexTool). I want to bulk load a huge
>> >> amount of data, but index maintaining makes it very slow. It would be
>> >> great if I can disable an index, load data, then use a MapReduce job to
>> >> update it to usable state.
>> >>
>> >> Also, does Phoenix's secondary index maintaining take TTL into account?
>> >>
>> >> Thanks,
>> >> Tongzhou
>>
>
>

Re: Bulk loading and index

Posted by Ankit Singhal <an...@gmail.com>.

HI Tongzhou,

May be you can trying dropping the current index and after your upload is
completed, you can create a async index. Then you can use IndexTool to
rebuild your index from start.

source:- https://phoenix.apache.org/secondary_indexing.html

CREATE INDEX async_index ON my_schema.my_table (v) ASYNC


But if you are only using CSVBulkLoadTool for bulk load, then it will
automatically prepare and bulk load index data also. So Index maintaining
would not be required.

Regards,
Ankit Singhal

On Sat, Jun 25, 2016 at 4:13 PM, Tongzhou Wang (Simon) <
tongzhou.wang.1994@gmail.com> wrote:

> Hi Josh,
>
> First, thanks for the response.
>
> As far as I can tell, a disabled index cannot be directly changed to
> USABLE. It must be rebuilt first. I am aware that I can do ALTER INDEX ....
> REBUILD. But, if I understand correctly, this is single thread and slow.
> I'm wondering if I can use the IndexTool map reduce job in this case.
>
> About TTL, I did some experiments. Turns out that Phoenix do not
> automatically remove index entry when the table entry dies from TTL
> setting. However, it is possible to set index table with same TTL so that
> index can be in sync.
>
> Best,
> Tongzhou
>
> > On Jun 25, 2016, at 15:31, Josh Elser <jo...@gmail.com> wrote:
> >
> > Hi Tongzhou,
> >
> > Maybe you can try `ALTER INDEX index ON table DISABLE`. And then the
> same command with USABLE after you update the index. Are you attempting to
> do this incrementally? Like, a bulk load of data then a bulk load of index
> data, repeat?
> >
> > Regarding the TTL, I assume so, but I'm not certain.
> >
> > Tongzhou Wang wrote:
> >> Hi all,
> >>
> >> I am writing to ask if there is a way to disable an index, then update
> >> it through the MapReduce job (IndexTool). I want to bulk load a huge
> >> amount of data, but index maintaining makes it very slow. It would be
> >> great if I can disable an index, load data, then use a MapReduce job to
> >> update it to usable state.
> >>
> >> Also, does Phoenix's secondary index maintaining take TTL into account?
> >>
> >> Thanks,
> >> Tongzhou
>

Re: Bulk loading and index

Posted by "Tongzhou Wang (Simon)" <to...@gmail.com>.

Hi Josh,

First, thanks for the response. 

As far as I can tell, a disabled index cannot be directly changed to USABLE. It must be rebuilt first. I am aware that I can do ALTER INDEX .... REBUILD. But, if I understand correctly, this is single thread and slow. I'm wondering if I can use the IndexTool map reduce job in this case. 

About TTL, I did some experiments. Turns out that Phoenix do not automatically remove index entry when the table entry dies from TTL setting. However, it is possible to set index table with same TTL so that index can be in sync. 

Best,
Tongzhou

> On Jun 25, 2016, at 15:31, Josh Elser <jo...@gmail.com> wrote:
> 
> Hi Tongzhou,
> 
> Maybe you can try `ALTER INDEX index ON table DISABLE`. And then the same command with USABLE after you update the index. Are you attempting to do this incrementally? Like, a bulk load of data then a bulk load of index data, repeat?
> 
> Regarding the TTL, I assume so, but I'm not certain.
> 
> Tongzhou Wang wrote:
>> Hi all,
>> 
>> I am writing to ask if there is a way to disable an index, then update
>> it through the MapReduce job (IndexTool). I want to bulk load a huge
>> amount of data, but index maintaining makes it very slow. It would be
>> great if I can disable an index, load data, then use a MapReduce job to
>> update it to usable state.
>> 
>> Also, does Phoenix's secondary index maintaining take TTL into account?
>> 
>> Thanks,
>> Tongzhou

Re: Bulk loading and index

Posted by Josh Elser <jo...@gmail.com>.

Hi Tongzhou,

Maybe you can try `ALTER INDEX index ON table DISABLE`. And then the 
same command with USABLE after you update the index. Are you attempting 
to do this incrementally? Like, a bulk load of data then a bulk load of 
index data, repeat?

Regarding the TTL, I assume so, but I'm not certain.

Tongzhou Wang wrote:
> Hi all,
>
> I am writing to ask if there is a way to disable an index, then update
> it through the MapReduce job (IndexTool). I want to bulk load a huge
> amount of data, but index maintaining makes it very slow. It would be
> great if I can disable an index, load data, then use a MapReduce job to
> update it to usable state.
>
> Also, does Phoenix's secondary index maintaining take TTL into account?
>
> Thanks,
> Tongzhou