You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@harmony.apache.org by "Xiaoming Gu (JIRA)" <ji...@apache.org> on 2008/09/02 05:38:44 UTC

[jira] Created: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

[drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
---------------------------------------------------------------

                 Key: HARMONY-5965
                 URL: https://issues.apache.org/jira/browse/HARMONY-5965
             Project: Harmony
          Issue Type: Improvement
          Components: DRLVM
            Reporter: Xiaoming Gu


In IA32 there is a quick (1 cycle) LEA instruction for loading effective address. The function of LEA is a combination of shift-left and addition. For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used but not limited in element address calculation for array.

In current Ia32InstCodeSelector.cpp, the function for translating Op_Shladd HIR generates shl and add. Since LEA has the same semantic, we could deploy it to improve performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Created: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by xiaoming gu <xi...@gmail.com>.

The work will be done in HIR2LIR pass. Thanks.

Xiaoming

On Tue, Sep 2, 2008 at 2:56 PM, Xiao-Feng Li <xi...@gmail.com> wrote:

> On Tue, Sep 2, 2008 at 11:38 AM, Xiaoming Gu (JIRA) <ji...@apache.org>
> wrote:
> > [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
> > ---------------------------------------------------------------
> >
> >                 Key: HARMONY-5965
> >                 URL: https://issues.apache.org/jira/browse/HARMONY-5965
> >             Project: Harmony
> >          Issue Type: Improvement
> >          Components: DRLVM
> >            Reporter: Xiaoming Gu
> >
> >
> > In IA32 there is a quick (1 cycle) LEA instruction for loading effective
> address. The function of LEA is a combination of shift-left and addition.
> For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used but not
> limited in element address calculation for array.
> >
> > In current Ia32InstCodeSelector.cpp, the function for translating
> Op_Shladd HIR generates shl and add. Since LEA has the same semantic, we
> could deploy it to improve performance.
>
> Thanks for the proposal. Which phase do you suggest to implement this
> optimization, IR lowering?
>
> Thanks,
> xiaofebg
>
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>
>
>
> --
> http://xiao-feng.blogspot.com
>

Re: [jira] Created: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by Xiao-Feng Li <xi...@gmail.com>.

On Tue, Sep 2, 2008 at 11:38 AM, Xiaoming Gu (JIRA) <ji...@apache.org> wrote:
> [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
> ---------------------------------------------------------------
>
>                 Key: HARMONY-5965
>                 URL: https://issues.apache.org/jira/browse/HARMONY-5965
>             Project: Harmony
>          Issue Type: Improvement
>          Components: DRLVM
>            Reporter: Xiaoming Gu
>
>
> In IA32 there is a quick (1 cycle) LEA instruction for loading effective address. The function of LEA is a combination of shift-left and addition. For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used but not limited in element address calculation for array.
>
> In current Ia32InstCodeSelector.cpp, the function for translating Op_Shladd HIR generates shl and add. Since LEA has the same semantic, we could deploy it to improve performance.

Thanks for the proposal. Which phase do you suggest to implement this
optimization, IR lowering?

Thanks,
xiaofebg

> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>



-- 
http://xiao-feng.blogspot.com

Re: [jira] Updated: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by xiaoming gu <xi...@gmail.com>.

I did some testing with Stefan Krause's benchmark set. You may see the
results in H5901 thread
and the benefits are from DIV strength reduction which is included in the
latest patch for H5901.
I'm planning to merge H5965 to H5901 when finish tuning MUL strength
reduction. Thanks.

Xiaoming

On Fri, Sep 19, 2008 at 4:13 PM, Aleksey Shipilev <
aleksey.shipilev@gmail.com> wrote:

> Thanks for making steady progress, Xiaoming!
> Have you composite scores for Stefan Krause's benchmark set with your
> patch included?
>
> Thanks,
> Aleksey.
>
> On Fri, Sep 19, 2008 at 10:45 AM, xiaoming gu <xi...@gmail.com>
> wrote:
> > Hi, all. I did something more for shladd=>LEA today. With the available
> MUL
> > strength reduction,
> > X*10 is reduced to (X<<2+X) <<1+0 and 0 is generated by a self XOR
> > instruction (CASE 3).
> > Actually this XOR is not necessay and could be eliminated in HIR2LIR
> pass.
> > Following is the
> > better instructions generated with the improve patch. Comparing with
> > previous CASE 3, you may
> > find XOR gone.
> >
> >
> > CASE 4: MUL strength reduction - using LEA and taking care of 0
> >
> > I22: LEA t48(EDI):I_32,t47[v434(EBP)+v434(EBP)*t46(4)]:I_32 bcOff: 42 \l\
> > I23: LEA t52(EDI):I_32,t51[t48(EDI)*t50(2)+t49(0)]:I_32 bcOff: 42 \l\
> > I861: MOV v533[v521(ESP)+t532(-24)]:I_32,t52(EDI):I_32 bcOff: 43 \l\
> > I860: MOV v535[v521(ESP)+t534(-28)]:I_32,t53(1):I_32 bcOff: 45 \l\
> > I26: EmptyPseudoInst bcOff: 48 \l\
> >
> >                    CASE1      CASE2     CASE3      CASE4
> > Time (msec)   6234         7688          5734          5704
> > Normalized     1              1.233         0.920         0.915
> >
> >
> > I'm going to submit the patch though it only brings small performance
> > improvement (0.5%). Any
> > comment is welcome. Thanks.
> >
> > Xiaoming
> >
> >
> > On Wed, Sep 17, 2008 at 4:13 PM, Xiao-Feng Li <xi...@gmail.com>
> wrote:
> >
> >> Xiaoming, Thanks for the explanation.
> >>
> >> Thanks,
> >> xiaofeng
> >>
> >> On Wed, Sep 17, 2008 at 3:35 PM, xiaoming gu <xi...@gmail.com>
> >> wrote:
> >> >  The 7.9% improvement comes from the complex function (shift left+add)
> >> and
> >> > quick execution (1 cycle) of LEA with
> >> > special hardware optimizations. In IA32, LEA is designed for computing
> >> > address originally but not limited to that
> >> > purpose. So we may use LEA LIR for shladd HIR for common arithmetic
> >> > calculations.
> >> >
> >> > And in the available MUL strength reduction (multiplybyconstant.cpp),
> >> there
> >> > is some part of code implying to use
> >> > LEA LIR for shladd HIR. But in later HIR2LIR pass, shladd HIR is
> >> transformed
> >> > to SAL and ADD LIRs which makes
> >> > MUL strength reduction always with no improvement.
> >> >
> >> > Thanks. -Xiaoming
> >> >
> >> > On Wed, Sep 17, 2008 at 11:16 AM, Xiao-Feng Li <xiaofeng.li@gmail.com
> >> >wrote:
> >> >
> >> >> On Wed, Sep 17, 2008 at 10:29 AM, Xiaoming Gu (JIRA) <
> jira@apache.org>
> >> >> wrote:
> >> >> >
> >> >> >     [
> >> >>
> >>
> https://issues.apache.org/jira/browse/HARMONY-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> >> ]
> >> >> >
> >> >> > Xiaoming Gu updated HARMONY-5965:
> >> >> > ---------------------------------
> >> >> >
> >> >> >    Attachment: H5965-V1.patch
> >> >> >
> >> >> > With this patch, shladd HIR could generate LEA LIR when the data is
> I4
> >> >> type and shift-left bit is 1/2/3.
> >> >> >
> >> >> > Note: A new MemOpndKind "MemOpndKind_LEA" is created because we
> just
> >> use
> >> >> the memory operand in LEA LIR to do common arithmetic calculation not
> >> for
> >> >> real memory address computation. If we still use MemOpndKind_Heap,
> there
> >> are
> >> >> some failed verifications in debug version.
> >> >> >
> >> >> > Then I turned on MUL strength reduction and get the following
> >> improvement
> >> >> with a synthetic example.
> >> >> >
> >> >> > hotspot of source code:
> >> >> >    for(int i=0;i<times;i++) // times=2,000,000,000
> >> >> >        result = result*multiplier; //multiplier=10, which is
> >> transformed
> >> >> from x*10 to (((x<<2)+x)<<1)+0
> >> >> >
> >> >> > Following is the binary code generated for "result =
> >> result*multiplier;".
> >> >> >
> >> >> > CASE 1: No MUL strength reduction - using IMUL
> >> >> > I868: MOV s47(EDI):I_32,v426(ESI):I_32 \l\
> >> >> > I867: MOV t351(EBP):I_32,t46(10):I_32 \l\
> >> >> > I22: (ID:s16(EFLGS):U_32) =IMUL s47(EDI):I_32,t351(EBP):I_32
>  bcOff:
> >> 42
> >> >> \l\
> >> >> > I866: MOV v527[v513(ESP)+t526(-28)]:I_32,s47(EDI):I_32  bcOff: 43
> \l\
> >> >> > I865: MOV v529[v513(ESP)+t528(-32)]:I_32,t48(1):I_32  bcOff: 45 \l\
> >> >> > I25: EmptyPseudoInst  bcOff: 48 \l\
> >> >> >
> >> >> > CASE 2: MUL strength reduction - using SAL and ADD
> >> >> > I884: MOV s47(EBP):I_32,v438(ESI):I_32 \l\
> >> >> > I23: (ID:s16(EFLGS):U_32) =SAL s47(EBP):I_32,t46(2):U_8  bcOff: 42
> \l\
> >> >> > I883: MOV s54(EDI):I_32,v438(ESI):I_32 \l\
> >> >> > I24: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,s47(EBP):I_32  bcOff:
> 42
> >> \l\
> >> >> > I116: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
> >> >> > I26: (ID:s16(EFLGS):U_32) =SAL s54(EDI):I_32,t51(1):U_8  bcOff: 42
> \l\
> >> >> > I117: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
> >> >> > I27: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,t50(0):I_32  bcOff: 42
> >> \l\
> >> >> > I882: MOV v539[v525(ESP)+t538(-28)]:I_32,s54(EDI):I_32  bcOff: 43
> \l\
> >> >> > I881: MOV v541[v525(ESP)+t540(-32)]:I_32,t55(1):I_32  bcOff: 45 \l\
> >> >> > I30: EmptyPseudoInst  bcOff: 48 \l\
> >> >> >
> >> >> > CASE 3: MUL strength reduction - using LEA
> >> >> > I22: LEA t48(EBP):I_32,t47[v436(ESI)+v436(ESI)*t46(4)]:I_32  bcOff:
> 42
> >> >> \l\
> >> >> > I868: (ID:s16(EFLGS):U_32) =XOR t361(EDI):I_32,t361(EDI):I_32 \l\
> >> >> > I23: LEA t52(EDI):I_32,t51[t361(EDI)+t48(EBP)*t50(2)]:I_32  bcOff:
> 42
> >> \l\
> >> >> > I867: MOV v537[v523(ESP)+t536(-28)]:I_32,t52(EDI):I_32  bcOff: 43
> \l\
> >> >> > I866: MOV v539[v523(ESP)+t538(-32)]:I_32,t53(1):I_32  bcOff: 45 \l\
> >> >> > I26: EmptyPseudoInst  bcOff: 48 \l\
> >> >> >
> >> >> >                               CASE1         CASE2           CASE3
> >> >> > Time (msec)        6234             7688                5734
> >> >>
> >> >> Good job!  The improvement looks good. It is about 7.9%. Thanks.
> >> >>
> >> >> Thanks,
> >> >> xiaofeng
> >> >>
> >> >> > I'm going to spend more time for H5901 to adjust MUL strength
> >> reduction.
> >> >> >
> >> >> >> [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
> >> >> >> ---------------------------------------------------------------
> >> >> >>
> >> >> >>                 Key: HARMONY-5965
> >> >> >>                 URL:
> >> https://issues.apache.org/jira/browse/HARMONY-5965
> >> >> >>             Project: Harmony
> >> >> >>          Issue Type: Improvement
> >> >> >>          Components: DRLVM
> >> >> >>            Reporter: Xiaoming Gu
> >> >> >>         Attachments: H5965-V1.patch
> >> >> >>
> >> >> >>
> >> >> >> In IA32 there is a quick (1 cycle) LEA instruction for loading
> >> effective
> >> >> address. The function of LEA is a combination of shift-left and
> >> addition.
> >> >> For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used
> but
> >> not
> >> >> limited in element address calculation for array.
> >> >> >> In current Ia32InstCodeSelector.cpp, the function for translating
> >> >> Op_Shladd HIR generates shl and add. Since LEA has the same semantic,
> we
> >> >> could deploy it to improve performance.
> >> >> >
> >> >> > --
> >> >> > This message is automatically generated by JIRA.
> >> >> > -
> >> >> > You can reply to this email to add a comment to the issue online.
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> http://xiao-feng.blogspot.com
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> http://xiao-feng.blogspot.com
> >>
> >
>

Re: [jira] Updated: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by Aleksey Shipilev <al...@gmail.com>.

Thanks for making steady progress, Xiaoming!
Have you composite scores for Stefan Krause's benchmark set with your
patch included?

Thanks,
Aleksey.

On Fri, Sep 19, 2008 at 10:45 AM, xiaoming gu <xi...@gmail.com> wrote:
> Hi, all. I did something more for shladd=>LEA today. With the available MUL
> strength reduction,
> X*10 is reduced to (X<<2+X) <<1+0 and 0 is generated by a self XOR
> instruction (CASE 3).
> Actually this XOR is not necessay and could be eliminated in HIR2LIR pass.
> Following is the
> better instructions generated with the improve patch. Comparing with
> previous CASE 3, you may
> find XOR gone.
>
>
> CASE 4: MUL strength reduction - using LEA and taking care of 0
>
> I22: LEA t48(EDI):I_32,t47[v434(EBP)+v434(EBP)*t46(4)]:I_32 bcOff: 42 \l\
> I23: LEA t52(EDI):I_32,t51[t48(EDI)*t50(2)+t49(0)]:I_32 bcOff: 42 \l\
> I861: MOV v533[v521(ESP)+t532(-24)]:I_32,t52(EDI):I_32 bcOff: 43 \l\
> I860: MOV v535[v521(ESP)+t534(-28)]:I_32,t53(1):I_32 bcOff: 45 \l\
> I26: EmptyPseudoInst bcOff: 48 \l\
>
>                    CASE1      CASE2     CASE3      CASE4
> Time (msec)   6234         7688          5734          5704
> Normalized     1              1.233         0.920         0.915
>
>
> I'm going to submit the patch though it only brings small performance
> improvement (0.5%). Any
> comment is welcome. Thanks.
>
> Xiaoming
>
>
> On Wed, Sep 17, 2008 at 4:13 PM, Xiao-Feng Li <xi...@gmail.com> wrote:
>
>> Xiaoming, Thanks for the explanation.
>>
>> Thanks,
>> xiaofeng
>>
>> On Wed, Sep 17, 2008 at 3:35 PM, xiaoming gu <xi...@gmail.com>
>> wrote:
>> >  The 7.9% improvement comes from the complex function (shift left+add)
>> and
>> > quick execution (1 cycle) of LEA with
>> > special hardware optimizations. In IA32, LEA is designed for computing
>> > address originally but not limited to that
>> > purpose. So we may use LEA LIR for shladd HIR for common arithmetic
>> > calculations.
>> >
>> > And in the available MUL strength reduction (multiplybyconstant.cpp),
>> there
>> > is some part of code implying to use
>> > LEA LIR for shladd HIR. But in later HIR2LIR pass, shladd HIR is
>> transformed
>> > to SAL and ADD LIRs which makes
>> > MUL strength reduction always with no improvement.
>> >
>> > Thanks. -Xiaoming
>> >
>> > On Wed, Sep 17, 2008 at 11:16 AM, Xiao-Feng Li <xiaofeng.li@gmail.com
>> >wrote:
>> >
>> >> On Wed, Sep 17, 2008 at 10:29 AM, Xiaoming Gu (JIRA) <ji...@apache.org>
>> >> wrote:
>> >> >
>> >> >     [
>> >>
>> https://issues.apache.org/jira/browse/HARMONY-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> ]
>> >> >
>> >> > Xiaoming Gu updated HARMONY-5965:
>> >> > ---------------------------------
>> >> >
>> >> >    Attachment: H5965-V1.patch
>> >> >
>> >> > With this patch, shladd HIR could generate LEA LIR when the data is I4
>> >> type and shift-left bit is 1/2/3.
>> >> >
>> >> > Note: A new MemOpndKind "MemOpndKind_LEA" is created because we just
>> use
>> >> the memory operand in LEA LIR to do common arithmetic calculation not
>> for
>> >> real memory address computation. If we still use MemOpndKind_Heap, there
>> are
>> >> some failed verifications in debug version.
>> >> >
>> >> > Then I turned on MUL strength reduction and get the following
>> improvement
>> >> with a synthetic example.
>> >> >
>> >> > hotspot of source code:
>> >> >    for(int i=0;i<times;i++) // times=2,000,000,000
>> >> >        result = result*multiplier; //multiplier=10, which is
>> transformed
>> >> from x*10 to (((x<<2)+x)<<1)+0
>> >> >
>> >> > Following is the binary code generated for "result =
>> result*multiplier;".
>> >> >
>> >> > CASE 1: No MUL strength reduction - using IMUL
>> >> > I868: MOV s47(EDI):I_32,v426(ESI):I_32 \l\
>> >> > I867: MOV t351(EBP):I_32,t46(10):I_32 \l\
>> >> > I22: (ID:s16(EFLGS):U_32) =IMUL s47(EDI):I_32,t351(EBP):I_32  bcOff:
>> 42
>> >> \l\
>> >> > I866: MOV v527[v513(ESP)+t526(-28)]:I_32,s47(EDI):I_32  bcOff: 43 \l\
>> >> > I865: MOV v529[v513(ESP)+t528(-32)]:I_32,t48(1):I_32  bcOff: 45 \l\
>> >> > I25: EmptyPseudoInst  bcOff: 48 \l\
>> >> >
>> >> > CASE 2: MUL strength reduction - using SAL and ADD
>> >> > I884: MOV s47(EBP):I_32,v438(ESI):I_32 \l\
>> >> > I23: (ID:s16(EFLGS):U_32) =SAL s47(EBP):I_32,t46(2):U_8  bcOff: 42 \l\
>> >> > I883: MOV s54(EDI):I_32,v438(ESI):I_32 \l\
>> >> > I24: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,s47(EBP):I_32  bcOff: 42
>> \l\
>> >> > I116: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
>> >> > I26: (ID:s16(EFLGS):U_32) =SAL s54(EDI):I_32,t51(1):U_8  bcOff: 42 \l\
>> >> > I117: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
>> >> > I27: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,t50(0):I_32  bcOff: 42
>> \l\
>> >> > I882: MOV v539[v525(ESP)+t538(-28)]:I_32,s54(EDI):I_32  bcOff: 43 \l\
>> >> > I881: MOV v541[v525(ESP)+t540(-32)]:I_32,t55(1):I_32  bcOff: 45 \l\
>> >> > I30: EmptyPseudoInst  bcOff: 48 \l\
>> >> >
>> >> > CASE 3: MUL strength reduction - using LEA
>> >> > I22: LEA t48(EBP):I_32,t47[v436(ESI)+v436(ESI)*t46(4)]:I_32  bcOff: 42
>> >> \l\
>> >> > I868: (ID:s16(EFLGS):U_32) =XOR t361(EDI):I_32,t361(EDI):I_32 \l\
>> >> > I23: LEA t52(EDI):I_32,t51[t361(EDI)+t48(EBP)*t50(2)]:I_32  bcOff: 42
>> \l\
>> >> > I867: MOV v537[v523(ESP)+t536(-28)]:I_32,t52(EDI):I_32  bcOff: 43 \l\
>> >> > I866: MOV v539[v523(ESP)+t538(-32)]:I_32,t53(1):I_32  bcOff: 45 \l\
>> >> > I26: EmptyPseudoInst  bcOff: 48 \l\
>> >> >
>> >> >                               CASE1         CASE2           CASE3
>> >> > Time (msec)        6234             7688                5734
>> >>
>> >> Good job!  The improvement looks good. It is about 7.9%. Thanks.
>> >>
>> >> Thanks,
>> >> xiaofeng
>> >>
>> >> > I'm going to spend more time for H5901 to adjust MUL strength
>> reduction.
>> >> >
>> >> >> [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
>> >> >> ---------------------------------------------------------------
>> >> >>
>> >> >>                 Key: HARMONY-5965
>> >> >>                 URL:
>> https://issues.apache.org/jira/browse/HARMONY-5965
>> >> >>             Project: Harmony
>> >> >>          Issue Type: Improvement
>> >> >>          Components: DRLVM
>> >> >>            Reporter: Xiaoming Gu
>> >> >>         Attachments: H5965-V1.patch
>> >> >>
>> >> >>
>> >> >> In IA32 there is a quick (1 cycle) LEA instruction for loading
>> effective
>> >> address. The function of LEA is a combination of shift-left and
>> addition.
>> >> For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used but
>> not
>> >> limited in element address calculation for array.
>> >> >> In current Ia32InstCodeSelector.cpp, the function for translating
>> >> Op_Shladd HIR generates shl and add. Since LEA has the same semantic, we
>> >> could deploy it to improve performance.
>> >> >
>> >> > --
>> >> > This message is automatically generated by JIRA.
>> >> > -
>> >> > You can reply to this email to add a comment to the issue online.
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> http://xiao-feng.blogspot.com
>> >>
>> >
>>
>>
>>
>> --
>> http://xiao-feng.blogspot.com
>>
>

Re: [jira] Updated: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by xiaoming gu <xi...@gmail.com>.

On Fri, Sep 19, 2008 at 8:24 PM, Ian Rogers <ro...@gmail.com> wrote:

> xiaoming gu wrote:
>
>> Hi, all. I did something more for shladd=>LEA today. With the available
>> MUL
>> strength reduction,
>> X*10 is reduced to (X<<2+X) <<1+0 and 0 is generated by a self XOR
>> instruction (CASE 3).
>> Actually this XOR is not necessay and could be eliminated in HIR2LIR pass.
>> Following is the
>> better instructions generated with the improve patch. Comparing with
>> previous CASE 3, you may
>> find XOR gone.
>>
>>
>> CASE 4: MUL strength reduction - using LEA and taking care of 0
>>
>> I22: LEA t48(EDI):I_32,t47[v434(EBP)+v434(EBP)*t46(4)]:I_32 bcOff: 42 \l\
>> I23: LEA t52(EDI):I_32,t51[t48(EDI)*t50(2)+t49(0)]:I_32 bcOff: 42 \l\
>> I861: MOV v533[v521(ESP)+t532(-24)]:I_32,t52(EDI):I_32 bcOff: 43 \l\
>> I860: MOV v535[v521(ESP)+t534(-28)]:I_32,t53(1):I_32 bcOff: 45 \l\
>> I26: EmptyPseudoInst bcOff: 48 \l\
>>
>>                    CASE1      CASE2     CASE3      CASE4
>> Time (msec)   6234         7688          5734          5704
>> Normalized     1              1.233         0.920         0.915
>>
>>
>> I'm going to submit the patch though it only brings small performance
>> improvement (0.5%). Any
>> comment is welcome. Thanks.
>>
>> Xiaoming
>>
>>
> Hi Xiaoming,
>
> inspired by all this discussion I rehashed some of this code in Jikes RVM,
> the results are here [1]. We have a BURS instruction selector so we don't
> generate ShiftAdd operations instead the tree pattern matcher will combine a
> shift and add into an address [2] that we then turn into an LEA [3]. Anyway
> for a multiply by 10 we now produce:
>
> 3           int_move                t3si(I) = 0
> 3           int_shl                 t4si(I) = l0si(I,d), 1
> 3           int_add                 t5si(I) = t3si(I), t4si(I)
> 3           int_shl                 t6si(I) = l0si(I,d), 3
> 3           int_add                 t7si(I) = t5si(I), t6si(I)
> 3           int_move                t2si(I) = t7si(I)
>
> that becomes:
>
> 3           ia32_lea                t6si(I) = <0+[l0si(I,d)*8]>DW
> 3           ia32_lea                t7si(I) = <[t6si(I)]+[l0si(I,d)*2]>DW
>
> Rather than (x<<2+x)<<1 we produce (x<<3)+(x<<1) as I believe this approach
> gives better opportunity for ILP. A better example is 100 for which we
> generate:
>
> 3           ia32_lea                EDX(I) = <0+[EAX(I,d)*4]>DW
> 3           ia32_lea                EDX(I) = <[EDX(I)]+[EDX(I)*8]>DW
> 3           ia32_shl                EAX(I) AF CF OF PF SF ZF <-- 6
> 3           ia32_add                EAX(I) AF CF OF PF SF ZF <-- EDX(I)
>
> that is ((x*4)*(1+8))+x*64. If we generated the shift and adds for the *64
> from the preceding shift then we end up losing the add in the 2nd LEA and
> having to generate an extra add to compensate, ie:
>
> 3           ia32_lea               t1 = l0*4
> 3           ia32_lea               t2 = t1*8
> 3           ia32_lea               t3 = t2*2 + t2
> 3           ia32_add              t1 = t3 + t1
>
> I believe we might as well do the shift by 6 and add as at least that can
> be done in parallel to the LEAs. I mention this as I think this may be what
> your multiply by 100 should look like. It'd be nice if Jikes RVM were doing
> simplification of constant division and of course for multiplication we're
> not considering subtractions to create the desired result (help welcome :-)
> ).
>
> Regards,
> Ian
>
> [1]
> http://jikesrvm.svn.sourceforge.net/viewvc/jikesrvm/rvmroot/trunk/rvm/src/org/jikesrvm/compilers/opt/Simplifier.java?revision=15001&view=markup#l_1379
> [2]
> http://jikesrvm.svn.sourceforge.net/viewvc/jikesrvm/rvmroot/trunk/rvm/src-generated/opt-burs/ia32/IA32.rules?view=markup#l_229
> [3]
> http://jikesrvm.svn.sourceforge.net/viewvc/jikesrvm/rvmroot/trunk/rvm/src-generated/opt-burs/ia32/IA32.rules?view=markup#l_297


Hi, Ian. Thanks for your helpful information.

X*100 is reduced to ((X<<2+X)<<2+0)<<2+((X<<2+X)<<2+0) and ((X<<2+X)<<2+0)
is reused. I get to know what your better ILP means and do some testing with
X*10. The timing result is 5750ms and no improvement on my desktop which is
with an Intel 4-core Q9550. I think the reason might be the advanced
micro-architecture of modern processors. The strategy of MUL strength
reduction in Harmony is not with the ILP concern and it always consumes the
produced operands in a stack as soon as possible.

There are SUB and NEG operations in the reduction and I'm going to study the
details next week. You may refer to multiplybyconstant.cpp for MUL reduction
and DIV reduction is in simplifyTauDiv() and simplifyTauRem() in
simplifier.cpp. Hope they are helpful. Thanks.

Xiaoming

Re: [jira] Updated: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by Ian Rogers <ro...@gmail.com>.

xiaoming gu wrote:
> Hi, all. I did something more for shladd=>LEA today. With the available MUL
> strength reduction,
> X*10 is reduced to (X<<2+X) <<1+0 and 0 is generated by a self XOR
> instruction (CASE 3).
> Actually this XOR is not necessay and could be eliminated in HIR2LIR pass.
> Following is the
> better instructions generated with the improve patch. Comparing with
> previous CASE 3, you may
> find XOR gone.
>
>
> CASE 4: MUL strength reduction - using LEA and taking care of 0
>
> I22: LEA t48(EDI):I_32,t47[v434(EBP)+v434(EBP)*t46(4)]:I_32 bcOff: 42 \l\
> I23: LEA t52(EDI):I_32,t51[t48(EDI)*t50(2)+t49(0)]:I_32 bcOff: 42 \l\
> I861: MOV v533[v521(ESP)+t532(-24)]:I_32,t52(EDI):I_32 bcOff: 43 \l\
> I860: MOV v535[v521(ESP)+t534(-28)]:I_32,t53(1):I_32 bcOff: 45 \l\
> I26: EmptyPseudoInst bcOff: 48 \l\
>
>                     CASE1      CASE2     CASE3      CASE4
> Time (msec)   6234         7688          5734          5704
> Normalized     1              1.233         0.920         0.915
>
>
> I'm going to submit the patch though it only brings small performance
> improvement (0.5%). Any
> comment is welcome. Thanks.
>
> Xiaoming
>   
Hi Xiaoming,

inspired by all this discussion I rehashed some of this code in Jikes 
RVM, the results are here [1]. We have a BURS instruction selector so we 
don't generate ShiftAdd operations instead the tree pattern matcher will 
combine a shift and add into an address [2] that we then turn into an 
LEA [3]. Anyway for a multiply by 10 we now produce:

3           int_move                t3si(I) = 0
3           int_shl                 t4si(I) = l0si(I,d), 1
3           int_add                 t5si(I) = t3si(I), t4si(I)
3           int_shl                 t6si(I) = l0si(I,d), 3
3           int_add                 t7si(I) = t5si(I), t6si(I)
3           int_move                t2si(I) = t7si(I)

that becomes:

3           ia32_lea                t6si(I) = <0+[l0si(I,d)*8]>DW
3           ia32_lea                t7si(I) = <[t6si(I)]+[l0si(I,d)*2]>DW

Rather than (x<<2+x)<<1 we produce (x<<3)+(x<<1) as I believe this 
approach gives better opportunity for ILP. A better example is 100 for 
which we generate:

3           ia32_lea                EDX(I) = <0+[EAX(I,d)*4]>DW
3           ia32_lea                EDX(I) = <[EDX(I)]+[EDX(I)*8]>DW
3           ia32_shl                EAX(I) AF CF OF PF SF ZF <-- 6
3           ia32_add                EAX(I) AF CF OF PF SF ZF <-- EDX(I)

that is ((x*4)*(1+8))+x*64. If we generated the shift and adds for the 
*64 from the preceding shift then we end up losing the add in the 2nd 
LEA and having to generate an extra add to compensate, ie:

3           ia32_lea               t1 = l0*4
3           ia32_lea               t2 = t1*8
3           ia32_lea               t3 = t2*2 + t2
3           ia32_add              t1 = t3 + t1

I believe we might as well do the shift by 6 and add as at least that 
can be done in parallel to the LEAs. I mention this as I think this may 
be what your multiply by 100 should look like. It'd be nice if Jikes RVM 
were doing simplification of constant division and of course for 
multiplication we're not considering subtractions to create the desired 
result (help welcome :-) ).

Regards,
Ian

[1] 
http://jikesrvm.svn.sourceforge.net/viewvc/jikesrvm/rvmroot/trunk/rvm/src/org/jikesrvm/compilers/opt/Simplifier.java?revision=15001&view=markup#l_1379
[2] 
http://jikesrvm.svn.sourceforge.net/viewvc/jikesrvm/rvmroot/trunk/rvm/src-generated/opt-burs/ia32/IA32.rules?view=markup#l_229
[3] 
http://jikesrvm.svn.sourceforge.net/viewvc/jikesrvm/rvmroot/trunk/rvm/src-generated/opt-burs/ia32/IA32.rules?view=markup#l_297

Re: [jira] Updated: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by xiaoming gu <xi...@gmail.com>.

Hi, all. I did something more for shladd=>LEA today. With the available MUL
strength reduction,
X*10 is reduced to (X<<2+X) <<1+0 and 0 is generated by a self XOR
instruction (CASE 3).
Actually this XOR is not necessay and could be eliminated in HIR2LIR pass.
Following is the
better instructions generated with the improve patch. Comparing with
previous CASE 3, you may
find XOR gone.


CASE 4: MUL strength reduction - using LEA and taking care of 0

I22: LEA t48(EDI):I_32,t47[v434(EBP)+v434(EBP)*t46(4)]:I_32 bcOff: 42 \l\
I23: LEA t52(EDI):I_32,t51[t48(EDI)*t50(2)+t49(0)]:I_32 bcOff: 42 \l\
I861: MOV v533[v521(ESP)+t532(-24)]:I_32,t52(EDI):I_32 bcOff: 43 \l\
I860: MOV v535[v521(ESP)+t534(-28)]:I_32,t53(1):I_32 bcOff: 45 \l\
I26: EmptyPseudoInst bcOff: 48 \l\

                    CASE1      CASE2     CASE3      CASE4
Time (msec)   6234         7688          5734          5704
Normalized     1              1.233         0.920         0.915


I'm going to submit the patch though it only brings small performance
improvement (0.5%). Any
comment is welcome. Thanks.

Xiaoming


On Wed, Sep 17, 2008 at 4:13 PM, Xiao-Feng Li <xi...@gmail.com> wrote:

> Xiaoming, Thanks for the explanation.
>
> Thanks,
> xiaofeng
>
> On Wed, Sep 17, 2008 at 3:35 PM, xiaoming gu <xi...@gmail.com>
> wrote:
> >  The 7.9% improvement comes from the complex function (shift left+add)
> and
> > quick execution (1 cycle) of LEA with
> > special hardware optimizations. In IA32, LEA is designed for computing
> > address originally but not limited to that
> > purpose. So we may use LEA LIR for shladd HIR for common arithmetic
> > calculations.
> >
> > And in the available MUL strength reduction (multiplybyconstant.cpp),
> there
> > is some part of code implying to use
> > LEA LIR for shladd HIR. But in later HIR2LIR pass, shladd HIR is
> transformed
> > to SAL and ADD LIRs which makes
> > MUL strength reduction always with no improvement.
> >
> > Thanks. -Xiaoming
> >
> > On Wed, Sep 17, 2008 at 11:16 AM, Xiao-Feng Li <xiaofeng.li@gmail.com
> >wrote:
> >
> >> On Wed, Sep 17, 2008 at 10:29 AM, Xiaoming Gu (JIRA) <ji...@apache.org>
> >> wrote:
> >> >
> >> >     [
> >>
> https://issues.apache.org/jira/browse/HARMONY-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> >> >
> >> > Xiaoming Gu updated HARMONY-5965:
> >> > ---------------------------------
> >> >
> >> >    Attachment: H5965-V1.patch
> >> >
> >> > With this patch, shladd HIR could generate LEA LIR when the data is I4
> >> type and shift-left bit is 1/2/3.
> >> >
> >> > Note: A new MemOpndKind "MemOpndKind_LEA" is created because we just
> use
> >> the memory operand in LEA LIR to do common arithmetic calculation not
> for
> >> real memory address computation. If we still use MemOpndKind_Heap, there
> are
> >> some failed verifications in debug version.
> >> >
> >> > Then I turned on MUL strength reduction and get the following
> improvement
> >> with a synthetic example.
> >> >
> >> > hotspot of source code:
> >> >    for(int i=0;i<times;i++) // times=2,000,000,000
> >> >        result = result*multiplier; //multiplier=10, which is
> transformed
> >> from x*10 to (((x<<2)+x)<<1)+0
> >> >
> >> > Following is the binary code generated for "result =
> result*multiplier;".
> >> >
> >> > CASE 1: No MUL strength reduction - using IMUL
> >> > I868: MOV s47(EDI):I_32,v426(ESI):I_32 \l\
> >> > I867: MOV t351(EBP):I_32,t46(10):I_32 \l\
> >> > I22: (ID:s16(EFLGS):U_32) =IMUL s47(EDI):I_32,t351(EBP):I_32  bcOff:
> 42
> >> \l\
> >> > I866: MOV v527[v513(ESP)+t526(-28)]:I_32,s47(EDI):I_32  bcOff: 43 \l\
> >> > I865: MOV v529[v513(ESP)+t528(-32)]:I_32,t48(1):I_32  bcOff: 45 \l\
> >> > I25: EmptyPseudoInst  bcOff: 48 \l\
> >> >
> >> > CASE 2: MUL strength reduction - using SAL and ADD
> >> > I884: MOV s47(EBP):I_32,v438(ESI):I_32 \l\
> >> > I23: (ID:s16(EFLGS):U_32) =SAL s47(EBP):I_32,t46(2):U_8  bcOff: 42 \l\
> >> > I883: MOV s54(EDI):I_32,v438(ESI):I_32 \l\
> >> > I24: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,s47(EBP):I_32  bcOff: 42
> \l\
> >> > I116: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
> >> > I26: (ID:s16(EFLGS):U_32) =SAL s54(EDI):I_32,t51(1):U_8  bcOff: 42 \l\
> >> > I117: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
> >> > I27: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,t50(0):I_32  bcOff: 42
> \l\
> >> > I882: MOV v539[v525(ESP)+t538(-28)]:I_32,s54(EDI):I_32  bcOff: 43 \l\
> >> > I881: MOV v541[v525(ESP)+t540(-32)]:I_32,t55(1):I_32  bcOff: 45 \l\
> >> > I30: EmptyPseudoInst  bcOff: 48 \l\
> >> >
> >> > CASE 3: MUL strength reduction - using LEA
> >> > I22: LEA t48(EBP):I_32,t47[v436(ESI)+v436(ESI)*t46(4)]:I_32  bcOff: 42
> >> \l\
> >> > I868: (ID:s16(EFLGS):U_32) =XOR t361(EDI):I_32,t361(EDI):I_32 \l\
> >> > I23: LEA t52(EDI):I_32,t51[t361(EDI)+t48(EBP)*t50(2)]:I_32  bcOff: 42
> \l\
> >> > I867: MOV v537[v523(ESP)+t536(-28)]:I_32,t52(EDI):I_32  bcOff: 43 \l\
> >> > I866: MOV v539[v523(ESP)+t538(-32)]:I_32,t53(1):I_32  bcOff: 45 \l\
> >> > I26: EmptyPseudoInst  bcOff: 48 \l\
> >> >
> >> >                               CASE1         CASE2           CASE3
> >> > Time (msec)        6234             7688                5734
> >>
> >> Good job!  The improvement looks good. It is about 7.9%. Thanks.
> >>
> >> Thanks,
> >> xiaofeng
> >>
> >> > I'm going to spend more time for H5901 to adjust MUL strength
> reduction.
> >> >
> >> >> [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
> >> >> ---------------------------------------------------------------
> >> >>
> >> >>                 Key: HARMONY-5965
> >> >>                 URL:
> https://issues.apache.org/jira/browse/HARMONY-5965
> >> >>             Project: Harmony
> >> >>          Issue Type: Improvement
> >> >>          Components: DRLVM
> >> >>            Reporter: Xiaoming Gu
> >> >>         Attachments: H5965-V1.patch
> >> >>
> >> >>
> >> >> In IA32 there is a quick (1 cycle) LEA instruction for loading
> effective
> >> address. The function of LEA is a combination of shift-left and
> addition.
> >> For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used but
> not
> >> limited in element address calculation for array.
> >> >> In current Ia32InstCodeSelector.cpp, the function for translating
> >> Op_Shladd HIR generates shl and add. Since LEA has the same semantic, we
> >> could deploy it to improve performance.
> >> >
> >> > --
> >> > This message is automatically generated by JIRA.
> >> > -
> >> > You can reply to this email to add a comment to the issue online.
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> http://xiao-feng.blogspot.com
> >>
> >
>
>
>
> --
> http://xiao-feng.blogspot.com
>

Re: [jira] Updated: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by Xiao-Feng Li <xi...@gmail.com>.

Xiaoming, Thanks for the explanation.

Thanks,
xiaofeng

On Wed, Sep 17, 2008 at 3:35 PM, xiaoming gu <xi...@gmail.com> wrote:
>  The 7.9% improvement comes from the complex function (shift left+add) and
> quick execution (1 cycle) of LEA with
> special hardware optimizations. In IA32, LEA is designed for computing
> address originally but not limited to that
> purpose. So we may use LEA LIR for shladd HIR for common arithmetic
> calculations.
>
> And in the available MUL strength reduction (multiplybyconstant.cpp), there
> is some part of code implying to use
> LEA LIR for shladd HIR. But in later HIR2LIR pass, shladd HIR is transformed
> to SAL and ADD LIRs which makes
> MUL strength reduction always with no improvement.
>
> Thanks. -Xiaoming
>
> On Wed, Sep 17, 2008 at 11:16 AM, Xiao-Feng Li <xi...@gmail.com>wrote:
>
>> On Wed, Sep 17, 2008 at 10:29 AM, Xiaoming Gu (JIRA) <ji...@apache.org>
>> wrote:
>> >
>> >     [
>> https://issues.apache.org/jira/browse/HARMONY-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>> >
>> > Xiaoming Gu updated HARMONY-5965:
>> > ---------------------------------
>> >
>> >    Attachment: H5965-V1.patch
>> >
>> > With this patch, shladd HIR could generate LEA LIR when the data is I4
>> type and shift-left bit is 1/2/3.
>> >
>> > Note: A new MemOpndKind "MemOpndKind_LEA" is created because we just use
>> the memory operand in LEA LIR to do common arithmetic calculation not for
>> real memory address computation. If we still use MemOpndKind_Heap, there are
>> some failed verifications in debug version.
>> >
>> > Then I turned on MUL strength reduction and get the following improvement
>> with a synthetic example.
>> >
>> > hotspot of source code:
>> >    for(int i=0;i<times;i++) // times=2,000,000,000
>> >        result = result*multiplier; //multiplier=10, which is transformed
>> from x*10 to (((x<<2)+x)<<1)+0
>> >
>> > Following is the binary code generated for "result = result*multiplier;".
>> >
>> > CASE 1: No MUL strength reduction - using IMUL
>> > I868: MOV s47(EDI):I_32,v426(ESI):I_32 \l\
>> > I867: MOV t351(EBP):I_32,t46(10):I_32 \l\
>> > I22: (ID:s16(EFLGS):U_32) =IMUL s47(EDI):I_32,t351(EBP):I_32  bcOff: 42
>> \l\
>> > I866: MOV v527[v513(ESP)+t526(-28)]:I_32,s47(EDI):I_32  bcOff: 43 \l\
>> > I865: MOV v529[v513(ESP)+t528(-32)]:I_32,t48(1):I_32  bcOff: 45 \l\
>> > I25: EmptyPseudoInst  bcOff: 48 \l\
>> >
>> > CASE 2: MUL strength reduction - using SAL and ADD
>> > I884: MOV s47(EBP):I_32,v438(ESI):I_32 \l\
>> > I23: (ID:s16(EFLGS):U_32) =SAL s47(EBP):I_32,t46(2):U_8  bcOff: 42 \l\
>> > I883: MOV s54(EDI):I_32,v438(ESI):I_32 \l\
>> > I24: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,s47(EBP):I_32  bcOff: 42 \l\
>> > I116: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
>> > I26: (ID:s16(EFLGS):U_32) =SAL s54(EDI):I_32,t51(1):U_8  bcOff: 42 \l\
>> > I117: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
>> > I27: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,t50(0):I_32  bcOff: 42 \l\
>> > I882: MOV v539[v525(ESP)+t538(-28)]:I_32,s54(EDI):I_32  bcOff: 43 \l\
>> > I881: MOV v541[v525(ESP)+t540(-32)]:I_32,t55(1):I_32  bcOff: 45 \l\
>> > I30: EmptyPseudoInst  bcOff: 48 \l\
>> >
>> > CASE 3: MUL strength reduction - using LEA
>> > I22: LEA t48(EBP):I_32,t47[v436(ESI)+v436(ESI)*t46(4)]:I_32  bcOff: 42
>> \l\
>> > I868: (ID:s16(EFLGS):U_32) =XOR t361(EDI):I_32,t361(EDI):I_32 \l\
>> > I23: LEA t52(EDI):I_32,t51[t361(EDI)+t48(EBP)*t50(2)]:I_32  bcOff: 42 \l\
>> > I867: MOV v537[v523(ESP)+t536(-28)]:I_32,t52(EDI):I_32  bcOff: 43 \l\
>> > I866: MOV v539[v523(ESP)+t538(-32)]:I_32,t53(1):I_32  bcOff: 45 \l\
>> > I26: EmptyPseudoInst  bcOff: 48 \l\
>> >
>> >                               CASE1         CASE2           CASE3
>> > Time (msec)        6234             7688                5734
>>
>> Good job!  The improvement looks good. It is about 7.9%. Thanks.
>>
>> Thanks,
>> xiaofeng
>>
>> > I'm going to spend more time for H5901 to adjust MUL strength reduction.
>> >
>> >> [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
>> >> ---------------------------------------------------------------
>> >>
>> >>                 Key: HARMONY-5965
>> >>                 URL: https://issues.apache.org/jira/browse/HARMONY-5965
>> >>             Project: Harmony
>> >>          Issue Type: Improvement
>> >>          Components: DRLVM
>> >>            Reporter: Xiaoming Gu
>> >>         Attachments: H5965-V1.patch
>> >>
>> >>
>> >> In IA32 there is a quick (1 cycle) LEA instruction for loading effective
>> address. The function of LEA is a combination of shift-left and addition.
>> For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used but not
>> limited in element address calculation for array.
>> >> In current Ia32InstCodeSelector.cpp, the function for translating
>> Op_Shladd HIR generates shl and add. Since LEA has the same semantic, we
>> could deploy it to improve performance.
>> >
>> > --
>> > This message is automatically generated by JIRA.
>> > -
>> > You can reply to this email to add a comment to the issue online.
>> >
>> >
>>
>>
>>
>> --
>> http://xiao-feng.blogspot.com
>>
>



-- 
http://xiao-feng.blogspot.com

Re: [jira] Updated: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by xiaoming gu <xi...@gmail.com>.

 The 7.9% improvement comes from the complex function (shift left+add) and
quick execution (1 cycle) of LEA with
special hardware optimizations. In IA32, LEA is designed for computing
address originally but not limited to that
purpose. So we may use LEA LIR for shladd HIR for common arithmetic
calculations.

And in the available MUL strength reduction (multiplybyconstant.cpp), there
is some part of code implying to use
LEA LIR for shladd HIR. But in later HIR2LIR pass, shladd HIR is transformed
to SAL and ADD LIRs which makes
MUL strength reduction always with no improvement.

Thanks. -Xiaoming

On Wed, Sep 17, 2008 at 11:16 AM, Xiao-Feng Li <xi...@gmail.com>wrote:

> On Wed, Sep 17, 2008 at 10:29 AM, Xiaoming Gu (JIRA) <ji...@apache.org>
> wrote:
> >
> >     [
> https://issues.apache.org/jira/browse/HARMONY-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> >
> > Xiaoming Gu updated HARMONY-5965:
> > ---------------------------------
> >
> >    Attachment: H5965-V1.patch
> >
> > With this patch, shladd HIR could generate LEA LIR when the data is I4
> type and shift-left bit is 1/2/3.
> >
> > Note: A new MemOpndKind "MemOpndKind_LEA" is created because we just use
> the memory operand in LEA LIR to do common arithmetic calculation not for
> real memory address computation. If we still use MemOpndKind_Heap, there are
> some failed verifications in debug version.
> >
> > Then I turned on MUL strength reduction and get the following improvement
> with a synthetic example.
> >
> > hotspot of source code:
> >    for(int i=0;i<times;i++) // times=2,000,000,000
> >        result = result*multiplier; //multiplier=10, which is transformed
> from x*10 to (((x<<2)+x)<<1)+0
> >
> > Following is the binary code generated for "result = result*multiplier;".
> >
> > CASE 1: No MUL strength reduction - using IMUL
> > I868: MOV s47(EDI):I_32,v426(ESI):I_32 \l\
> > I867: MOV t351(EBP):I_32,t46(10):I_32 \l\
> > I22: (ID:s16(EFLGS):U_32) =IMUL s47(EDI):I_32,t351(EBP):I_32  bcOff: 42
> \l\
> > I866: MOV v527[v513(ESP)+t526(-28)]:I_32,s47(EDI):I_32  bcOff: 43 \l\
> > I865: MOV v529[v513(ESP)+t528(-32)]:I_32,t48(1):I_32  bcOff: 45 \l\
> > I25: EmptyPseudoInst  bcOff: 48 \l\
> >
> > CASE 2: MUL strength reduction - using SAL and ADD
> > I884: MOV s47(EBP):I_32,v438(ESI):I_32 \l\
> > I23: (ID:s16(EFLGS):U_32) =SAL s47(EBP):I_32,t46(2):U_8  bcOff: 42 \l\
> > I883: MOV s54(EDI):I_32,v438(ESI):I_32 \l\
> > I24: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,s47(EBP):I_32  bcOff: 42 \l\
> > I116: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
> > I26: (ID:s16(EFLGS):U_32) =SAL s54(EDI):I_32,t51(1):U_8  bcOff: 42 \l\
> > I117: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
> > I27: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,t50(0):I_32  bcOff: 42 \l\
> > I882: MOV v539[v525(ESP)+t538(-28)]:I_32,s54(EDI):I_32  bcOff: 43 \l\
> > I881: MOV v541[v525(ESP)+t540(-32)]:I_32,t55(1):I_32  bcOff: 45 \l\
> > I30: EmptyPseudoInst  bcOff: 48 \l\
> >
> > CASE 3: MUL strength reduction - using LEA
> > I22: LEA t48(EBP):I_32,t47[v436(ESI)+v436(ESI)*t46(4)]:I_32  bcOff: 42
> \l\
> > I868: (ID:s16(EFLGS):U_32) =XOR t361(EDI):I_32,t361(EDI):I_32 \l\
> > I23: LEA t52(EDI):I_32,t51[t361(EDI)+t48(EBP)*t50(2)]:I_32  bcOff: 42 \l\
> > I867: MOV v537[v523(ESP)+t536(-28)]:I_32,t52(EDI):I_32  bcOff: 43 \l\
> > I866: MOV v539[v523(ESP)+t538(-32)]:I_32,t53(1):I_32  bcOff: 45 \l\
> > I26: EmptyPseudoInst  bcOff: 48 \l\
> >
> >                               CASE1         CASE2           CASE3
> > Time (msec)        6234             7688                5734
>
> Good job!  The improvement looks good. It is about 7.9%. Thanks.
>
> Thanks,
> xiaofeng
>
> > I'm going to spend more time for H5901 to adjust MUL strength reduction.
> >
> >> [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
> >> ---------------------------------------------------------------
> >>
> >>                 Key: HARMONY-5965
> >>                 URL: https://issues.apache.org/jira/browse/HARMONY-5965
> >>             Project: Harmony
> >>          Issue Type: Improvement
> >>          Components: DRLVM
> >>            Reporter: Xiaoming Gu
> >>         Attachments: H5965-V1.patch
> >>
> >>
> >> In IA32 there is a quick (1 cycle) LEA instruction for loading effective
> address. The function of LEA is a combination of shift-left and addition.
> For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used but not
> limited in element address calculation for array.
> >> In current Ia32InstCodeSelector.cpp, the function for translating
> Op_Shladd HIR generates shl and add. Since LEA has the same semantic, we
> could deploy it to improve performance.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>
>
>
> --
> http://xiao-feng.blogspot.com
>

Re: [jira] Updated: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by Xiao-Feng Li <xi...@gmail.com>.

On Wed, Sep 17, 2008 at 10:29 AM, Xiaoming Gu (JIRA) <ji...@apache.org> wrote:
>
>     [ https://issues.apache.org/jira/browse/HARMONY-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Xiaoming Gu updated HARMONY-5965:
> ---------------------------------
>
>    Attachment: H5965-V1.patch
>
> With this patch, shladd HIR could generate LEA LIR when the data is I4 type and shift-left bit is 1/2/3.
>
> Note: A new MemOpndKind "MemOpndKind_LEA" is created because we just use the memory operand in LEA LIR to do common arithmetic calculation not for real memory address computation. If we still use MemOpndKind_Heap, there are some failed verifications in debug version.
>
> Then I turned on MUL strength reduction and get the following improvement with a synthetic example.
>
> hotspot of source code:
>    for(int i=0;i<times;i++) // times=2,000,000,000
>        result = result*multiplier; //multiplier=10, which is transformed from x*10 to (((x<<2)+x)<<1)+0
>
> Following is the binary code generated for "result = result*multiplier;".
>
> CASE 1: No MUL strength reduction - using IMUL
> I868: MOV s47(EDI):I_32,v426(ESI):I_32 \l\
> I867: MOV t351(EBP):I_32,t46(10):I_32 \l\
> I22: (ID:s16(EFLGS):U_32) =IMUL s47(EDI):I_32,t351(EBP):I_32  bcOff: 42 \l\
> I866: MOV v527[v513(ESP)+t526(-28)]:I_32,s47(EDI):I_32  bcOff: 43 \l\
> I865: MOV v529[v513(ESP)+t528(-32)]:I_32,t48(1):I_32  bcOff: 45 \l\
> I25: EmptyPseudoInst  bcOff: 48 \l\
>
> CASE 2: MUL strength reduction - using SAL and ADD
> I884: MOV s47(EBP):I_32,v438(ESI):I_32 \l\
> I23: (ID:s16(EFLGS):U_32) =SAL s47(EBP):I_32,t46(2):U_8  bcOff: 42 \l\
> I883: MOV s54(EDI):I_32,v438(ESI):I_32 \l\
> I24: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,s47(EBP):I_32  bcOff: 42 \l\
> I116: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
> I26: (ID:s16(EFLGS):U_32) =SAL s54(EDI):I_32,t51(1):U_8  bcOff: 42 \l\
> I117: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
> I27: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,t50(0):I_32  bcOff: 42 \l\
> I882: MOV v539[v525(ESP)+t538(-28)]:I_32,s54(EDI):I_32  bcOff: 43 \l\
> I881: MOV v541[v525(ESP)+t540(-32)]:I_32,t55(1):I_32  bcOff: 45 \l\
> I30: EmptyPseudoInst  bcOff: 48 \l\
>
> CASE 3: MUL strength reduction - using LEA
> I22: LEA t48(EBP):I_32,t47[v436(ESI)+v436(ESI)*t46(4)]:I_32  bcOff: 42 \l\
> I868: (ID:s16(EFLGS):U_32) =XOR t361(EDI):I_32,t361(EDI):I_32 \l\
> I23: LEA t52(EDI):I_32,t51[t361(EDI)+t48(EBP)*t50(2)]:I_32  bcOff: 42 \l\
> I867: MOV v537[v523(ESP)+t536(-28)]:I_32,t52(EDI):I_32  bcOff: 43 \l\
> I866: MOV v539[v523(ESP)+t538(-32)]:I_32,t53(1):I_32  bcOff: 45 \l\
> I26: EmptyPseudoInst  bcOff: 48 \l\
>
>                               CASE1         CASE2           CASE3
> Time (msec)        6234             7688                5734

Good job!  The improvement looks good. It is about 7.9%. Thanks.

Thanks,
xiaofeng

> I'm going to spend more time for H5901 to adjust MUL strength reduction.
>
>> [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
>> ---------------------------------------------------------------
>>
>>                 Key: HARMONY-5965
>>                 URL: https://issues.apache.org/jira/browse/HARMONY-5965
>>             Project: Harmony
>>          Issue Type: Improvement
>>          Components: DRLVM
>>            Reporter: Xiaoming Gu
>>         Attachments: H5965-V1.patch
>>
>>
>> In IA32 there is a quick (1 cycle) LEA instruction for loading effective address. The function of LEA is a combination of shift-left and addition. For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used but not limited in element address calculation for array.
>> In current Ia32InstCodeSelector.cpp, the function for translating Op_Shladd HIR generates shl and add. Since LEA has the same semantic, we could deploy it to improve performance.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>



-- 
http://xiao-feng.blogspot.com

[jira] Closed: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by "Xiaoming Gu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HARMONY-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiaoming Gu closed HARMONY-5965.
--------------------------------

    Resolution: Fixed

Merged to H5901.

> [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
> ---------------------------------------------------------------
>
>                 Key: HARMONY-5965
>                 URL: https://issues.apache.org/jira/browse/HARMONY-5965
>             Project: Harmony
>          Issue Type: Improvement
>          Components: DRLVM
>            Reporter: Xiaoming Gu
>         Attachments: H5965-V1.patch, H5965-V2.patch
>
>
> In IA32 there is a quick (1 cycle) LEA instruction for loading effective address. The function of LEA is a combination of shift-left and addition. For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used but not limited in element address calculation for array.
> In current Ia32InstCodeSelector.cpp, the function for translating Op_Shladd HIR generates shl and add. Since LEA has the same semantic, we could deploy it to improve performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by "Xiaoming Gu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HARMONY-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiaoming Gu updated HARMONY-5965:
---------------------------------

    Attachment: H5965-V2.patch

In the available MUL strength reduction, X*10 is reduced to (X<<2+X) <<1+0 and 0 is generated by a self XOR instruction (CASE 3). Actually this XOR is not necessay and it's eliminated in HIR2LIR pass in this improved patch. Following is the better instructions generated with the improve patch. Comparing with CASE 3, you may find XOR gone.


CASE 4: MUL strength reduction - using LEA and taking care of 0

I22: LEA t48(EDI):I_32,t47[v434(EBP)+v434(EBP)*t46(4)]:I_32 bcOff: 42 \l\
I23: LEA t52(EDI):I_32,t51[t48(EDI)*t50(2)+t49(0)]:I_32 bcOff: 42 \l\
I861: MOV v533[v521(ESP)+t532(-24)]:I_32,t52(EDI):I_32 bcOff: 43 \l\
I860: MOV v535[v521(ESP)+t534(-28)]:I_32,t53(1):I_32 bcOff: 45 \l\
I26: EmptyPseudoInst bcOff: 48 \l\

                    CASE1      CASE2     CASE3      CASE4
Time (msec)   6234         7688          5734          5704
Normalized     1              1.233         0.920         0.915

We get more 0.5% improvement for the synthetic example.

> [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
> ---------------------------------------------------------------
>
>                 Key: HARMONY-5965
>                 URL: https://issues.apache.org/jira/browse/HARMONY-5965
>             Project: Harmony
>          Issue Type: Improvement
>          Components: DRLVM
>            Reporter: Xiaoming Gu
>         Attachments: H5965-V1.patch, H5965-V2.patch
>
>
> In IA32 there is a quick (1 cycle) LEA instruction for loading effective address. The function of LEA is a combination of shift-left and addition. For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used but not limited in element address calculation for array.
> In current Ia32InstCodeSelector.cpp, the function for translating Op_Shladd HIR generates shl and add. Since LEA has the same semantic, we could deploy it to improve performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HARMONY-5965) [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32

Posted by "Xiaoming Gu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HARMONY-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiaoming Gu updated HARMONY-5965:
---------------------------------

    Attachment: H5965-V1.patch

With this patch, shladd HIR could generate LEA LIR when the data is I4 type and shift-left bit is 1/2/3.

Note: A new MemOpndKind "MemOpndKind_LEA" is created because we just use the memory operand in LEA LIR to do common arithmetic calculation not for real memory address computation. If we still use MemOpndKind_Heap, there are some failed verifications in debug version.

Then I turned on MUL strength reduction and get the following improvement with a synthetic example. 

hotspot of source code:
    for(int i=0;i<times;i++) // times=2,000,000,000
	result = result*multiplier; //multiplier=10, which is transformed from x*10 to (((x<<2)+x)<<1)+0

Following is the binary code generated for "result = result*multiplier;".

CASE 1: No MUL strength reduction - using IMUL
I868: MOV s47(EDI):I_32,v426(ESI):I_32 \l\
I867: MOV t351(EBP):I_32,t46(10):I_32 \l\
I22: (ID:s16(EFLGS):U_32) =IMUL s47(EDI):I_32,t351(EBP):I_32  bcOff: 42 \l\
I866: MOV v527[v513(ESP)+t526(-28)]:I_32,s47(EDI):I_32  bcOff: 43 \l\
I865: MOV v529[v513(ESP)+t528(-32)]:I_32,t48(1):I_32  bcOff: 45 \l\
I25: EmptyPseudoInst  bcOff: 48 \l\

CASE 2: MUL strength reduction - using SAL and ADD
I884: MOV s47(EBP):I_32,v438(ESI):I_32 \l\
I23: (ID:s16(EFLGS):U_32) =SAL s47(EBP):I_32,t46(2):U_8  bcOff: 42 \l\
I883: MOV s54(EDI):I_32,v438(ESI):I_32 \l\
I24: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,s47(EBP):I_32  bcOff: 42 \l\
I116: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
I26: (ID:s16(EFLGS):U_32) =SAL s54(EDI):I_32,t51(1):U_8  bcOff: 42 \l\
I117: (AD:s54(EDI):I_32) =CopyPseudoInst/MOV (AU:s54(EDI):I_32) \l\
I27: (ID:s16(EFLGS):U_32) =ADD s54(EDI):I_32,t50(0):I_32  bcOff: 42 \l\
I882: MOV v539[v525(ESP)+t538(-28)]:I_32,s54(EDI):I_32  bcOff: 43 \l\
I881: MOV v541[v525(ESP)+t540(-32)]:I_32,t55(1):I_32  bcOff: 45 \l\
I30: EmptyPseudoInst  bcOff: 48 \l\

CASE 3: MUL strength reduction - using LEA
I22: LEA t48(EBP):I_32,t47[v436(ESI)+v436(ESI)*t46(4)]:I_32  bcOff: 42 \l\
I868: (ID:s16(EFLGS):U_32) =XOR t361(EDI):I_32,t361(EDI):I_32 \l\
I23: LEA t52(EDI):I_32,t51[t361(EDI)+t48(EBP)*t50(2)]:I_32  bcOff: 42 \l\
I867: MOV v537[v523(ESP)+t536(-28)]:I_32,t52(EDI):I_32  bcOff: 43 \l\
I866: MOV v539[v523(ESP)+t538(-32)]:I_32,t53(1):I_32  bcOff: 45 \l\
I26: EmptyPseudoInst  bcOff: 48 \l\

                               CASE1         CASE2           CASE3
Time (msec)        6234             7688                5734

I'm going to spend more time for H5901 to adjust MUL strength reduction.

> [drlvm][jit]generate Mnemonic_LEA LIR for Op_Shladd HIR in IA32
> ---------------------------------------------------------------
>
>                 Key: HARMONY-5965
>                 URL: https://issues.apache.org/jira/browse/HARMONY-5965
>             Project: Harmony
>          Issue Type: Improvement
>          Components: DRLVM
>            Reporter: Xiaoming Gu
>         Attachments: H5965-V1.patch
>
>
> In IA32 there is a quick (1 cycle) LEA instruction for loading effective address. The function of LEA is a combination of shift-left and addition. For example LEA dst, src, 2, 4 does dst=src<<2+4. It's usually used but not limited in element address calculation for array.
> In current Ia32InstCodeSelector.cpp, the function for translating Op_Shladd HIR generates shl and add. Since LEA has the same semantic, we could deploy it to improve performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.