You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@harmony.apache.org by xiaoming gu <xi...@gmail.com> on 2008/10/11 06:44:19 UTC

discussion for H5826

Hi, all. I find the benefits of this patch come from changing a 64-bit MUL
(3 32-bit MUL and 2 32-bit ADD) to a 32-bit MUL. At present, the
optimization is done as a magic replacement, which is not a common way to
generate code. (https://issues.apache.org/jira/browse/HARMONY-5826)

Assume A and B are both 64-bit operands and we are doing A*B. In 32-bit
machine, the MUL operation is usually translated to (High 32-bit of A)*(Low
32-bit of b)+(Low 32-bit of A)*(High 32-bit of B)+(Low 32-bit of A)*(Low
32-bit of A). But when we know High 32-bit of A and B are both 0, only (Low
32-bit of A)*(Low 32-bit of A) needed.

Following are the HIR and LIR for result = (a & ffffffffL) * (b & ffffffffL)
+ (c & ffffffffL) + (d & ffffffffL) without this patch. We can do the
optimization in HIR simplifier or LIR peephole. I'm not sure whether
changing int64 operation to int32 operation will bring overhead for 64-bit
machine. I think maybe peephole is a better place. I find there is no
peephole optimization for XOR. If JIT could find out the result of XOR is 0,
then propagates the 0 to MUL and related MUL is eliminated. The left problem
is whether there is sufficient data-flow analysis in LIR to do the
propagation and elimination.
=====HIR=====
  I42:ldci8     #4294967295 -) t38:int64
  I43:and       t37, t38 -) t39:int64
  I44:convi8  g23 -) t40:int64
  I45:and       t40, t38 -) t41:int64
  I46:mul   t39, t41 -) t42:int64
=====LIR=====
    238B02A6 I329: MOV s286[v208(ESP)+t285(-100)]:I_32,v19(EAX):I_32
    238B02AA I328: MOV t291[v208(ESP)+t290(-104)]:I_32,v19(EAX):I_32
    238B02AE I327: (ID:s8(EFLGS):U_32) =XOR t206(EDX):I_32,t206(EDX):I_32
    238B02B0 I326: MOV s292(EAX):I_32,s286[v208(ESP)+t285(-100)]:I_32
    238B02B4 I70: (ID:s8(EFLGS):U_32) =MUL
s139(EDX):I_32,s292(EAX):I_32,t206(EDX):I_32
    238B02B6 I325: MOV s140(EBX):I_32,s292(EAX):I_32
    238B02B8 I324: (ID:s8(EFLGS):U_32) =XOR s292(EAX):I_32,s292(EAX):I_32
    238B02BA I323: MOV t289(EDI):I_32,v217[v208(ESP)+t216(-24)]:I_32
    238B02BE I73: (ID:s8(EFLGS):U_32) =MUL
s139(EDX):I_32,s292(EAX):I_32,t289(EDI):I_32
    238B02C0 I74: (ID:s8(EFLGS):U_32) =ADD s140(EBX):I_32,s292(EAX):I_32
    238B02C2 I322: MOV v19(EAX):I_32,t291[v208(ESP)+t290(-104)]:I_32
    DEADBEEF I75: (AD:s293(EAX):I_32) =CopyPseudoInst/MOV (AU:v19(EAX):I_32)

    238B02C6 I76: (ID:s8(EFLGS):U_32) =MUL
s139(EDX):I_32,s293(EAX):I_32,t289(EDI):I_32
    238B02C8 I321: MOV s286[v208(ESP)+t285(-100)]:I_32,s293(EAX):I_32
    238B02CC I77: (ID:s8(EFLGS):U_32) =ADD s139(EDX):I_32,s140(EBX):I_32

Any comments? Thanks.

Xiaoming

Re: discussion for H5826

Posted by xiaoming gu <xi...@gmail.com>.

Hi, guys. I'm debugging a basic version of global propagation in LIR for
IA32. Now a simple test case passed but I encounter a problem when running
SPECjvm2008. If we do not use early_prop in server_static or opt (by editing
emconf files), then some error happens when running startup.helloworld. I
counldn't continue the debugging for global propagation since I don't know
whether the bug comes from it.

In my opinion, the optimizations should be as independent as possible.
Early_prop only does some propagations. I have no idea why it could not be
bypassed. I'm going to work on this problem. Any comment or idea? Thanks.

Xiaoming

On Wed, Oct 15, 2008 at 5:20 PM, xiaoming gu <xi...@gmail.com> wrote:

> Hi, guys. I made some progress today. Now we could get partly benefits by
> improving peephole in LIR.
>
> In current peephole, there is no actual optimization for "dst1, dst2 = MUL
> src1, src2(0)". After add this, one 32-bit MUL is eliminated and we got
> almost half part benefits (from about 200 to about 222). You may see
> optimization process with the following LIR.
>
> =======before early_prop=======
> BB_19
>   PersistentId = 20
>   ExecCnt = 9999.98
>   Loop: Depth=0, !hdr, hdr=NULL
>   Predcessors: BB_18
>   Successors:  BB_33 [Prob=1]
>     I41: o87:U_32 =MOV t19:I_32
>     I60: (AD:o89:U_32) =CopyPseudoInst/MOV (AU:o87:U_32)
>     I61: (AD:o90:I_32) =CopyPseudoInst/MOV (AU:o103(0):I_32)
>     I45: o93:U_32 =MOV t35:I_32
>     I62: (AD:o95:U_32) =CopyPseudoInst/MOV (AU:o93:U_32)
>     I63: (AD:o96:I_32) =CopyPseudoInst/MOV (AU:o104(0):I_32)
>
>
> BB_33
>   PersistentId = 0
>   ExecCnt = 0
>   Loop: Depth=0, !hdr, hdr=NULL
>   Predcessors: BB_19
>   Successors:  BB_34 [Prob=1]
>     I49: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o89:U_32)
>     I50: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o96:I_32
>     I51: (AD:o101:I_32) =CopyPseudoInst/MOV (AU:o99:I_32)
>     I52: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o90:I_32)
>     I53: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o95:U_32
>     I54: o101:I_32 (ID:v8(EFLGS):U_32) =ADD o101:I_32,o99:I_32
>     I55: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o89:U_32)
>     I56: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o95:U_32
>     I57: o100:I_32 (ID:v8(EFLGS):U_32) =ADD o100:I_32,o101:I_32
>
> =======after early_prop=======
> BB_33
>   PersistentId = 0
>   ExecCnt = 0
>   Loop: Depth=0, !hdr, hdr=NULL
>   Predcessors: BB_19
>   Successors:  BB_34 [Prob=1]
>     I49: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
>     I50: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o104(0):I_32
>     I51: (AD:o101:I_32) =CopyPseudoInst/MOV (AU:o99:I_32)
>     I52: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o103(0):I_32)
>     I53: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32   //o99
> is NOT propagate from o103(0)
>     I54: o101:I_32 (ID:v8(EFLGS):U_32) =ADD o101:I_32,o99:I_32
>     I55: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
>     I56: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32
>     I57: o100:I_32 (ID:v8(EFLGS):U_32) =ADD o100:I_32,o101:I_32
>
> =======before 1st peephole=======
> BB_33
>   PersistentId = 0
>   ExecCnt = 0
>   Loop: Depth=0, !hdr, hdr=NULL
>   Predcessors: BB_19
>   Successors:  BB_34 [Prob=1]
>     I49: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
>     I50: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o104(0):I_32
>     I51: (AD:o101:I_32) =CopyPseudoInst/MOV (AU:o99:I_32)
>     I52: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o103(0):I_32)
>     I53: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32
>     I54: o101:I_32 (ID:v8(EFLGS):U_32) =ADD o101:I_32,o99:I_32
>     I55: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
>     I56: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32
>     I57: o100:I_32 (ID:v8(EFLGS):U_32) =ADD o100:I_32,o101:I_32
>
> =======after 1st peephole=======
> BB_33
>   PersistentId = 0
>   ExecCnt = 0
>   Loop: Depth=0, !hdr, hdr=NULL
>   Predcessors: BB_19
>   Successors:  BB_34 [Prob=1]
>     I49: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
>     I65: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32)
>     I64: (AD:s100:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32)
>     I51: (AD:s101:I_32) =CopyPseudoInst/MOV (AU:s99:I_32)
>     I52: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t103(0):I_32)
>     I53: s100:I_32,s99:I_32 (ID:v8(EFLGS):U_32) =MUL s99:I_32,t35:I_32
>     I54: s101:I_32 (ID:v8(EFLGS):U_32) =ADD s101:I_32,s99:I_32
>     I55: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
>     I56: s100:I_32,s99:I_32 (ID:v8(EFLGS):U_32) =MUL s99:I_32,t35:I_32
>     I57: s100:I_32 (ID:v8(EFLGS):U_32) =ADD s100:I_32,s101:I_32
>
> =======before 2nd cg_dce=======
> BB_33
>   PersistentId = 0
>   ExecCnt = 0
>   Loop: Depth=0, !hdr, hdr=NULL
>   Predcessors: BB_19
>   Successors:  BB_34 [Prob=1]
>     I49: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
>     I65: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32)
>     I64: (AD:s100:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32)
>     I51: (AD:s101:I_32) =CopyPseudoInst/MOV (AU:s99:I_32)
>     I52: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t103(0):I_32)
>     I53: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,t35:I_32
>     I54: (ID:v8(EFLGS):U_32) =ADD s101:I_32,s99:I_32
>     I55: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
>     I56: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,t35:I_32
>     I57: (ID:v8(EFLGS):U_32) =ADD s100:I_32,s101:I_32
>
> =======after 2nd cg_dce=======
> BB_33
>   PersistentId = 0
>   ExecCnt = 0
>   Loop: Depth=0, !hdr, hdr=NULL
>   Predcessors: BB_18
>   Successors:  BB_30 [Prob=1]
>     I65: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32) // 0 should
> be propagated to s101 in I54
>     I51: (AD:s101:I_32) =CopyPseudoInst/MOV (AU:s99:I_32)    // then I65
> and I51 could be eliminated
>     I52: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t103(0):I_32)
>     I53: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,v35:I_32
>     I54: (ID:v8(EFLGS):U_32) =ADD s101:I_32,s99:I_32
>     I55: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:v19:I_32)
>     I56: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,v35:I_32
>     I57: (ID:v8(EFLGS):U_32) =ADD s100:I_32,s101:I_32
>
> =======finally=======
> BB_33
>   PersistentId = 0
>   ExecCnt = 9999.99
>   Loop: Depth=0, !hdr, hdr=NULL
>   Predcessors: BB_18
>   Successors:  BB_30 [Prob=1]
> Layout Succ: BB_30
> Block code address: 238B0128
>     238B0128 I188: (ID:s8(EFLGS):U_32) =XOR t99(EDI):I_32,t99(EDI):I_32
>     DEADBEEF I51: (AD:s101(EDI):I_32) =CopyPseudoInst/MOV
> (AU:t99(EDI):I_32)
>     238B012A I187: (ID:s8(EFLGS):U_32) =XOR s130(EAX):I_32,s130(EAX):I_32
>     238B012C I53: (ID:s8(EFLGS):U_32) =MUL
> t100(EDX):I_32,s130(EAX):I_32,t35(EBP):I_32
>     238B012E I54: (ID:s8(EFLGS):U_32) =ADD s101(EDI):I_32,s130(EAX):I_32
>     238B0130 I186: MOV s153(EAX):I_32,t19(EBX):I_32
>     238B0132 I56: (ID:s8(EFLGS):U_32) =MUL
> s154(EDX):I_32,s153(EAX):I_32,t35(EBP):I_32
>     238B0134 I57: (ID:s8(EFLGS):U_32) =ADD s154(EDX):I_32,s101(EDI):I_32
>
> You may see this optimization is related to early_prop (propagate 0 to
> MUL), peephole (change MUL with 0 to MOV 0) and cg_dce (eliminate unused MOV
> 0 or other LIR). Now we have problems with early_prop. For early_prop the
> problem is that it only do propagation for the operands with unique
> definition. That's why 0 is not propagated to the 2nd unnecessary MUL (even
> local propagation could do). We need to use better data-flow analysis to do
> the propagation. For the left unnecessary instructions after cg_dce, I'm
> going to add another early_prop pass before it. Hope it works.
>
> Xiaoming
>
>
> On Mon, Oct 13, 2008 at 9:23 AM, Xiao-Feng Li <xi...@gmail.com>wrote:
>
>> On Sat, Oct 11, 2008 at 12:44 PM, xiaoming gu <xi...@gmail.com>
>> wrote:
>> > Hi, all. I find the benefits of this patch come from changing a 64-bit
>> MUL
>> > (3 32-bit MUL and 2 32-bit ADD) to a 32-bit MUL. At present, the
>> > optimization is done as a magic replacement, which is not a common way
>> to
>> > generate code. (https://issues.apache.org/jira/browse/HARMONY-5826)
>>
>> Thanks for identifying the root cause of improvement.
>>
>> > Assume A and B are both 64-bit operands and we are doing A*B. In 32-bit
>> > machine, the MUL operation is usually translated to (High 32-bit of
>> A)*(Low
>> > 32-bit of b)+(Low 32-bit of A)*(High 32-bit of B)+(Low 32-bit of A)*(Low
>> > 32-bit of A). But when we know High 32-bit of A and B are both 0, only
>> (Low
>> > 32-bit of A)*(Low 32-bit of A) needed.
>> >
>> > Following are the HIR and LIR for result = (a & ffffffffL) * (b &
>> ffffffffL)
>> > + (c & ffffffffL) + (d & ffffffffL) without this patch. We can do the
>> > optimization in HIR simplifier or LIR peephole. I'm not sure whether
>> > changing int64 operation to int32 operation will bring overhead for
>> 64-bit
>> > machine.
>>
>> This needs performance evaluation to answer. This patch looks like
>> specific to 32-bit machine, since original sequence of 3 MULs were for
>> 32-bit machine.
>>
>> > I think maybe peephole is a better place. I find there is no
>> > peephole optimization for XOR. If JIT could find out the result of XOR
>> is 0,
>> > then propagates the 0 to MUL and related MUL is eliminated. The left
>> problem
>> > is whether there is sufficient data-flow analysis in LIR to do the
>> > propagation and elimination.
>>
>> I guess the best place choice depends on the semantics of the
>> optimization. Since this is for instruction sequence optimization
>> rather than operation logic optimization, I agree it is better to be
>> done in LIR peephole.
>>
>> Thanks,
>> xiaofeng
>>
>> >
>> > =====HIR=====
>> >  I42:ldci8     #4294967295 -) t38:int64
>> >  I43:and       t37, t38 -) t39:int64
>> >  I44:convi8  g23 -) t40:int64
>> >  I45:and       t40, t38 -) t41:int64
>> >  I46:mul   t39, t41 -) t42:int64
>> > =====LIR=====
>> >    238B02A6 I329: MOV s286[v208(ESP)+t285(-100)]:I_32,v19(EAX):I_32
>> >    238B02AA I328: MOV t291[v208(ESP)+t290(-104)]:I_32,v19(EAX):I_32
>> >    238B02AE I327: (ID:s8(EFLGS):U_32) =XOR t206(EDX):I_32,t206(EDX):I_32
>> >    238B02B0 I326: MOV s292(EAX):I_32,s286[v208(ESP)+t285(-100)]:I_32
>> >    238B02B4 I70: (ID:s8(EFLGS):U_32) =MUL
>> > s139(EDX):I_32,s292(EAX):I_32,t206(EDX):I_32
>> >    238B02B6 I325: MOV s140(EBX):I_32,s292(EAX):I_32
>> >    238B02B8 I324: (ID:s8(EFLGS):U_32) =XOR s292(EAX):I_32,s292(EAX):I_32
>> >    238B02BA I323: MOV t289(EDI):I_32,v217[v208(ESP)+t216(-24)]:I_32
>> >    238B02BE I73: (ID:s8(EFLGS):U_32) =MUL
>> > s139(EDX):I_32,s292(EAX):I_32,t289(EDI):I_32
>> >    238B02C0 I74: (ID:s8(EFLGS):U_32) =ADD s140(EBX):I_32,s292(EAX):I_32
>> >    238B02C2 I322: MOV v19(EAX):I_32,t291[v208(ESP)+t290(-104)]:I_32
>> >    DEADBEEF I75: (AD:s293(EAX):I_32) =CopyPseudoInst/MOV
>> (AU:v19(EAX):I_32)
>> >
>> >    238B02C6 I76: (ID:s8(EFLGS):U_32) =MUL
>> > s139(EDX):I_32,s293(EAX):I_32,t289(EDI):I_32
>> >    238B02C8 I321: MOV s286[v208(ESP)+t285(-100)]:I_32,s293(EAX):I_32
>> >    238B02CC I77: (ID:s8(EFLGS):U_32) =ADD s139(EDX):I_32,s140(EBX):I_32
>> >
>> > Any comments? Thanks.
>> >
>> > Xiaoming
>> >
>>
>>
>>
>> --
>> http://xiao-feng.blogspot.com
>>
>
>

Re: discussion for H5826

Posted by xiaoming gu <xi...@gmail.com>.

Hi, guys. I made some progress today. Now we could get partly benefits by
improving peephole in LIR.

In current peephole, there is no actual optimization for "dst1, dst2 = MUL
src1, src2(0)". After add this, one 32-bit MUL is eliminated and we got
almost half part benefits (from about 200 to about 222). You may see
optimization process with the following LIR.

=======before early_prop=======
BB_19
  PersistentId = 20
  ExecCnt = 9999.98
  Loop: Depth=0, !hdr, hdr=NULL
  Predcessors: BB_18
  Successors:  BB_33 [Prob=1]
    I41: o87:U_32 =MOV t19:I_32
    I60: (AD:o89:U_32) =CopyPseudoInst/MOV (AU:o87:U_32)
    I61: (AD:o90:I_32) =CopyPseudoInst/MOV (AU:o103(0):I_32)
    I45: o93:U_32 =MOV t35:I_32
    I62: (AD:o95:U_32) =CopyPseudoInst/MOV (AU:o93:U_32)
    I63: (AD:o96:I_32) =CopyPseudoInst/MOV (AU:o104(0):I_32)


BB_33
  PersistentId = 0
  ExecCnt = 0
  Loop: Depth=0, !hdr, hdr=NULL
  Predcessors: BB_19
  Successors:  BB_34 [Prob=1]
    I49: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o89:U_32)
    I50: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o96:I_32
    I51: (AD:o101:I_32) =CopyPseudoInst/MOV (AU:o99:I_32)
    I52: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o90:I_32)
    I53: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o95:U_32
    I54: o101:I_32 (ID:v8(EFLGS):U_32) =ADD o101:I_32,o99:I_32
    I55: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o89:U_32)
    I56: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o95:U_32
    I57: o100:I_32 (ID:v8(EFLGS):U_32) =ADD o100:I_32,o101:I_32

=======after early_prop=======
BB_33
  PersistentId = 0
  ExecCnt = 0
  Loop: Depth=0, !hdr, hdr=NULL
  Predcessors: BB_19
  Successors:  BB_34 [Prob=1]
    I49: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
    I50: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o104(0):I_32
    I51: (AD:o101:I_32) =CopyPseudoInst/MOV (AU:o99:I_32)
    I52: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o103(0):I_32)
    I53: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32   //o99
is NOT propagate from o103(0)
    I54: o101:I_32 (ID:v8(EFLGS):U_32) =ADD o101:I_32,o99:I_32
    I55: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
    I56: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32
    I57: o100:I_32 (ID:v8(EFLGS):U_32) =ADD o100:I_32,o101:I_32

=======before 1st peephole=======
BB_33
  PersistentId = 0
  ExecCnt = 0
  Loop: Depth=0, !hdr, hdr=NULL
  Predcessors: BB_19
  Successors:  BB_34 [Prob=1]
    I49: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
    I50: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,o104(0):I_32
    I51: (AD:o101:I_32) =CopyPseudoInst/MOV (AU:o99:I_32)
    I52: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:o103(0):I_32)
    I53: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32
    I54: o101:I_32 (ID:v8(EFLGS):U_32) =ADD o101:I_32,o99:I_32
    I55: (AD:o99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
    I56: o100:I_32,o99:I_32 (ID:v8(EFLGS):U_32) =MUL o99:I_32,t35:I_32
    I57: o100:I_32 (ID:v8(EFLGS):U_32) =ADD o100:I_32,o101:I_32

=======after 1st peephole=======
BB_33
  PersistentId = 0
  ExecCnt = 0
  Loop: Depth=0, !hdr, hdr=NULL
  Predcessors: BB_19
  Successors:  BB_34 [Prob=1]
    I49: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
    I65: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32)
    I64: (AD:s100:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32)
    I51: (AD:s101:I_32) =CopyPseudoInst/MOV (AU:s99:I_32)
    I52: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t103(0):I_32)
    I53: s100:I_32,s99:I_32 (ID:v8(EFLGS):U_32) =MUL s99:I_32,t35:I_32
    I54: s101:I_32 (ID:v8(EFLGS):U_32) =ADD s101:I_32,s99:I_32
    I55: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
    I56: s100:I_32,s99:I_32 (ID:v8(EFLGS):U_32) =MUL s99:I_32,t35:I_32
    I57: s100:I_32 (ID:v8(EFLGS):U_32) =ADD s100:I_32,s101:I_32

=======before 2nd cg_dce=======
BB_33
  PersistentId = 0
  ExecCnt = 0
  Loop: Depth=0, !hdr, hdr=NULL
  Predcessors: BB_19
  Successors:  BB_34 [Prob=1]
    I49: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
    I65: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32)
    I64: (AD:s100:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32)
    I51: (AD:s101:I_32) =CopyPseudoInst/MOV (AU:s99:I_32)
    I52: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t103(0):I_32)
    I53: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,t35:I_32
    I54: (ID:v8(EFLGS):U_32) =ADD s101:I_32,s99:I_32
    I55: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t19:I_32)
    I56: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,t35:I_32
    I57: (ID:v8(EFLGS):U_32) =ADD s100:I_32,s101:I_32

=======after 2nd cg_dce=======
BB_33
  PersistentId = 0
  ExecCnt = 0
  Loop: Depth=0, !hdr, hdr=NULL
  Predcessors: BB_18
  Successors:  BB_30 [Prob=1]
    I65: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t104(0):I_32) // 0 should be
propagated to s101 in I54
    I51: (AD:s101:I_32) =CopyPseudoInst/MOV (AU:s99:I_32)    // then I65 and
I51 could be eliminated
    I52: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:t103(0):I_32)
    I53: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,v35:I_32
    I54: (ID:v8(EFLGS):U_32) =ADD s101:I_32,s99:I_32
    I55: (AD:s99:I_32) =CopyPseudoInst/MOV (AU:v19:I_32)
    I56: (ID:v8(EFLGS):U_32) =MUL s100:I_32,s99:I_32,v35:I_32
    I57: (ID:v8(EFLGS):U_32) =ADD s100:I_32,s101:I_32

=======finally=======
BB_33
  PersistentId = 0
  ExecCnt = 9999.99
  Loop: Depth=0, !hdr, hdr=NULL
  Predcessors: BB_18
  Successors:  BB_30 [Prob=1]
Layout Succ: BB_30
Block code address: 238B0128
    238B0128 I188: (ID:s8(EFLGS):U_32) =XOR t99(EDI):I_32,t99(EDI):I_32
    DEADBEEF I51: (AD:s101(EDI):I_32) =CopyPseudoInst/MOV (AU:t99(EDI):I_32)

    238B012A I187: (ID:s8(EFLGS):U_32) =XOR s130(EAX):I_32,s130(EAX):I_32
    238B012C I53: (ID:s8(EFLGS):U_32) =MUL
t100(EDX):I_32,s130(EAX):I_32,t35(EBP):I_32
    238B012E I54: (ID:s8(EFLGS):U_32) =ADD s101(EDI):I_32,s130(EAX):I_32
    238B0130 I186: MOV s153(EAX):I_32,t19(EBX):I_32
    238B0132 I56: (ID:s8(EFLGS):U_32) =MUL
s154(EDX):I_32,s153(EAX):I_32,t35(EBP):I_32
    238B0134 I57: (ID:s8(EFLGS):U_32) =ADD s154(EDX):I_32,s101(EDI):I_32

You may see this optimization is related to early_prop (propagate 0 to MUL),
peephole (change MUL with 0 to MOV 0) and cg_dce (eliminate unused MOV 0 or
other LIR). Now we have problems with early_prop. For early_prop the problem
is that it only do propagation for the operands with unique definition.
That's why 0 is not propagated to the 2nd unnecessary MUL (even local
propagation could do). We need to use better data-flow analysis to do the
propagation. For the left unnecessary instructions after cg_dce, I'm going
to add another early_prop pass before it. Hope it works.

Xiaoming

On Mon, Oct 13, 2008 at 9:23 AM, Xiao-Feng Li <xi...@gmail.com> wrote:

> On Sat, Oct 11, 2008 at 12:44 PM, xiaoming gu <xi...@gmail.com>
> wrote:
> > Hi, all. I find the benefits of this patch come from changing a 64-bit
> MUL
> > (3 32-bit MUL and 2 32-bit ADD) to a 32-bit MUL. At present, the
> > optimization is done as a magic replacement, which is not a common way to
> > generate code. (https://issues.apache.org/jira/browse/HARMONY-5826)
>
> Thanks for identifying the root cause of improvement.
>
> > Assume A and B are both 64-bit operands and we are doing A*B. In 32-bit
> > machine, the MUL operation is usually translated to (High 32-bit of
> A)*(Low
> > 32-bit of b)+(Low 32-bit of A)*(High 32-bit of B)+(Low 32-bit of A)*(Low
> > 32-bit of A). But when we know High 32-bit of A and B are both 0, only
> (Low
> > 32-bit of A)*(Low 32-bit of A) needed.
> >
> > Following are the HIR and LIR for result = (a & ffffffffL) * (b &
> ffffffffL)
> > + (c & ffffffffL) + (d & ffffffffL) without this patch. We can do the
> > optimization in HIR simplifier or LIR peephole. I'm not sure whether
> > changing int64 operation to int32 operation will bring overhead for
> 64-bit
> > machine.
>
> This needs performance evaluation to answer. This patch looks like
> specific to 32-bit machine, since original sequence of 3 MULs were for
> 32-bit machine.
>
> > I think maybe peephole is a better place. I find there is no
> > peephole optimization for XOR. If JIT could find out the result of XOR is
> 0,
> > then propagates the 0 to MUL and related MUL is eliminated. The left
> problem
> > is whether there is sufficient data-flow analysis in LIR to do the
> > propagation and elimination.
>
> I guess the best place choice depends on the semantics of the
> optimization. Since this is for instruction sequence optimization
> rather than operation logic optimization, I agree it is better to be
> done in LIR peephole.
>
> Thanks,
> xiaofeng
>
> >
> > =====HIR=====
> >  I42:ldci8     #4294967295 -) t38:int64
> >  I43:and       t37, t38 -) t39:int64
> >  I44:convi8  g23 -) t40:int64
> >  I45:and       t40, t38 -) t41:int64
> >  I46:mul   t39, t41 -) t42:int64
> > =====LIR=====
> >    238B02A6 I329: MOV s286[v208(ESP)+t285(-100)]:I_32,v19(EAX):I_32
> >    238B02AA I328: MOV t291[v208(ESP)+t290(-104)]:I_32,v19(EAX):I_32
> >    238B02AE I327: (ID:s8(EFLGS):U_32) =XOR t206(EDX):I_32,t206(EDX):I_32
> >    238B02B0 I326: MOV s292(EAX):I_32,s286[v208(ESP)+t285(-100)]:I_32
> >    238B02B4 I70: (ID:s8(EFLGS):U_32) =MUL
> > s139(EDX):I_32,s292(EAX):I_32,t206(EDX):I_32
> >    238B02B6 I325: MOV s140(EBX):I_32,s292(EAX):I_32
> >    238B02B8 I324: (ID:s8(EFLGS):U_32) =XOR s292(EAX):I_32,s292(EAX):I_32
> >    238B02BA I323: MOV t289(EDI):I_32,v217[v208(ESP)+t216(-24)]:I_32
> >    238B02BE I73: (ID:s8(EFLGS):U_32) =MUL
> > s139(EDX):I_32,s292(EAX):I_32,t289(EDI):I_32
> >    238B02C0 I74: (ID:s8(EFLGS):U_32) =ADD s140(EBX):I_32,s292(EAX):I_32
> >    238B02C2 I322: MOV v19(EAX):I_32,t291[v208(ESP)+t290(-104)]:I_32
> >    DEADBEEF I75: (AD:s293(EAX):I_32) =CopyPseudoInst/MOV
> (AU:v19(EAX):I_32)
> >
> >    238B02C6 I76: (ID:s8(EFLGS):U_32) =MUL
> > s139(EDX):I_32,s293(EAX):I_32,t289(EDI):I_32
> >    238B02C8 I321: MOV s286[v208(ESP)+t285(-100)]:I_32,s293(EAX):I_32
> >    238B02CC I77: (ID:s8(EFLGS):U_32) =ADD s139(EDX):I_32,s140(EBX):I_32
> >
> > Any comments? Thanks.
> >
> > Xiaoming
> >
>
>
>
> --
> http://xiao-feng.blogspot.com
>

Re: discussion for H5826

Posted by Xiao-Feng Li <xi...@gmail.com>.

On Sat, Oct 11, 2008 at 12:44 PM, xiaoming gu <xi...@gmail.com> wrote:
> Hi, all. I find the benefits of this patch come from changing a 64-bit MUL
> (3 32-bit MUL and 2 32-bit ADD) to a 32-bit MUL. At present, the
> optimization is done as a magic replacement, which is not a common way to
> generate code. (https://issues.apache.org/jira/browse/HARMONY-5826)

Thanks for identifying the root cause of improvement.

> Assume A and B are both 64-bit operands and we are doing A*B. In 32-bit
> machine, the MUL operation is usually translated to (High 32-bit of A)*(Low
> 32-bit of b)+(Low 32-bit of A)*(High 32-bit of B)+(Low 32-bit of A)*(Low
> 32-bit of A). But when we know High 32-bit of A and B are both 0, only (Low
> 32-bit of A)*(Low 32-bit of A) needed.
>
> Following are the HIR and LIR for result = (a & ffffffffL) * (b & ffffffffL)
> + (c & ffffffffL) + (d & ffffffffL) without this patch. We can do the
> optimization in HIR simplifier or LIR peephole. I'm not sure whether
> changing int64 operation to int32 operation will bring overhead for 64-bit
> machine.

This needs performance evaluation to answer. This patch looks like
specific to 32-bit machine, since original sequence of 3 MULs were for
32-bit machine.

> I think maybe peephole is a better place. I find there is no
> peephole optimization for XOR. If JIT could find out the result of XOR is 0,
> then propagates the 0 to MUL and related MUL is eliminated. The left problem
> is whether there is sufficient data-flow analysis in LIR to do the
> propagation and elimination.

I guess the best place choice depends on the semantics of the
optimization. Since this is for instruction sequence optimization
rather than operation logic optimization, I agree it is better to be
done in LIR peephole.

Thanks,
xiaofeng

>
> =====HIR=====
>  I42:ldci8     #4294967295 -) t38:int64
>  I43:and       t37, t38 -) t39:int64
>  I44:convi8  g23 -) t40:int64
>  I45:and       t40, t38 -) t41:int64
>  I46:mul   t39, t41 -) t42:int64
> =====LIR=====
>    238B02A6 I329: MOV s286[v208(ESP)+t285(-100)]:I_32,v19(EAX):I_32
>    238B02AA I328: MOV t291[v208(ESP)+t290(-104)]:I_32,v19(EAX):I_32
>    238B02AE I327: (ID:s8(EFLGS):U_32) =XOR t206(EDX):I_32,t206(EDX):I_32
>    238B02B0 I326: MOV s292(EAX):I_32,s286[v208(ESP)+t285(-100)]:I_32
>    238B02B4 I70: (ID:s8(EFLGS):U_32) =MUL
> s139(EDX):I_32,s292(EAX):I_32,t206(EDX):I_32
>    238B02B6 I325: MOV s140(EBX):I_32,s292(EAX):I_32
>    238B02B8 I324: (ID:s8(EFLGS):U_32) =XOR s292(EAX):I_32,s292(EAX):I_32
>    238B02BA I323: MOV t289(EDI):I_32,v217[v208(ESP)+t216(-24)]:I_32
>    238B02BE I73: (ID:s8(EFLGS):U_32) =MUL
> s139(EDX):I_32,s292(EAX):I_32,t289(EDI):I_32
>    238B02C0 I74: (ID:s8(EFLGS):U_32) =ADD s140(EBX):I_32,s292(EAX):I_32
>    238B02C2 I322: MOV v19(EAX):I_32,t291[v208(ESP)+t290(-104)]:I_32
>    DEADBEEF I75: (AD:s293(EAX):I_32) =CopyPseudoInst/MOV (AU:v19(EAX):I_32)
>
>    238B02C6 I76: (ID:s8(EFLGS):U_32) =MUL
> s139(EDX):I_32,s293(EAX):I_32,t289(EDI):I_32
>    238B02C8 I321: MOV s286[v208(ESP)+t285(-100)]:I_32,s293(EAX):I_32
>    238B02CC I77: (ID:s8(EFLGS):U_32) =ADD s139(EDX):I_32,s140(EBX):I_32
>
> Any comments? Thanks.
>
> Xiaoming
>



-- 
http://xiao-feng.blogspot.com

Re: discussion for H5826

Posted by xiaoming gu <xi...@gmail.com>.

Aleksey, please see the attachments which are the source code, the HIR
before HIR2LIR and the final LIR.

With more thoughts, I plan to do the optimization in HIR simplifier first
because it looks much easier than in LIR peephole. When I finish that step,
I'll try LIR peephole later.

Xiaoming

On Mon, Oct 13, 2008 at 4:02 AM, Aleksey Shipilev <
aleksey.shipilev@gmail.com> wrote:

> Hi, Xiaoming!
>
> I see you're deeply interested in Jitrino :) I'm actually surprised
> that LIR is so clean... Can you show full LIR dump for
> unsignedMultAddAdd? But now, let's see the refined version of your LIR
> dump:
>
> mov [esp-100], eax
> mov [esp-104], eax
> xor edx, edx
> mov eax, [esp-100]
> mul edx, eax, edx
> mov ebx, eax
> xor eax, eax
> mov edi, [esp-24]
> mul edx, eax, edi
> add ebx, eax
> mov eax, [esp-104]
> mul edx, eax, edi
> mov [esp-100], eax
> add edx, ebx
>
> ...after noticing that "xor reg,reg" === "reg = 0" and propagating zero:
>
> mov [esp-100], eax
> mov [esp-104], eax
> mov eax, [esp-100]
> mov edi, [esp-24]
> mov eax, [esp-104]
> mul edx, eax, edi
> mov [esp-100], eax
>
> ...and some more sweep:
>
> mov edi, [esp-24]
> mul edx, eax, edi
> mov [esp-100], eax
>
> Looks pretty good for me.
>
> Thanks,
> Aleksey.
>
> On Sat, Oct 11, 2008 at 8:44 AM, xiaoming gu <xi...@gmail.com>
> wrote:
> > Hi, all. I find the benefits of this patch come from changing a 64-bit
> MUL
> > (3 32-bit MUL and 2 32-bit ADD) to a 32-bit MUL. At present, the
> > optimization is done as a magic replacement, which is not a common way to
> > generate code. (https://issues.apache.org/jira/browse/HARMONY-5826)
> >
> > Assume A and B are both 64-bit operands and we are doing A*B. In 32-bit
> > machine, the MUL operation is usually translated to (High 32-bit of
> A)*(Low
> > 32-bit of b)+(Low 32-bit of A)*(High 32-bit of B)+(Low 32-bit of A)*(Low
> > 32-bit of A). But when we know High 32-bit of A and B are both 0, only
> (Low
> > 32-bit of A)*(Low 32-bit of A) needed.
> >
> > Following are the HIR and LIR for result = (a & ffffffffL) * (b &
> ffffffffL)
> > + (c & ffffffffL) + (d & ffffffffL) without this patch. We can do the
> > optimization in HIR simplifier or LIR peephole. I'm not sure whether
> > changing int64 operation to int32 operation will bring overhead for
> 64-bit
> > machine. I think maybe peephole is a better place. I find there is no
> > peephole optimization for XOR. If JIT could find out the result of XOR is
> 0,
> > then propagates the 0 to MUL and related MUL is eliminated. The left
> problem
> > is whether there is sufficient data-flow analysis in LIR to do the
> > propagation and elimination.
> > =====HIR=====
> >  I42:ldci8     #4294967295 -) t38:int64
> >  I43:and       t37, t38 -) t39:int64
> >  I44:convi8  g23 -) t40:int64
> >  I45:and       t40, t38 -) t41:int64
> >  I46:mul   t39, t41 -) t42:int64
> > =====LIR=====
> >    238B02A6 I329: MOV s286[v208(ESP)+t285(-100)]:I_32,v19(EAX):I_32
> >    238B02AA I328: MOV t291[v208(ESP)+t290(-104)]:I_32,v19(EAX):I_32
> >    238B02AE I327: (ID:s8(EFLGS):U_32) =XOR t206(EDX):I_32,t206(EDX):I_32
> >    238B02B0 I326: MOV s292(EAX):I_32,s286[v208(ESP)+t285(-100)]:I_32
> >    238B02B4 I70: (ID:s8(EFLGS):U_32) =MUL
> > s139(EDX):I_32,s292(EAX):I_32,t206(EDX):I_32
> >    238B02B6 I325: MOV s140(EBX):I_32,s292(EAX):I_32
> >    238B02B8 I324: (ID:s8(EFLGS):U_32) =XOR s292(EAX):I_32,s292(EAX):I_32
> >    238B02BA I323: MOV t289(EDI):I_32,v217[v208(ESP)+t216(-24)]:I_32
> >    238B02BE I73: (ID:s8(EFLGS):U_32) =MUL
> > s139(EDX):I_32,s292(EAX):I_32,t289(EDI):I_32
> >    238B02C0 I74: (ID:s8(EFLGS):U_32) =ADD s140(EBX):I_32,s292(EAX):I_32
> >    238B02C2 I322: MOV v19(EAX):I_32,t291[v208(ESP)+t290(-104)]:I_32
> >    DEADBEEF I75: (AD:s293(EAX):I_32) =CopyPseudoInst/MOV
> (AU:v19(EAX):I_32)
> >
> >    238B02C6 I76: (ID:s8(EFLGS):U_32) =MUL
> > s139(EDX):I_32,s293(EAX):I_32,t289(EDI):I_32
> >    238B02C8 I321: MOV s286[v208(ESP)+t285(-100)]:I_32,s293(EAX):I_32
> >    238B02CC I77: (ID:s8(EFLGS):U_32) =ADD s139(EDX):I_32,s140(EBX):I_32
> >
> > Any comments? Thanks.
> >
> > Xiaoming
> >
>

Re: discussion for H5826

Posted by Aleksey Shipilev <al...@gmail.com>.

Hi, Xiaoming!

I see you're deeply interested in Jitrino :) I'm actually surprised
that LIR is so clean... Can you show full LIR dump for
unsignedMultAddAdd? But now, let's see the refined version of your LIR
dump:

mov [esp-100], eax
mov [esp-104], eax
xor edx, edx
mov eax, [esp-100]
mul edx, eax, edx
mov ebx, eax
xor eax, eax
mov edi, [esp-24]
mul edx, eax, edi
add ebx, eax
mov eax, [esp-104]
mul edx, eax, edi
mov [esp-100], eax
add edx, ebx

...after noticing that "xor reg,reg" === "reg = 0" and propagating zero:

mov [esp-100], eax
mov [esp-104], eax
mov eax, [esp-100]
mov edi, [esp-24]
mov eax, [esp-104]
mul edx, eax, edi
mov [esp-100], eax

...and some more sweep:

mov edi, [esp-24]
mul edx, eax, edi
mov [esp-100], eax

Looks pretty good for me.

Thanks,
Aleksey.

On Sat, Oct 11, 2008 at 8:44 AM, xiaoming gu <xi...@gmail.com> wrote:
> Hi, all. I find the benefits of this patch come from changing a 64-bit MUL
> (3 32-bit MUL and 2 32-bit ADD) to a 32-bit MUL. At present, the
> optimization is done as a magic replacement, which is not a common way to
> generate code. (https://issues.apache.org/jira/browse/HARMONY-5826)
>
> Assume A and B are both 64-bit operands and we are doing A*B. In 32-bit
> machine, the MUL operation is usually translated to (High 32-bit of A)*(Low
> 32-bit of b)+(Low 32-bit of A)*(High 32-bit of B)+(Low 32-bit of A)*(Low
> 32-bit of A). But when we know High 32-bit of A and B are both 0, only (Low
> 32-bit of A)*(Low 32-bit of A) needed.
>
> Following are the HIR and LIR for result = (a & ffffffffL) * (b & ffffffffL)
> + (c & ffffffffL) + (d & ffffffffL) without this patch. We can do the
> optimization in HIR simplifier or LIR peephole. I'm not sure whether
> changing int64 operation to int32 operation will bring overhead for 64-bit
> machine. I think maybe peephole is a better place. I find there is no
> peephole optimization for XOR. If JIT could find out the result of XOR is 0,
> then propagates the 0 to MUL and related MUL is eliminated. The left problem
> is whether there is sufficient data-flow analysis in LIR to do the
> propagation and elimination.
> =====HIR=====
>  I42:ldci8     #4294967295 -) t38:int64
>  I43:and       t37, t38 -) t39:int64
>  I44:convi8  g23 -) t40:int64
>  I45:and       t40, t38 -) t41:int64
>  I46:mul   t39, t41 -) t42:int64
> =====LIR=====
>    238B02A6 I329: MOV s286[v208(ESP)+t285(-100)]:I_32,v19(EAX):I_32
>    238B02AA I328: MOV t291[v208(ESP)+t290(-104)]:I_32,v19(EAX):I_32
>    238B02AE I327: (ID:s8(EFLGS):U_32) =XOR t206(EDX):I_32,t206(EDX):I_32
>    238B02B0 I326: MOV s292(EAX):I_32,s286[v208(ESP)+t285(-100)]:I_32
>    238B02B4 I70: (ID:s8(EFLGS):U_32) =MUL
> s139(EDX):I_32,s292(EAX):I_32,t206(EDX):I_32
>    238B02B6 I325: MOV s140(EBX):I_32,s292(EAX):I_32
>    238B02B8 I324: (ID:s8(EFLGS):U_32) =XOR s292(EAX):I_32,s292(EAX):I_32
>    238B02BA I323: MOV t289(EDI):I_32,v217[v208(ESP)+t216(-24)]:I_32
>    238B02BE I73: (ID:s8(EFLGS):U_32) =MUL
> s139(EDX):I_32,s292(EAX):I_32,t289(EDI):I_32
>    238B02C0 I74: (ID:s8(EFLGS):U_32) =ADD s140(EBX):I_32,s292(EAX):I_32
>    238B02C2 I322: MOV v19(EAX):I_32,t291[v208(ESP)+t290(-104)]:I_32
>    DEADBEEF I75: (AD:s293(EAX):I_32) =CopyPseudoInst/MOV (AU:v19(EAX):I_32)
>
>    238B02C6 I76: (ID:s8(EFLGS):U_32) =MUL
> s139(EDX):I_32,s293(EAX):I_32,t289(EDI):I_32
>    238B02C8 I321: MOV s286[v208(ESP)+t285(-100)]:I_32,s293(EAX):I_32
>    238B02CC I77: (ID:s8(EFLGS):U_32) =ADD s139(EDX):I_32,s140(EBX):I_32
>
> Any comments? Thanks.
>
> Xiaoming
>