You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/06/04 17:47:56 UTC

[GitHub] [tvm-vta] adavare opened a new pull request #27: Chisel Pipelined ALU

adavare opened a new pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27


   A pipelined implementation of the Chisel "TensorAlu" module as well as more flexible scratchpad read/write interfaces are added. Chisel unit tests are added for the modified ALU. Around ~4x speedup is seen in the cycle count reported by the "ALU Unit test" portion of "tvm/vta/tests/python/integration/test_benchmark_gemm.py", while more modest speedups are seen in all other tests located in the "tvm/vta/tests/python/integration" and "tvm/vta/tests/python/unittest" directories.
   
   Code contributions to this PR were made by the following individuals (in alphabetical order): @suvadeep89, @stevenmburns, @pasqoc, @adavare, @sjain12intel, @aasorokiin, and @zhenkuny.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] tmoreau89 merged pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
tmoreau89 merged pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] tmoreau89 commented on pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
tmoreau89 commented on pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#issuecomment-856368425


   Let's get this one PR merged in first, then we can enable the unit tests in a separate PR. Thank you @vegaluisjose for your help in reviewing the PR and ensuring the unit tests are passing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] tmoreau89 commented on pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
tmoreau89 commented on pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#issuecomment-856363593


   > @tmoreau89, @vegaluisjose: I see that "tests/scripts/task_python_vta_tsim.sh" is being skipped in the pr-merge CI. Do you know why this is happening?
   > 
   > The 2 pytest commands in this test run successfully for me in ~12 min or so. I was considering extending the tsim test with Chisel unit tests ("sbt test"), but perhaps it is being skipped since the runtime is already too high?
   
   Thank you for the extensive PR @adavare and team. On the testing, I think it would be good to run some more extensive unit tests in this repo (TVM submodule) but we'll want to limit the overall runtime of the Chisel tests in the mainline repo (CI testing is already getting very long).
   
   I'm not too worried about making the CI testing time long on the tvm-vta repo given that PRs don't get submitted too often. Therefore I'm in favor of running longer, more extensive tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] vegaluisjose commented on a change in pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
vegaluisjose commented on a change in pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#discussion_r646699023



##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)

Review comment:
       Hey @adavare, pretty nice PR. Btw, is it possible to remove this space in `Reg` declarations in this file i.e., `Reg(false.B)` instead of `Reg( false.B)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] vegaluisjose commented on pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
vegaluisjose commented on pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#issuecomment-856245092


   Nice @adavare, could you please rebase to incorporate #28? we added linting to unittest (was disable before). After this rebase, you could try `make lint` to see if everything goes fine and we can merge this. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] adavare commented on pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
adavare commented on pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#issuecomment-854999719


   @stevenmburns: You're right, that bugfix was not previously included in this PR, but I've just added it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] adavare commented on a change in pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
adavare commented on a change in pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#discussion_r646841052



##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid
+    io.acc.rd(idx).idx.valid := RegNext(accRdIdxValid)
+  }
+
+  val new_src_idx_001 = src_idx_001 + src_offset
+  val src_idx_002 = RegNext( new_src_idx_001)
+  val src_idx_003 = RegNext( src_idx_002)
+
+  val new_dst_idx_001 = dst_idx_001 + dst_offset
+  val dst_idx_002 = RegNext( new_dst_idx_001)
+  val dst_idx_003 = RegNext( dst_idx_002)
+  val dst_idx_004 = RegNext( dst_idx_003)

Review comment:
       Changed to _r# throughout

##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid

Review comment:
       deleted

##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002

Review comment:
       deleted




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] adavare commented on a change in pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
adavare commented on a change in pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#discussion_r646844048



##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid
+    io.acc.rd(idx).idx.valid := RegNext(accRdIdxValid)
+  }
+
+  val new_src_idx_001 = src_idx_001 + src_offset
+  val src_idx_002 = RegNext( new_src_idx_001)
+  val src_idx_003 = RegNext( src_idx_002)
+
+  val new_dst_idx_001 = dst_idx_001 + dst_offset
+  val dst_idx_002 = RegNext( new_dst_idx_001)
+  val dst_idx_003 = RegNext( dst_idx_002)
+  val dst_idx_004 = RegNext( dst_idx_003)
+
+  // split registers of stage 2 by data groups
+  val accRdIdxBits = Mux( src_valid_001 || io.dec.alu_use_imm, new_src_idx_001, new_dst_idx_001)
+  for (idx <- 0 until dataSplitFactor) {
+    io.acc.rd(idx).idx.bits := RegNext(accRdIdxBits)
+    assert( io.acc.rd(idx).data.valid === (valid_003 || src_valid_003))
+  }
+
+  require(io.out.splitWidth == 1 && io.out.splitLength == 1, "-F- Out split write is not supported")
+  val numVecUnits = dataSplitFactor
+  val outData = Wire(io.out.wr(0).bits.data.cloneType)
+  val dataRemapB = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  val dataRemapA = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  // numVecUnits is a pow of 2
+  // split dec bits pipe further if there are many vecUnits
+  val decSplitNb0 =  if (numVecUnits < 8) 1 else 2
+  val decSplit0 = Wire(Vec(decSplitNb0, io.dec.cloneType))
+  for (idx <- 0 until decSplitNb0) {
+    decSplit0(idx) := ShiftRegister(io.dec, if(aluDataReadPipeDelay < 2) 0 else 1)
+  }
+
+  for (idx <- 0 until numVecUnits) {
+    val alu = Module(new AluVector)
+
+    for(aluLenIdx <- 0 until alu.io.acc_b.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_b.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_b.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapB(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+      }
+    }
+    val save_src = RegNext(dataRemapB(idx))
+    val tensorImm = Wire(new TensorClientData(tensorType = "acc"))
+    tensorImm.data.valid := RegNext(valid_002) //valid_003 split
+    val tensorImmBits_piped = ShiftRegister(
+      decSplit0(idx/(numVecUnits/decSplitNb0)).alu_imm,
+      if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    tensorImm.data.bits.foreach { b =>
+      b.foreach { c =>
+        c := Mux(tensorImmBits_piped(C_ALU_IMM_BITS - 1),
+          Cat(-1.S((aluBits - C_ALU_IMM_BITS).W), tensorImmBits_piped), tensorImmBits_piped)
+      }
+    }
+
+    // alu
+    val tensorOpBits_piped = ShiftRegister(
+    decSplit0(idx/(numVecUnits/decSplitNb0)).alu_op,
+    if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    val isSHR = (tensorOpBits_piped === ALU_OP(3))
+    val neg_shift = isSHR & tensorImmBits_piped(C_ALU_IMM_BITS - 1)
+    val fixme_alu_op = Mux(
+      neg_shift,
+      ALU_OP(4), // use opcode = 4 for left shift
+      tensorOpBits_piped)
+    alu.io.opcode := fixme_alu_op
+
+    assert( !valid_003 || io.acc.rd(idx).data.valid)
+
+    alu.io.acc_a.data.valid := RegNext(valid_002) //valid_003 split
+
+    for(aluLenIdx <- 0 until alu.io.acc_a.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_a.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_a.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapA(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+        alu.io.acc_a.data.bits := dataRemapA(idx)
+      }
+    }
+    val tensorUseImmBits_piped = ShiftRegister(
+    decSplit0(idx/(numVecUnits/decSplitNb0)).alu_use_imm,
+    if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    alu.io.acc_b.data.valid := Mux(tensorUseImmBits_piped,
+      tensorImm.data.valid,
+      valid_003)
+    alu.io.acc_b.data.bits := Mux(tensorUseImmBits_piped,
+      tensorImm.data.bits,
+      save_src)
+
+    assert( alu.io.acc_y.data.valid === valid_004)
+    io.acc.wr(idx).valid := RegNext(valid_003) //valid_004 split
+    io.acc.wr(idx).bits.idx := RegNext(dst_idx_003)//dst_idx_004 split
+
+    for(aluLenIdx <- 0 until alu.io.acc_y.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_y.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_y.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        io.acc.wr(accGrpIdx).bits.data(accLenIdx)(accWdtIdx) :=
+          alu.io.acc_y.data.bits(aluLenIdx)(aluWdtIdx)
+      }
+    }
+
+    assert( alu.io.out.data.valid === valid_004)
+    for (idx1 <- 0 until io.out.tensorLength) {
+      for (idx2 <- 0 until io.out.tensorWidth/numVecUnits) {
+        outData(idx1)(idx*io.out.tensorWidth/numVecUnits + idx2) := alu.io.out.data.bits(idx1)(idx2)
+      }
+    }
+  }
+
+// comment for split write
+  io.out.wr(0).valid := valid_004
+  io.out.wr(0).bits.idx := dst_idx_004
+  io.out.wr(0).bits.data := outData
+  io.out.tieoffRead()
+
+  val bypass_dst = valid_003 && valid_004 && ( dst_idx_004 === dst_idx_003)
+  val bypass_src = src_valid_003 && valid_004 && ( dst_idx_004 === src_idx_003)
+
+  // Do we need a bypass
+  when ( bypass_dst) {
+    printf( "Bypass required on dst_idx read %x RAW with write %x\n", dst_idx_003, dst_idx_004)
+    assert( false.B, "DST bypass required")
+  }
+  when ( bypass_src) {
+    printf( "Bypass required on src_idx read %x RAW with write %x\n", src_idx_003, dst_idx_004)
+    assert( false.B, "SRC bypass required")
+  }
+}

Review comment:
       extracted asserts as suggested




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] vegaluisjose commented on a change in pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
vegaluisjose commented on a change in pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#discussion_r646700789



##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)

Review comment:
       remove extra space




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] vegaluisjose commented on a change in pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
vegaluisjose commented on a change in pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#discussion_r646700419



##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)

Review comment:
       remove extra space here as well




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] vegaluisjose commented on a change in pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
vegaluisjose commented on a change in pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#discussion_r646699491



##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {

Review comment:
       Same here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] stevenmburns commented on pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
stevenmburns commented on pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#issuecomment-854920693


   @adavare I'm not completely following what you are doing here but if you aren't changing the scratchpad interface from the original design, you'll need to make one modification in LoadUop (delay the LSB by the same amount as the synchronous read).
   
   ```
     when(io.vme_rd.data.fire()) {
       mem.write(waddr, wdata, wmask)
     }
   
     // read-from-sram
     io.uop.data.valid := RegNext(io.uop.idx.valid)
   
     // delay LSB of idx by a cycle because of the one-cycle memory read latency
     val sIdx = RegNext(io.uop.idx.bits % numUop.U)
     val rIdx = io.uop.idx.bits >> log2Ceil(numUop)
     val memRead = mem.read(rIdx, io.uop.idx.valid)
     val sWord = memRead.asUInt.asTypeOf(wdata)
     val sUop = sWord(sIdx).asTypeOf(io.uop.data.bits)
   
     io.uop.data.bits <> sUop
   
   ```
   If you are inserting our new TensorLoad implementation in this PR, then we corrected that in there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] adavare commented on pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
adavare commented on pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#issuecomment-856318101


   @vegaluisjose: I've pulled in #28 and fixed a few linting errors that came up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] adavare commented on pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
adavare commented on pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#issuecomment-855022981


   @tmoreau89, @vegaluisjose: I see that "tests/scripts/task_python_vta_tsim.sh" is being skipped in the pr-merge CI. Do you know why this is happening?
   
   The 2 pytest commands in this test run successfully for me in ~12 min or so. I was considering extending the tsim test with Chisel unit tests ("sbt test"), but perhaps it is being skipped since the runtime is already too high?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] adavare commented on a change in pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
adavare commented on a change in pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#discussion_r646840562



##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)

Review comment:
       All "( " occurrences replaced throughout PR




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] adavare commented on a change in pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
adavare commented on a change in pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#discussion_r646844433



##########
File path: hardware/chisel/src/main/scala/core/TensorStore.scala
##########
@@ -123,24 +128,28 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)(
             state := sWriteCmd
             xfer_bytes := xfer_stride_bytes
             when(xsize < xfer_stride_pulses) {
-              xlen := xsize
+              assert(xsize > 0.U)
+              xlen := xsize - 1.U
               xrem := 0.U
             }.otherwise {
               xlen := xfer_stride_pulses - 1.U
+              assert(xsize >= xfer_stride_pulses)
               xrem := xsize - xfer_stride_pulses
             }
           }
         } // split
         .elsewhen(xrem < xfer_split_pulses) {
           state := sWriteCmd
           xfer_bytes := xfer_split_bytes
-          xlen := xrem
+          assert(xrem > 0.U)
+          xlen := xrem - 1.U
           xrem := 0.U
         }
         .otherwise {
           state := sWriteCmd
           xfer_bytes := xfer_split_bytes
           xlen := xfer_split_pulses - 1.U
+          assert(xrem >= xfer_split_pulses)

Review comment:
       added prints to all asserts in state machine code




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] adavare commented on a change in pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
adavare commented on a change in pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#discussion_r646843812



##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid
+    io.acc.rd(idx).idx.valid := RegNext(accRdIdxValid)
+  }
+
+  val new_src_idx_001 = src_idx_001 + src_offset
+  val src_idx_002 = RegNext( new_src_idx_001)
+  val src_idx_003 = RegNext( src_idx_002)
+
+  val new_dst_idx_001 = dst_idx_001 + dst_offset
+  val dst_idx_002 = RegNext( new_dst_idx_001)
+  val dst_idx_003 = RegNext( dst_idx_002)
+  val dst_idx_004 = RegNext( dst_idx_003)
+
+  // split registers of stage 2 by data groups
+  val accRdIdxBits = Mux( src_valid_001 || io.dec.alu_use_imm, new_src_idx_001, new_dst_idx_001)
+  for (idx <- 0 until dataSplitFactor) {
+    io.acc.rd(idx).idx.bits := RegNext(accRdIdxBits)
+    assert( io.acc.rd(idx).data.valid === (valid_003 || src_valid_003))
+  }
+
+  require(io.out.splitWidth == 1 && io.out.splitLength == 1, "-F- Out split write is not supported")
+  val numVecUnits = dataSplitFactor
+  val outData = Wire(io.out.wr(0).bits.data.cloneType)
+  val dataRemapB = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  val dataRemapA = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  // numVecUnits is a pow of 2
+  // split dec bits pipe further if there are many vecUnits
+  val decSplitNb0 =  if (numVecUnits < 8) 1 else 2
+  val decSplit0 = Wire(Vec(decSplitNb0, io.dec.cloneType))
+  for (idx <- 0 until decSplitNb0) {
+    decSplit0(idx) := ShiftRegister(io.dec, if(aluDataReadPipeDelay < 2) 0 else 1)
+  }
+
+  for (idx <- 0 until numVecUnits) {
+    val alu = Module(new AluVector)
+
+    for(aluLenIdx <- 0 until alu.io.acc_b.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_b.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_b.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapB(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+      }
+    }
+    val save_src = RegNext(dataRemapB(idx))
+    val tensorImm = Wire(new TensorClientData(tensorType = "acc"))
+    tensorImm.data.valid := RegNext(valid_002) //valid_003 split

Review comment:
       Reusing existing node




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] vegaluisjose commented on a change in pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
vegaluisjose commented on a change in pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#discussion_r646703504



##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid
+    io.acc.rd(idx).idx.valid := RegNext(accRdIdxValid)
+  }
+
+  val new_src_idx_001 = src_idx_001 + src_offset
+  val src_idx_002 = RegNext( new_src_idx_001)
+  val src_idx_003 = RegNext( src_idx_002)
+
+  val new_dst_idx_001 = dst_idx_001 + dst_offset
+  val dst_idx_002 = RegNext( new_dst_idx_001)
+  val dst_idx_003 = RegNext( dst_idx_002)
+  val dst_idx_004 = RegNext( dst_idx_003)

Review comment:
       what would you think on using `_r1` instead of `_001` or some other format that starts with a letter  followed by a number

##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid

Review comment:
       Delete comment

##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002

Review comment:
       Delete comment

##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid
+    io.acc.rd(idx).idx.valid := RegNext(accRdIdxValid)
+  }
+
+  val new_src_idx_001 = src_idx_001 + src_offset
+  val src_idx_002 = RegNext( new_src_idx_001)
+  val src_idx_003 = RegNext( src_idx_002)
+
+  val new_dst_idx_001 = dst_idx_001 + dst_offset
+  val dst_idx_002 = RegNext( new_dst_idx_001)
+  val dst_idx_003 = RegNext( dst_idx_002)
+  val dst_idx_004 = RegNext( dst_idx_003)
+
+  // split registers of stage 2 by data groups
+  val accRdIdxBits = Mux( src_valid_001 || io.dec.alu_use_imm, new_src_idx_001, new_dst_idx_001)
+  for (idx <- 0 until dataSplitFactor) {
+    io.acc.rd(idx).idx.bits := RegNext(accRdIdxBits)
+    assert( io.acc.rd(idx).data.valid === (valid_003 || src_valid_003))
+  }
+
+  require(io.out.splitWidth == 1 && io.out.splitLength == 1, "-F- Out split write is not supported")
+  val numVecUnits = dataSplitFactor
+  val outData = Wire(io.out.wr(0).bits.data.cloneType)
+  val dataRemapB = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  val dataRemapA = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  // numVecUnits is a pow of 2
+  // split dec bits pipe further if there are many vecUnits
+  val decSplitNb0 =  if (numVecUnits < 8) 1 else 2
+  val decSplit0 = Wire(Vec(decSplitNb0, io.dec.cloneType))
+  for (idx <- 0 until decSplitNb0) {
+    decSplit0(idx) := ShiftRegister(io.dec, if(aluDataReadPipeDelay < 2) 0 else 1)
+  }
+
+  for (idx <- 0 until numVecUnits) {
+    val alu = Module(new AluVector)
+
+    for(aluLenIdx <- 0 until alu.io.acc_b.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_b.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_b.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapB(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+      }
+    }
+    val save_src = RegNext(dataRemapB(idx))
+    val tensorImm = Wire(new TensorClientData(tensorType = "acc"))
+    tensorImm.data.valid := RegNext(valid_002) //valid_003 split
+    val tensorImmBits_piped = ShiftRegister(
+      decSplit0(idx/(numVecUnits/decSplitNb0)).alu_imm,
+      if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    tensorImm.data.bits.foreach { b =>
+      b.foreach { c =>
+        c := Mux(tensorImmBits_piped(C_ALU_IMM_BITS - 1),
+          Cat(-1.S((aluBits - C_ALU_IMM_BITS).W), tensorImmBits_piped), tensorImmBits_piped)
+      }
+    }
+
+    // alu
+    val tensorOpBits_piped = ShiftRegister(
+    decSplit0(idx/(numVecUnits/decSplitNb0)).alu_op,
+    if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    val isSHR = (tensorOpBits_piped === ALU_OP(3))
+    val neg_shift = isSHR & tensorImmBits_piped(C_ALU_IMM_BITS - 1)
+    val fixme_alu_op = Mux(
+      neg_shift,
+      ALU_OP(4), // use opcode = 4 for left shift
+      tensorOpBits_piped)
+    alu.io.opcode := fixme_alu_op
+
+    assert( !valid_003 || io.acc.rd(idx).data.valid)
+
+    alu.io.acc_a.data.valid := RegNext(valid_002) //valid_003 split
+
+    for(aluLenIdx <- 0 until alu.io.acc_a.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_a.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_a.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapA(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+        alu.io.acc_a.data.bits := dataRemapA(idx)
+      }
+    }
+    val tensorUseImmBits_piped = ShiftRegister(
+    decSplit0(idx/(numVecUnits/decSplitNb0)).alu_use_imm,
+    if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    alu.io.acc_b.data.valid := Mux(tensorUseImmBits_piped,
+      tensorImm.data.valid,
+      valid_003)
+    alu.io.acc_b.data.bits := Mux(tensorUseImmBits_piped,
+      tensorImm.data.bits,
+      save_src)
+
+    assert( alu.io.acc_y.data.valid === valid_004)
+    io.acc.wr(idx).valid := RegNext(valid_003) //valid_004 split
+    io.acc.wr(idx).bits.idx := RegNext(dst_idx_003)//dst_idx_004 split

Review comment:
       maybe creating a binding for `_r4` instead of using a comment `_004`

##########
File path: hardware/chisel/src/main/scala/core/Compute.scala
##########
@@ -118,44 +123,102 @@ class Compute(debug: Boolean = false)(implicit p: Parameters) extends Module {
   loadUop.io.baddr := io.uop_baddr
   io.vme_rd(0) <> loadUop.io.vme_rd
   loadUop.io.uop.idx <> Mux(dec.io.isGemm, tensorGemm.io.uop.idx, tensorAlu.io.uop.idx)
+  assert( !tensorGemm.io.uop.idx.valid || !tensorAlu.io.uop.idx.valid)
 
   // acc
   tensorAcc.io.start := state === sIdle & start & dec.io.isLoadAcc
   tensorAcc.io.inst := inst_q.io.deq.bits
   tensorAcc.io.baddr := io.acc_baddr
-  tensorAcc.io.tensor.rd.idx <> Mux(dec.io.isGemm, tensorGemm.io.acc.rd.idx, tensorAlu.io.acc.rd.idx)
-  tensorAcc.io.tensor.wr <> Mux(dec.io.isGemm, tensorGemm.io.acc.wr, tensorAlu.io.acc.wr)
+  require(tensorAcc.io.tensor.lenSplit ==
+    tensorAcc.io.tensor.tensorLength, "-F- Expecting a whole batch in acc group")
+
+  // split factor of isGemm for many groups
+  val splitFactorL0 = pow(2,log2Ceil(tensorAcc.io.tensor.splitWidth) / 2).toInt
+  val splitFactorL1 = pow(2,log2Ceil(tensorAcc.io.tensor.splitWidth)
+    - log2Ceil(tensorAcc.io.tensor.splitWidth) / 2).toInt
+  require(splitFactorL0 * splitFactorL1 == tensorAcc.io.tensor.splitWidth)
+  val accRdSelectL0 = for (idx <- 0 until splitFactorL1) yield {
+    // can save 1 stage on small design
+    if (splitFactorL1 > 1) RegNext(dec.io.isGemm, init = false.B) else dec.io.isGemm
+  }
+
+  for (idx <- 0 until tensorAcc.io.tensor.splitWidth) {
+    tensorAcc.io.tensor.rd(idx).idx <> Mux(
+      RegNext(accRdSelectL0(idx/splitFactorL0), init = false.B),
+      tensorGemm.io.acc.rd(idx).idx,
+      tensorAlu.io.acc.rd(idx).idx)
+    tensorAcc.io.tensor.wr(idx) <> Mux(
+      RegNext(accRdSelectL0(idx/splitFactorL0), init = false.B),
+      tensorGemm.io.acc.wr(idx),
+      tensorAlu.io.acc.wr(idx))
+  }
   io.vme_rd(1) <> tensorAcc.io.vme_rd
-  io.acc_wr_event := tensorAcc.io.tensor.wr.valid
+  io.acc_wr_event := tensorAcc.io.tensor.wr(topAccGrpIdx).valid
 
   // gemm
-  tensorGemm.io.start := state === sIdle & start & dec.io.isGemm
-  tensorGemm.io.inst := inst_q.io.deq.bits
+  tensorGemm.io.start := RegNext(state === sIdle & start & dec.io.isGemm, init = false.B)
+  tensorGemm.io.dec := inst_q.io.deq.bits.asTypeOf(new GemmDecode)
   tensorGemm.io.uop.data.valid := loadUop.io.uop.data.valid & dec.io.isGemm
   tensorGemm.io.uop.data.bits <> loadUop.io.uop.data.bits
   tensorGemm.io.inp <> io.inp
   tensorGemm.io.wgt <> io.wgt
-  tensorGemm.io.acc.rd.data.valid := tensorAcc.io.tensor.rd.data.valid & dec.io.isGemm
-  tensorGemm.io.acc.rd.data.bits <> tensorAcc.io.tensor.rd.data.bits
-  tensorGemm.io.out.rd.data.valid := io.out.rd.data.valid & dec.io.isGemm
-  tensorGemm.io.out.rd.data.bits <> io.out.rd.data.bits
+  for (idx <- 0 until tensorGemm.io.acc.splitWidth) {
+    tensorGemm.io.acc.rd(idx).data.valid :=
+      tensorAcc.io.tensor.rd(idx).data.valid & RegNext(dec.io.isGemm, init = false.B)
+    tensorGemm.io.acc.rd(idx).data.bits <> tensorAcc.io.tensor.rd(idx).data.bits
+  }
+  for (idx <- 0 until tensorGemm.io.out.splitWidth) {
+    tensorGemm.io.out.rd(idx).data.valid :=
+      io.out.rd(idx).data.valid & RegNext(dec.io.isGemm, init = false.B)
+    tensorGemm.io.out.rd(idx).data.bits <> io.out.rd(idx).data.bits
+  }
 
   // alu
-  tensorAlu.io.start := state === sIdle & start & dec.io.isAlu
-  tensorAlu.io.inst := inst_q.io.deq.bits
+  tensorAlu.io.start := RegNext(state === sIdle & start & dec.io.isAlu, init = false.B)
+  tensorAlu.io.dec := inst_q.io.deq.bits.asTypeOf(new AluDecode)
   tensorAlu.io.uop.data.valid := loadUop.io.uop.data.valid & dec.io.isAlu
   tensorAlu.io.uop.data.bits <> loadUop.io.uop.data.bits
-  tensorAlu.io.acc.rd.data.valid := tensorAcc.io.tensor.rd.data.valid & dec.io.isAlu
-  tensorAlu.io.acc.rd.data.bits <> tensorAcc.io.tensor.rd.data.bits
-  tensorAlu.io.out.rd.data.valid := io.out.rd.data.valid & dec.io.isAlu
-  tensorAlu.io.out.rd.data.bits <> io.out.rd.data.bits
+  for (idx <- 0 until tensorAlu.io.acc.splitWidth) {
+    tensorAlu.io.acc.rd(idx).data.valid :=
+      tensorAcc.io.tensor.rd(idx).data.valid & RegNext(dec.io.isAlu, init = false.B)
+    tensorAlu.io.acc.rd(idx).data.bits <> tensorAcc.io.tensor.rd(idx).data.bits
+  }
+  for (idx <- 0 until tensorAlu.io.out.splitWidth) {
+    tensorAlu.io.out.rd(idx).data.valid :=
+      io.out.rd(idx).data.valid & RegNext(dec.io.isAlu, init = false.B)
+    tensorAlu.io.out.rd(idx).data.bits <> io.out.rd(idx).data.bits
+  }
 
   // out
-  io.out.rd.idx <> Mux(dec.io.isGemm,
-    tensorGemm.io.out.rd.idx,
-    tensorAlu.io.out.rd.idx)
-  io.out.wr <> Mux(dec.io.isGemm, tensorGemm.io.out.wr, tensorAlu.io.out.wr)
+  for (idx <- 0 until tensorGemm.io.out.splitWidth) {
+    io.out.rd(idx).idx <> Mux(dec.io.isGemm,
+      tensorGemm.io.out.rd(idx).idx,
+      tensorAlu.io.out.rd(idx).idx)
+    assert( !tensorGemm.io.out.rd(idx).idx.valid || !tensorAlu.io.out.rd(idx).idx.valid)
+    assert( !tensorGemm.io.out.rd(idx).data.valid || !tensorAlu.io.out.rd(idx).data.valid)
 
+    assert( !tensorGemm.io.out.wr(idx).valid || !tensorAlu.io.out.wr(idx).valid)
+  }
+  require (tensorGemm.io.out.splitWidth == 1)
+  require (tensorAlu.io.out.splitWidth == 1)
+  io.out.wr(0).valid := Mux(
+    RegNext(dec.io.isGemm, init = false.B), tensorGemm.io.out.wr(0).valid, tensorAlu.io.out.wr(0).valid)
+  io.out.wr(0).bits.idx := Mux(
+    RegNext(dec.io.isGemm, init = false.B), tensorGemm.io.out.wr(0).bits.idx, tensorAlu.io.out.wr(0).bits.idx)
+  //put mux/Reg into every gemm group to build pipe (for Mux select) tree over distance

Review comment:
       could you add a space after the comment? like `// put` instead of `//put`

##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid
+    io.acc.rd(idx).idx.valid := RegNext(accRdIdxValid)
+  }
+
+  val new_src_idx_001 = src_idx_001 + src_offset
+  val src_idx_002 = RegNext( new_src_idx_001)
+  val src_idx_003 = RegNext( src_idx_002)
+
+  val new_dst_idx_001 = dst_idx_001 + dst_offset
+  val dst_idx_002 = RegNext( new_dst_idx_001)
+  val dst_idx_003 = RegNext( dst_idx_002)
+  val dst_idx_004 = RegNext( dst_idx_003)
+
+  // split registers of stage 2 by data groups
+  val accRdIdxBits = Mux( src_valid_001 || io.dec.alu_use_imm, new_src_idx_001, new_dst_idx_001)
+  for (idx <- 0 until dataSplitFactor) {
+    io.acc.rd(idx).idx.bits := RegNext(accRdIdxBits)
+    assert( io.acc.rd(idx).data.valid === (valid_003 || src_valid_003))
+  }
+
+  require(io.out.splitWidth == 1 && io.out.splitLength == 1, "-F- Out split write is not supported")
+  val numVecUnits = dataSplitFactor
+  val outData = Wire(io.out.wr(0).bits.data.cloneType)
+  val dataRemapB = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  val dataRemapA = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  // numVecUnits is a pow of 2
+  // split dec bits pipe further if there are many vecUnits
+  val decSplitNb0 =  if (numVecUnits < 8) 1 else 2
+  val decSplit0 = Wire(Vec(decSplitNb0, io.dec.cloneType))
+  for (idx <- 0 until decSplitNb0) {
+    decSplit0(idx) := ShiftRegister(io.dec, if(aluDataReadPipeDelay < 2) 0 else 1)
+  }
+
+  for (idx <- 0 until numVecUnits) {
+    val alu = Module(new AluVector)
+
+    for(aluLenIdx <- 0 until alu.io.acc_b.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_b.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_b.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapB(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+      }
+    }
+    val save_src = RegNext(dataRemapB(idx))
+    val tensorImm = Wire(new TensorClientData(tensorType = "acc"))
+    tensorImm.data.valid := RegNext(valid_002) //valid_003 split

Review comment:
       maybe creating a binding for `valid_r3` and then assigned it to `tensorImm.data.valid` instead of comment

##########
File path: hardware/chisel/src/main/scala/core/TensorStore.scala
##########
@@ -123,24 +128,28 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)(
             state := sWriteCmd
             xfer_bytes := xfer_stride_bytes
             when(xsize < xfer_stride_pulses) {
-              xlen := xsize
+              assert(xsize > 0.U)
+              xlen := xsize - 1.U
               xrem := 0.U
             }.otherwise {
               xlen := xfer_stride_pulses - 1.U
+              assert(xsize >= xfer_stride_pulses)
               xrem := xsize - xfer_stride_pulses
             }
           }
         } // split
         .elsewhen(xrem < xfer_split_pulses) {
           state := sWriteCmd
           xfer_bytes := xfer_split_bytes
-          xlen := xrem
+          assert(xrem > 0.U)
+          xlen := xrem - 1.U
           xrem := 0.U
         }
         .otherwise {
           state := sWriteCmd
           xfer_bytes := xfer_split_bytes
           xlen := xfer_split_pulses - 1.U
+          assert(xrem >= xfer_split_pulses)

Review comment:
       maybe add some comments to `assert`

##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid
+    io.acc.rd(idx).idx.valid := RegNext(accRdIdxValid)
+  }
+
+  val new_src_idx_001 = src_idx_001 + src_offset
+  val src_idx_002 = RegNext( new_src_idx_001)
+  val src_idx_003 = RegNext( src_idx_002)
+
+  val new_dst_idx_001 = dst_idx_001 + dst_offset
+  val dst_idx_002 = RegNext( new_dst_idx_001)
+  val dst_idx_003 = RegNext( dst_idx_002)
+  val dst_idx_004 = RegNext( dst_idx_003)
+
+  // split registers of stage 2 by data groups
+  val accRdIdxBits = Mux( src_valid_001 || io.dec.alu_use_imm, new_src_idx_001, new_dst_idx_001)
+  for (idx <- 0 until dataSplitFactor) {
+    io.acc.rd(idx).idx.bits := RegNext(accRdIdxBits)
+    assert( io.acc.rd(idx).data.valid === (valid_003 || src_valid_003))
+  }
+
+  require(io.out.splitWidth == 1 && io.out.splitLength == 1, "-F- Out split write is not supported")
+  val numVecUnits = dataSplitFactor
+  val outData = Wire(io.out.wr(0).bits.data.cloneType)
+  val dataRemapB = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  val dataRemapA = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  // numVecUnits is a pow of 2
+  // split dec bits pipe further if there are many vecUnits
+  val decSplitNb0 =  if (numVecUnits < 8) 1 else 2
+  val decSplit0 = Wire(Vec(decSplitNb0, io.dec.cloneType))
+  for (idx <- 0 until decSplitNb0) {
+    decSplit0(idx) := ShiftRegister(io.dec, if(aluDataReadPipeDelay < 2) 0 else 1)
+  }
+
+  for (idx <- 0 until numVecUnits) {
+    val alu = Module(new AluVector)
+
+    for(aluLenIdx <- 0 until alu.io.acc_b.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_b.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_b.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapB(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+      }
+    }
+    val save_src = RegNext(dataRemapB(idx))
+    val tensorImm = Wire(new TensorClientData(tensorType = "acc"))
+    tensorImm.data.valid := RegNext(valid_002) //valid_003 split
+    val tensorImmBits_piped = ShiftRegister(
+      decSplit0(idx/(numVecUnits/decSplitNb0)).alu_imm,
+      if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    tensorImm.data.bits.foreach { b =>
+      b.foreach { c =>
+        c := Mux(tensorImmBits_piped(C_ALU_IMM_BITS - 1),
+          Cat(-1.S((aluBits - C_ALU_IMM_BITS).W), tensorImmBits_piped), tensorImmBits_piped)
+      }
+    }
+
+    // alu
+    val tensorOpBits_piped = ShiftRegister(
+    decSplit0(idx/(numVecUnits/decSplitNb0)).alu_op,
+    if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    val isSHR = (tensorOpBits_piped === ALU_OP(3))
+    val neg_shift = isSHR & tensorImmBits_piped(C_ALU_IMM_BITS - 1)
+    val fixme_alu_op = Mux(
+      neg_shift,
+      ALU_OP(4), // use opcode = 4 for left shift
+      tensorOpBits_piped)
+    alu.io.opcode := fixme_alu_op
+
+    assert( !valid_003 || io.acc.rd(idx).data.valid)
+
+    alu.io.acc_a.data.valid := RegNext(valid_002) //valid_003 split
+
+    for(aluLenIdx <- 0 until alu.io.acc_a.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_a.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_a.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapA(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+        alu.io.acc_a.data.bits := dataRemapA(idx)
+      }
+    }
+    val tensorUseImmBits_piped = ShiftRegister(
+    decSplit0(idx/(numVecUnits/decSplitNb0)).alu_use_imm,
+    if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    alu.io.acc_b.data.valid := Mux(tensorUseImmBits_piped,
+      tensorImm.data.valid,
+      valid_003)
+    alu.io.acc_b.data.bits := Mux(tensorUseImmBits_piped,
+      tensorImm.data.bits,
+      save_src)
+
+    assert( alu.io.acc_y.data.valid === valid_004)
+    io.acc.wr(idx).valid := RegNext(valid_003) //valid_004 split
+    io.acc.wr(idx).bits.idx := RegNext(dst_idx_003)//dst_idx_004 split
+
+    for(aluLenIdx <- 0 until alu.io.acc_y.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_y.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_y.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        io.acc.wr(accGrpIdx).bits.data(accLenIdx)(accWdtIdx) :=
+          alu.io.acc_y.data.bits(aluLenIdx)(aluWdtIdx)
+      }
+    }
+
+    assert( alu.io.out.data.valid === valid_004)
+    for (idx1 <- 0 until io.out.tensorLength) {
+      for (idx2 <- 0 until io.out.tensorWidth/numVecUnits) {
+        outData(idx1)(idx*io.out.tensorWidth/numVecUnits + idx2) := alu.io.out.data.bits(idx1)(idx2)
+      }
+    }
+  }
+
+// comment for split write
+  io.out.wr(0).valid := valid_004
+  io.out.wr(0).bits.idx := dst_idx_004
+  io.out.wr(0).bits.data := outData
+  io.out.tieoffRead()
+
+  val bypass_dst = valid_003 && valid_004 && ( dst_idx_004 === dst_idx_003)
+  val bypass_src = src_valid_003 && valid_004 && ( dst_idx_004 === src_idx_003)
+
+  // Do we need a bypass
+  when ( bypass_dst) {
+    printf( "Bypass required on dst_idx read %x RAW with write %x\n", dst_idx_003, dst_idx_004)
+    assert( false.B, "DST bypass required")
+  }
+  when ( bypass_src) {
+    printf( "Bypass required on src_idx read %x RAW with write %x\n", src_idx_003, dst_idx_004)
+    assert( false.B, "SRC bypass required")
+  }
+}

Review comment:
       why not doing? `assert(!bypass_src, ...)`

##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid
+    io.acc.rd(idx).idx.valid := RegNext(accRdIdxValid)
+  }
+
+  val new_src_idx_001 = src_idx_001 + src_offset
+  val src_idx_002 = RegNext( new_src_idx_001)
+  val src_idx_003 = RegNext( src_idx_002)
+
+  val new_dst_idx_001 = dst_idx_001 + dst_offset
+  val dst_idx_002 = RegNext( new_dst_idx_001)
+  val dst_idx_003 = RegNext( dst_idx_002)
+  val dst_idx_004 = RegNext( dst_idx_003)
+
+  // split registers of stage 2 by data groups
+  val accRdIdxBits = Mux( src_valid_001 || io.dec.alu_use_imm, new_src_idx_001, new_dst_idx_001)
+  for (idx <- 0 until dataSplitFactor) {
+    io.acc.rd(idx).idx.bits := RegNext(accRdIdxBits)
+    assert( io.acc.rd(idx).data.valid === (valid_003 || src_valid_003))
+  }
+
+  require(io.out.splitWidth == 1 && io.out.splitLength == 1, "-F- Out split write is not supported")
+  val numVecUnits = dataSplitFactor
+  val outData = Wire(io.out.wr(0).bits.data.cloneType)
+  val dataRemapB = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  val dataRemapA = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  // numVecUnits is a pow of 2
+  // split dec bits pipe further if there are many vecUnits
+  val decSplitNb0 =  if (numVecUnits < 8) 1 else 2
+  val decSplit0 = Wire(Vec(decSplitNb0, io.dec.cloneType))
+  for (idx <- 0 until decSplitNb0) {
+    decSplit0(idx) := ShiftRegister(io.dec, if(aluDataReadPipeDelay < 2) 0 else 1)
+  }
+
+  for (idx <- 0 until numVecUnits) {
+    val alu = Module(new AluVector)
+
+    for(aluLenIdx <- 0 until alu.io.acc_b.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_b.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_b.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapB(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+      }
+    }
+    val save_src = RegNext(dataRemapB(idx))
+    val tensorImm = Wire(new TensorClientData(tensorType = "acc"))
+    tensorImm.data.valid := RegNext(valid_002) //valid_003 split
+    val tensorImmBits_piped = ShiftRegister(
+      decSplit0(idx/(numVecUnits/decSplitNb0)).alu_imm,
+      if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    tensorImm.data.bits.foreach { b =>
+      b.foreach { c =>
+        c := Mux(tensorImmBits_piped(C_ALU_IMM_BITS - 1),
+          Cat(-1.S((aluBits - C_ALU_IMM_BITS).W), tensorImmBits_piped), tensorImmBits_piped)
+      }
+    }
+
+    // alu
+    val tensorOpBits_piped = ShiftRegister(
+    decSplit0(idx/(numVecUnits/decSplitNb0)).alu_op,
+    if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    val isSHR = (tensorOpBits_piped === ALU_OP(3))
+    val neg_shift = isSHR & tensorImmBits_piped(C_ALU_IMM_BITS - 1)
+    val fixme_alu_op = Mux(
+      neg_shift,
+      ALU_OP(4), // use opcode = 4 for left shift
+      tensorOpBits_piped)
+    alu.io.opcode := fixme_alu_op
+
+    assert( !valid_003 || io.acc.rd(idx).data.valid)

Review comment:
       remove extra space

##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid
+    io.acc.rd(idx).idx.valid := RegNext(accRdIdxValid)
+  }
+
+  val new_src_idx_001 = src_idx_001 + src_offset
+  val src_idx_002 = RegNext( new_src_idx_001)
+  val src_idx_003 = RegNext( src_idx_002)
+
+  val new_dst_idx_001 = dst_idx_001 + dst_offset
+  val dst_idx_002 = RegNext( new_dst_idx_001)
+  val dst_idx_003 = RegNext( dst_idx_002)
+  val dst_idx_004 = RegNext( dst_idx_003)
+
+  // split registers of stage 2 by data groups
+  val accRdIdxBits = Mux( src_valid_001 || io.dec.alu_use_imm, new_src_idx_001, new_dst_idx_001)
+  for (idx <- 0 until dataSplitFactor) {
+    io.acc.rd(idx).idx.bits := RegNext(accRdIdxBits)
+    assert( io.acc.rd(idx).data.valid === (valid_003 || src_valid_003))
+  }
+
+  require(io.out.splitWidth == 1 && io.out.splitLength == 1, "-F- Out split write is not supported")
+  val numVecUnits = dataSplitFactor
+  val outData = Wire(io.out.wr(0).bits.data.cloneType)
+  val dataRemapB = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  val dataRemapA = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  // numVecUnits is a pow of 2
+  // split dec bits pipe further if there are many vecUnits
+  val decSplitNb0 =  if (numVecUnits < 8) 1 else 2
+  val decSplit0 = Wire(Vec(decSplitNb0, io.dec.cloneType))
+  for (idx <- 0 until decSplitNb0) {
+    decSplit0(idx) := ShiftRegister(io.dec, if(aluDataReadPipeDelay < 2) 0 else 1)
+  }
+
+  for (idx <- 0 until numVecUnits) {
+    val alu = Module(new AluVector)
+
+    for(aluLenIdx <- 0 until alu.io.acc_b.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_b.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_b.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapB(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+      }
+    }
+    val save_src = RegNext(dataRemapB(idx))
+    val tensorImm = Wire(new TensorClientData(tensorType = "acc"))
+    tensorImm.data.valid := RegNext(valid_002) //valid_003 split
+    val tensorImmBits_piped = ShiftRegister(
+      decSplit0(idx/(numVecUnits/decSplitNb0)).alu_imm,
+      if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    tensorImm.data.bits.foreach { b =>
+      b.foreach { c =>
+        c := Mux(tensorImmBits_piped(C_ALU_IMM_BITS - 1),
+          Cat(-1.S((aluBits - C_ALU_IMM_BITS).W), tensorImmBits_piped), tensorImmBits_piped)
+      }
+    }
+
+    // alu
+    val tensorOpBits_piped = ShiftRegister(
+    decSplit0(idx/(numVecUnits/decSplitNb0)).alu_op,
+    if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    val isSHR = (tensorOpBits_piped === ALU_OP(3))
+    val neg_shift = isSHR & tensorImmBits_piped(C_ALU_IMM_BITS - 1)
+    val fixme_alu_op = Mux(
+      neg_shift,
+      ALU_OP(4), // use opcode = 4 for left shift
+      tensorOpBits_piped)
+    alu.io.opcode := fixme_alu_op
+
+    assert( !valid_003 || io.acc.rd(idx).data.valid)
+
+    alu.io.acc_a.data.valid := RegNext(valid_002) //valid_003 split
+
+    for(aluLenIdx <- 0 until alu.io.acc_a.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_a.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_a.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapA(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+        alu.io.acc_a.data.bits := dataRemapA(idx)
+      }
+    }
+    val tensorUseImmBits_piped = ShiftRegister(
+    decSplit0(idx/(numVecUnits/decSplitNb0)).alu_use_imm,
+    if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    alu.io.acc_b.data.valid := Mux(tensorUseImmBits_piped,
+      tensorImm.data.valid,
+      valid_003)
+    alu.io.acc_b.data.bits := Mux(tensorUseImmBits_piped,
+      tensorImm.data.bits,
+      save_src)
+
+    assert( alu.io.acc_y.data.valid === valid_004)
+    io.acc.wr(idx).valid := RegNext(valid_003) //valid_004 split
+    io.acc.wr(idx).bits.idx := RegNext(dst_idx_003)//dst_idx_004 split
+
+    for(aluLenIdx <- 0 until alu.io.acc_y.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_y.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_y.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        io.acc.wr(accGrpIdx).bits.data(accLenIdx)(accWdtIdx) :=
+          alu.io.acc_y.data.bits(aluLenIdx)(aluWdtIdx)
+      }
+    }
+
+    assert( alu.io.out.data.valid === valid_004)
+    for (idx1 <- 0 until io.out.tensorLength) {
+      for (idx2 <- 0 until io.out.tensorWidth/numVecUnits) {
+        outData(idx1)(idx*io.out.tensorWidth/numVecUnits + idx2) := alu.io.out.data.bits(idx1)(idx2)
+      }
+    }
+  }
+
+// comment for split write
+  io.out.wr(0).valid := valid_004
+  io.out.wr(0).bits.idx := dst_idx_004
+  io.out.wr(0).bits.data := outData
+  io.out.tieoffRead()
+
+  val bypass_dst = valid_003 && valid_004 && ( dst_idx_004 === dst_idx_003)
+  val bypass_src = src_valid_003 && valid_004 && ( dst_idx_004 === src_idx_003)

Review comment:
       Extra space in parens `( )`

##########
File path: hardware/chisel/src/main/scala/core/Compute.scala
##########
@@ -58,6 +60,9 @@ class Compute(debug: Boolean = false)(implicit p: Parameters) extends Module {
   val tensorGemm = Module(new TensorGemm)
   val tensorAlu = Module(new TensorAlu)
 
+  //try to use the acc closest to top IO

Review comment:
       space after comment `// try` instead of `//try`

##########
File path: hardware/chisel/src/main/scala/core/TensorGemm.scala
##########
@@ -240,10 +250,7 @@ class TensorGemm(debug: Boolean = false)(implicit p: Parameters) extends Module
       state := sExe
     }
     is(sExe) {
-      when(
-        (cnt_o === dec.lp_0 - 1.U) &&
-          (cnt_i === dec.lp_1 - 1.U) &&
-          (uop_idx === uop_end - 1.U)) {
+      when(cond) {

Review comment:
       maybe better name for `cond`?

##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -308,3 +605,6 @@ class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
     }
   }
 }
+
+class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends TensorAluPipelined(debug)
+//class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends TensorAluOrig(debug)

Review comment:
       delete comment

##########
File path: hardware/chisel/src/main/scala/core/TensorAlu.scala
##########
@@ -97,38 +97,330 @@ class AluVector(implicit p: Parameters) extends Module {
   io.out.data.valid := valid.asUInt.andR
 }
 
-/** TensorAlu.
- *
- * This unit instantiate the ALU vector unit (AluVector) and go over the
- * micro-ops (uops) which are used to read the source operands (vectors)
- * from the acc-scratchpad and then they are written back the same
- * acc-scratchpad.
- */
-class TensorAlu(debug: Boolean = false)(implicit p: Parameters) extends Module {
+class TensorAluIndexGenerator(debug: Boolean = false)(implicit p: Parameters) extends Module {
+  val cnt_o_width = (new AluDecode).lp_0.getWidth
+  val cnt_i_width = (new AluDecode).lp_1.getWidth
+
+  val io = IO(new Bundle {
+    val start = Input(Bool())
+    val last = Output(Bool())
+    val dec = Input(new AluDecode)
+    val valid = Output(Bool())
+    val src_valid = Output(Bool())
+    val dst_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val src_idx = Output(UInt(new TensorParams(tensorType="acc").memAddrBits.W))
+    val uop_idx = Output(UInt(log2Ceil(p(CoreKey).uopMemDepth).W))
+    val cnt_o = Output(UInt(cnt_o_width.W))
+    val cnt_i = Output(UInt(cnt_i_width.W))
+  })
+
+  io.last := false.B
+
+  val running = RegInit( false.B)
+  val stutter = RegInit( false.B)
+
+  val advance = io.dec.alu_use_imm || stutter
+
+  when( !running && io.start) {
+    running := true.B
+  } .elsewhen( running && !advance) {
+    stutter := true.B
+  } .elsewhen( running && advance) {
+    when ( io.last) {
+      running := false.B
+    }
+    stutter := false.B
+  }
+
+  val cnt_i = Reg( chiselTypeOf(io.dec.lp_1))
+  val dst_i = Reg( chiselTypeOf(io.dst_idx))
+  val src_i = Reg( chiselTypeOf(io.src_idx))
+
+  val cnt_o = Reg( chiselTypeOf(io.dec.lp_0))
+  val dst_o = Reg( chiselTypeOf(io.dst_idx))
+  val src_o = Reg( chiselTypeOf(io.src_idx))
+
+  val uop_idx = Reg( chiselTypeOf(io.dec.uop_end))
+
+  io.valid := running && advance
+  io.src_valid := running && !advance
+  io.dst_idx := dst_i
+  io.src_idx := src_i
+  io.uop_idx := uop_idx
+  io.cnt_o := cnt_o
+  io.cnt_i := cnt_i
+
+  when( !running) {
+    cnt_i := 0.U; dst_i := 0.U; src_i := 0.U;
+    cnt_o := 0.U; dst_o := 0.U; src_o := 0.U;
+    uop_idx := io.dec.uop_begin
+  } .elsewhen (advance) {
+    when (uop_idx =/= io.dec.uop_end - 1.U) {
+      uop_idx := uop_idx + 1.U
+    }.otherwise {
+      uop_idx := io.dec.uop_begin
+      when ( cnt_i =/= io.dec.lp_1 - 1.U) {
+        cnt_i := cnt_i + 1.U
+        dst_i := dst_i + io.dec.dst_1
+        src_i := src_i + io.dec.src_1
+      }.otherwise {
+        when ( cnt_o =/= io.dec.lp_0 - 1.U) {
+          val dst_tmp = dst_o + io.dec.dst_0
+          val src_tmp = src_o + io.dec.src_0
+          cnt_o := cnt_o + 1.U
+          dst_o := dst_tmp
+          src_o := src_tmp
+          cnt_i := 0.U
+          dst_i := dst_tmp
+          src_i := src_tmp
+        } .otherwise {
+          io.last := true.B
+        }
+      }
+    }
+  }
+}
+
+class TensorAluIfc(implicit p: Parameters) extends Module {
   val aluBits = p(CoreKey).accBits
   val io = IO(new Bundle {
     val start = Input(Bool())
     val done = Output(Bool())
-    val inst = Input(UInt(INST_BITS.W))
+    val dec = Input(new AluDecode)
     val uop = new UopMaster
     val acc = new TensorMaster(tensorType = "acc")
     val out = new TensorMaster(tensorType = "out")
   })
+}
+
+class TensorAluPipelined(debug: Boolean = false)(implicit p: Parameters) extends TensorAluIfc {
+  val stateBits = 2
+  val inflightBits = 4
+  val dataSplitFactor = p(CoreKey).blockOutFactor
+
+  val sIdle::sRun::sWait::Nil = Enum(3)
+  val state = RegInit(init=sIdle)
+  val inflight = RegInit(0.U(inflightBits.W))
+
+  val index_generator = Module(new TensorAluIndexGenerator)
+  val aluDataReadPipeDelay = 0 // available for pipelining
+
+  // State Machine for compute io.done correctly
+  io.done := false.B
+  when( state === sIdle && io.start) {
+    state := sRun
+  }.elsewhen( state === sRun && index_generator.io.last) {
+    state := sWait
+  }.elsewhen( state === sWait && inflight === 0.U) {
+    state := sIdle
+    io.done := true.B
+  }
+
+  index_generator.io.start := io.start
+  index_generator.io.dec := io.dec
+
+  // second term works around funny clearing in uop register file flopped output
+  io.uop.idx.valid := index_generator.io.valid || index_generator.io.src_valid
+  io.uop.idx.bits := index_generator.io.uop_idx
+
+  val valid_001 = ShiftRegister( index_generator.io.valid, aluDataReadPipeDelay + 1, resetData=false.B, en = true.B)
+  val valid_002 = RegNext( valid_001, init=false.B)
+  val valid_003 = RegNext( valid_002, init=false.B)
+  val valid_004 = RegNext( valid_003, init=false.B)
+
+  when( index_generator.io.valid && valid_004) {
+  }.elsewhen( index_generator.io.valid) {
+    assert( inflight =/= ((1<<inflightBits)-1).U)
+    inflight := inflight + 1.U
+  }.elsewhen( valid_004) {
+    assert( inflight =/= 0.U)
+    inflight := inflight - 1.U
+  }
+  when( state === sIdle) {
+    assert( inflight === 0.U)
+    inflight := 0.U
+  }
+
+  val src_valid_001 = ShiftRegister(
+    index_generator.io.src_valid,
+    aluDataReadPipeDelay + 1,
+    resetData=false.B, en = true.B)
+  val src_valid_002 = RegNext( src_valid_001, init=false.B)
+  val src_valid_003 = RegNext( src_valid_002, init=false.B)
+  val src_valid_004 = RegNext( src_valid_003, init=false.B)
+
+  val dst_idx_001 = ShiftRegister( index_generator.io.dst_idx, aluDataReadPipeDelay + 1)
+  val src_idx_001 = ShiftRegister( index_generator.io.src_idx, aluDataReadPipeDelay + 1)
+
+  val uop_data_001 = ShiftRegister(io.uop.data, aluDataReadPipeDelay)
+
+  val dst_offset = uop_data_001.bits.u0
+
+  val w = dst_offset.getWidth
+  val u2 = uop_data_001.bits.u2.asTypeOf(UInt(w.W))
+  val s = log2Ceil(p(CoreKey).inpMemDepth)
+  val u1 = uop_data_001.bits.u1.asTypeOf(UInt(w.W))
+  val src_offset = (u2 << s) | u1
+
+  // split registers of stage 2 by data groups
+  //val accRdIdxValid = valid_002 || src_valid_002
+  val accRdIdxValid = valid_001 || src_valid_001
+  for (idx <- 0 until dataSplitFactor) {
+    //io.acc.rd(idx).idx.valid := accRdIdxValid
+    io.acc.rd(idx).idx.valid := RegNext(accRdIdxValid)
+  }
+
+  val new_src_idx_001 = src_idx_001 + src_offset
+  val src_idx_002 = RegNext( new_src_idx_001)
+  val src_idx_003 = RegNext( src_idx_002)
+
+  val new_dst_idx_001 = dst_idx_001 + dst_offset
+  val dst_idx_002 = RegNext( new_dst_idx_001)
+  val dst_idx_003 = RegNext( dst_idx_002)
+  val dst_idx_004 = RegNext( dst_idx_003)
+
+  // split registers of stage 2 by data groups
+  val accRdIdxBits = Mux( src_valid_001 || io.dec.alu_use_imm, new_src_idx_001, new_dst_idx_001)
+  for (idx <- 0 until dataSplitFactor) {
+    io.acc.rd(idx).idx.bits := RegNext(accRdIdxBits)
+    assert( io.acc.rd(idx).data.valid === (valid_003 || src_valid_003))
+  }
+
+  require(io.out.splitWidth == 1 && io.out.splitLength == 1, "-F- Out split write is not supported")
+  val numVecUnits = dataSplitFactor
+  val outData = Wire(io.out.wr(0).bits.data.cloneType)
+  val dataRemapB = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  val dataRemapA = Wire(Vec(numVecUnits, io.acc.rd(0).data.bits.cloneType))
+  // numVecUnits is a pow of 2
+  // split dec bits pipe further if there are many vecUnits
+  val decSplitNb0 =  if (numVecUnits < 8) 1 else 2
+  val decSplit0 = Wire(Vec(decSplitNb0, io.dec.cloneType))
+  for (idx <- 0 until decSplitNb0) {
+    decSplit0(idx) := ShiftRegister(io.dec, if(aluDataReadPipeDelay < 2) 0 else 1)
+  }
+
+  for (idx <- 0 until numVecUnits) {
+    val alu = Module(new AluVector)
+
+    for(aluLenIdx <- 0 until alu.io.acc_b.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_b.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_b.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapB(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+      }
+    }
+    val save_src = RegNext(dataRemapB(idx))
+    val tensorImm = Wire(new TensorClientData(tensorType = "acc"))
+    tensorImm.data.valid := RegNext(valid_002) //valid_003 split
+    val tensorImmBits_piped = ShiftRegister(
+      decSplit0(idx/(numVecUnits/decSplitNb0)).alu_imm,
+      if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    tensorImm.data.bits.foreach { b =>
+      b.foreach { c =>
+        c := Mux(tensorImmBits_piped(C_ALU_IMM_BITS - 1),
+          Cat(-1.S((aluBits - C_ALU_IMM_BITS).W), tensorImmBits_piped), tensorImmBits_piped)
+      }
+    }
+
+    // alu
+    val tensorOpBits_piped = ShiftRegister(
+    decSplit0(idx/(numVecUnits/decSplitNb0)).alu_op,
+    if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    val isSHR = (tensorOpBits_piped === ALU_OP(3))
+    val neg_shift = isSHR & tensorImmBits_piped(C_ALU_IMM_BITS - 1)
+    val fixme_alu_op = Mux(
+      neg_shift,
+      ALU_OP(4), // use opcode = 4 for left shift
+      tensorOpBits_piped)
+    alu.io.opcode := fixme_alu_op
+
+    assert( !valid_003 || io.acc.rd(idx).data.valid)
+
+    alu.io.acc_a.data.valid := RegNext(valid_002) //valid_003 split
+
+    for(aluLenIdx <- 0 until alu.io.acc_a.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_a.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_a.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        dataRemapA(idx)(aluLenIdx)(aluWdtIdx) :=
+          io.acc.rd(accGrpIdx).data.bits(accLenIdx)(accWdtIdx)
+        alu.io.acc_a.data.bits := dataRemapA(idx)
+      }
+    }
+    val tensorUseImmBits_piped = ShiftRegister(
+    decSplit0(idx/(numVecUnits/decSplitNb0)).alu_use_imm,
+    if(aluDataReadPipeDelay < 2) aluDataReadPipeDelay else aluDataReadPipeDelay -1)
+    alu.io.acc_b.data.valid := Mux(tensorUseImmBits_piped,
+      tensorImm.data.valid,
+      valid_003)
+    alu.io.acc_b.data.bits := Mux(tensorUseImmBits_piped,
+      tensorImm.data.bits,
+      save_src)
+
+    assert( alu.io.acc_y.data.valid === valid_004)
+    io.acc.wr(idx).valid := RegNext(valid_003) //valid_004 split
+    io.acc.wr(idx).bits.idx := RegNext(dst_idx_003)//dst_idx_004 split
+
+    for(aluLenIdx <- 0 until alu.io.acc_y.lenSplit) {
+      for(aluWdtIdx <- 0 until alu.io.acc_y.widthSplit) {
+        val (accGrpIdx, accLenIdx, accWdtIdx) =
+          alu.io.acc_y.reindexDataFromGroup(idx, aluLenIdx, aluWdtIdx)
+        io.acc.wr(accGrpIdx).bits.data(accLenIdx)(accWdtIdx) :=
+          alu.io.acc_y.data.bits(aluLenIdx)(aluWdtIdx)
+      }
+    }
+
+    assert( alu.io.out.data.valid === valid_004)
+    for (idx1 <- 0 until io.out.tensorLength) {
+      for (idx2 <- 0 until io.out.tensorWidth/numVecUnits) {
+        outData(idx1)(idx*io.out.tensorWidth/numVecUnits + idx2) := alu.io.out.data.bits(idx1)(idx2)
+      }
+    }
+  }
+
+// comment for split write
+  io.out.wr(0).valid := valid_004
+  io.out.wr(0).bits.idx := dst_idx_004
+  io.out.wr(0).bits.data := outData
+  io.out.tieoffRead()
+
+  val bypass_dst = valid_003 && valid_004 && ( dst_idx_004 === dst_idx_003)
+  val bypass_src = src_valid_003 && valid_004 && ( dst_idx_004 === src_idx_003)
+
+  // Do we need a bypass
+  when ( bypass_dst) {
+    printf( "Bypass required on dst_idx read %x RAW with write %x\n", dst_idx_003, dst_idx_004)

Review comment:
       extra space

##########
File path: hardware/chisel/src/test/scala/unittest/TensorAluTest.scala
##########
@@ -0,0 +1,253 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package unittest
+
+import chisel3._
+import chisel3.util._
+import chisel3.iotesters.{ChiselFlatSpec, Driver, PeekPokeTester}
+import scala.util.Random
+import unittest.util._
+import vta.core._
+import vta.util.config._
+
+class TensorAluIndexGeneratorTester(c: TensorAluIndexGenerator, alu_use_imm : Int = 0) extends PeekPokeTester(c) {
+
+
+  val uop_begin = 0
+  val uop_end = 2
+  assert( uop_begin < uop_end)
+
+  val lp_0 = 2
+  val lp_1 = 3
+  val dst_0 = 1*lp_1
+  val src_0 = 2*lp_1
+  val dst_1 = 1
+  val src_1 = 2
+
+  poke( c.io.dec.reset, 0)
+  poke( c.io.dec.alu_use_imm, alu_use_imm)
+  poke( c.io.dec.uop_begin, uop_begin)
+  poke( c.io.dec.uop_end, uop_end)
+  poke( c.io.dec.lp_0, lp_0)
+  poke( c.io.dec.lp_1, lp_1)
+  poke( c.io.dec.dst_0, dst_0)
+  poke( c.io.dec.dst_1, dst_1)
+  poke( c.io.dec.src_0, src_0)
+  poke( c.io.dec.src_1, src_1)
+  // Don't need empty_0,{push,pop}_{next,prev},op
+
+
+  class Mocks {
+    val uop_indices = new scala.collection.mutable.Queue[BigInt]
+    val dst_indices = new scala.collection.mutable.Queue[BigInt]
+    val src_indices = new scala.collection.mutable.Queue[BigInt]
+
+    def logical_step() {
+      step(1)
+      if ( peek( c.io.valid) == 1) {
+        expect( c.io.uop_idx, uop_indices.dequeue())
+        expect( c.io.dst_idx, dst_indices.dequeue())
+      }
+      if ( peek( c.io.src_valid) == 1) {
+        expect( c.io.src_idx, src_indices.dequeue())
+      }
+    }
+
+    def test_if_done() {
+      println( s"uop_indices remaining: ${uop_indices.size}")
+      println( s"dst_indices remaining: ${dst_indices.size}")
+      println( s"src_indices remaining: ${src_indices.size}")
+      assert( uop_indices.isEmpty)
+      assert( dst_indices.isEmpty)
+      assert( src_indices.isEmpty)
+    }
+  }
+
+  val mocks = new Mocks
+  for { cnt_o <- 0 until lp_0
+        cnt_i <- 0 until lp_1
+        uop_idx <- uop_begin until uop_end} {
+    mocks.uop_indices.enqueue( uop_idx)
+    mocks.dst_indices.enqueue( dst_0*cnt_o + dst_1*cnt_i)
+    if (alu_use_imm == 0) {
+      mocks.src_indices.enqueue( src_0*cnt_o + src_1*cnt_i)
+    }
+  }
+
+  poke( c.io.start, 1)
+  mocks.logical_step()
+  poke( c.io.start, 0)
+
+  val end = (uop_end-uop_begin)*lp_0*lp_1
+  var count = 0
+  while( peek( c.io.last) == 0 && count < 10*end + 100) { 
+    mocks.logical_step()
+    count += 1
+  }
+  mocks.test_if_done()
+  step(1)
+}
+
+class TensorAluIndexGenerator_0_Test extends GenericTest( "TensorAluIndexGenerator_0", (p:Parameters) => new TensorAluIndexGenerator()(p), (c:TensorAluIndexGenerator) => new TensorAluIndexGeneratorTester(c, 0))
+
+class TensorAluIndexGenerator_1_Test extends GenericTest( "TensorAluIndexGenerator_1", (p:Parameters) => new TensorAluIndexGenerator()(p), (c:TensorAluIndexGenerator) => new TensorAluIndexGeneratorTester(c, 1))
+
+
+class TensorAluPipelinedTester(c: TensorAlu) extends PeekPokeTester(c) {
+  poke( c.io.start, 0)
+
+  val uop_begin = 0
+  val uop_end = 1
+  assert( uop_begin < uop_end)
+  val alu_use_imm = 1
+  val lp_0 = 2
+  val lp_1 = 3
+  val dst_0 = 1*lp_1
+  val src_0 = 2*lp_1
+  val dst_1 = 1
+  val src_1 = 2
+
+  val dst_offset = BigInt( "000", 16)
+  val src_offset = BigInt( "100", 16)
+
+  val u0 = dst_offset
+  val u1 = src_offset
+  val u2 = 0 // if src_offset is big, some bits go here
+
+  poke( c.io.dec.reset, 0)
+  poke( c.io.dec.alu_op, 2) // ADD or ADDI 1
+  poke( c.io.dec.alu_imm, 1)
+  poke( c.io.dec.alu_use_imm, alu_use_imm)
+  poke( c.io.dec.uop_begin, uop_begin)
+  poke( c.io.dec.uop_end, uop_end)
+  poke( c.io.dec.lp_0, lp_0)
+  poke( c.io.dec.lp_1, lp_1)
+  poke( c.io.dec.dst_0, dst_0)
+  poke( c.io.dec.dst_1, dst_1)
+  poke( c.io.dec.src_0, src_0)
+  poke( c.io.dec.src_1, src_1)
+
+  // Don't need empty_0,{push,pop}_{next,prev},op
+
+  poke( c.io.uop.data.bits.u0, u0)
+  poke( c.io.uop.data.bits.u1, u1)
+  poke( c.io.uop.data.bits.u2, u2)
+  
+  require(c.io.acc.splitWidth == 1, "-F- Test doesnt support acc data access split")
+  require(c.io.acc.splitLength == 1, "-F- Test doesnt support acc data access split")
+  
+  val acc = IndexedSeq.tabulate(c.io.acc.rd(0).data.bits(0).size){ i => BigInt(i) }
+  for { lhs <- c.io.acc.rd(0).data.bits} {
+    poke( lhs, acc.reverse)
+  }
+
+  class TensorMasterMock( tm: TensorMaster) {
+    poke( tm.rd(0).data.valid, 0)
+    var valid = peek(tm.rd(0).idx.valid)
+    def logical_step( v: Option[BigInt]) {
+      poke( tm.rd(0).data.valid, valid)
+      valid = peek( tm.rd(0).idx.valid)
+      for { x <- v} expect( tm.rd(0).idx.valid, x)
+    }
+  }
+
+  class UopMasterMock( um: UopMaster) {
+    poke( um.data.valid, 0)
+    var valid = peek( um.idx.valid)
+    def logical_step( v: Option[BigInt]) {
+      poke( um.data.valid, valid)
+      valid = peek( um.idx.valid)
+      for { x <- v} expect( um.idx.valid, x)
+    }
+  }
+
+  class Mocks {
+    val uop_mock = new UopMasterMock( c.io.uop)
+    val acc_mock = new TensorMasterMock( c.io.acc)
+
+    val uop_indices = new scala.collection.mutable.Queue[BigInt]
+    val acc_indices = new scala.collection.mutable.Queue[BigInt]
+    val accout_indices = new scala.collection.mutable.Queue[BigInt]
+    val out_indices = new scala.collection.mutable.Queue[BigInt]
+
+    def logical_step() {
+      step(1)
+      uop_mock.logical_step( None)
+      acc_mock.logical_step( None)
+      if ( peek( c.io.uop.idx.valid) == 1) {
+        expect( c.io.uop.idx.bits, uop_indices.dequeue())
+      }
+      if ( peek( c.io.acc.rd(0).idx.valid) == 1) {
+        expect( c.io.acc.rd(0).idx.bits, acc_indices.dequeue())
+      }
+      if ( peek( c.io.acc.wr(0).valid) == 1) {
+        expect( c.io.acc.wr(0).bits.idx, accout_indices.dequeue())
+      }
+      if ( peek( c.io.out.wr(0).valid) == 1) {
+        expect( c.io.out.wr(0).bits.idx, out_indices.dequeue())
+      }
+    }
+
+    def test_if_done() {
+      println( s"uop_indices remaining: ${uop_indices.size}")
+      println( s"acc_indices remaining: ${acc_indices.size}")
+      println( s"accout_indices remaining: ${accout_indices.size}")
+      println( s"out_indices remaining: ${out_indices.size}")
+      assert( uop_indices.isEmpty)
+      assert( acc_indices.isEmpty)
+      assert( accout_indices.isEmpty)
+      assert( out_indices.isEmpty)
+    }
+  }
+
+  val mocks = new Mocks
+  for { cnt_o <- 0 until lp_0
+        cnt_i <- 0 until lp_1
+        uop_idx <- uop_begin until uop_end} {
+    mocks.uop_indices.enqueue( uop_idx)
+    // if ( alu_use_imm == 0) {
+    //   mocks.acc_indices.enqueue( src_offset + src_0*cnt_o + src_1*cnt_i)
+    // }

Review comment:
       remove comment

##########
File path: hardware/chisel/src/main/scala/core/TensorGemm.scala
##########
@@ -327,40 +335,42 @@ class TensorGemm(debug: Boolean = false)(implicit p: Parameters) extends Module
   io.uop.idx.bits := uop_idx
 
   // inp
-  io.inp.rd.idx.valid := state === sReadTensor
-  io.inp.rd.idx.bits := uop_inp
+  io.inp.rd(0).idx.valid := state === sReadTensor
+  io.inp.rd(0).idx.bits := uop_inp
   io.inp.tieoffWrite() // read-only
 
   // wgt
-  io.wgt.rd.idx.valid := state === sReadTensor
-  io.wgt.rd.idx.bits := uop_wgt
+  io.wgt.rd(0).idx.valid := state === sReadTensor
+  io.wgt.rd(0).idx.bits := uop_wgt
   io.wgt.tieoffWrite() // read-only
 
   // acc_i
-  io.acc.rd.idx.valid := state === sReadTensor
-  io.acc.rd.idx.bits := uop_acc
+  io.acc.rd(0).idx.valid := state === sReadTensor
+  io.acc.rd(0).idx.bits := uop_acc
 
   // mvc
   mvc.io.reset := dec.reset & state === sExe
-  mvc.io.inp.data <> io.inp.rd.data
-  mvc.io.wgt.data <> io.wgt.rd.data
-  mvc.io.acc_i.data <> io.acc.rd.data
+  mvc.io.inp.data <> io.inp.rd(0).data
+  mvc.io.wgt.data <> io.wgt.rd(0).data
+  mvc.io.acc_i.data <> io.acc.rd(0).data
 
   // acc_o
-  io.acc.wr.valid := mvc.io.acc_o.data.valid &
+  io.acc.wr(0).valid := mvc.io.acc_o.data.valid &
     Mux(dec.reset, true.B, wrpipe.io.deq.valid)
-  io.acc.wr.bits.idx := Mux(dec.reset, uop_acc, wrpipe.io.deq.bits)
-  io.acc.wr.bits.data <> mvc.io.acc_o.data.bits
+  io.acc.wr(0).bits.idx := Mux(dec.reset, uop_acc, wrpipe.io.deq.bits)
+  io.acc.wr(0).bits.data <> mvc.io.acc_o.data.bits
 
   // out
-  io.out.wr.valid := mvc.io.out.data.valid & wrpipe.io.deq.valid
-  io.out.wr.bits.idx := wrpipe.io.deq.bits
-  io.out.wr.bits.data <> mvc.io.out.data.bits
+  io.out.wr(0).valid := mvc.io.out.data.valid & wrpipe.io.deq.valid
+  io.out.wr(0).bits.idx := wrpipe.io.deq.bits
+  io.out.wr(0).bits.data <> mvc.io.out.data.bits
   io.out.tieoffRead() // write-only
 
   io.done := done
 
-  if (debug) {
+  if ( debug) {

Review comment:
       extra space

##########
File path: hardware/chisel/src/main/scala/core/TensorUtil.scala
##########
@@ -58,10 +67,118 @@ class TensorParams(tensorType: String = "none")(implicit p: Parameters) extends
       p(CoreKey).wgtMemDepth
     else if (tensorType == "acc")
       p(CoreKey).accMemDepth
+    else if (tensorType == "fetch") {
+      require(p(ShellKey).memParams.dataBits >= INST_BITS,
+        "-F- Cannot make fetch tensor narrower than data pulse. TODO: narrow fetch with tensors")
+      // still should be one data line
+      (1 << p(ShellKey).memParams.lenBits)*(INST_BITS / 64)
+    }
+    else if (tensorType == "uop") {
+      p(CoreKey).uopMemDepth
+    }
     else
       p(CoreKey).outMemDepth
 
+  // acc/wgt parts are grouped to form
+  // a physically compact compute entity
+  //
+  val (splitLength, splitWidth) =
+    if (tensorType == "inp") {
+      (1, 1)
+    } else if (tensorType == "wgt") {
+      (p(CoreKey).blockOutFactor, 1)
+    } else if (tensorType == "acc") {
+      // acc scratchpad is batch rows of blockout columns
+      // GEMM/ALU operation group is based on wgt tiling of blockout
+      // means acc out of a group if batch > 1 is not
+      // continous data and may be placed into different memory
+      //modules. But the whole idea of a group to localize
+      // piece of wgt to piece of acc data transformation
+      //
+      (1, p(CoreKey).blockOutFactor)
+    } else if (tensorType == "fetch") {
+      (1, 1)
+    } else if (tensorType == "uop") {
+      (1, 1)
+    } else if (tensorType == "out") {
+      (1, 1) // narrow store doesnt support split
+    } else {
+      (1, 1)
+    }
+  require (splitLength == 1 || splitWidth == 1, "-F- Can split only one dimension.")
+
+  //provide index of a group closes to IO

Review comment:
       need space

##########
File path: hardware/chisel/src/main/scala/core/TensorUtil.scala
##########
@@ -58,10 +67,118 @@ class TensorParams(tensorType: String = "none")(implicit p: Parameters) extends
       p(CoreKey).wgtMemDepth
     else if (tensorType == "acc")
       p(CoreKey).accMemDepth
+    else if (tensorType == "fetch") {
+      require(p(ShellKey).memParams.dataBits >= INST_BITS,
+        "-F- Cannot make fetch tensor narrower than data pulse. TODO: narrow fetch with tensors")
+      // still should be one data line
+      (1 << p(ShellKey).memParams.lenBits)*(INST_BITS / 64)
+    }
+    else if (tensorType == "uop") {
+      p(CoreKey).uopMemDepth
+    }
     else
       p(CoreKey).outMemDepth
 
+  // acc/wgt parts are grouped to form
+  // a physically compact compute entity
+  //
+  val (splitLength, splitWidth) =
+    if (tensorType == "inp") {
+      (1, 1)
+    } else if (tensorType == "wgt") {
+      (p(CoreKey).blockOutFactor, 1)
+    } else if (tensorType == "acc") {
+      // acc scratchpad is batch rows of blockout columns
+      // GEMM/ALU operation group is based on wgt tiling of blockout
+      // means acc out of a group if batch > 1 is not
+      // continous data and may be placed into different memory
+      //modules. But the whole idea of a group to localize
+      // piece of wgt to piece of acc data transformation
+      //
+      (1, p(CoreKey).blockOutFactor)
+    } else if (tensorType == "fetch") {
+      (1, 1)
+    } else if (tensorType == "uop") {
+      (1, 1)
+    } else if (tensorType == "out") {
+      (1, 1) // narrow store doesnt support split
+    } else {
+      (1, 1)
+    }
+  require (splitLength == 1 || splitWidth == 1, "-F- Can split only one dimension.")
+
+  //provide index of a group closes to IO
+  // exepect 2 columns of groups io on top and indexing from bottom
+  val closestIOGrpIdx =
+    if (tensorType == "inp") {
+      splitLength - 1
+    } else if (tensorType == "wgt") {
+      if (splitLength < 2) 0 else splitLength / 2 - 1
+    } else if (tensorType == "acc") {
+      if (splitWidth < 2) 0 else splitWidth / 2 - 1
+    } else if (tensorType == "fetch") {
+      0
+    } else if (tensorType == "uop") {
+      0
+    } else if (tensorType == "out") {
+      0
+    } else {
+      0
+    }
+
   val memAddrBits = log2Ceil(memDepth)
+
+  val tensorSizeBits = tensorLength * tensorWidth * tensorElemBits
+  val tsSizeRatio = tensorSizeBits / memBlockBits
+  val clSizeRatio = memBlockBits / tensorSizeBits
+
+  val lenSplit = tensorLength / splitLength // tensor rows in a group
+  val widthSplit = tensorWidth / splitWidth // tensor colums in a group
+  require(lenSplit > 0 && widthSplit > 0, "-F- wrong split")
+
+  // tensor condsiders groups as a continous data, gemm generates a data window
+  // Map data index from a window index to a continous groups index
+  def reindexDataFromGroup (grpIdx : Int, lenIdx: Int, wdtIdx: Int) = {
+
+    val grpLen = lenSplit // tensor rows in a group
+    val grpWdt = widthSplit // tensor colums in a group
+    val srcGrpRow = grpIdx / splitWidth // group row
+    val srcGrpCol = grpIdx % splitWidth // group column
+    val tnzRow = srcGrpRow * grpLen
+    val tnzCol = srcGrpCol * grpWdt
+    val flatIdx = (tnzRow + lenIdx) * tensorWidth + tnzCol + wdtIdx
+
+    val outGroupIdx = flatIdx / (grpLen * grpWdt)
+    val outGroupOffset = flatIdx % (grpLen * grpWdt)
+    val outGroupLenIdx = outGroupOffset / grpWdt
+    val outGroupWdthIdx = outGroupOffset % grpWdt
+    (outGroupIdx, outGroupLenIdx, outGroupWdthIdx)
+  }
+  //map data index form a continous to a window index

Review comment:
       need space

##########
File path: hardware/chisel/src/test/scala/unittest/TensorAluTest.scala
##########
@@ -0,0 +1,253 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package unittest
+
+import chisel3._
+import chisel3.util._
+import chisel3.iotesters.{ChiselFlatSpec, Driver, PeekPokeTester}
+import scala.util.Random
+import unittest.util._
+import vta.core._
+import vta.util.config._
+
+class TensorAluIndexGeneratorTester(c: TensorAluIndexGenerator, alu_use_imm : Int = 0) extends PeekPokeTester(c) {
+
+
+  val uop_begin = 0
+  val uop_end = 2
+  assert( uop_begin < uop_end)
+
+  val lp_0 = 2
+  val lp_1 = 3
+  val dst_0 = 1*lp_1
+  val src_0 = 2*lp_1
+  val dst_1 = 1
+  val src_1 = 2
+
+  poke( c.io.dec.reset, 0)
+  poke( c.io.dec.alu_use_imm, alu_use_imm)
+  poke( c.io.dec.uop_begin, uop_begin)
+  poke( c.io.dec.uop_end, uop_end)
+  poke( c.io.dec.lp_0, lp_0)
+  poke( c.io.dec.lp_1, lp_1)
+  poke( c.io.dec.dst_0, dst_0)
+  poke( c.io.dec.dst_1, dst_1)
+  poke( c.io.dec.src_0, src_0)
+  poke( c.io.dec.src_1, src_1)
+  // Don't need empty_0,{push,pop}_{next,prev},op
+
+
+  class Mocks {
+    val uop_indices = new scala.collection.mutable.Queue[BigInt]
+    val dst_indices = new scala.collection.mutable.Queue[BigInt]
+    val src_indices = new scala.collection.mutable.Queue[BigInt]
+
+    def logical_step() {
+      step(1)
+      if ( peek( c.io.valid) == 1) {
+        expect( c.io.uop_idx, uop_indices.dequeue())
+        expect( c.io.dst_idx, dst_indices.dequeue())
+      }
+      if ( peek( c.io.src_valid) == 1) {
+        expect( c.io.src_idx, src_indices.dequeue())
+      }
+    }
+
+    def test_if_done() {
+      println( s"uop_indices remaining: ${uop_indices.size}")
+      println( s"dst_indices remaining: ${dst_indices.size}")
+      println( s"src_indices remaining: ${src_indices.size}")
+      assert( uop_indices.isEmpty)
+      assert( dst_indices.isEmpty)
+      assert( src_indices.isEmpty)
+    }
+  }
+
+  val mocks = new Mocks
+  for { cnt_o <- 0 until lp_0
+        cnt_i <- 0 until lp_1
+        uop_idx <- uop_begin until uop_end} {
+    mocks.uop_indices.enqueue( uop_idx)
+    mocks.dst_indices.enqueue( dst_0*cnt_o + dst_1*cnt_i)
+    if (alu_use_imm == 0) {
+      mocks.src_indices.enqueue( src_0*cnt_o + src_1*cnt_i)
+    }
+  }
+
+  poke( c.io.start, 1)
+  mocks.logical_step()
+  poke( c.io.start, 0)
+
+  val end = (uop_end-uop_begin)*lp_0*lp_1
+  var count = 0
+  while( peek( c.io.last) == 0 && count < 10*end + 100) { 
+    mocks.logical_step()
+    count += 1
+  }
+  mocks.test_if_done()
+  step(1)
+}
+
+class TensorAluIndexGenerator_0_Test extends GenericTest( "TensorAluIndexGenerator_0", (p:Parameters) => new TensorAluIndexGenerator()(p), (c:TensorAluIndexGenerator) => new TensorAluIndexGeneratorTester(c, 0))
+
+class TensorAluIndexGenerator_1_Test extends GenericTest( "TensorAluIndexGenerator_1", (p:Parameters) => new TensorAluIndexGenerator()(p), (c:TensorAluIndexGenerator) => new TensorAluIndexGeneratorTester(c, 1))
+
+
+class TensorAluPipelinedTester(c: TensorAlu) extends PeekPokeTester(c) {
+  poke( c.io.start, 0)
+
+  val uop_begin = 0
+  val uop_end = 1
+  assert( uop_begin < uop_end)
+  val alu_use_imm = 1
+  val lp_0 = 2
+  val lp_1 = 3
+  val dst_0 = 1*lp_1
+  val src_0 = 2*lp_1
+  val dst_1 = 1
+  val src_1 = 2
+
+  val dst_offset = BigInt( "000", 16)
+  val src_offset = BigInt( "100", 16)
+
+  val u0 = dst_offset
+  val u1 = src_offset
+  val u2 = 0 // if src_offset is big, some bits go here
+
+  poke( c.io.dec.reset, 0)
+  poke( c.io.dec.alu_op, 2) // ADD or ADDI 1
+  poke( c.io.dec.alu_imm, 1)
+  poke( c.io.dec.alu_use_imm, alu_use_imm)
+  poke( c.io.dec.uop_begin, uop_begin)
+  poke( c.io.dec.uop_end, uop_end)
+  poke( c.io.dec.lp_0, lp_0)
+  poke( c.io.dec.lp_1, lp_1)
+  poke( c.io.dec.dst_0, dst_0)
+  poke( c.io.dec.dst_1, dst_1)
+  poke( c.io.dec.src_0, src_0)
+  poke( c.io.dec.src_1, src_1)
+
+  // Don't need empty_0,{push,pop}_{next,prev},op
+
+  poke( c.io.uop.data.bits.u0, u0)
+  poke( c.io.uop.data.bits.u1, u1)
+  poke( c.io.uop.data.bits.u2, u2)
+  
+  require(c.io.acc.splitWidth == 1, "-F- Test doesnt support acc data access split")
+  require(c.io.acc.splitLength == 1, "-F- Test doesnt support acc data access split")
+  
+  val acc = IndexedSeq.tabulate(c.io.acc.rd(0).data.bits(0).size){ i => BigInt(i) }
+  for { lhs <- c.io.acc.rd(0).data.bits} {
+    poke( lhs, acc.reverse)
+  }
+
+  class TensorMasterMock( tm: TensorMaster) {
+    poke( tm.rd(0).data.valid, 0)
+    var valid = peek(tm.rd(0).idx.valid)
+    def logical_step( v: Option[BigInt]) {
+      poke( tm.rd(0).data.valid, valid)
+      valid = peek( tm.rd(0).idx.valid)
+      for { x <- v} expect( tm.rd(0).idx.valid, x)
+    }
+  }
+
+  class UopMasterMock( um: UopMaster) {
+    poke( um.data.valid, 0)
+    var valid = peek( um.idx.valid)
+    def logical_step( v: Option[BigInt]) {
+      poke( um.data.valid, valid)
+      valid = peek( um.idx.valid)
+      for { x <- v} expect( um.idx.valid, x)
+    }
+  }
+
+  class Mocks {
+    val uop_mock = new UopMasterMock( c.io.uop)
+    val acc_mock = new TensorMasterMock( c.io.acc)
+
+    val uop_indices = new scala.collection.mutable.Queue[BigInt]
+    val acc_indices = new scala.collection.mutable.Queue[BigInt]
+    val accout_indices = new scala.collection.mutable.Queue[BigInt]
+    val out_indices = new scala.collection.mutable.Queue[BigInt]
+
+    def logical_step() {
+      step(1)
+      uop_mock.logical_step( None)
+      acc_mock.logical_step( None)
+      if ( peek( c.io.uop.idx.valid) == 1) {
+        expect( c.io.uop.idx.bits, uop_indices.dequeue())
+      }
+      if ( peek( c.io.acc.rd(0).idx.valid) == 1) {
+        expect( c.io.acc.rd(0).idx.bits, acc_indices.dequeue())
+      }
+      if ( peek( c.io.acc.wr(0).valid) == 1) {
+        expect( c.io.acc.wr(0).bits.idx, accout_indices.dequeue())
+      }
+      if ( peek( c.io.out.wr(0).valid) == 1) {
+        expect( c.io.out.wr(0).bits.idx, out_indices.dequeue())
+      }
+    }
+
+    def test_if_done() {
+      println( s"uop_indices remaining: ${uop_indices.size}")
+      println( s"acc_indices remaining: ${acc_indices.size}")
+      println( s"accout_indices remaining: ${accout_indices.size}")
+      println( s"out_indices remaining: ${out_indices.size}")
+      assert( uop_indices.isEmpty)
+      assert( acc_indices.isEmpty)
+      assert( accout_indices.isEmpty)
+      assert( out_indices.isEmpty)
+    }
+  }
+
+  val mocks = new Mocks
+  for { cnt_o <- 0 until lp_0
+        cnt_i <- 0 until lp_1
+        uop_idx <- uop_begin until uop_end} {
+    mocks.uop_indices.enqueue( uop_idx)
+    // if ( alu_use_imm == 0) {
+    //   mocks.acc_indices.enqueue( src_offset + src_0*cnt_o + src_1*cnt_i)
+    // }
+    mocks.acc_indices.enqueue( src_offset + src_0*cnt_o + src_1*cnt_i)
+    mocks.accout_indices.enqueue( dst_offset + dst_0*cnt_o + dst_1*cnt_i)
+    mocks.out_indices.enqueue( dst_offset + dst_0*cnt_o + dst_1*cnt_i)
+  }
+
+  poke( c.io.start, 0)

Review comment:
       extra space

##########
File path: hardware/chisel/src/test/scala/unittest/TensorAluTest.scala
##########
@@ -0,0 +1,253 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package unittest
+
+import chisel3._
+import chisel3.util._
+import chisel3.iotesters.{ChiselFlatSpec, Driver, PeekPokeTester}
+import scala.util.Random
+import unittest.util._
+import vta.core._
+import vta.util.config._
+
+class TensorAluIndexGeneratorTester(c: TensorAluIndexGenerator, alu_use_imm : Int = 0) extends PeekPokeTester(c) {
+
+
+  val uop_begin = 0
+  val uop_end = 2
+  assert( uop_begin < uop_end)
+
+  val lp_0 = 2
+  val lp_1 = 3
+  val dst_0 = 1*lp_1
+  val src_0 = 2*lp_1
+  val dst_1 = 1
+  val src_1 = 2
+
+  poke( c.io.dec.reset, 0)
+  poke( c.io.dec.alu_use_imm, alu_use_imm)
+  poke( c.io.dec.uop_begin, uop_begin)
+  poke( c.io.dec.uop_end, uop_end)
+  poke( c.io.dec.lp_0, lp_0)
+  poke( c.io.dec.lp_1, lp_1)
+  poke( c.io.dec.dst_0, dst_0)
+  poke( c.io.dec.dst_1, dst_1)
+  poke( c.io.dec.src_0, src_0)
+  poke( c.io.dec.src_1, src_1)
+  // Don't need empty_0,{push,pop}_{next,prev},op
+
+
+  class Mocks {
+    val uop_indices = new scala.collection.mutable.Queue[BigInt]
+    val dst_indices = new scala.collection.mutable.Queue[BigInt]
+    val src_indices = new scala.collection.mutable.Queue[BigInt]
+
+    def logical_step() {
+      step(1)
+      if ( peek( c.io.valid) == 1) {
+        expect( c.io.uop_idx, uop_indices.dequeue())
+        expect( c.io.dst_idx, dst_indices.dequeue())
+      }
+      if ( peek( c.io.src_valid) == 1) {
+        expect( c.io.src_idx, src_indices.dequeue())
+      }
+    }
+
+    def test_if_done() {
+      println( s"uop_indices remaining: ${uop_indices.size}")
+      println( s"dst_indices remaining: ${dst_indices.size}")
+      println( s"src_indices remaining: ${src_indices.size}")
+      assert( uop_indices.isEmpty)
+      assert( dst_indices.isEmpty)
+      assert( src_indices.isEmpty)
+    }
+  }
+
+  val mocks = new Mocks
+  for { cnt_o <- 0 until lp_0
+        cnt_i <- 0 until lp_1
+        uop_idx <- uop_begin until uop_end} {
+    mocks.uop_indices.enqueue( uop_idx)
+    mocks.dst_indices.enqueue( dst_0*cnt_o + dst_1*cnt_i)
+    if (alu_use_imm == 0) {
+      mocks.src_indices.enqueue( src_0*cnt_o + src_1*cnt_i)
+    }
+  }
+
+  poke( c.io.start, 1)
+  mocks.logical_step()
+  poke( c.io.start, 0)
+
+  val end = (uop_end-uop_begin)*lp_0*lp_1
+  var count = 0
+  while( peek( c.io.last) == 0 && count < 10*end + 100) { 
+    mocks.logical_step()
+    count += 1
+  }
+  mocks.test_if_done()
+  step(1)
+}
+
+class TensorAluIndexGenerator_0_Test extends GenericTest( "TensorAluIndexGenerator_0", (p:Parameters) => new TensorAluIndexGenerator()(p), (c:TensorAluIndexGenerator) => new TensorAluIndexGeneratorTester(c, 0))
+
+class TensorAluIndexGenerator_1_Test extends GenericTest( "TensorAluIndexGenerator_1", (p:Parameters) => new TensorAluIndexGenerator()(p), (c:TensorAluIndexGenerator) => new TensorAluIndexGeneratorTester(c, 1))
+
+
+class TensorAluPipelinedTester(c: TensorAlu) extends PeekPokeTester(c) {
+  poke( c.io.start, 0)
+
+  val uop_begin = 0
+  val uop_end = 1
+  assert( uop_begin < uop_end)
+  val alu_use_imm = 1
+  val lp_0 = 2
+  val lp_1 = 3
+  val dst_0 = 1*lp_1
+  val src_0 = 2*lp_1
+  val dst_1 = 1
+  val src_1 = 2
+
+  val dst_offset = BigInt( "000", 16)
+  val src_offset = BigInt( "100", 16)
+
+  val u0 = dst_offset
+  val u1 = src_offset
+  val u2 = 0 // if src_offset is big, some bits go here
+
+  poke( c.io.dec.reset, 0)
+  poke( c.io.dec.alu_op, 2) // ADD or ADDI 1
+  poke( c.io.dec.alu_imm, 1)
+  poke( c.io.dec.alu_use_imm, alu_use_imm)
+  poke( c.io.dec.uop_begin, uop_begin)
+  poke( c.io.dec.uop_end, uop_end)
+  poke( c.io.dec.lp_0, lp_0)
+  poke( c.io.dec.lp_1, lp_1)
+  poke( c.io.dec.dst_0, dst_0)
+  poke( c.io.dec.dst_1, dst_1)
+  poke( c.io.dec.src_0, src_0)
+  poke( c.io.dec.src_1, src_1)
+
+  // Don't need empty_0,{push,pop}_{next,prev},op
+
+  poke( c.io.uop.data.bits.u0, u0)
+  poke( c.io.uop.data.bits.u1, u1)
+  poke( c.io.uop.data.bits.u2, u2)
+  
+  require(c.io.acc.splitWidth == 1, "-F- Test doesnt support acc data access split")
+  require(c.io.acc.splitLength == 1, "-F- Test doesnt support acc data access split")
+  
+  val acc = IndexedSeq.tabulate(c.io.acc.rd(0).data.bits(0).size){ i => BigInt(i) }
+  for { lhs <- c.io.acc.rd(0).data.bits} {
+    poke( lhs, acc.reverse)
+  }
+
+  class TensorMasterMock( tm: TensorMaster) {
+    poke( tm.rd(0).data.valid, 0)
+    var valid = peek(tm.rd(0).idx.valid)
+    def logical_step( v: Option[BigInt]) {
+      poke( tm.rd(0).data.valid, valid)
+      valid = peek( tm.rd(0).idx.valid)
+      for { x <- v} expect( tm.rd(0).idx.valid, x)
+    }
+  }
+
+  class UopMasterMock( um: UopMaster) {
+    poke( um.data.valid, 0)
+    var valid = peek( um.idx.valid)
+    def logical_step( v: Option[BigInt]) {
+      poke( um.data.valid, valid)
+      valid = peek( um.idx.valid)
+      for { x <- v} expect( um.idx.valid, x)
+    }
+  }
+
+  class Mocks {
+    val uop_mock = new UopMasterMock( c.io.uop)
+    val acc_mock = new TensorMasterMock( c.io.acc)
+
+    val uop_indices = new scala.collection.mutable.Queue[BigInt]
+    val acc_indices = new scala.collection.mutable.Queue[BigInt]
+    val accout_indices = new scala.collection.mutable.Queue[BigInt]
+    val out_indices = new scala.collection.mutable.Queue[BigInt]
+
+    def logical_step() {
+      step(1)
+      uop_mock.logical_step( None)
+      acc_mock.logical_step( None)
+      if ( peek( c.io.uop.idx.valid) == 1) {
+        expect( c.io.uop.idx.bits, uop_indices.dequeue())
+      }
+      if ( peek( c.io.acc.rd(0).idx.valid) == 1) {
+        expect( c.io.acc.rd(0).idx.bits, acc_indices.dequeue())
+      }
+      if ( peek( c.io.acc.wr(0).valid) == 1) {
+        expect( c.io.acc.wr(0).bits.idx, accout_indices.dequeue())
+      }
+      if ( peek( c.io.out.wr(0).valid) == 1) {
+        expect( c.io.out.wr(0).bits.idx, out_indices.dequeue())
+      }
+    }
+
+    def test_if_done() {
+      println( s"uop_indices remaining: ${uop_indices.size}")
+      println( s"acc_indices remaining: ${acc_indices.size}")
+      println( s"accout_indices remaining: ${accout_indices.size}")
+      println( s"out_indices remaining: ${out_indices.size}")
+      assert( uop_indices.isEmpty)
+      assert( acc_indices.isEmpty)
+      assert( accout_indices.isEmpty)
+      assert( out_indices.isEmpty)
+    }
+  }
+
+  val mocks = new Mocks
+  for { cnt_o <- 0 until lp_0
+        cnt_i <- 0 until lp_1
+        uop_idx <- uop_begin until uop_end} {
+    mocks.uop_indices.enqueue( uop_idx)
+    // if ( alu_use_imm == 0) {
+    //   mocks.acc_indices.enqueue( src_offset + src_0*cnt_o + src_1*cnt_i)
+    // }
+    mocks.acc_indices.enqueue( src_offset + src_0*cnt_o + src_1*cnt_i)
+    mocks.accout_indices.enqueue( dst_offset + dst_0*cnt_o + dst_1*cnt_i)
+    mocks.out_indices.enqueue( dst_offset + dst_0*cnt_o + dst_1*cnt_i)
+  }
+
+  poke( c.io.start, 0)
+
+  step( 1)
+
+  //expect( c.io.state, c.sIdle)

Review comment:
       delete comment




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] tmoreau89 commented on pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
tmoreau89 commented on pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#issuecomment-856369457


   Thank you @adavare and the team (@suvadeep89, @stevenmburns, @pasqoc, @adavare, @sjain12intel, @aasorokiin, and @zhenkuny) for these improvements on the VTA chisel design, and @vegaluisjose for reviewing the PR. The PR has been merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm-vta] adavare commented on a change in pull request #27: Chisel Pipelined ALU

Posted by GitBox <gi...@apache.org>.
adavare commented on a change in pull request #27:
URL: https://github.com/apache/tvm-vta/pull/27#discussion_r646843304



##########
File path: hardware/chisel/src/main/scala/core/Compute.scala
##########
@@ -118,44 +123,102 @@ class Compute(debug: Boolean = false)(implicit p: Parameters) extends Module {
   loadUop.io.baddr := io.uop_baddr
   io.vme_rd(0) <> loadUop.io.vme_rd
   loadUop.io.uop.idx <> Mux(dec.io.isGemm, tensorGemm.io.uop.idx, tensorAlu.io.uop.idx)
+  assert( !tensorGemm.io.uop.idx.valid || !tensorAlu.io.uop.idx.valid)
 
   // acc
   tensorAcc.io.start := state === sIdle & start & dec.io.isLoadAcc
   tensorAcc.io.inst := inst_q.io.deq.bits
   tensorAcc.io.baddr := io.acc_baddr
-  tensorAcc.io.tensor.rd.idx <> Mux(dec.io.isGemm, tensorGemm.io.acc.rd.idx, tensorAlu.io.acc.rd.idx)
-  tensorAcc.io.tensor.wr <> Mux(dec.io.isGemm, tensorGemm.io.acc.wr, tensorAlu.io.acc.wr)
+  require(tensorAcc.io.tensor.lenSplit ==
+    tensorAcc.io.tensor.tensorLength, "-F- Expecting a whole batch in acc group")
+
+  // split factor of isGemm for many groups
+  val splitFactorL0 = pow(2,log2Ceil(tensorAcc.io.tensor.splitWidth) / 2).toInt
+  val splitFactorL1 = pow(2,log2Ceil(tensorAcc.io.tensor.splitWidth)
+    - log2Ceil(tensorAcc.io.tensor.splitWidth) / 2).toInt
+  require(splitFactorL0 * splitFactorL1 == tensorAcc.io.tensor.splitWidth)
+  val accRdSelectL0 = for (idx <- 0 until splitFactorL1) yield {
+    // can save 1 stage on small design
+    if (splitFactorL1 > 1) RegNext(dec.io.isGemm, init = false.B) else dec.io.isGemm
+  }
+
+  for (idx <- 0 until tensorAcc.io.tensor.splitWidth) {
+    tensorAcc.io.tensor.rd(idx).idx <> Mux(
+      RegNext(accRdSelectL0(idx/splitFactorL0), init = false.B),
+      tensorGemm.io.acc.rd(idx).idx,
+      tensorAlu.io.acc.rd(idx).idx)
+    tensorAcc.io.tensor.wr(idx) <> Mux(
+      RegNext(accRdSelectL0(idx/splitFactorL0), init = false.B),
+      tensorGemm.io.acc.wr(idx),
+      tensorAlu.io.acc.wr(idx))
+  }
   io.vme_rd(1) <> tensorAcc.io.vme_rd
-  io.acc_wr_event := tensorAcc.io.tensor.wr.valid
+  io.acc_wr_event := tensorAcc.io.tensor.wr(topAccGrpIdx).valid
 
   // gemm
-  tensorGemm.io.start := state === sIdle & start & dec.io.isGemm
-  tensorGemm.io.inst := inst_q.io.deq.bits
+  tensorGemm.io.start := RegNext(state === sIdle & start & dec.io.isGemm, init = false.B)
+  tensorGemm.io.dec := inst_q.io.deq.bits.asTypeOf(new GemmDecode)
   tensorGemm.io.uop.data.valid := loadUop.io.uop.data.valid & dec.io.isGemm
   tensorGemm.io.uop.data.bits <> loadUop.io.uop.data.bits
   tensorGemm.io.inp <> io.inp
   tensorGemm.io.wgt <> io.wgt
-  tensorGemm.io.acc.rd.data.valid := tensorAcc.io.tensor.rd.data.valid & dec.io.isGemm
-  tensorGemm.io.acc.rd.data.bits <> tensorAcc.io.tensor.rd.data.bits
-  tensorGemm.io.out.rd.data.valid := io.out.rd.data.valid & dec.io.isGemm
-  tensorGemm.io.out.rd.data.bits <> io.out.rd.data.bits
+  for (idx <- 0 until tensorGemm.io.acc.splitWidth) {
+    tensorGemm.io.acc.rd(idx).data.valid :=
+      tensorAcc.io.tensor.rd(idx).data.valid & RegNext(dec.io.isGemm, init = false.B)
+    tensorGemm.io.acc.rd(idx).data.bits <> tensorAcc.io.tensor.rd(idx).data.bits
+  }
+  for (idx <- 0 until tensorGemm.io.out.splitWidth) {
+    tensorGemm.io.out.rd(idx).data.valid :=
+      io.out.rd(idx).data.valid & RegNext(dec.io.isGemm, init = false.B)
+    tensorGemm.io.out.rd(idx).data.bits <> io.out.rd(idx).data.bits
+  }
 
   // alu
-  tensorAlu.io.start := state === sIdle & start & dec.io.isAlu
-  tensorAlu.io.inst := inst_q.io.deq.bits
+  tensorAlu.io.start := RegNext(state === sIdle & start & dec.io.isAlu, init = false.B)
+  tensorAlu.io.dec := inst_q.io.deq.bits.asTypeOf(new AluDecode)
   tensorAlu.io.uop.data.valid := loadUop.io.uop.data.valid & dec.io.isAlu
   tensorAlu.io.uop.data.bits <> loadUop.io.uop.data.bits
-  tensorAlu.io.acc.rd.data.valid := tensorAcc.io.tensor.rd.data.valid & dec.io.isAlu
-  tensorAlu.io.acc.rd.data.bits <> tensorAcc.io.tensor.rd.data.bits
-  tensorAlu.io.out.rd.data.valid := io.out.rd.data.valid & dec.io.isAlu
-  tensorAlu.io.out.rd.data.bits <> io.out.rd.data.bits
+  for (idx <- 0 until tensorAlu.io.acc.splitWidth) {
+    tensorAlu.io.acc.rd(idx).data.valid :=
+      tensorAcc.io.tensor.rd(idx).data.valid & RegNext(dec.io.isAlu, init = false.B)
+    tensorAlu.io.acc.rd(idx).data.bits <> tensorAcc.io.tensor.rd(idx).data.bits
+  }
+  for (idx <- 0 until tensorAlu.io.out.splitWidth) {
+    tensorAlu.io.out.rd(idx).data.valid :=
+      io.out.rd(idx).data.valid & RegNext(dec.io.isAlu, init = false.B)
+    tensorAlu.io.out.rd(idx).data.bits <> io.out.rd(idx).data.bits
+  }
 
   // out
-  io.out.rd.idx <> Mux(dec.io.isGemm,
-    tensorGemm.io.out.rd.idx,
-    tensorAlu.io.out.rd.idx)
-  io.out.wr <> Mux(dec.io.isGemm, tensorGemm.io.out.wr, tensorAlu.io.out.wr)
+  for (idx <- 0 until tensorGemm.io.out.splitWidth) {
+    io.out.rd(idx).idx <> Mux(dec.io.isGemm,
+      tensorGemm.io.out.rd(idx).idx,
+      tensorAlu.io.out.rd(idx).idx)
+    assert( !tensorGemm.io.out.rd(idx).idx.valid || !tensorAlu.io.out.rd(idx).idx.valid)
+    assert( !tensorGemm.io.out.rd(idx).data.valid || !tensorAlu.io.out.rd(idx).data.valid)
 
+    assert( !tensorGemm.io.out.wr(idx).valid || !tensorAlu.io.out.wr(idx).valid)
+  }
+  require (tensorGemm.io.out.splitWidth == 1)
+  require (tensorAlu.io.out.splitWidth == 1)
+  io.out.wr(0).valid := Mux(
+    RegNext(dec.io.isGemm, init = false.B), tensorGemm.io.out.wr(0).valid, tensorAlu.io.out.wr(0).valid)
+  io.out.wr(0).bits.idx := Mux(
+    RegNext(dec.io.isGemm, init = false.B), tensorGemm.io.out.wr(0).bits.idx, tensorAlu.io.out.wr(0).bits.idx)
+  //put mux/Reg into every gemm group to build pipe (for Mux select) tree over distance

Review comment:
       All "//\S" occurrences replaced throughout PR




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org