You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Jiangtao Peng <pe...@gmail.com> on 2022/09/14 06:34:46 UTC
[C++][Gandiva] string expression evaluation performance issue using mimalloc

Hi all,



Arrow use jemalloc as default memory allocator. For some reason, I am going
to use mimalloc instead. But there seems have big performance difference
between two memory allocators.



Here are my steps.



I use simple compile options:

*-DCMAKE_BUILD_TYPE*=debug \
*-DARROW_JEMALLOC*=OFF|ON \
*-DARROW_MIMALLOC*=ON|OFF \
*-DARROW_GANDIVA*=ON \
*-DARROW_GANDIVA_STATIC_LIBSTDCPP*=ON \
*-DARROW_BUILD_TESTS*=ON



Then I write a simple case:

void TestPerf(int64_t char_length, int64_t num_records) {
  // schema for input fields
  auto field_a = field("a", utf8());
  auto schema = arrow::schema(*{*field_a*}*);

  // output fields
  auto res = field("res", utf8());


  auto node_a = TreeExprBuilder::MakeField(field_a);
  auto upper_a = TreeExprBuilder::MakeFunction("upper", *{*node_a*}*, utf8());
  auto expr = TreeExprBuilder::MakeExpression(upper_a, res);


  // Build a projector for the expressions.
  std::shared_ptr<Projector> projector;
  auto status = Projector::Make(schema, *{*expr*}*,
TestConfiguration(), &projector);
  EXPECT_TRUE(status.ok()) << status.message();
  std::string val = std::string(char_length, 'a');
  arrow::StringBuilder builder;
  for (int i = 0; i < num_records; i++) {
    auto _ = builder.Append(val);
  }
  std::shared_ptr<arrow::StringArray> array_a;
  auto _ = builder.Finish(&array_a);


  // prepare input record batch
  auto in_batch = arrow::RecordBatch::Make(schema, num_records, *{*array_a*}*);
  auto start_epoch = std::chrono::duration_cast<std::chrono::milliseconds>(
                         std::chrono::system_clock::now().time_since_epoch())
                         .count();
  // Evaluate expression
  arrow::ArrayVector outputs;
  status = projector->Evaluate(*in_batch,
arrow::default_memory_pool(), &outputs);
  EXPECT_TRUE(status.ok()) << status.message();
  std::*cout *<< std::chrono::duration_cast<std::chrono::milliseconds>(
                   std::chrono::system_clock::now().time_since_epoch())
                       .count() -
                   start_epoch
            << "ms" << std::endl;
}



TestPerf(20, 10000);TestPerf(20, 100000);TestPerf(200,
10000);TestPerf(200, 100000);TestPerf(2000, 10000);





this case is going to calculate expression “upper(a)”, “a” has different
size with 20/200/2000. Evaluation time results:



| char_length | num_records | Using Mimalloc (ms) | Using Jemalloc(ms) |

|      20     |    10000    |     29              |   3                |

|      20     |   100000    |    2686             |  26                |

|     200     |    10000    |     954             |  11                |

|     200     |   100000    |    220153           |  118               |

|     2000    |    10000    |    21162            | 89                 |



Is this performance gap expected? Or any other compile options should I
note? How can I make performance better using mimalloc?



Best,

Jiangtao