You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by Living Zhang <li...@gmail.com> on 2018/11/14 10:09:05 UTC

Perfromance issue when writing ORC file output from Hadoop map/reduce task

I output orc file from my hadoop task. My schema contains nested structs
and some lists(about four lists). The length of each list is between 0 to
200. My task input is also orc file of simple struct.

The situation is when mapper begin to run, all mapper will be stuck at
progress 1.67%, after about 20 minutes, the process starts to move forward.

I tried to figure out the reason.

   1. I commented out the `context.write`, the whole map reduce task
   finished within 10 minutes, and stuck situation is disappeared.
   2. I output empty list. The task is still finished quickly.

So it seems the big list is the reason of the problem. However, the
questions are:

   1. Whether ORC file with list of length 200 will cause perfromance
   problem? If it is, any solutions?
   2. why the progress stucks at begging and starts to run after 20 minutes?

the version of orc is: orc-mapreduce 1.5.2, hadoop-mapreduce-client-core
2.8.0