You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Nicolas Troncoso <nt...@gmail.com> on 2018/08/25 00:10:18 UTC

JVM gets killed when loading parquet file

Hi,
I'm loading a parque file 900:

maeve:$ parquet-tools rowcount -d
part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet
part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet row
count: 77318
Total RowCount: 77318

maeve:$ parquet-tools size -d
part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet
part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet:
1009297251 bytes
Total Size: 1009297251 bytes

with the following java code snippet.

import org.apache.parquet.column.page.PageReadStore;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.simple.convert.GroupRecordConverter;
import org.apache.parquet.hadoop.ParquetFileReader;
import org.apache.parquet.io.ColumnIOFactory;
import org.apache.parquet.io.MessageColumnIO;
import org.apache.parquet.io.RecordReader;
import org.apache.parquet.schema.MessageType;

public long importProcessParquetFile(ParquetFileReader parquetReader,
LongRunningTaskTracker tracker)
throws IOException{
long records = 0;
PageReadStore pages = null;
List<D> batch = new ArrayList<>();
MessageType schema =
parquetReader.getFooter().getFileMetaData().getSchema();
logger.warn("Got Schema");
while(null != (pages = parquetReader.readNextRowGroup())){
MessageColumnIO columnIo = new ColumnIOFactory().getColumnIO(schema);
logger.warn("Got columnIo");
RecordReader<Group> recordReader = columnIo.getRecordReader(pages, new
GroupRecordConverter(schema));
//^^^^^^ this line causes the OOM kill on the production environment.
logger.warn("Got recordReader");
for(int i = 0; i < pages.getRowCount(); i++){
D bean = parseBean(recordReader.read());
if(bean == null){
logger.warn("Could not create bean while importing Zillow Region Master",
new Throwable());
continue;
}
batch.add(bean);
if(batch.size() >= getBatchSize()){
records += storeAndClearBatch(batch);
tracker.heartbeat(records);
}
}
}
records += storeAndClearBatch(batch);
tracker.heartbeat(records);
return records;
}

The production environment has 16GB ram no swap.

I'm clearly not understanding something its a 960MB file. Even if it go
fully loaded into memory it should have more than enough to do the
processing.

If i run it on my dev machine with a swap file i can run it to completion.
I'm trying to understand why the memory footprint gets so big, and if there
is a more efficient way to read the file.
Maybe there is a more efficient way to create the file?

the file was created with parquet-mr 1.8, the file is being read with the
parquet-hadoop 1.9

cheers.