You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Zoltan Ivanfi <zi...@cloudera.com.INVALID> on 2018/09/10 12:37:57 UTC

Re: JVM gets killed when loading parquet file

Hi Nicolas,

Have you tried increasing the maximum Java heap size?

https://stackoverflow.com/a/15517399/5613485

Br,

Zoltan

On Wed, Aug 29, 2018 at 8:39 PM Nicolas Troncoso <nt...@gmail.com> wrote:

> > I'm clearly not understanding something its a 960MB file. Even if it go
> fully loaded into memory it should have more than enough to do the
> processing.
>
> I think the culprit of my problems is the size of the rowGroup. I'm trying
> to get the people who generate these files to make the rowGrops size
> smaller.  These files are only for transporting data from one place to
> another and not intended for colunar manipulation so there is not much gain
> in having massive rowGroups.
>
> In that subject: is if possible to  to read the rowGroup by parts to avoid
> having the whole thing in memory?
>
>
> On Sun, Aug 26, 2018 at 9:10 PM Nicolas Troncoso <nt...@gmail.com>
> wrote:
>
> > Hi,
> > I'm loading a parque file 900:
> >
> > maeve:$ parquet-tools rowcount -d
> > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet
> > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet row
> > count: 77318
> > Total RowCount: 77318
> >
> > maeve:$ parquet-tools size -d
> > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet
> > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet:
> > 1009297251 bytes
> > Total Size: 1009297251 bytes
> >
> > with the following java code snippet.
> >
> > import org.apache.parquet.column.page.PageReadStore;
> > import org.apache.parquet.example.data.Group;
> > import
> org.apache.parquet.example.data.simple.convert.GroupRecordConverter;
> > import org.apache.parquet.hadoop.ParquetFileReader;
> > import org.apache.parquet.io.ColumnIOFactory;
> > import org.apache.parquet.io.MessageColumnIO;
> > import org.apache.parquet.io.RecordReader;
> > import org.apache.parquet.schema.MessageType;
> >
> > public long importProcessParquetFile(ParquetFileReader parquetReader,
> > LongRunningTaskTracker tracker)
> > throws IOException{
> > long records = 0;
> > PageReadStore pages = null;
> > List<D> batch = new ArrayList<>();
> > MessageType schema =
> > parquetReader.getFooter().getFileMetaData().getSchema();
> > logger.warn("Got Schema");
> > while(null != (pages = parquetReader.readNextRowGroup())){
> > MessageColumnIO columnIo = new ColumnIOFactory().getColumnIO(schema);
> > logger.warn("Got columnIo");
> > RecordReader<Group> recordReader = columnIo.getRecordReader(pages, new
> > GroupRecordConverter(schema));
> > //^^^^^^ this line causes the OOM kill on the production environment.
> > logger.warn("Got recordReader");
> > for(int i = 0; i < pages.getRowCount(); i++){
> > D bean = parseBean(recordReader.read());
> > if(bean == null){
> > logger.warn("Could not create bean while importing Zillow Region Master",
> > new Throwable());
> > continue;
> > }
> > batch.add(bean);
> > if(batch.size() >= getBatchSize()){
> > records += storeAndClearBatch(batch);
> > tracker.heartbeat(records);
> > }
> > }
> > }
> > records += storeAndClearBatch(batch);
> > tracker.heartbeat(records);
> > return records;
> > }
> >
> > The production environment has 16GB ram no swap.
> >
> > I'm clearly not understanding something its a 960MB file. Even if it go
> > fully loaded into memory it should have more than enough to do the
> > processing.
> >
> > If i run it on my dev machine with a swap file i can run it to
> completion.
> > I'm trying to understand why the memory footprint gets so big, and if
> there
> > is a more efficient way to read the file.
> > Maybe there is a more efficient way to create the file?
> >
> > the file was created with parquet-mr 1.8, the file is being read with the
> > parquet-hadoop 1.9
> >
> > cheers.
> >
> >
>

Re: JVM gets killed when loading parquet file

Posted by Nicolas Troncoso <nt...@gmail.com>.
Hi,
Sorry this email got lost in my inbox.

The  maximum Java heap size is big enough. It's getting killed by the OS
because the server is physically running out of memory. I'm currently
working around the issue by having a big enough swap file.

This works in the mean time. The creators of the files i'm importing told
me to investigate `sqoop`. I will be doing that at some point.

On Mon, Sep 10, 2018 at 5:38 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi Nicolas,
>
> Have you tried increasing the maximum Java heap size?
>
> https://stackoverflow.com/a/15517399/5613485
>
> Br,
>
> Zoltan
>
> On Wed, Aug 29, 2018 at 8:39 PM Nicolas Troncoso <nt...@gmail.com>
> wrote:
>
> > > I'm clearly not understanding something its a 960MB file. Even if it go
> > fully loaded into memory it should have more than enough to do the
> > processing.
> >
> > I think the culprit of my problems is the size of the rowGroup. I'm
> trying
> > to get the people who generate these files to make the rowGrops size
> > smaller.  These files are only for transporting data from one place to
> > another and not intended for colunar manipulation so there is not much
> gain
> > in having massive rowGroups.
> >
> > In that subject: is if possible to  to read the rowGroup by parts to
> avoid
> > having the whole thing in memory?
> >
> >
> > On Sun, Aug 26, 2018 at 9:10 PM Nicolas Troncoso <nt...@gmail.com>
> > wrote:
> >
> > > Hi,
> > > I'm loading a parque file 900:
> > >
> > > maeve:$ parquet-tools rowcount -d
> > > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet
> > > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet row
> > > count: 77318
> > > Total RowCount: 77318
> > >
> > > maeve:$ parquet-tools size -d
> > > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet
> > > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet:
> > > 1009297251 bytes
> > > Total Size: 1009297251 bytes
> > >
> > > with the following java code snippet.
> > >
> > > import org.apache.parquet.column.page.PageReadStore;
> > > import org.apache.parquet.example.data.Group;
> > > import
> > org.apache.parquet.example.data.simple.convert.GroupRecordConverter;
> > > import org.apache.parquet.hadoop.ParquetFileReader;
> > > import org.apache.parquet.io.ColumnIOFactory;
> > > import org.apache.parquet.io.MessageColumnIO;
> > > import org.apache.parquet.io.RecordReader;
> > > import org.apache.parquet.schema.MessageType;
> > >
> > > public long importProcessParquetFile(ParquetFileReader parquetReader,
> > > LongRunningTaskTracker tracker)
> > > throws IOException{
> > > long records = 0;
> > > PageReadStore pages = null;
> > > List<D> batch = new ArrayList<>();
> > > MessageType schema =
> > > parquetReader.getFooter().getFileMetaData().getSchema();
> > > logger.warn("Got Schema");
> > > while(null != (pages = parquetReader.readNextRowGroup())){
> > > MessageColumnIO columnIo = new ColumnIOFactory().getColumnIO(schema);
> > > logger.warn("Got columnIo");
> > > RecordReader<Group> recordReader = columnIo.getRecordReader(pages, new
> > > GroupRecordConverter(schema));
> > > //^^^^^^ this line causes the OOM kill on the production environment.
> > > logger.warn("Got recordReader");
> > > for(int i = 0; i < pages.getRowCount(); i++){
> > > D bean = parseBean(recordReader.read());
> > > if(bean == null){
> > > logger.warn("Could not create bean while importing Zillow Region
> Master",
> > > new Throwable());
> > > continue;
> > > }
> > > batch.add(bean);
> > > if(batch.size() >= getBatchSize()){
> > > records += storeAndClearBatch(batch);
> > > tracker.heartbeat(records);
> > > }
> > > }
> > > }
> > > records += storeAndClearBatch(batch);
> > > tracker.heartbeat(records);
> > > return records;
> > > }
> > >
> > > The production environment has 16GB ram no swap.
> > >
> > > I'm clearly not understanding something its a 960MB file. Even if it go
> > > fully loaded into memory it should have more than enough to do the
> > > processing.
> > >
> > > If i run it on my dev machine with a swap file i can run it to
> > completion.
> > > I'm trying to understand why the memory footprint gets so big, and if
> > there
> > > is a more efficient way to read the file.
> > > Maybe there is a more efficient way to create the file?
> > >
> > > the file was created with parquet-mr 1.8, the file is being read with
> the
> > > parquet-hadoop 1.9
> > >
> > > cheers.
> > >
> > >
> >
>