You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Andrey Yatsuk <ya...@megaputer.ru> on 2018/02/26 10:51:53 UTC

Partial text read from Parquet

I have dataset with big strings (every record about 15 mb) in parquet.
When I try to open all parquet parts I get OutOfMemmoryException.
How can I get only headers (first 100 symbols) for each string record without reading all record?

Code (Java):
  Schema avroProj = SchemaBuilder.builder()
    .record("proj").fields()
    .name("idx").type().nullable().longType().noDefault()
    .name("text").type().nullable().bytesType().noDefault()
    .endRecord();
  Configuration conf = new Configuration();
  AvroReadSupport.setRequestedProjection(conf, avroProj);
  ParquetReader<GenericRecord> parquetReader = AvroParquetReader
    .<GenericRecord>builder(new Path(filePath))
    .withConf(conf)
    .build();
  GenericRecord record = parquetReader.read(); // record already have full text in RAM
  Long idx = (Long) record.get("idx");
  ByteBuffer rawText = (ByteBuffer) record.get("text");
  String header = new String(rawText.array()).substring(0, 200);