You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Henrique Nascimento <He...@la.logicalis.com> on 2020/05/15 19:49:13 UTC

PutParquet generating invalid files - Can not read value at 0 in block -1 in file - Encoding DELTA_BINARY_PACKED is only supported for type INT32

Hi all,

I´m having some trouble in a production environment with PutParquet processor. When my flow file has only header + 1-3 records, the PutParquet succeeds, the file is written in HDFS, but it is invalid. But when the flow file has a lot of records, the PutParquet processor also succeeds and it is possible to read the generated files.
I tried to open the invalid parquet files using parquet-tools, hive and pyspark, and all of them fails with the same error: "Can not read value at 0 in block -1 in file".
Hive also shows me this error in the log file: Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BINARY_PACKED is only supported for type INT32.

To reproduce the problem, i used a GetFile processor + PutParquet writing in HDFS, NIFI version 1.11.4

Here is an example of the content of a file that is created, but invalid (i changed some chars):

timestamp,ggsn,apn,msisdn,statustype,ip,sessionid,duration
1589236199000,186.4.75.1,webapn.company.com,44895956521,Start,177945774,979cdf6b021ed038,-1,

And an example of a success case:

timestamp,ggsn,apn,msisdn,statustype,ip,sessionid,duration
1589569200000,186.6.64.1,webapn.company.com,12395856026,Start,176224166,989dhe2808a0e10c,-1,
1589569200000,186.6.96.1,webapn.movistar.com.uy,12393446203,Stop,177119485,989dhe6904515cf7,3712000,
1589569200000,186.6.0.3,webapn.movistar.com.uy,12394359006,Stop,-1407442482,989dhe0f010282f1,7092000,
1589569200000,186.6.96.1,webapn.movistar.com.uy,12394427751,Start,177550761,989dhe6904dd35df,-1,
1589569200000,186.6.64.1,webapn.movistar.com.uy,12393309416,Start,176616344,989dhe2703f93f8a,-1,
1589569200000,186.6.0.3,webapn.movistar.com.uy,12394355488,Start,176177290,989dhe10505a9af1,-1,
1589569200000,186.6.64.1,webapn.movistar.com.uy,12395478656,Start,176688933,989dhe2703f93f8b,-1,
1589569200000,186.6.96.1,webapn.movistar.com.uy,12395214244,Start,172288204,989dhe6900c48aa7,-1,
1589569200000,186.6.64.1,webapn.movistar.com.uy,12393418526,Stop,176335286,989dhe27081d0fa1,50000,
1589569200000,186.6.96.1,webapn.movistar.com.uy,12394828264,Start,177952229,989dhe6900c48aa8,-1,
1589569200000,152.146.0.1,webapn.movistar.com.uy,12394416031,Stop,-1405606344,989dhe49ccja1399,58000,
1589569200000,186.6.96.1,webapn.movistar.com.uy,12394589217,Start,177743029,989dhe6a04ee2123,-1,
1589569200000,152.146.0.1,webapn.movistar.com.uy,12394859666,Start,-1407233995,989dhe4916be3ee9,-1,
1589569200000,152.146.0.1,webapn.movistar.com.uy,12393735602,Stop,-1407845029,c83b809dde72f30a,402000,

My PutParquet is configured to write files UNCOMPRESSED, version PARQUET_2_0, and TRUE for avro configs. He is also using a CSVReader as record reader, with this schema:
{
"namespace": "nifi",
"name": "logs_radius",
"type": "record",
"fields": [
  { "name": "timestamp", "type": "long" },
  { "name": "ggsn", "type": "string" },
  { "name": "apn", "type": "string" },
  { "name": "msisdn", "type": "string" },
  { "name": "statustype", "type": "string" },
  { "name": "ip", "type": "int" },
  { "name": "sessionid", "type": "string" },
  { "name": "duration", "type": "long" }
]
}

And my Hadoop cluster is a standard CDH 5.16.1 installation, hive 1.1.0-cdh5.16.1.

Please, where is my mistake? Or should i open a Jira?

Thank you for your time.


Henrique Nascimento
Analista de Software Sr / Data Intelligence Business Unit
telefone: +55 (19) 3797 6531

Av. Cambacica, 520
1º andar - Prédio 07 - 13097-160
Campinas, São Paulo, Brasil
www.logicalis.com<https://logicalisconnected.jiveon.com/external-link.jspa?url=http%3A%2F%2Fwww.logicalis.com>

[cid:image002.png@01D62AD8.BF205CC0]<https://logicalisconnected.jiveon.com/servlet/JiveServlet/showImage/38-10704-267330/pastedImage_2.png>

A Logicalis reconhece como obrigações apenas os atos praticados por seus representantes legais, observados os limites e condições previstos em seus atos constitutivos e na legislação em vigor. Esta mensagem, inclusive seus anexos, pode conter informações confidenciais. Caso você tenha recebido esta mensagem indevidamente, por favor apague-a do seu sistema e avise imediatamente o remetente. Qualquer forma de utilização, reprodução, retransmissão, alteração, distribuição e/ou divulgação de conteúdo desta mensagem ou de parte dele sem a autorização expressa de seu remetente, é estritamente proibida.


Re: PutParquet generating invalid files - Can not read value at 0 in block -1 in file - Encoding DELTA_BINARY_PACKED is only supported for type INT32

Posted by Henrique Nascimento <he...@la.logicalis.com>.
Hi all,

I opened a Jira:

https://issues.apache.org/jira/browse/NIFI-7495

Regards,

Henrique


Em 15/05/2020 16:49, Henrique Nascimento escreveu:
>
> I´m having some trouble in a production environment with PutParquet 
> processor. When my flow file has only header + 1-3 records, the 
> PutParquet succeeds, the file is written in HDFS, but it is invalid. 
> But when the flow file has a lot of records, the PutParquet processor 
> also succeeds and it is possible to read the generated files.
>
> I tried to open the invalid parquet files using parquet-tools, hive 
> and pyspark, and all of them fails with the same error: “Can not read 
> value at 0 in block -1 in file”.
>
> Hive also shows me this error in the log file: Caused by: 
> parquet.io.ParquetDecodingException: Encoding DELTA_BINARY_PACKED is 
> only supported for type INT32.
>
> To reproduce the problem, i used a GetFile processor + PutParquet 
> writing in HDFS, NIFI version 1.11.4
>
> Here is an example of the content of a file that is created, but 
> invalid (i changed some chars):
>
> timestamp,ggsn,apn,msisdn,statustype,ip,sessionid,duration
>
> 1589236199000,186.4.75.1,webapn.company.com,44895956521,Start,177945774,979cdf6b021ed038,-1,
>
> And an example of a success case:
>
> timestamp,ggsn,apn,msisdn,statustype,ip,sessionid,duration
>
> 1589569200000,186.6.64.1,webapn.company.com,12395856026,Start,176224166,989dhe2808a0e10c,-1,
>
> 1589569200000,186.6.96.1,webapn.movistar.com.uy,12393446203,Stop,177119485,989dhe6904515cf7,3712000,
>
> 1589569200000,186.6.0.3,webapn.movistar.com.uy,12394359006,Stop,-1407442482,989dhe0f010282f1,7092000,
>
> 1589569200000,186.6.96.1,webapn.movistar.com.uy,12394427751,Start,177550761,989dhe6904dd35df,-1,
>
> 1589569200000,186.6.64.1,webapn.movistar.com.uy,12393309416,Start,176616344,989dhe2703f93f8a,-1,
>
> 1589569200000,186.6.0.3,webapn.movistar.com.uy,12394355488,Start,176177290,989dhe10505a9af1,-1,
>
> 1589569200000,186.6.64.1,webapn.movistar.com.uy,12395478656,Start,176688933,989dhe2703f93f8b,-1,
>
> 1589569200000,186.6.96.1,webapn.movistar.com.uy,12395214244,Start,172288204,989dhe6900c48aa7,-1,
>
> 1589569200000,186.6.64.1,webapn.movistar.com.uy,12393418526,Stop,176335286,989dhe27081d0fa1,50000,
>
> 1589569200000,186.6.96.1,webapn.movistar.com.uy,12394828264,Start,177952229,989dhe6900c48aa8,-1,
>
> 1589569200000,152.146.0.1,webapn.movistar.com.uy,12394416031,Stop,-1405606344,989dhe49ccja1399,58000,
>
> 1589569200000,186.6.96.1,webapn.movistar.com.uy,12394589217,Start,177743029,989dhe6a04ee2123,-1,
>
> 1589569200000,152.146.0.1,webapn.movistar.com.uy,12394859666,Start,-1407233995,989dhe4916be3ee9,-1,
>
> 1589569200000,152.146.0.1,webapn.movistar.com.uy,12393735602,Stop,-1407845029,c83b809dde72f30a,402000,
>
> My PutParquet is configured to write files UNCOMPRESSED, version 
> PARQUET_2_0, and TRUE for avro configs. He is also using a CSVReader 
> as record reader, with this schema:
>
> {
>
> "namespace": "nifi",
>
> "name": "logs_radius",
>
> "type": "record",
>
> "fields": [
>
>   { "name": "timestamp", "type": "long" },
>
>   { "name": "ggsn", "type": "string" },
>
>   { "name": "apn", "type": "string" },
>
>   { "name": "msisdn", "type": "string" },
>
>   { "name": "statustype", "type": "string" },
>
>   { "name": "ip", "type": "int" },
>
>   { "name": "sessionid", "type": "string" },
>
>   { "name": "duration", "type": "long" }
>
> ]
>
> }
>