You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/11 22:17:48 UTC

[GitHub] [arrow] micomahesh1982 opened a new issue, #13125: parquet conversion failed,Bool column has NA values in column boolean__v

micomahesh1982 opened a new issue, #13125:
URL: https://github.com/apache/arrow/issues/13125

   I'm using the below code , while input data has boolean column with null and not null data however it's failing at the parquet conversion i'e "parquet conversion failed,Bool column has NA values in column boolean__v". kindly let me know what could be the issue
   
   for chunk_number, chunk in enumerate(pd.read_csv(**read_csv_args), 1):
               fields = []
                       for col,dtypes in sessionSchema.items():
                           fields.append(pa.field(col, dtypes, True)) # nullable=True, pass a DataFrame which in fact has nulls it appears the schema is ignored
                       glue_schema = pa.schema(fields)
   
                   table = pa.Table.from_pandas(chunk, preserve_index=False, schema=glue_schema)
                   if chunk_number == 1:
                       schema = table.schema
                       # Open a Parquet file for writing
                       pq_writer = pq.ParquetWriter(targetKey, schema, compression='snappy')
                  # Write CSV chunk to the parquet file
                   pq_writer.write_table(table)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on issue #13125: parquet conversion failed,Bool column has NA values in column boolean__v

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on issue #13125:
URL: https://github.com/apache/arrow/issues/13125#issuecomment-1124586991

   Tried using [pyarrow.csv.read_csv](https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html#pyarrow.csv.read_csv)?
   
   ```python
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] micomahesh1982 commented on issue #13125: parquet conversion failed,Bool column has NA values in column boolean__v

Posted by GitBox <gi...@apache.org>.

micomahesh1982 commented on issue #13125:
URL: https://github.com/apache/arrow/issues/13125#issuecomment-1126257354

   couple of things i do explain here so that you may have visibility and provide me the solution which fits in this, 
   (1) We have a source file like (CSV file, size of 6GB compressed/uncompressed), then we don't read whole file into memory using pandas, do use 'chunks' then pass this to pyarrow to convert parquet and write on s3 until all 'chunk's are done.
   
   This approach have control of memory consumption and run into any high memory usage so that 'chunk' used. however while writing chunk into s3:// folder does it cause below error.
   Python Error: <>, exitCode: <139>
   
   have you come across any scenario to overcome or where it's happening? please let me
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] micomahesh1982 commented on issue #13125: parquet conversion failed,Bool column has NA values in column boolean__v

Posted by GitBox <gi...@apache.org>.

micomahesh1982 commented on issue #13125:
URL: https://github.com/apache/arrow/issues/13125#issuecomment-1124554021

   technically, 
   
   csv_file = StringIO("""int__v|Decimal__v|Float__v|Boolean__v|String__v|Null__v|Date__v|Timestamp__v
   1|43.4|11.02|True|'456'|12|2021-03-02|2019-08-07 10:11:12
   2|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
   3|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
   4|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
   5|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
   6|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
   7|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
   8|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
   9|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
   10|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
   11|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
   12|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
   13|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
   14|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
   15|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
   16|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
   17|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
   18|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
   19|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
   4|||||||
   """)
   
   params = { 'filepath_or_buffer': csv_file, 'chunksize': 10, 'encoding': 'UTF-8',  'sep': '|', 'low_memory': True, 'engine': 'python', 'skip_blank_lines': True }
   
   Next, loop through chunk of data and convert csv files into parquet and write into s3.
   you would be noticing 4th column is 'boolean' datatype and last row has 'NA' which means null data. That's where getting "parquet conversion failed,Bool column has NA values in column boolean__v"
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] micomahesh1982 commented on issue #13125: parquet conversion failed,Bool column has NA values in column boolean__v

Posted by GitBox <gi...@apache.org>.

micomahesh1982 commented on issue #13125:
URL: https://github.com/apache/arrow/issues/13125#issuecomment-1124561696

   pyarrow==5.0.0; python_full_version >= "3.6.2" and python_version < "3.10" and python_version >= "3.6"  - This is the pyarrow version being used in our project


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on issue #13125: parquet conversion failed,Bool column has NA values in column boolean__v

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on issue #13125:
URL: https://github.com/apache/arrow/issues/13125#issuecomment-1275898611

   There is not enough information for us to be able to reproduce your issue and help solving it. Will close it as is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on issue #13125: parquet conversion failed,Bool column has NA values in column boolean__v

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on issue #13125:
URL: https://github.com/apache/arrow/issues/13125#issuecomment-1124600121

   Tried using [pyarrow.csv.read_csv](https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html#pyarrow.csv.read_csv) to read arrow table from csv and then write to parquet?
   
   Hope this will help:
   
   ```python
   >>> import io
   >>> import pyarrow.csv as csv
   
   >>> s = """int__v|Decimal__v|Float__v|Boolean__v|String__v|Null__v|Date__v|Timestamp__v
   ... 1|43.4|11.02|True|'456'|12|2021-03-02|2019-08-07 10:11:12
   ... 2|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
   ... 3|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
   ... 4|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
   ... 5|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
   ... 6|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
   ... 7|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
   ... 8|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
   ... 9|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
   ... 10|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
   ... 11|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
   ... 12|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
   ... 13|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
   ... 14|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
   ... 15|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
   ... 16|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12
   ... 17|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13
   ... 18|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12
   ... 19|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13
   ... 4|||||||
   ... """
   >>> source = io.BytesIO(s.encode())
   
   # Read with pyarrow.csv.read_csv
   >>> parse_options = csv.ParseOptions(delimiter="|")
   >>> table = csv.read_csv(source, parse_options=parse_options)
   
   # Write to parquet
   >>> import pyarrow.parquet as pq
   >>> pq.write_table(table, 'example.parquet', compression='snappy')
   
   # Check the result
   >>> pq.read_table('example.parquet')["Boolean__v"]
   <pyarrow.lib.ChunkedArray object at 0x139459450>
   [
     [
       true,
       false,
       true,
       false,
       true,
       ...
       true,
       false,
       true,
       false,
       null
     ]
   ]
   ```
   
   You can also define [pyarrow.csv.ReadOptions](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions) like `block_size`, `encoding` and [pyarrow.csv.ParseOptions](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow.csv.ParseOptions) like `ignore_empty_lines.`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on issue #13125: parquet conversion failed,Bool column has NA values in column boolean__v

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on issue #13125:
URL: https://github.com/apache/arrow/issues/13125#issuecomment-1124525696

   What version of PyArrow are you using?
   
   I created this minimal reproducible example, can you run it and check if it works for you?
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   chunk = pd.DataFrame([True, False, None], columns=['col1'])
   field = pa.field("col1", pa.bool_())
   glue_schema = pa.schema([field])
   
   table = pa.Table.from_pandas(chunk, preserve_index=False, schema=glue_schema)
   
   # Open a Parquet file for writing
   pq_writer = pq.ParquetWriter('example.parquet', 
                                schema = glue_schema,
                                compression='snappy')
   
   # Write CSV chunk to the parquet file
   pq_writer.write_table(table)
   pq_writer.close()
   
   # Read the chunk
   pq.read_table('example.parquet').to_pandas()
   ```
   
   It should create this output:
   
   ```python
   >>> pq.read_table('example.parquet').to_pandas()
       col1
   0   True
   1  False
   2   None
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org