You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "ghuls (via GitHub)" <gi...@apache.org> on 2023/05/04 14:08:10 UTC
[GitHub] [arrow-rs] ghuls commented on issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.
ghuls commented on issue #3721:
URL: https://github.com/apache/arrow-rs/issues/3721#issuecomment-1534849720
@suxiaogang223 Thanks for the effort. There is still some issue in some cases.
Most likely this is related when the gzipped file contain multiple gzip chunks (especially if the different decompressed chunks end in the middle of a line) like what happens when compressing files with bgzip (tool from HTSlib used a lot in bioinformatics)
For example, the following fails for me:
```bash
# Create a compressed TSV file with bgzip (multiple gzip chunks)
❯ yes $(printf 'chr1\t123456\t123465\tABCDEFGHIJKLMNOPQRSTUVX\t1\n') | tr ' ' '\t' | head -n 4000 | dd if=/dev/stdin of=/dev/stdout bs=4k | bgzip > test.tsv.bgzip.gz
43+1 records in
43+1 records out
180000 bytes (180 kB, 176 KiB) copied, 0.0034506 s, 52.2 MB/s
$ cat test_parquet.schema
message root {
OPTIONAL BYTE_ARRAY Chromosome (STRING);
OPTIONAL INT64 Start;
OPTIONAL INT64 End;
OPTIONAL BYTE_ARRAY Name (STRING);
OPTIONAL INT64 Count;
}
parquet-fromcsv \
--schema test_parquet.schema \
--delimiter $'\t' \
--csv-compression gzip \
--input-file test.tsv.bgzip.gz \
--output-file test.parquet
Error: WithContext("Failed to read RecordBatch from CSV", ArrowError(CsvError("incorrect number of fields for line 1451, expected 5 got 4")))
```
The same failure, but made with plain gzip, where we compress each 4096 as a different gzip chunk:
```bash
$ rm test.gzip_4096_chunked.tsv.gz; final_file_size=$(yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' | head -n 1000 | wc -c); for i in $(seq 1 4096 "${final_file_size}") ; do echo $i; yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' | head -n 1000 | tail -c +${i} | head -c 4096 | gzip >> gzip_4096_chunked.tsv.gz; done
1
4097
8193
12289
16385
20481
24577
28673
32769
36865
40961
# Each line contains the same content:
$ zcat gzip_4096_chunked.tsv.gz | uniq -c
1000 chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
$ parquet-fromcsv --schema /staging/leuven/stg_00002/lcb/ghuls/fragments_parquet.schema --delimiter $'\t' --csv-compression gzip --input-file gzip_4096_chunked.tsv.gz --output-file gzip_4096_chunked.parquet AT001Error: WithContext("Failed to read RecordBatch from CSV", ArrowError(CsvError("incorrect number of fields for line 92, expected 5 got 1")))
# Last 10 lines of the first chunk:
❯ yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' | head -n 1000 | tail -c +1 | head -c 4096 | tail
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
c
# First 10 lines of second chunk:
$ yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' | head -n 1000 | tail -c +4097 | head -c 4096 | head
hr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org