You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "ghuls (via GitHub)" <gi...@apache.org> on 2023/02/14 23:20:43 UTC

[GitHub] [arrow-rs] ghuls opened a new issue, #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

ghuls opened a new issue, #3721:
URL: https://github.com/apache/arrow-rs/issues/3721

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   Reading a gzip compressed TSV file with `parquet-fromcsv`.
   
   **Describe the solution you'd like**
   
   Support compressed CSV/TSV files in `parquet-fromcsv`.
   
   
   Also it would be nice if there was a link to the parquet schema text format, or some example schemas.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] suxiaogang223 commented on issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

Posted by "suxiaogang223 (via GitHub)" <gi...@apache.org>.

suxiaogang223 commented on issue #3721:
URL: https://github.com/apache/arrow-rs/issues/3721#issuecomment-1535760621

   @ghuls  Thank you for your feedback. I really didn't think about multi-compressed files🙃, could you please make a new issue so i can continue to track this problem and fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] ghuls commented on issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

Posted by "ghuls (via GitHub)" <gi...@apache.org>.

ghuls commented on issue #3721:
URL: https://github.com/apache/arrow-rs/issues/3721#issuecomment-1471941769

   That is a pity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] suxiaogang223 commented on issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

Posted by "suxiaogang223 (via GitHub)" <gi...@apache.org>.

suxiaogang223 commented on issue #3721:
URL: https://github.com/apache/arrow-rs/issues/3721#issuecomment-1465212521

   I sadly find that this feature could not be implemented elegantly. I want to be able to stream compressed files, although I can use the `flate2` library to process compressed files like normal files, but arrow-csv reader requires Seek trait. 
   ```Rust
   arrow_csv::reader::ReaderBuilder
   pub fn build<R>(self, reader: R) -> Result<Reader<R>, ArrowError>
   where
       R: Read + Seek,
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] suxiaogang223 commented on issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

Posted by "suxiaogang223 (via GitHub)" <gi...@apache.org>.

suxiaogang223 commented on issue #3721:
URL: https://github.com/apache/arrow-rs/issues/3721#issuecomment-1431595650

   I would like try to fix this👀


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] ghuls commented on issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

Posted by "ghuls (via GitHub)" <gi...@apache.org>.

ghuls commented on issue #3721:
URL: https://github.com/apache/arrow-rs/issues/3721#issuecomment-1534849720

   @suxiaogang223 Thanks for the effort. There is still some issue in some cases.
   Most likely this is related when the gzipped file contain multiple gzip chunks (especially if the different decompressed chunks end in the middle of a line) like what happens when compressing files with bgzip (tool from HTSlib used a lot in bioinformatics)
   
   For example, the following fails for me:
   ```bash
   # Create a compressed TSV file with bgzip (multiple gzip chunks)
   ❯  yes $(printf 'chr1\t123456\t123465\tABCDEFGHIJKLMNOPQRSTUVX\t1\n') | tr ' ' '\t' |  head -n 4000 | dd if=/dev/stdin of=/dev/stdout bs=4k | bgzip > test.tsv.bgzip.gz
   43+1 records in
   43+1 records out
   180000 bytes (180 kB, 176 KiB) copied, 0.0034506 s, 52.2 MB/s
   
   $ cat test_parquet.schema
   message root {
     OPTIONAL BYTE_ARRAY Chromosome (STRING);
     OPTIONAL INT64 Start;
     OPTIONAL INT64 End;
     OPTIONAL BYTE_ARRAY Name (STRING);
     OPTIONAL INT64 Count;
   }
   
   
   parquet-fromcsv \
       --schema test_parquet.schema \
       --delimiter $'\t' \
       --csv-compression gzip \
       --input-file test.tsv.bgzip.gz \
       --output-file test.parquet
   
   Error: WithContext("Failed to read RecordBatch from CSV", ArrowError(CsvError("incorrect number of fields for line 1451, expected 5 got 4")))
   ```
   
   The same failure, but made with plain gzip, where we compress each 4096 as a different gzip chunk:
   
   ```bash
   $ rm test.gzip_4096_chunked.tsv.gz; final_file_size=$(yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' |  head -n 1000 | wc -c);  for i in $(seq 1 4096 "${final_file_size}") ; do echo $i; yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' |  head -n 1000 | tail -c +${i} | head -c 4096 | gzip >> gzip_4096_chunked.tsv.gz; done
   1
   4097
   8193
   12289
   16385
   20481
   24577
   28673
   32769
   36865
   40961
   
   # Each line contains the same content:
   $ zcat gzip_4096_chunked.tsv.gz | uniq -c
      1000 chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   
   
   $ parquet-fromcsv --schema /staging/leuven/stg_00002/lcb/ghuls/fragments_parquet.schema --delimiter $'\t' --csv-compression gzip --input-file gzip_4096_chunked.tsv.gz --output-file gzip_4096_chunked.parquet AT001Error: WithContext("Failed to read RecordBatch from CSV", ArrowError(CsvError("incorrect number of fields for line 92, expected 5 got 1")))
   
   # Last 10 lines of the first chunk:
   ❯  yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' |  head -n 1000 | tail -c +1 | head -c 4096 | tail
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   c
   
   # First 10 lines of second chunk:
   $  yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' |  head -n 1000 | tail -c +4097 | head -c 4096 | head
   hr1     123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold closed issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold closed issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.
URL: https://github.com/apache/arrow-rs/issues/3721


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #3721:
URL: https://github.com/apache/arrow-rs/issues/3721#issuecomment-1552715474

   `label_issue.py` automatically added labels {'parquet'} from #4160


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #3721:
URL: https://github.com/apache/arrow-rs/issues/3721#issuecomment-1522133686

   [`build_buffered`](https://docs.rs/arrow-csv/latest/arrow_csv/reader/struct.ReaderBuilder.html#method.build_buffered) added in #3368 removes the `Seek` requirement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] suxiaogang223 commented on issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

Posted by "suxiaogang223 (via GitHub)" <gi...@apache.org>.

suxiaogang223 commented on issue #3721:
URL: https://github.com/apache/arrow-rs/issues/3721#issuecomment-1522643339

   great


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] suxiaogang223 commented on issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

Posted by "suxiaogang223 (via GitHub)" <gi...@apache.org>.

suxiaogang223 commented on issue #3721:
URL: https://github.com/apache/arrow-rs/issues/3721#issuecomment-1526833275

   I would fix this now🤓


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #3721: Support compressed CSV/TSV files in `parquet-fromcsv`.

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #3721:
URL: https://github.com/apache/arrow-rs/issues/3721#issuecomment-1526085603

   https://github.com/apache/arrow-rs/issues/4130 has removed the Seek requirement from all the CSV reader APIs, so it should now be possible to support compressed CSV in addition to stdin input within parquet-fromcsv 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org