You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Shirish Jamthe <sj...@google.com> on 2018/03/15 23:38:36 UTC

Looking for I/O transform to untar a tar.gz

Hi,

My input is a tar.gz or .zip file which contains thousands of tar.gz files
and other files.
I would lile to extract the tar.gz files from the tar.

Is there a transform that can do that? I couldn't find one.
If not is it in works? Any pointers to start work on it?

thanks

Re: Looking for I/O transform to untar a tar.gz

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Gzip is supported by TextIO. However you are right, tar is not yet supported. It's similar in the way of dealing with entries.

Could you please create a Jira about that ?

Thanks
Regards
JB

Le 16 mars 2018 à 14:50, à 14:50, Chamikara Jayalath <ch...@google.com> a écrit:
>FWIW, if you have a concat gzip file [1] TextIO and other file-based
>sources should be able to read that. But we don't support tar files. Is
>it
>possible to perform tar extraction before running the pipeline ? This
>step
>probably cannot be parallelized. So not much value in performing within
>the
>pipeline anyways (other than easy access to various file-systems).
>
>- Cham
>
>[1]
>https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files
>
>On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
>achuthan.sajeevan@gmail.com> wrote:
>
>> Eugene - Yes, you are correct. I tried with a text file &  Beam
>wordcount
>> example. The TextIO reader reads some illegal characters as seen
>below.
>>
>>
>> here’s: 1
>> addiction: 1
>> new: 1
>> we: 1
>> mood: 1
>> an: 1
>> incredible: 1
>> swings,: 1
>> known: 1
>> choices.: 1
>>
>^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>> 1
>> already: 2
>> today: 1
>> the: 3
>> generation: 1
>> wordcount-00002
>>
>>
>> thanks
>> Saj
>>
>>
>> On 16 March 2018 at 17:45, Eugene Kirpichov <ki...@google.com>
>wrote:
>>
>>> To clarify: I think natively supporting .tar and .tar.gz would be
>quite
>>> useful. I'm just saying that currently we don't.
>>>
>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov
><ki...@google.com>
>>> wrote:
>>>
>>>> The code behaves as I expected, and the output is corrupt.
>>>> Beam unzipped the .gz, but then interpreted the .tar as a text
>file, and
>>>> split the .tar file by \n.
>>>> E.g. the first file of the output starts with lines:
>>>>
>A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513
>>>> 5ustar
>>>>
>eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353
>>>> 0ustar  eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>>>>
>>>> which are clearly not the expected input.
>>>>
>>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>
>>>>> Eugene, I ran the code and it works fine.  I am very confident in
>this
>>>>> case. I appreciate you guys for the great work.
>>>>>
>>>>> The code supposed to show that Beam TextIO can read the double
>>>>> compressed files and write output without any processing. so
>ignored the
>>>>> processing steps. I agree with you the further processing is not
>easy in
>>>>> this case.
>>>>>
>>>>>
>>>>> import org.apache.beam.sdk.Pipeline;
>>>>> import org.apache.beam.sdk.io.TextIO;
>>>>> import org.apache.beam.sdk.options.PipelineOptions;
>>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>>>> import org.apache.beam.sdk.transforms.DoFn;
>>>>> import org.apache.beam.sdk.transforms.ParDo;
>>>>>
>>>>> public class ReadCompressedTextFile {
>>>>>
>>>>> public static void main(String[] args) {
>>>>> PipelineOptions optios =
>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>
>>>>>     p.apply("ReadLines",
>>>>>     TextIO.read().from("./dataset.tar.gz")
>>>>>
>>>>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>>>>     @ProcessElement
>>>>>     public void processElement(ProcessContext c) {
>>>>>     c.output(c.element());
>>>>>     // Just write the all content to "/tmp/filout/outputfile"
>>>>>     }
>>>>>
>>>>>     }))
>>>>>
>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>
>>>>>     p.run().waitUntilFinish();
>>>>> }
>>>>>
>>>>> }
>>>>>
>>>>> The full code, data file & output contents are attached.
>>>>>
>>>>> thanks
>>>>> Saj
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 16 March 2018 at 16:56, Eugene Kirpichov <ki...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but
>can not
>>>>>> handle properly .tar. Did you run this code? Did your test
>.tar.gz file
>>>>>> contain multiple files? Did you obtain the expected output,
>identical to
>>>>>> the input except for order of lines?
>>>>>> (also, the ParDo in this code doesn't do anything - it outputs
>its
>>>>>> input - so it can be removed)
>>>>>>
>>>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Guys,
>>>>>>>
>>>>>>> The TextIo can handle the tar.gz type double compressed files.
>See
>>>>>>> the code test code.
>>>>>>>
>>>>>>>  PipelineOptions optios =
>>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>
>>>>>>>    * p.apply("ReadLines", 
>TextIO.read().from("/dataset.tar.gz"))*
>>>>>>>                       .apply(ParDo.of(new DoFn<String,
>String>(){
>>>>>>>     @ProcessElement
>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>     c.output(c.element());
>>>>>>>     }
>>>>>>>
>>>>>>>     }))
>>>>>>>
>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>
>>>>>>>     p.run().waitUntilFinish();
>>>>>>>
>>>>>>> Thanks
>>>>>>> /Saj
>>>>>>>
>>>>>>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com>
>wrote:
>>>>>>>
>>>>>>>> Hi!
>>>>>>>> Quick questions:
>>>>>>>> - which sdk are you using?
>>>>>>>> - is this batch or streaming?
>>>>>>>>
>>>>>>>> As JB mentioned, TextIO is able to work with compressed files
>that
>>>>>>>> contain text. Nothing currently handles the double
>decompression that I
>>>>>>>> believe you're looking for.
>>>>>>>> TextIO for Java is also able to"watch" a directory for new
>files. If
>>>>>>>> you're able to (outside of your pipeline) decompress your first
>zip file
>>>>>>>> into a directory that your pipeline is watching, you may be
>able to use
>>>>>>>> that as work around. Does that sound like a good thing?
>>>>>>>> Finally, if you want to implement a transform that does all
>your
>>>>>>>> logic, well then that sounds like SplittableDoFn material; and
>in that
>>>>>>>> case, someone that knows SDF better can give you guidance (or
>clarify if my
>>>>>>>> suggestions are not correct).
>>>>>>>> Best
>>>>>>>> -P.
>>>>>>>>
>>>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré
><jb...@nanthrax.net>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> TextIO supports compressed file. Do you want to read files in
>text ?
>>>>>>>>>
>>>>>>>>> Can you detail a bit the use case ?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com>
>a
>>>>>>>>> écrit:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> My input is a tar.gz or .zip file which contains thousands of
>>>>>>>>>> tar.gz files and other files.
>>>>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>>>>
>>>>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>>>>
>>>>>>>>>> thanks
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>

Re: Looking for I/O transform to untar a tar.gz

Posted by Sajeevan Achuthan <ac...@gmail.com>.

Thanks Cham

On 16 March 2018 at 23:28, Chamikara Jayalath <ch...@google.com> wrote:

> Actually, I could assign it to you.
>
> On Fri, Mar 16, 2018 at 4:27 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> Of course. Feel free to add a comment to JIRA and send out a pull request
>> for this.
>> Can one of the JIRA admins assign this to Sajeevan ?
>>
>> Thanks,
>> Cham
>>
>> On Fri, Mar 16, 2018 at 4:22 PM Sajeevan Achuthan <
>> achuthan.sajeevan@gmail.com> wrote:
>>
>>> Hi Guys,
>>>
>>> Can I take a look at this issue? If you agree, my Jira id is eachsaj
>>>
>>> thanks
>>> Saj
>>>
>>>
>>>
>>> On 16 March 2018 at 22:13, Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>> Created https://issues.apache.org/jira/browse/BEAM-3867.
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>> On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov <ki...@google.com>
>>>> wrote:
>>>>
>>>>> Reading can not be parallelized, but processing can be - so there is
>>>>> value in having our file-based sources automatically decompress .tar and
>>>>> .tar.gz.
>>>>> (also, I suspect that many people use Beam even for cases with a
>>>>> modest amount of data, that don't have or need parallelism, just for the
>>>>> sake of convenience of Beam's APIs and IOs)
>>>>>
>>>>> On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath <
>>>>> chamikara@google.com> wrote:
>>>>>
>>>>>> FWIW, if you have a concat gzip file [1] TextIO and other file-based
>>>>>> sources should be able to read that. But we don't support tar files. Is it
>>>>>> possible to perform tar extraction before running the pipeline ? This step
>>>>>> probably cannot be parallelized. So not much value in performing within the
>>>>>> pipeline anyways (other than easy access to various file-systems).
>>>>>>
>>>>>> - Cham
>>>>>>
>>>>>> [1] https://stackoverflow.com/questions/8005114/fast-
>>>>>> concatenation-of-multiple-gzip-files
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>
>>>>>>> Eugene - Yes, you are correct. I tried with a text file &  Beam
>>>>>>> wordcount example. The TextIO reader reads some illegal characters as seen
>>>>>>> below.
>>>>>>>
>>>>>>>
>>>>>>> here’s: 1
>>>>>>> addiction: 1
>>>>>>> new: 1
>>>>>>> we: 1
>>>>>>> mood: 1
>>>>>>> an: 1
>>>>>>> incredible: 1
>>>>>>> swings,: 1
>>>>>>> known: 1
>>>>>>> choices.: 1
>>>>>>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
>>>>>>> eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>>>>>>> 1
>>>>>>> already: 2
>>>>>>> today: 1
>>>>>>> the: 3
>>>>>>> generation: 1
>>>>>>> wordcount-00002
>>>>>>>
>>>>>>>
>>>>>>> thanks
>>>>>>> Saj
>>>>>>>
>>>>>>>
>>>>>>> On 16 March 2018 at 17:45, Eugene Kirpichov <ki...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> To clarify: I think natively supporting .tar and .tar.gz would be
>>>>>>>> quite useful. I'm just saying that currently we don't.
>>>>>>>>
>>>>>>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <
>>>>>>>> kirpichov@google.com> wrote:
>>>>>>>>
>>>>>>>>> The code behaves as I expected, and the output is corrupt.
>>>>>>>>> Beam unzipped the .gz, but then interpreted the .tar as a text
>>>>>>>>> file, and split the .tar file by \n.
>>>>>>>>> E.g. the first file of the output starts with lines:
>>>>>>>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/
>>>>>>>>> 0000755000175000017500000000000013252764467016513 5ustar
>>>>>>>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/
>>>>>>>>> data0000644000175000017500000000360513252764467017353 0ustar
>>>>>>>>> eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>>>>>>>>>
>>>>>>>>> which are clearly not the expected input.
>>>>>>>>>
>>>>>>>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>>>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Eugene, I ran the code and it works fine.  I am very confident in
>>>>>>>>>> this case. I appreciate you guys for the great work.
>>>>>>>>>>
>>>>>>>>>> The code supposed to show that Beam TextIO can read the double
>>>>>>>>>> compressed files and write output without any processing. so ignored the
>>>>>>>>>> processing steps. I agree with you the further processing is not easy in
>>>>>>>>>> this case.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> import org.apache.beam.sdk.Pipeline;
>>>>>>>>>> import org.apache.beam.sdk.io.TextIO;
>>>>>>>>>> import org.apache.beam.sdk.options.PipelineOptions;
>>>>>>>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>>>>>>>>> import org.apache.beam.sdk.transforms.DoFn;
>>>>>>>>>> import org.apache.beam.sdk.transforms.ParDo;
>>>>>>>>>>
>>>>>>>>>> public class ReadCompressedTextFile {
>>>>>>>>>>
>>>>>>>>>> public static void main(String[] args) {
>>>>>>>>>> PipelineOptions optios = PipelineOptionsFactory.
>>>>>>>>>> fromArgs(args).withValidation().create();
>>>>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>>>>
>>>>>>>>>>     p.apply("ReadLines",
>>>>>>>>>>     TextIO.read().from("./dataset.tar.gz")
>>>>>>>>>>
>>>>>>>>>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>>>>>>>>>     @ProcessElement
>>>>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>>>>     c.output(c.element());
>>>>>>>>>>     // Just write the all content to "/tmp/filout/outputfile"
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     }))
>>>>>>>>>>
>>>>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>>>>
>>>>>>>>>>     p.run().waitUntilFinish();
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> The full code, data file & output contents are attached.
>>>>>>>>>>
>>>>>>>>>> thanks
>>>>>>>>>> Saj
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 16 March 2018 at 16:56, Eugene Kirpichov <kirpichov@google.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but
>>>>>>>>>>> can not handle properly .tar. Did you run this code? Did your test .tar.gz
>>>>>>>>>>> file contain multiple files? Did you obtain the expected output, identical
>>>>>>>>>>> to the input except for order of lines?
>>>>>>>>>>> (also, the ParDo in this code doesn't do anything - it outputs
>>>>>>>>>>> its input - so it can be removed)
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>>>>>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Guys,
>>>>>>>>>>>>
>>>>>>>>>>>> The TextIo can handle the tar.gz type double compressed files.
>>>>>>>>>>>> See the code test code.
>>>>>>>>>>>>
>>>>>>>>>>>>  PipelineOptions optios = PipelineOptionsFactory.
>>>>>>>>>>>> fromArgs(args).withValidation().create();
>>>>>>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>>>>>>
>>>>>>>>>>>>    * p.apply("ReadLines",
>>>>>>>>>>>> TextIO.read().from("/dataset.tar.gz"))*
>>>>>>>>>>>>                       .apply(ParDo.of(new DoFn<String,
>>>>>>>>>>>> String>(){
>>>>>>>>>>>>     @ProcessElement
>>>>>>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>>>>>>     c.output(c.element());
>>>>>>>>>>>>     }
>>>>>>>>>>>>
>>>>>>>>>>>>     }))
>>>>>>>>>>>>
>>>>>>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>>>>>>
>>>>>>>>>>>>     p.run().waitUntilFinish();
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> /Saj
>>>>>>>>>>>>
>>>>>>>>>>>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>> Quick questions:
>>>>>>>>>>>>> - which sdk are you using?
>>>>>>>>>>>>> - is this batch or streaming?
>>>>>>>>>>>>>
>>>>>>>>>>>>> As JB mentioned, TextIO is able to work with compressed files
>>>>>>>>>>>>> that contain text. Nothing currently handles the double decompression that
>>>>>>>>>>>>> I believe you're looking for.
>>>>>>>>>>>>> TextIO for Java is also able to"watch" a directory for new
>>>>>>>>>>>>> files. If you're able to (outside of your pipeline) decompress your first
>>>>>>>>>>>>> zip file into a directory that your pipeline is watching, you may be able
>>>>>>>>>>>>> to use that as work around. Does that sound like a good thing?
>>>>>>>>>>>>> Finally, if you want to implement a transform that does all
>>>>>>>>>>>>> your logic, well then that sounds like SplittableDoFn material; and in that
>>>>>>>>>>>>> case, someone that knows SDF better can give you guidance (or clarify if my
>>>>>>>>>>>>> suggestions are not correct).
>>>>>>>>>>>>> Best
>>>>>>>>>>>>> -P.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <
>>>>>>>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> TextIO supports compressed file. Do you want to read files in
>>>>>>>>>>>>>> text ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you detail a bit the use case ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com>
>>>>>>>>>>>>>> a écrit:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My input is a tar.gz or .zip file which contains thousands
>>>>>>>>>>>>>>> of tar.gz files and other files.
>>>>>>>>>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>>>>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>

Re: Looking for I/O transform to untar a tar.gz

Posted by Chamikara Jayalath <ch...@google.com>.

Actually, I could assign it to you.

On Fri, Mar 16, 2018 at 4:27 PM Chamikara Jayalath <ch...@google.com>
wrote:

> Of course. Feel free to add a comment to JIRA and send out a pull request
> for this.
> Can one of the JIRA admins assign this to Sajeevan ?
>
> Thanks,
> Cham
>
> On Fri, Mar 16, 2018 at 4:22 PM Sajeevan Achuthan <
> achuthan.sajeevan@gmail.com> wrote:
>
>> Hi Guys,
>>
>> Can I take a look at this issue? If you agree, my Jira id is eachsaj
>>
>> thanks
>> Saj
>>
>>
>>
>> On 16 March 2018 at 22:13, Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> Created https://issues.apache.org/jira/browse/BEAM-3867.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov <ki...@google.com>
>>> wrote:
>>>
>>>> Reading can not be parallelized, but processing can be - so there is
>>>> value in having our file-based sources automatically decompress .tar and
>>>> .tar.gz.
>>>> (also, I suspect that many people use Beam even for cases with a modest
>>>> amount of data, that don't have or need parallelism, just for the sake of
>>>> convenience of Beam's APIs and IOs)
>>>>
>>>> On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath <
>>>> chamikara@google.com> wrote:
>>>>
>>>>> FWIW, if you have a concat gzip file [1] TextIO and other file-based
>>>>> sources should be able to read that. But we don't support tar files. Is it
>>>>> possible to perform tar extraction before running the pipeline ? This step
>>>>> probably cannot be parallelized. So not much value in performing within the
>>>>> pipeline anyways (other than easy access to various file-systems).
>>>>>
>>>>> - Cham
>>>>>
>>>>> [1]
>>>>> https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files
>>>>>
>>>>>
>>>>> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>
>>>>>> Eugene - Yes, you are correct. I tried with a text file &  Beam
>>>>>> wordcount example. The TextIO reader reads some illegal characters as seen
>>>>>> below.
>>>>>>
>>>>>>
>>>>>> here’s: 1
>>>>>> addiction: 1
>>>>>> new: 1
>>>>>> we: 1
>>>>>> mood: 1
>>>>>> an: 1
>>>>>> incredible: 1
>>>>>> swings,: 1
>>>>>> known: 1
>>>>>> choices.: 1
>>>>>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>>>>>> 1
>>>>>> already: 2
>>>>>> today: 1
>>>>>> the: 3
>>>>>> generation: 1
>>>>>> wordcount-00002
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>> Saj
>>>>>>
>>>>>>
>>>>>> On 16 March 2018 at 17:45, Eugene Kirpichov <ki...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> To clarify: I think natively supporting .tar and .tar.gz would be
>>>>>>> quite useful. I'm just saying that currently we don't.
>>>>>>>
>>>>>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <
>>>>>>> kirpichov@google.com> wrote:
>>>>>>>
>>>>>>>> The code behaves as I expected, and the output is corrupt.
>>>>>>>> Beam unzipped the .gz, but then interpreted the .tar as a text
>>>>>>>> file, and split the .tar file by \n.
>>>>>>>> E.g. the first file of the output starts with lines:
>>>>>>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513
>>>>>>>> 5ustar
>>>>>>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353
>>>>>>>> 0ustar  eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>>>>>>>>
>>>>>>>> which are clearly not the expected input.
>>>>>>>>
>>>>>>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Eugene, I ran the code and it works fine.  I am very confident in
>>>>>>>>> this case. I appreciate you guys for the great work.
>>>>>>>>>
>>>>>>>>> The code supposed to show that Beam TextIO can read the double
>>>>>>>>> compressed files and write output without any processing. so ignored the
>>>>>>>>> processing steps. I agree with you the further processing is not easy in
>>>>>>>>> this case.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> import org.apache.beam.sdk.Pipeline;
>>>>>>>>> import org.apache.beam.sdk.io.TextIO;
>>>>>>>>> import org.apache.beam.sdk.options.PipelineOptions;
>>>>>>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>>>>>>>> import org.apache.beam.sdk.transforms.DoFn;
>>>>>>>>> import org.apache.beam.sdk.transforms.ParDo;
>>>>>>>>>
>>>>>>>>> public class ReadCompressedTextFile {
>>>>>>>>>
>>>>>>>>> public static void main(String[] args) {
>>>>>>>>> PipelineOptions optios =
>>>>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>>>
>>>>>>>>>     p.apply("ReadLines",
>>>>>>>>>     TextIO.read().from("./dataset.tar.gz")
>>>>>>>>>
>>>>>>>>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>>>>>>>>     @ProcessElement
>>>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>>>     c.output(c.element());
>>>>>>>>>     // Just write the all content to "/tmp/filout/outputfile"
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>>     }))
>>>>>>>>>
>>>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>>>
>>>>>>>>>     p.run().waitUntilFinish();
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> The full code, data file & output contents are attached.
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>> Saj
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 16 March 2018 at 16:56, Eugene Kirpichov <ki...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but
>>>>>>>>>> can not handle properly .tar. Did you run this code? Did your test .tar.gz
>>>>>>>>>> file contain multiple files? Did you obtain the expected output, identical
>>>>>>>>>> to the input except for order of lines?
>>>>>>>>>> (also, the ParDo in this code doesn't do anything - it outputs
>>>>>>>>>> its input - so it can be removed)
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>>>>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Guys,
>>>>>>>>>>>
>>>>>>>>>>> The TextIo can handle the tar.gz type double compressed files.
>>>>>>>>>>> See the code test code.
>>>>>>>>>>>
>>>>>>>>>>>  PipelineOptions optios =
>>>>>>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>>>>>
>>>>>>>>>>>    * p.apply("ReadLines",
>>>>>>>>>>> TextIO.read().from("/dataset.tar.gz"))*
>>>>>>>>>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>>>>>>>>>     @ProcessElement
>>>>>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>>>>>     c.output(c.element());
>>>>>>>>>>>     }
>>>>>>>>>>>
>>>>>>>>>>>     }))
>>>>>>>>>>>
>>>>>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>>>>>
>>>>>>>>>>>     p.run().waitUntilFinish();
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> /Saj
>>>>>>>>>>>
>>>>>>>>>>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi!
>>>>>>>>>>>> Quick questions:
>>>>>>>>>>>> - which sdk are you using?
>>>>>>>>>>>> - is this batch or streaming?
>>>>>>>>>>>>
>>>>>>>>>>>> As JB mentioned, TextIO is able to work with compressed files
>>>>>>>>>>>> that contain text. Nothing currently handles the double decompression that
>>>>>>>>>>>> I believe you're looking for.
>>>>>>>>>>>> TextIO for Java is also able to"watch" a directory for new
>>>>>>>>>>>> files. If you're able to (outside of your pipeline) decompress your first
>>>>>>>>>>>> zip file into a directory that your pipeline is watching, you may be able
>>>>>>>>>>>> to use that as work around. Does that sound like a good thing?
>>>>>>>>>>>> Finally, if you want to implement a transform that does all
>>>>>>>>>>>> your logic, well then that sounds like SplittableDoFn material; and in that
>>>>>>>>>>>> case, someone that knows SDF better can give you guidance (or clarify if my
>>>>>>>>>>>> suggestions are not correct).
>>>>>>>>>>>> Best
>>>>>>>>>>>> -P.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <
>>>>>>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>
>>>>>>>>>>>>> TextIO supports compressed file. Do you want to read files in
>>>>>>>>>>>>> text ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you detail a bit the use case ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> JB
>>>>>>>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com>
>>>>>>>>>>>>> a écrit:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My input is a tar.gz or .zip file which contains thousands of
>>>>>>>>>>>>>> tar.gz files and other files.
>>>>>>>>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>>>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>

Re: Looking for I/O transform to untar a tar.gz

Posted by Chamikara Jayalath <ch...@google.com>.

Of course. Feel free to add a comment to JIRA and send out a pull request
for this.
Can one of the JIRA admins assign this to Sajeevan ?

Thanks,
Cham

On Fri, Mar 16, 2018 at 4:22 PM Sajeevan Achuthan <
achuthan.sajeevan@gmail.com> wrote:

> Hi Guys,
>
> Can I take a look at this issue? If you agree, my Jira id is eachsaj
>
> thanks
> Saj
>
>
>
> On 16 March 2018 at 22:13, Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> Created https://issues.apache.org/jira/browse/BEAM-3867.
>>
>> Thanks,
>> Cham
>>
>> On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov <ki...@google.com>
>> wrote:
>>
>>> Reading can not be parallelized, but processing can be - so there is
>>> value in having our file-based sources automatically decompress .tar and
>>> .tar.gz.
>>> (also, I suspect that many people use Beam even for cases with a modest
>>> amount of data, that don't have or need parallelism, just for the sake of
>>> convenience of Beam's APIs and IOs)
>>>
>>> On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>> FWIW, if you have a concat gzip file [1] TextIO and other file-based
>>>> sources should be able to read that. But we don't support tar files. Is it
>>>> possible to perform tar extraction before running the pipeline ? This step
>>>> probably cannot be parallelized. So not much value in performing within the
>>>> pipeline anyways (other than easy access to various file-systems).
>>>>
>>>> - Cham
>>>>
>>>> [1]
>>>> https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files
>>>>
>>>>
>>>> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>
>>>>> Eugene - Yes, you are correct. I tried with a text file &  Beam
>>>>> wordcount example. The TextIO reader reads some illegal characters as seen
>>>>> below.
>>>>>
>>>>>
>>>>> here’s: 1
>>>>> addiction: 1
>>>>> new: 1
>>>>> we: 1
>>>>> mood: 1
>>>>> an: 1
>>>>> incredible: 1
>>>>> swings,: 1
>>>>> known: 1
>>>>> choices.: 1
>>>>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>>>>> 1
>>>>> already: 2
>>>>> today: 1
>>>>> the: 3
>>>>> generation: 1
>>>>> wordcount-00002
>>>>>
>>>>>
>>>>> thanks
>>>>> Saj
>>>>>
>>>>>
>>>>> On 16 March 2018 at 17:45, Eugene Kirpichov <ki...@google.com>
>>>>> wrote:
>>>>>
>>>>>> To clarify: I think natively supporting .tar and .tar.gz would be
>>>>>> quite useful. I'm just saying that currently we don't.
>>>>>>
>>>>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <
>>>>>> kirpichov@google.com> wrote:
>>>>>>
>>>>>>> The code behaves as I expected, and the output is corrupt.
>>>>>>> Beam unzipped the .gz, but then interpreted the .tar as a text file,
>>>>>>> and split the .tar file by \n.
>>>>>>> E.g. the first file of the output starts with lines:
>>>>>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513
>>>>>>> 5ustar
>>>>>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353
>>>>>>> 0ustar  eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>>>>>>>
>>>>>>> which are clearly not the expected input.
>>>>>>>
>>>>>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>>
>>>>>>>> Eugene, I ran the code and it works fine.  I am very confident in
>>>>>>>> this case. I appreciate you guys for the great work.
>>>>>>>>
>>>>>>>> The code supposed to show that Beam TextIO can read the double
>>>>>>>> compressed files and write output without any processing. so ignored the
>>>>>>>> processing steps. I agree with you the further processing is not easy in
>>>>>>>> this case.
>>>>>>>>
>>>>>>>>
>>>>>>>> import org.apache.beam.sdk.Pipeline;
>>>>>>>> import org.apache.beam.sdk.io.TextIO;
>>>>>>>> import org.apache.beam.sdk.options.PipelineOptions;
>>>>>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>>>>>>> import org.apache.beam.sdk.transforms.DoFn;
>>>>>>>> import org.apache.beam.sdk.transforms.ParDo;
>>>>>>>>
>>>>>>>> public class ReadCompressedTextFile {
>>>>>>>>
>>>>>>>> public static void main(String[] args) {
>>>>>>>> PipelineOptions optios =
>>>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>>
>>>>>>>>     p.apply("ReadLines",
>>>>>>>>     TextIO.read().from("./dataset.tar.gz")
>>>>>>>>
>>>>>>>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>>>>>>>     @ProcessElement
>>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>>     c.output(c.element());
>>>>>>>>     // Just write the all content to "/tmp/filout/outputfile"
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     }))
>>>>>>>>
>>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>>
>>>>>>>>     p.run().waitUntilFinish();
>>>>>>>> }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> The full code, data file & output contents are attached.
>>>>>>>>
>>>>>>>> thanks
>>>>>>>> Saj
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 16 March 2018 at 16:56, Eugene Kirpichov <ki...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can
>>>>>>>>> not handle properly .tar. Did you run this code? Did your test .tar.gz file
>>>>>>>>> contain multiple files? Did you obtain the expected output, identical to
>>>>>>>>> the input except for order of lines?
>>>>>>>>> (also, the ParDo in this code doesn't do anything - it outputs its
>>>>>>>>> input - so it can be removed)
>>>>>>>>>
>>>>>>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>>>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Guys,
>>>>>>>>>>
>>>>>>>>>> The TextIo can handle the tar.gz type double compressed files.
>>>>>>>>>> See the code test code.
>>>>>>>>>>
>>>>>>>>>>  PipelineOptions optios =
>>>>>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>>>>
>>>>>>>>>>    * p.apply("ReadLines",
>>>>>>>>>> TextIO.read().from("/dataset.tar.gz"))*
>>>>>>>>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>>>>>>>>     @ProcessElement
>>>>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>>>>     c.output(c.element());
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     }))
>>>>>>>>>>
>>>>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>>>>
>>>>>>>>>>     p.run().waitUntilFinish();
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> /Saj
>>>>>>>>>>
>>>>>>>>>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi!
>>>>>>>>>>> Quick questions:
>>>>>>>>>>> - which sdk are you using?
>>>>>>>>>>> - is this batch or streaming?
>>>>>>>>>>>
>>>>>>>>>>> As JB mentioned, TextIO is able to work with compressed files
>>>>>>>>>>> that contain text. Nothing currently handles the double decompression that
>>>>>>>>>>> I believe you're looking for.
>>>>>>>>>>> TextIO for Java is also able to"watch" a directory for new
>>>>>>>>>>> files. If you're able to (outside of your pipeline) decompress your first
>>>>>>>>>>> zip file into a directory that your pipeline is watching, you may be able
>>>>>>>>>>> to use that as work around. Does that sound like a good thing?
>>>>>>>>>>> Finally, if you want to implement a transform that does all your
>>>>>>>>>>> logic, well then that sounds like SplittableDoFn material; and in that
>>>>>>>>>>> case, someone that knows SDF better can give you guidance (or clarify if my
>>>>>>>>>>> suggestions are not correct).
>>>>>>>>>>> Best
>>>>>>>>>>> -P.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <
>>>>>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>> TextIO supports compressed file. Do you want to read files in
>>>>>>>>>>>> text ?
>>>>>>>>>>>>
>>>>>>>>>>>> Can you detail a bit the use case ?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Regards
>>>>>>>>>>>> JB
>>>>>>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com>
>>>>>>>>>>>> a écrit:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> My input is a tar.gz or .zip file which contains thousands of
>>>>>>>>>>>>> tar.gz files and other files.
>>>>>>>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>
>

Re: Looking for I/O transform to untar a tar.gz

Posted by Sajeevan Achuthan <ac...@gmail.com>.

Hi Guys,

Can I take a look at this issue? If you agree, my Jira id is eachsaj

thanks
Saj



On 16 March 2018 at 22:13, Chamikara Jayalath <ch...@google.com> wrote:

> Created https://issues.apache.org/jira/browse/BEAM-3867.
>
> Thanks,
> Cham
>
> On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov <ki...@google.com>
> wrote:
>
>> Reading can not be parallelized, but processing can be - so there is
>> value in having our file-based sources automatically decompress .tar and
>> .tar.gz.
>> (also, I suspect that many people use Beam even for cases with a modest
>> amount of data, that don't have or need parallelism, just for the sake of
>> convenience of Beam's APIs and IOs)
>>
>> On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> FWIW, if you have a concat gzip file [1] TextIO and other file-based
>>> sources should be able to read that. But we don't support tar files. Is it
>>> possible to perform tar extraction before running the pipeline ? This step
>>> probably cannot be parallelized. So not much value in performing within the
>>> pipeline anyways (other than easy access to various file-systems).
>>>
>>> - Cham
>>>
>>> [1] https://stackoverflow.com/questions/8005114/fast-
>>> concatenation-of-multiple-gzip-files
>>>
>>>
>>> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
>>> achuthan.sajeevan@gmail.com> wrote:
>>>
>>>> Eugene - Yes, you are correct. I tried with a text file &  Beam
>>>> wordcount example. The TextIO reader reads some illegal characters as seen
>>>> below.
>>>>
>>>>
>>>> here’s: 1
>>>> addiction: 1
>>>> new: 1
>>>> we: 1
>>>> mood: 1
>>>> an: 1
>>>> incredible: 1
>>>> swings,: 1
>>>> known: 1
>>>> choices.: 1
>>>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
>>>> eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>>>> 1
>>>> already: 2
>>>> today: 1
>>>> the: 3
>>>> generation: 1
>>>> wordcount-00002
>>>>
>>>>
>>>> thanks
>>>> Saj
>>>>
>>>>
>>>> On 16 March 2018 at 17:45, Eugene Kirpichov <ki...@google.com>
>>>> wrote:
>>>>
>>>>> To clarify: I think natively supporting .tar and .tar.gz would be
>>>>> quite useful. I'm just saying that currently we don't.
>>>>>
>>>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <
>>>>> kirpichov@google.com> wrote:
>>>>>
>>>>>> The code behaves as I expected, and the output is corrupt.
>>>>>> Beam unzipped the .gz, but then interpreted the .tar as a text file,
>>>>>> and split the .tar file by \n.
>>>>>> E.g. the first file of the output starts with lines:
>>>>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/
>>>>>> 0000755000175000017500000000000013252764467016513 5ustar
>>>>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/
>>>>>> data0000644000175000017500000000360513252764467017353 0ustar
>>>>>> eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>>>>>>
>>>>>> which are clearly not the expected input.
>>>>>>
>>>>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>
>>>>>>> Eugene, I ran the code and it works fine.  I am very confident in
>>>>>>> this case. I appreciate you guys for the great work.
>>>>>>>
>>>>>>> The code supposed to show that Beam TextIO can read the double
>>>>>>> compressed files and write output without any processing. so ignored the
>>>>>>> processing steps. I agree with you the further processing is not easy in
>>>>>>> this case.
>>>>>>>
>>>>>>>
>>>>>>> import org.apache.beam.sdk.Pipeline;
>>>>>>> import org.apache.beam.sdk.io.TextIO;
>>>>>>> import org.apache.beam.sdk.options.PipelineOptions;
>>>>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>>>>>> import org.apache.beam.sdk.transforms.DoFn;
>>>>>>> import org.apache.beam.sdk.transforms.ParDo;
>>>>>>>
>>>>>>> public class ReadCompressedTextFile {
>>>>>>>
>>>>>>> public static void main(String[] args) {
>>>>>>> PipelineOptions optios = PipelineOptionsFactory.
>>>>>>> fromArgs(args).withValidation().create();
>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>
>>>>>>>     p.apply("ReadLines",
>>>>>>>     TextIO.read().from("./dataset.tar.gz")
>>>>>>>
>>>>>>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>>>>>>     @ProcessElement
>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>     c.output(c.element());
>>>>>>>     // Just write the all content to "/tmp/filout/outputfile"
>>>>>>>     }
>>>>>>>
>>>>>>>     }))
>>>>>>>
>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>
>>>>>>>     p.run().waitUntilFinish();
>>>>>>> }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> The full code, data file & output contents are attached.
>>>>>>>
>>>>>>> thanks
>>>>>>> Saj
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 16 March 2018 at 16:56, Eugene Kirpichov <ki...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can
>>>>>>>> not handle properly .tar. Did you run this code? Did your test .tar.gz file
>>>>>>>> contain multiple files? Did you obtain the expected output, identical to
>>>>>>>> the input except for order of lines?
>>>>>>>> (also, the ParDo in this code doesn't do anything - it outputs its
>>>>>>>> input - so it can be removed)
>>>>>>>>
>>>>>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Guys,
>>>>>>>>>
>>>>>>>>> The TextIo can handle the tar.gz type double compressed files. See
>>>>>>>>> the code test code.
>>>>>>>>>
>>>>>>>>>  PipelineOptions optios = PipelineOptionsFactory.
>>>>>>>>> fromArgs(args).withValidation().create();
>>>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>>>
>>>>>>>>>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>>>>>>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>>>>>>>     @ProcessElement
>>>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>>>     c.output(c.element());
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>>     }))
>>>>>>>>>
>>>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>>>
>>>>>>>>>     p.run().waitUntilFinish();
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> /Saj
>>>>>>>>>
>>>>>>>>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi!
>>>>>>>>>> Quick questions:
>>>>>>>>>> - which sdk are you using?
>>>>>>>>>> - is this batch or streaming?
>>>>>>>>>>
>>>>>>>>>> As JB mentioned, TextIO is able to work with compressed files
>>>>>>>>>> that contain text. Nothing currently handles the double decompression that
>>>>>>>>>> I believe you're looking for.
>>>>>>>>>> TextIO for Java is also able to"watch" a directory for new files.
>>>>>>>>>> If you're able to (outside of your pipeline) decompress your first zip file
>>>>>>>>>> into a directory that your pipeline is watching, you may be able to use
>>>>>>>>>> that as work around. Does that sound like a good thing?
>>>>>>>>>> Finally, if you want to implement a transform that does all your
>>>>>>>>>> logic, well then that sounds like SplittableDoFn material; and in that
>>>>>>>>>> case, someone that knows SDF better can give you guidance (or clarify if my
>>>>>>>>>> suggestions are not correct).
>>>>>>>>>> Best
>>>>>>>>>> -P.
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <
>>>>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> TextIO supports compressed file. Do you want to read files in
>>>>>>>>>>> text ?
>>>>>>>>>>>
>>>>>>>>>>> Can you detail a bit the use case ?
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Regards
>>>>>>>>>>> JB
>>>>>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com> a
>>>>>>>>>>> écrit:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> My input is a tar.gz or .zip file which contains thousands of
>>>>>>>>>>>> tar.gz files and other files.
>>>>>>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>>>>>>
>>>>>>>>>>>> thanks
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>

Re: Looking for I/O transform to untar a tar.gz

Posted by Chamikara Jayalath <ch...@google.com>.

Created https://issues.apache.org/jira/browse/BEAM-3867.

Thanks,
Cham

On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov <ki...@google.com>
wrote:

> Reading can not be parallelized, but processing can be - so there is value
> in having our file-based sources automatically decompress .tar and .tar.gz.
> (also, I suspect that many people use Beam even for cases with a modest
> amount of data, that don't have or need parallelism, just for the sake of
> convenience of Beam's APIs and IOs)
>
> On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> FWIW, if you have a concat gzip file [1] TextIO and other file-based
>> sources should be able to read that. But we don't support tar files. Is it
>> possible to perform tar extraction before running the pipeline ? This step
>> probably cannot be parallelized. So not much value in performing within the
>> pipeline anyways (other than easy access to various file-systems).
>>
>> - Cham
>>
>> [1]
>> https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files
>>
>>
>> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
>> achuthan.sajeevan@gmail.com> wrote:
>>
>>> Eugene - Yes, you are correct. I tried with a text file &  Beam
>>> wordcount example. The TextIO reader reads some illegal characters as seen
>>> below.
>>>
>>>
>>> here’s: 1
>>> addiction: 1
>>> new: 1
>>> we: 1
>>> mood: 1
>>> an: 1
>>> incredible: 1
>>> swings,: 1
>>> known: 1
>>> choices.: 1
>>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>>> 1
>>> already: 2
>>> today: 1
>>> the: 3
>>> generation: 1
>>> wordcount-00002
>>>
>>>
>>> thanks
>>> Saj
>>>
>>>
>>> On 16 March 2018 at 17:45, Eugene Kirpichov <ki...@google.com>
>>> wrote:
>>>
>>>> To clarify: I think natively supporting .tar and .tar.gz would be quite
>>>> useful. I'm just saying that currently we don't.
>>>>
>>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <ki...@google.com>
>>>> wrote:
>>>>
>>>>> The code behaves as I expected, and the output is corrupt.
>>>>> Beam unzipped the .gz, but then interpreted the .tar as a text file,
>>>>> and split the .tar file by \n.
>>>>> E.g. the first file of the output starts with lines:
>>>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513
>>>>> 5ustar
>>>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353
>>>>> 0ustar  eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>>>>>
>>>>> which are clearly not the expected input.
>>>>>
>>>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>
>>>>>> Eugene, I ran the code and it works fine.  I am very confident in
>>>>>> this case. I appreciate you guys for the great work.
>>>>>>
>>>>>> The code supposed to show that Beam TextIO can read the double
>>>>>> compressed files and write output without any processing. so ignored the
>>>>>> processing steps. I agree with you the further processing is not easy in
>>>>>> this case.
>>>>>>
>>>>>>
>>>>>> import org.apache.beam.sdk.Pipeline;
>>>>>> import org.apache.beam.sdk.io.TextIO;
>>>>>> import org.apache.beam.sdk.options.PipelineOptions;
>>>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>>>>> import org.apache.beam.sdk.transforms.DoFn;
>>>>>> import org.apache.beam.sdk.transforms.ParDo;
>>>>>>
>>>>>> public class ReadCompressedTextFile {
>>>>>>
>>>>>> public static void main(String[] args) {
>>>>>> PipelineOptions optios =
>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>
>>>>>>     p.apply("ReadLines",
>>>>>>     TextIO.read().from("./dataset.tar.gz")
>>>>>>
>>>>>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>>>>>     @ProcessElement
>>>>>>     public void processElement(ProcessContext c) {
>>>>>>     c.output(c.element());
>>>>>>     // Just write the all content to "/tmp/filout/outputfile"
>>>>>>     }
>>>>>>
>>>>>>     }))
>>>>>>
>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>
>>>>>>     p.run().waitUntilFinish();
>>>>>> }
>>>>>>
>>>>>> }
>>>>>>
>>>>>> The full code, data file & output contents are attached.
>>>>>>
>>>>>> thanks
>>>>>> Saj
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 16 March 2018 at 16:56, Eugene Kirpichov <ki...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can
>>>>>>> not handle properly .tar. Did you run this code? Did your test .tar.gz file
>>>>>>> contain multiple files? Did you obtain the expected output, identical to
>>>>>>> the input except for order of lines?
>>>>>>> (also, the ParDo in this code doesn't do anything - it outputs its
>>>>>>> input - so it can be removed)
>>>>>>>
>>>>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Guys,
>>>>>>>>
>>>>>>>> The TextIo can handle the tar.gz type double compressed files. See
>>>>>>>> the code test code.
>>>>>>>>
>>>>>>>>  PipelineOptions optios =
>>>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>>
>>>>>>>>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>>>>>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>>>>>>     @ProcessElement
>>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>>     c.output(c.element());
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     }))
>>>>>>>>
>>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>>
>>>>>>>>     p.run().waitUntilFinish();
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> /Saj
>>>>>>>>
>>>>>>>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi!
>>>>>>>>> Quick questions:
>>>>>>>>> - which sdk are you using?
>>>>>>>>> - is this batch or streaming?
>>>>>>>>>
>>>>>>>>> As JB mentioned, TextIO is able to work with compressed files that
>>>>>>>>> contain text. Nothing currently handles the double decompression that I
>>>>>>>>> believe you're looking for.
>>>>>>>>> TextIO for Java is also able to"watch" a directory for new files.
>>>>>>>>> If you're able to (outside of your pipeline) decompress your first zip file
>>>>>>>>> into a directory that your pipeline is watching, you may be able to use
>>>>>>>>> that as work around. Does that sound like a good thing?
>>>>>>>>> Finally, if you want to implement a transform that does all your
>>>>>>>>> logic, well then that sounds like SplittableDoFn material; and in that
>>>>>>>>> case, someone that knows SDF better can give you guidance (or clarify if my
>>>>>>>>> suggestions are not correct).
>>>>>>>>> Best
>>>>>>>>> -P.
>>>>>>>>>
>>>>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <
>>>>>>>>> jb@nanthrax.net> wrote:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> TextIO supports compressed file. Do you want to read files in
>>>>>>>>>> text ?
>>>>>>>>>>
>>>>>>>>>> Can you detail a bit the use case ?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Regards
>>>>>>>>>> JB
>>>>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com> a
>>>>>>>>>> écrit:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> My input is a tar.gz or .zip file which contains thousands of
>>>>>>>>>>> tar.gz files and other files.
>>>>>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>>>>>
>>>>>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>>>>>
>>>>>>>>>>> thanks
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>

Re: Looking for I/O transform to untar a tar.gz

Posted by Eugene Kirpichov <ki...@google.com>.

Reading can not be parallelized, but processing can be - so there is value
in having our file-based sources automatically decompress .tar and .tar.gz.
(also, I suspect that many people use Beam even for cases with a modest
amount of data, that don't have or need parallelism, just for the sake of
convenience of Beam's APIs and IOs)

On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath <ch...@google.com>
wrote:

> FWIW, if you have a concat gzip file [1] TextIO and other file-based
> sources should be able to read that. But we don't support tar files. Is it
> possible to perform tar extraction before running the pipeline ? This step
> probably cannot be parallelized. So not much value in performing within the
> pipeline anyways (other than easy access to various file-systems).
>
> - Cham
>
> [1]
> https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files
>
>
> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
> achuthan.sajeevan@gmail.com> wrote:
>
>> Eugene - Yes, you are correct. I tried with a text file &  Beam wordcount
>> example. The TextIO reader reads some illegal characters as seen below.
>>
>>
>> here’s: 1
>> addiction: 1
>> new: 1
>> we: 1
>> mood: 1
>> an: 1
>> incredible: 1
>> swings,: 1
>> known: 1
>> choices.: 1
>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>> 1
>> already: 2
>> today: 1
>> the: 3
>> generation: 1
>> wordcount-00002
>>
>>
>> thanks
>> Saj
>>
>>
>> On 16 March 2018 at 17:45, Eugene Kirpichov <ki...@google.com> wrote:
>>
>>> To clarify: I think natively supporting .tar and .tar.gz would be quite
>>> useful. I'm just saying that currently we don't.
>>>
>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <ki...@google.com>
>>> wrote:
>>>
>>>> The code behaves as I expected, and the output is corrupt.
>>>> Beam unzipped the .gz, but then interpreted the .tar as a text file,
>>>> and split the .tar file by \n.
>>>> E.g. the first file of the output starts with lines:
>>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513
>>>> 5ustar
>>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353
>>>> 0ustar  eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>>>>
>>>> which are clearly not the expected input.
>>>>
>>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>
>>>>> Eugene, I ran the code and it works fine.  I am very confident in this
>>>>> case. I appreciate you guys for the great work.
>>>>>
>>>>> The code supposed to show that Beam TextIO can read the double
>>>>> compressed files and write output without any processing. so ignored the
>>>>> processing steps. I agree with you the further processing is not easy in
>>>>> this case.
>>>>>
>>>>>
>>>>> import org.apache.beam.sdk.Pipeline;
>>>>> import org.apache.beam.sdk.io.TextIO;
>>>>> import org.apache.beam.sdk.options.PipelineOptions;
>>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>>>> import org.apache.beam.sdk.transforms.DoFn;
>>>>> import org.apache.beam.sdk.transforms.ParDo;
>>>>>
>>>>> public class ReadCompressedTextFile {
>>>>>
>>>>> public static void main(String[] args) {
>>>>> PipelineOptions optios =
>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>
>>>>>     p.apply("ReadLines",
>>>>>     TextIO.read().from("./dataset.tar.gz")
>>>>>
>>>>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>>>>     @ProcessElement
>>>>>     public void processElement(ProcessContext c) {
>>>>>     c.output(c.element());
>>>>>     // Just write the all content to "/tmp/filout/outputfile"
>>>>>     }
>>>>>
>>>>>     }))
>>>>>
>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>
>>>>>     p.run().waitUntilFinish();
>>>>> }
>>>>>
>>>>> }
>>>>>
>>>>> The full code, data file & output contents are attached.
>>>>>
>>>>> thanks
>>>>> Saj
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 16 March 2018 at 16:56, Eugene Kirpichov <ki...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can
>>>>>> not handle properly .tar. Did you run this code? Did your test .tar.gz file
>>>>>> contain multiple files? Did you obtain the expected output, identical to
>>>>>> the input except for order of lines?
>>>>>> (also, the ParDo in this code doesn't do anything - it outputs its
>>>>>> input - so it can be removed)
>>>>>>
>>>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Guys,
>>>>>>>
>>>>>>> The TextIo can handle the tar.gz type double compressed files. See
>>>>>>> the code test code.
>>>>>>>
>>>>>>>  PipelineOptions optios =
>>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>
>>>>>>>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>>>>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>>>>>     @ProcessElement
>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>     c.output(c.element());
>>>>>>>     }
>>>>>>>
>>>>>>>     }))
>>>>>>>
>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>
>>>>>>>     p.run().waitUntilFinish();
>>>>>>>
>>>>>>> Thanks
>>>>>>> /Saj
>>>>>>>
>>>>>>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com> wrote:
>>>>>>>
>>>>>>>> Hi!
>>>>>>>> Quick questions:
>>>>>>>> - which sdk are you using?
>>>>>>>> - is this batch or streaming?
>>>>>>>>
>>>>>>>> As JB mentioned, TextIO is able to work with compressed files that
>>>>>>>> contain text. Nothing currently handles the double decompression that I
>>>>>>>> believe you're looking for.
>>>>>>>> TextIO for Java is also able to"watch" a directory for new files.
>>>>>>>> If you're able to (outside of your pipeline) decompress your first zip file
>>>>>>>> into a directory that your pipeline is watching, you may be able to use
>>>>>>>> that as work around. Does that sound like a good thing?
>>>>>>>> Finally, if you want to implement a transform that does all your
>>>>>>>> logic, well then that sounds like SplittableDoFn material; and in that
>>>>>>>> case, someone that knows SDF better can give you guidance (or clarify if my
>>>>>>>> suggestions are not correct).
>>>>>>>> Best
>>>>>>>> -P.
>>>>>>>>
>>>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> TextIO supports compressed file. Do you want to read files in text
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>> Can you detail a bit the use case ?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com> a
>>>>>>>>> écrit:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> My input is a tar.gz or .zip file which contains thousands of
>>>>>>>>>> tar.gz files and other files.
>>>>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>>>>
>>>>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>>>>
>>>>>>>>>> thanks
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>

Re: Looking for I/O transform to untar a tar.gz

Posted by Chamikara Jayalath <ch...@google.com>.

FWIW, if you have a concat gzip file [1] TextIO and other file-based
sources should be able to read that. But we don't support tar files. Is it
possible to perform tar extraction before running the pipeline ? This step
probably cannot be parallelized. So not much value in performing within the
pipeline anyways (other than easy access to various file-systems).

- Cham

[1]
https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files

On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
achuthan.sajeevan@gmail.com> wrote:

> Eugene - Yes, you are correct. I tried with a text file &  Beam wordcount
> example. The TextIO reader reads some illegal characters as seen below.
>
>
> here’s: 1
> addiction: 1
> new: 1
> we: 1
> mood: 1
> an: 1
> incredible: 1
> swings,: 1
> known: 1
> choices.: 1
> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
> 1
> already: 2
> today: 1
> the: 3
> generation: 1
> wordcount-00002
>
>
> thanks
> Saj
>
>
> On 16 March 2018 at 17:45, Eugene Kirpichov <ki...@google.com> wrote:
>
>> To clarify: I think natively supporting .tar and .tar.gz would be quite
>> useful. I'm just saying that currently we don't.
>>
>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <ki...@google.com>
>> wrote:
>>
>>> The code behaves as I expected, and the output is corrupt.
>>> Beam unzipped the .gz, but then interpreted the .tar as a text file, and
>>> split the .tar file by \n.
>>> E.g. the first file of the output starts with lines:
>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513
>>> 5ustar
>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353
>>> 0ustar  eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>>>
>>> which are clearly not the expected input.
>>>
>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>> achuthan.sajeevan@gmail.com> wrote:
>>>
>>>> Eugene, I ran the code and it works fine.  I am very confident in this
>>>> case. I appreciate you guys for the great work.
>>>>
>>>> The code supposed to show that Beam TextIO can read the double
>>>> compressed files and write output without any processing. so ignored the
>>>> processing steps. I agree with you the further processing is not easy in
>>>> this case.
>>>>
>>>>
>>>> import org.apache.beam.sdk.Pipeline;
>>>> import org.apache.beam.sdk.io.TextIO;
>>>> import org.apache.beam.sdk.options.PipelineOptions;
>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>>> import org.apache.beam.sdk.transforms.DoFn;
>>>> import org.apache.beam.sdk.transforms.ParDo;
>>>>
>>>> public class ReadCompressedTextFile {
>>>>
>>>> public static void main(String[] args) {
>>>> PipelineOptions optios =
>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>     Pipeline p = Pipeline.create(optios);
>>>>
>>>>     p.apply("ReadLines",
>>>>     TextIO.read().from("./dataset.tar.gz")
>>>>
>>>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>>>     @ProcessElement
>>>>     public void processElement(ProcessContext c) {
>>>>     c.output(c.element());
>>>>     // Just write the all content to "/tmp/filout/outputfile"
>>>>     }
>>>>
>>>>     }))
>>>>
>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>
>>>>     p.run().waitUntilFinish();
>>>> }
>>>>
>>>> }
>>>>
>>>> The full code, data file & output contents are attached.
>>>>
>>>> thanks
>>>> Saj
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 16 March 2018 at 16:56, Eugene Kirpichov <ki...@google.com>
>>>> wrote:
>>>>
>>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
>>>>> handle properly .tar. Did you run this code? Did your test .tar.gz file
>>>>> contain multiple files? Did you obtain the expected output, identical to
>>>>> the input except for order of lines?
>>>>> (also, the ParDo in this code doesn't do anything - it outputs its
>>>>> input - so it can be removed)
>>>>>
>>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>>
>>>>>> Hi Guys,
>>>>>>
>>>>>> The TextIo can handle the tar.gz type double compressed files. See
>>>>>> the code test code.
>>>>>>
>>>>>>  PipelineOptions optios =
>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>
>>>>>>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>>>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>>>>     @ProcessElement
>>>>>>     public void processElement(ProcessContext c) {
>>>>>>     c.output(c.element());
>>>>>>     }
>>>>>>
>>>>>>     }))
>>>>>>
>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>
>>>>>>     p.run().waitUntilFinish();
>>>>>>
>>>>>> Thanks
>>>>>> /Saj
>>>>>>
>>>>>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com> wrote:
>>>>>>
>>>>>>> Hi!
>>>>>>> Quick questions:
>>>>>>> - which sdk are you using?
>>>>>>> - is this batch or streaming?
>>>>>>>
>>>>>>> As JB mentioned, TextIO is able to work with compressed files that
>>>>>>> contain text. Nothing currently handles the double decompression that I
>>>>>>> believe you're looking for.
>>>>>>> TextIO for Java is also able to"watch" a directory for new files. If
>>>>>>> you're able to (outside of your pipeline) decompress your first zip file
>>>>>>> into a directory that your pipeline is watching, you may be able to use
>>>>>>> that as work around. Does that sound like a good thing?
>>>>>>> Finally, if you want to implement a transform that does all your
>>>>>>> logic, well then that sounds like SplittableDoFn material; and in that
>>>>>>> case, someone that knows SDF better can give you guidance (or clarify if my
>>>>>>> suggestions are not correct).
>>>>>>> Best
>>>>>>> -P.
>>>>>>>
>>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> TextIO supports compressed file. Do you want to read files in text ?
>>>>>>>>
>>>>>>>> Can you detail a bit the use case ?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com> a
>>>>>>>> écrit:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> My input is a tar.gz or .zip file which contains thousands of
>>>>>>>>> tar.gz files and other files.
>>>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>>>
>>>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>>
>>>>>>>> --
>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>

Re: Looking for I/O transform to untar a tar.gz

Posted by Sajeevan Achuthan <ac...@gmail.com>.

Eugene - Yes, you are correct. I tried with a text file &  Beam wordcount
example. The TextIO reader reads some illegal characters as seen below.


here’s: 1
addiction: 1
new: 1
we: 1
mood: 1
an: 1
incredible: 1
swings,: 1
known: 1
choices.: 1
^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
1
already: 2
today: 1
the: 3
generation: 1
wordcount-00002


thanks
Saj


On 16 March 2018 at 17:45, Eugene Kirpichov <ki...@google.com> wrote:

> To clarify: I think natively supporting .tar and .tar.gz would be quite
> useful. I'm just saying that currently we don't.
>
> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <ki...@google.com>
> wrote:
>
>> The code behaves as I expected, and the output is corrupt.
>> Beam unzipped the .gz, but then interpreted the .tar as a text file, and
>> split the .tar file by \n.
>> E.g. the first file of the output starts with lines:
>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/
>> 0000755000175000017500000000000013252764467016513 5ustar
>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/
>> data0000644000175000017500000000360513252764467017353 0ustar
>> eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>>
>> which are clearly not the expected input.
>>
>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>> achuthan.sajeevan@gmail.com> wrote:
>>
>>> Eugene, I ran the code and it works fine.  I am very confident in this
>>> case. I appreciate you guys for the great work.
>>>
>>> The code supposed to show that Beam TextIO can read the double
>>> compressed files and write output without any processing. so ignored the
>>> processing steps. I agree with you the further processing is not easy in
>>> this case.
>>>
>>>
>>> import org.apache.beam.sdk.Pipeline;
>>> import org.apache.beam.sdk.io.TextIO;
>>> import org.apache.beam.sdk.options.PipelineOptions;
>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>> import org.apache.beam.sdk.transforms.DoFn;
>>> import org.apache.beam.sdk.transforms.ParDo;
>>>
>>> public class ReadCompressedTextFile {
>>>
>>> public static void main(String[] args) {
>>> PipelineOptions optios = PipelineOptionsFactory.
>>> fromArgs(args).withValidation().create();
>>>     Pipeline p = Pipeline.create(optios);
>>>
>>>     p.apply("ReadLines",
>>>     TextIO.read().from("./dataset.tar.gz")
>>>
>>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>>     @ProcessElement
>>>     public void processElement(ProcessContext c) {
>>>     c.output(c.element());
>>>     // Just write the all content to "/tmp/filout/outputfile"
>>>     }
>>>
>>>     }))
>>>
>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>
>>>     p.run().waitUntilFinish();
>>> }
>>>
>>> }
>>>
>>> The full code, data file & output contents are attached.
>>>
>>> thanks
>>> Saj
>>>
>>>
>>>
>>>
>>>
>>> On 16 March 2018 at 16:56, Eugene Kirpichov <ki...@google.com>
>>> wrote:
>>>
>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
>>>> handle properly .tar. Did you run this code? Did your test .tar.gz file
>>>> contain multiple files? Did you obtain the expected output, identical to
>>>> the input except for order of lines?
>>>> (also, the ParDo in this code doesn't do anything - it outputs its
>>>> input - so it can be removed)
>>>>
>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>>> achuthan.sajeevan@gmail.com> wrote:
>>>>
>>>>> Hi Guys,
>>>>>
>>>>> The TextIo can handle the tar.gz type double compressed files. See the
>>>>> code test code.
>>>>>
>>>>>  PipelineOptions optios = PipelineOptionsFactory.
>>>>> fromArgs(args).withValidation().create();
>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>
>>>>>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>>>     @ProcessElement
>>>>>     public void processElement(ProcessContext c) {
>>>>>     c.output(c.element());
>>>>>     }
>>>>>
>>>>>     }))
>>>>>
>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>
>>>>>     p.run().waitUntilFinish();
>>>>>
>>>>> Thanks
>>>>> /Saj
>>>>>
>>>>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com> wrote:
>>>>>
>>>>>> Hi!
>>>>>> Quick questions:
>>>>>> - which sdk are you using?
>>>>>> - is this batch or streaming?
>>>>>>
>>>>>> As JB mentioned, TextIO is able to work with compressed files that
>>>>>> contain text. Nothing currently handles the double decompression that I
>>>>>> believe you're looking for.
>>>>>> TextIO for Java is also able to"watch" a directory for new files. If
>>>>>> you're able to (outside of your pipeline) decompress your first zip file
>>>>>> into a directory that your pipeline is watching, you may be able to use
>>>>>> that as work around. Does that sound like a good thing?
>>>>>> Finally, if you want to implement a transform that does all your
>>>>>> logic, well then that sounds like SplittableDoFn material; and in that
>>>>>> case, someone that knows SDF better can give you guidance (or clarify if my
>>>>>> suggestions are not correct).
>>>>>> Best
>>>>>> -P.
>>>>>>
>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> TextIO supports compressed file. Do you want to read files in text ?
>>>>>>>
>>>>>>> Can you detail a bit the use case ?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Regards
>>>>>>> JB
>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com> a
>>>>>>> écrit:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> My input is a tar.gz or .zip file which contains thousands of
>>>>>>>> tar.gz files and other files.
>>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>>
>>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>>
>>>>>>>> thanks
>>>>>>>>
>>>>>>> --
>>>>>> Got feedback? go/pabloem-feedback
>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>
>>>>>
>>>>>
>>>

Re: Looking for I/O transform to untar a tar.gz

Posted by Eugene Kirpichov <ki...@google.com>.

To clarify: I think natively supporting .tar and .tar.gz would be quite
useful. I'm just saying that currently we don't.

On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <ki...@google.com>
wrote:

> The code behaves as I expected, and the output is corrupt.
> Beam unzipped the .gz, but then interpreted the .tar as a text file, and
> split the .tar file by \n.
> E.g. the first file of the output starts with lines:
> A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513
> 5ustar
> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353
> 0ustar  eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>
> which are clearly not the expected input.
>
> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
> achuthan.sajeevan@gmail.com> wrote:
>
>> Eugene, I ran the code and it works fine.  I am very confident in this
>> case. I appreciate you guys for the great work.
>>
>> The code supposed to show that Beam TextIO can read the double compressed
>> files and write output without any processing. so ignored the processing
>> steps. I agree with you the further processing is not easy in this case.
>>
>>
>> import org.apache.beam.sdk.Pipeline;
>> import org.apache.beam.sdk.io.TextIO;
>> import org.apache.beam.sdk.options.PipelineOptions;
>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>> import org.apache.beam.sdk.transforms.DoFn;
>> import org.apache.beam.sdk.transforms.ParDo;
>>
>> public class ReadCompressedTextFile {
>>
>> public static void main(String[] args) {
>> PipelineOptions optios =
>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>     Pipeline p = Pipeline.create(optios);
>>
>>     p.apply("ReadLines",
>>     TextIO.read().from("./dataset.tar.gz")
>>
>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>     @ProcessElement
>>     public void processElement(ProcessContext c) {
>>     c.output(c.element());
>>     // Just write the all content to "/tmp/filout/outputfile"
>>     }
>>
>>     }))
>>
>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>
>>     p.run().waitUntilFinish();
>> }
>>
>> }
>>
>> The full code, data file & output contents are attached.
>>
>> thanks
>> Saj
>>
>>
>>
>>
>>
>> On 16 March 2018 at 16:56, Eugene Kirpichov <ki...@google.com> wrote:
>>
>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
>>> handle properly .tar. Did you run this code? Did your test .tar.gz file
>>> contain multiple files? Did you obtain the expected output, identical to
>>> the input except for order of lines?
>>> (also, the ParDo in this code doesn't do anything - it outputs its input
>>> - so it can be removed)
>>>
>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>> achuthan.sajeevan@gmail.com> wrote:
>>>
>>>> Hi Guys,
>>>>
>>>> The TextIo can handle the tar.gz type double compressed files. See the
>>>> code test code.
>>>>
>>>>  PipelineOptions optios =
>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>     Pipeline p = Pipeline.create(optios);
>>>>
>>>>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>>     @ProcessElement
>>>>     public void processElement(ProcessContext c) {
>>>>     c.output(c.element());
>>>>     }
>>>>
>>>>     }))
>>>>
>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>
>>>>     p.run().waitUntilFinish();
>>>>
>>>> Thanks
>>>> /Saj
>>>>
>>>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com> wrote:
>>>>
>>>>> Hi!
>>>>> Quick questions:
>>>>> - which sdk are you using?
>>>>> - is this batch or streaming?
>>>>>
>>>>> As JB mentioned, TextIO is able to work with compressed files that
>>>>> contain text. Nothing currently handles the double decompression that I
>>>>> believe you're looking for.
>>>>> TextIO for Java is also able to"watch" a directory for new files. If
>>>>> you're able to (outside of your pipeline) decompress your first zip file
>>>>> into a directory that your pipeline is watching, you may be able to use
>>>>> that as work around. Does that sound like a good thing?
>>>>> Finally, if you want to implement a transform that does all your
>>>>> logic, well then that sounds like SplittableDoFn material; and in that
>>>>> case, someone that knows SDF better can give you guidance (or clarify if my
>>>>> suggestions are not correct).
>>>>> Best
>>>>> -P.
>>>>>
>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> TextIO supports compressed file. Do you want to read files in text ?
>>>>>>
>>>>>> Can you detail a bit the use case ?
>>>>>>
>>>>>> Thanks
>>>>>> Regards
>>>>>> JB
>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com> a
>>>>>> écrit:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>>>>>>> files and other files.
>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>
>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>
>>>>>>> thanks
>>>>>>>
>>>>>> --
>>>>> Got feedback? go/pabloem-feedback
>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>
>>>>
>>>>
>>

Re: Looking for I/O transform to untar a tar.gz

Posted by Eugene Kirpichov <ki...@google.com>.

The code behaves as I expected, and the output is corrupt.
Beam unzipped the .gz, but then interpreted the .tar as a text file, and
split the .tar file by \n.
E.g. the first file of the output starts with lines:
A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513
5ustar
eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353
0ustar  eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>

which are clearly not the expected input.

On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
achuthan.sajeevan@gmail.com> wrote:

> Eugene, I ran the code and it works fine.  I am very confident in this
> case. I appreciate you guys for the great work.
>
> The code supposed to show that Beam TextIO can read the double compressed
> files and write output without any processing. so ignored the processing
> steps. I agree with you the further processing is not easy in this case.
>
>
> import org.apache.beam.sdk.Pipeline;
> import org.apache.beam.sdk.io.TextIO;
> import org.apache.beam.sdk.options.PipelineOptions;
> import org.apache.beam.sdk.options.PipelineOptionsFactory;
> import org.apache.beam.sdk.transforms.DoFn;
> import org.apache.beam.sdk.transforms.ParDo;
>
> public class ReadCompressedTextFile {
>
> public static void main(String[] args) {
> PipelineOptions optios =
> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>     Pipeline p = Pipeline.create(optios);
>
>     p.apply("ReadLines",
>     TextIO.read().from("./dataset.tar.gz")
>
>     ).apply(ParDo.of(new DoFn<String, String>(){
>     @ProcessElement
>     public void processElement(ProcessContext c) {
>     c.output(c.element());
>     // Just write the all content to "/tmp/filout/outputfile"
>     }
>
>     }))
>
>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>
>     p.run().waitUntilFinish();
> }
>
> }
>
> The full code, data file & output contents are attached.
>
> thanks
> Saj
>
>
>
>
>
> On 16 March 2018 at 16:56, Eugene Kirpichov <ki...@google.com> wrote:
>
>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
>> handle properly .tar. Did you run this code? Did your test .tar.gz file
>> contain multiple files? Did you obtain the expected output, identical to
>> the input except for order of lines?
>> (also, the ParDo in this code doesn't do anything - it outputs its input
>> - so it can be removed)
>>
>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>> achuthan.sajeevan@gmail.com> wrote:
>>
>>> Hi Guys,
>>>
>>> The TextIo can handle the tar.gz type double compressed files. See the
>>> code test code.
>>>
>>>  PipelineOptions optios =
>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>     Pipeline p = Pipeline.create(optios);
>>>
>>>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>     @ProcessElement
>>>     public void processElement(ProcessContext c) {
>>>     c.output(c.element());
>>>     }
>>>
>>>     }))
>>>
>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>
>>>     p.run().waitUntilFinish();
>>>
>>> Thanks
>>> /Saj
>>>
>>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com> wrote:
>>>
>>>> Hi!
>>>> Quick questions:
>>>> - which sdk are you using?
>>>> - is this batch or streaming?
>>>>
>>>> As JB mentioned, TextIO is able to work with compressed files that
>>>> contain text. Nothing currently handles the double decompression that I
>>>> believe you're looking for.
>>>> TextIO for Java is also able to"watch" a directory for new files. If
>>>> you're able to (outside of your pipeline) decompress your first zip file
>>>> into a directory that your pipeline is watching, you may be able to use
>>>> that as work around. Does that sound like a good thing?
>>>> Finally, if you want to implement a transform that does all your logic,
>>>> well then that sounds like SplittableDoFn material; and in that case,
>>>> someone that knows SDF better can give you guidance (or clarify if my
>>>> suggestions are not correct).
>>>> Best
>>>> -P.
>>>>
>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> TextIO supports compressed file. Do you want to read files in text ?
>>>>>
>>>>> Can you detail a bit the use case ?
>>>>>
>>>>> Thanks
>>>>> Regards
>>>>> JB
>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com> a écrit:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>>>>>> files and other files.
>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>
>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>
>>>>>> thanks
>>>>>>
>>>>> --
>>>> Got feedback? go/pabloem-feedback
>>>> <https://goto.google.com/pabloem-feedback>
>>>>
>>>
>>>
>

Re: Looking for I/O transform to untar a tar.gz

Posted by Sajeevan Achuthan <ac...@gmail.com>.

Eugene, I ran the code and it works fine.  I am very confident in this
case. I appreciate you guys for the great work.

The code supposed to show that Beam TextIO can read the double compressed
files and write output without any processing. so ignored the processing
steps. I agree with you the further processing is not easy in this case.


import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;

public class ReadCompressedTextFile {

public static void main(String[] args) {
PipelineOptions optios =
PipelineOptionsFactory.fromArgs(args).withValidation().create();
    Pipeline p = Pipeline.create(optios);

    p.apply("ReadLines",
    TextIO.read().from("./dataset.tar.gz")

    ).apply(ParDo.of(new DoFn<String, String>(){
    @ProcessElement
    public void processElement(ProcessContext c) {
    c.output(c.element());
    // Just write the all content to "/tmp/filout/outputfile"
    }

    }))

   .apply(TextIO.write().to("/tmp/filout/outputfile"));

    p.run().waitUntilFinish();
}

}

The full code, data file & output contents are attached.

thanks
Saj





On 16 March 2018 at 16:56, Eugene Kirpichov <ki...@google.com> wrote:

> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
> handle properly .tar. Did you run this code? Did your test .tar.gz file
> contain multiple files? Did you obtain the expected output, identical to
> the input except for order of lines?
> (also, the ParDo in this code doesn't do anything - it outputs its input -
> so it can be removed)
>
> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
> achuthan.sajeevan@gmail.com> wrote:
>
>> Hi Guys,
>>
>> The TextIo can handle the tar.gz type double compressed files. See the
>> code test code.
>>
>>  PipelineOptions optios = PipelineOptionsFactory.
>> fromArgs(args).withValidation().create();
>>     Pipeline p = Pipeline.create(optios);
>>
>>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>     @ProcessElement
>>     public void processElement(ProcessContext c) {
>>     c.output(c.element());
>>     }
>>
>>     }))
>>
>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>
>>     p.run().waitUntilFinish();
>>
>> Thanks
>> /Saj
>>
>> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com> wrote:
>>
>>> Hi!
>>> Quick questions:
>>> - which sdk are you using?
>>> - is this batch or streaming?
>>>
>>> As JB mentioned, TextIO is able to work with compressed files that
>>> contain text. Nothing currently handles the double decompression that I
>>> believe you're looking for.
>>> TextIO for Java is also able to"watch" a directory for new files. If
>>> you're able to (outside of your pipeline) decompress your first zip file
>>> into a directory that your pipeline is watching, you may be able to use
>>> that as work around. Does that sound like a good thing?
>>> Finally, if you want to implement a transform that does all your logic,
>>> well then that sounds like SplittableDoFn material; and in that case,
>>> someone that knows SDF better can give you guidance (or clarify if my
>>> suggestions are not correct).
>>> Best
>>> -P.
>>>
>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> TextIO supports compressed file. Do you want to read files in text ?
>>>>
>>>> Can you detail a bit the use case ?
>>>>
>>>> Thanks
>>>> Regards
>>>> JB
>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com> a écrit:
>>>>>
>>>>> Hi,
>>>>>
>>>>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>>>>> files and other files.
>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>
>>>>> Is there a transform that can do that? I couldn't find one.
>>>>> If not is it in works? Any pointers to start work on it?
>>>>>
>>>>> thanks
>>>>>
>>>> --
>>> Got feedback? go/pabloem-feedback
>>> <https://goto.google.com/pabloem-feedback>
>>>
>>
>>

Re: Looking for I/O transform to untar a tar.gz

Posted by Eugene Kirpichov <ki...@google.com>.

Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
handle properly .tar. Did you run this code? Did your test .tar.gz file
contain multiple files? Did you obtain the expected output, identical to
the input except for order of lines?
(also, the ParDo in this code doesn't do anything - it outputs its input -
so it can be removed)

On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
achuthan.sajeevan@gmail.com> wrote:

> Hi Guys,
>
> The TextIo can handle the tar.gz type double compressed files. See the
> code test code.
>
>  PipelineOptions optios =
> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>     Pipeline p = Pipeline.create(optios);
>
>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>                       .apply(ParDo.of(new DoFn<String, String>(){
>     @ProcessElement
>     public void processElement(ProcessContext c) {
>     c.output(c.element());
>     }
>
>     }))
>
>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>
>     p.run().waitUntilFinish();
>
> Thanks
> /Saj
>
> On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com> wrote:
>
>> Hi!
>> Quick questions:
>> - which sdk are you using?
>> - is this batch or streaming?
>>
>> As JB mentioned, TextIO is able to work with compressed files that
>> contain text. Nothing currently handles the double decompression that I
>> believe you're looking for.
>> TextIO for Java is also able to"watch" a directory for new files. If
>> you're able to (outside of your pipeline) decompress your first zip file
>> into a directory that your pipeline is watching, you may be able to use
>> that as work around. Does that sound like a good thing?
>> Finally, if you want to implement a transform that does all your logic,
>> well then that sounds like SplittableDoFn material; and in that case,
>> someone that knows SDF better can give you guidance (or clarify if my
>> suggestions are not correct).
>> Best
>> -P.
>>
>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>>> Hi
>>>
>>> TextIO supports compressed file. Do you want to read files in text ?
>>>
>>> Can you detail a bit the use case ?
>>>
>>> Thanks
>>> Regards
>>> JB
>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com> a écrit:
>>>>
>>>> Hi,
>>>>
>>>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>>>> files and other files.
>>>> I would lile to extract the tar.gz files from the tar.
>>>>
>>>> Is there a transform that can do that? I couldn't find one.
>>>> If not is it in works? Any pointers to start work on it?
>>>>
>>>> thanks
>>>>
>>> --
>> Got feedback? go/pabloem-feedback
>> <https://goto.google.com/pabloem-feedback>
>>
>
>

Re: Looking for I/O transform to untar a tar.gz

Posted by Sajeevan Achuthan <ac...@gmail.com>.

Hi Guys,

The TextIo can handle the tar.gz type double compressed files. See the code
test code.

 PipelineOptions optios =
PipelineOptionsFactory.fromArgs(args).withValidation().create();
    Pipeline p = Pipeline.create(optios);

   * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
                      .apply(ParDo.of(new DoFn<String, String>(){
    @ProcessElement
    public void processElement(ProcessContext c) {
    c.output(c.element());
    }

    }))

   .apply(TextIO.write().to("/tmp/filout/outputfile"));

    p.run().waitUntilFinish();

Thanks
/Saj
On 16 March 2018 at 04:29, Pablo Estrada <pa...@google.com> wrote:

> Hi!
> Quick questions:
> - which sdk are you using?
> - is this batch or streaming?
>
> As JB mentioned, TextIO is able to work with compressed files that contain
> text. Nothing currently handles the double decompression that I believe
> you're looking for.
> TextIO for Java is also able to"watch" a directory for new files. If
> you're able to (outside of your pipeline) decompress your first zip file
> into a directory that your pipeline is watching, you may be able to use
> that as work around. Does that sound like a good thing?
> Finally, if you want to implement a transform that does all your logic,
> well then that sounds like SplittableDoFn material; and in that case,
> someone that knows SDF better can give you guidance (or clarify if my
> suggestions are not correct).
> Best
> -P.
>
> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> Hi
>>
>> TextIO supports compressed file. Do you want to read files in text ?
>>
>> Can you detail a bit the use case ?
>>
>> Thanks
>> Regards
>> JB
>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com> a écrit:
>>>
>>> Hi,
>>>
>>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>>> files and other files.
>>> I would lile to extract the tar.gz files from the tar.
>>>
>>> Is there a transform that can do that? I couldn't find one.
>>> If not is it in works? Any pointers to start work on it?
>>>
>>> thanks
>>>
>> --
> Got feedback? go/pabloem-feedback
>

Re: Looking for I/O transform to untar a tar.gz

Posted by Pablo Estrada <pa...@google.com>.

Hi!
Quick questions:
- which sdk are you using?
- is this batch or streaming?

As JB mentioned, TextIO is able to work with compressed files that contain
text. Nothing currently handles the double decompression that I believe
you're looking for.
TextIO for Java is also able to"watch" a directory for new files. If you're
able to (outside of your pipeline) decompress your first zip file into a
directory that your pipeline is watching, you may be able to use that as
work around. Does that sound like a good thing?
Finally, if you want to implement a transform that does all your logic,
well then that sounds like SplittableDoFn material; and in that case,
someone that knows SDF better can give you guidance (or clarify if my
suggestions are not correct).
Best
-P.

On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:

> Hi
>
> TextIO supports compressed file. Do you want to read files in text ?
>
> Can you detail a bit the use case ?
>
> Thanks
> Regards
> JB
> Le 15 mars 2018, à 18:28, Shirish Jamthe <sj...@google.com> a écrit:
>>
>> Hi,
>>
>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>> files and other files.
>> I would lile to extract the tar.gz files from the tar.
>>
>> Is there a transform that can do that? I couldn't find one.
>> If not is it in works? Any pointers to start work on it?
>>
>> thanks
>>
> --
Got feedback? go/pabloem-feedback

Re: Looking for I/O transform to untar a tar.gz

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi

TextIO supports compressed file. Do you want to read files in text ?

Can you detail a bit the use case ?

Thanks
Regards
JB

Le 15 mars 2018 à 18:28, à 18:28, Shirish Jamthe <sj...@google.com> a écrit:
>Hi,
>
>My input is a tar.gz or .zip file which contains thousands of tar.gz
>files
>and other files.
>I would lile to extract the tar.gz files from the tar.
>
>Is there a transform that can do that? I couldn't find one.
>If not is it in works? Any pointers to start work on it?
>
>thanks