You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Paul Tremblay <pa...@gmail.com> on 2017/02/04 21:25:24 UTC
Turning rows into columns
I am using pyspark 2.1 and am wondering how to convert a flat file, with
one record per row, into a columnar format.
Here is an example of the data:
u'WARC/1.0',
u'WARC-Type: warcinfo',
u'WARC-Date: 2016-12-08T13:00:23Z',
u'WARC-Record-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>',
u'Content-Length: 344',
u'Content-Type: application/warc-fields',
u'WARC-Filename:
CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.warc.gz',
u'',
u'robots: classic',
u'hostname: ip-10-31-129-80.ec2.internal',
u'software: Nutch 1.6 (CC)/CC WarcExport 1.0',
u'isPartOf: CC-MAIN-2016-50',
u'operator: CommonCrawl Admin',
u'description: Wide crawl of the web for November 2016',
u'publisher: CommonCrawl',
u'format: WARC File Format 1.0',
u'conformsTo:
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf',
u'',
u'',
u'WARC/1.0',
u'WARC-Type: request',
u'WARC-Date: 2016-12-02T17:54:09Z',
u'WARC-Record-ID: <urn:uuid:cc7ddf8b-4646-4440-a70a-e253818cf10b>',
u'Content-Length: 220',
u'Content-Type: application/http; msgtype=request',
u'WARC-Warcinfo-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>',
u'WARC-IP-Address: 217.197.115.133',
u'WARC-Target-URI: http://1018201.vkrugudruzei.ru/blog/',
u'',
u'GET /blog/ HTTP/1.0',
u'Host: 1018201.vkrugudruzei.ru',
u'Accept-Encoding: x-gzip, gzip, deflate',
u'User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)',
u'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
u'',
u'',
u'',
u'WARC/1.0',
u'WARC-Type: response',
u'WARC-Date: 2016-12-02T17:54:09Z',
u'WARC-Record-ID: <urn:uuid:4c5e6d1a-e64f-4b6e-8101-c5e46feb84a0>',
u'Content-Length: 577',
u'Content-Type: application/http; msgtype=response',
u'WARC-Warcinfo-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>',
u'WARC-Concurrent-To: <urn:uuid:cc7ddf8b-4646-4440-a70a-e253818cf10b>',
u'WARC-IP-Address: 217.197.115.133',
u'WARC-Target-URI: http://1018201.vkrugudruzei.ru/blog/',
u'WARC-Payload-Digest: sha1:Y4TZFLB6UTXHU4HUVONBXC5NZQW2LYMM',
u'WARC-Block-Digest: sha1:3J7HHBMWTSC7W53DDB7BHTUVPM26QS4B',
u'']
I want to convert it to something like:
{warc-type='request',warc-date='2016-12-02'.
ward-record-id='<urn:uuid:cc7ddf8b-4646-4440-a70a-e253818cf10b....}
In Python I would simply set a flag, and read line by line (create a state
machine). You can't do this in spark, though.
Thanks
Henry
--
Paul Henry Tremblay
Robert Half Technology
Re: Turning rows into columns
Posted by Paul Tremblay <pa...@gmail.com>.
Yes, that's what I need. Thanks.
P.
On 02/05/2017 12:17 PM, Koert Kuipers wrote:
> since there is no key to group by and assemble records i would suggest
> to write this in RDD land and then convert to data frame. you can use
> sc.wholeTextFiles to process text files and create a state machine
>
> On Feb 4, 2017 16:25, "Paul Tremblay" <paulhtremblay@gmail.com
> <ma...@gmail.com>> wrote:
>
> I am using pyspark 2.1 and am wondering how to convert a flat
> file, with one record per row, into a columnar format.
>
> Here is an example of the data:
>
> u'WARC/1.0',
> u'WARC-Type: warcinfo',
> u'WARC-Date: 2016-12-08T13:00:23Z',
> u'WARC-Record-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>',
> u'Content-Length: 344',
> u'Content-Type: application/warc-fields',
> u'WARC-Filename:
> CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.warc.gz',
> u'',
> u'robots: classic',
> u'hostname: ip-10-31-129-80.ec2.internal',
> u'software: Nutch 1.6 (CC)/CC WarcExport 1.0',
> u'isPartOf: CC-MAIN-2016-50',
> u'operator: CommonCrawl Admin',
> u'description: Wide crawl of the web for November 2016',
> u'publisher: CommonCrawl',
> u'format: WARC File Format 1.0',
> u'conformsTo:
> http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
> <http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf>',
> u'',
> u'',
> u'WARC/1.0',
> u'WARC-Type: request',
> u'WARC-Date: 2016-12-02T17:54:09Z',
> u'WARC-Record-ID: <urn:uuid:cc7ddf8b-4646-4440-a70a-e253818cf10b>',
> u'Content-Length: 220',
> u'Content-Type: application/http; msgtype=request',
> u'WARC-Warcinfo-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>',
> u'WARC-IP-Address: 217.197.115.133',
> u'WARC-Target-URI: http://1018201.vkrugudruzei.ru/blog/
> <http://1018201.vkrugudruzei.ru/blog/>',
> u'',
> u'GET /blog/ HTTP/1.0',
> u'Host: 1018201.vkrugudruzei.ru <http://1018201.vkrugudruzei.ru>',
> u'Accept-Encoding: x-gzip, gzip, deflate',
> u'User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
> <http://commoncrawl.org/faq/%29>',
> u'Accept:
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
> u'',
> u'',
> u'',
> u'WARC/1.0',
> u'WARC-Type: response',
> u'WARC-Date: 2016-12-02T17:54:09Z',
> u'WARC-Record-ID: <urn:uuid:4c5e6d1a-e64f-4b6e-8101-c5e46feb84a0>',
> u'Content-Length: 577',
> u'Content-Type: application/http; msgtype=response',
> u'WARC-Warcinfo-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>',
> u'WARC-Concurrent-To:
> <urn:uuid:cc7ddf8b-4646-4440-a70a-e253818cf10b>',
> u'WARC-IP-Address: 217.197.115.133',
> u'WARC-Target-URI: http://1018201.vkrugudruzei.ru/blog/
> <http://1018201.vkrugudruzei.ru/blog/>',
> u'WARC-Payload-Digest: sha1:Y4TZFLB6UTXHU4HUVONBXC5NZQW2LYMM',
> u'WARC-Block-Digest: sha1:3J7HHBMWTSC7W53DDB7BHTUVPM26QS4B',
> u'']
>
> I want to convert it to something like:
> {warc-type='request',warc-date='2016-12-02'.
> ward-record-id='<urn:uuid:cc7ddf8b-4646-4440-a70a-e253818cf10b....}
>
> In Python I would simply set a flag, and read line by line (create
> a state machine). You can't do this in spark, though.
>
> Thanks
>
> Henry
>
> --
> Paul Henry Tremblay
> Robert Half Technology
>
>
Re: Turning rows into columns
Posted by Koert Kuipers <ko...@tresata.com>.
since there is no key to group by and assemble records i would suggest to
write this in RDD land and then convert to data frame. you can use
sc.wholeTextFiles to process text files and create a state machine
On Feb 4, 2017 16:25, "Paul Tremblay" <pa...@gmail.com> wrote:
I am using pyspark 2.1 and am wondering how to convert a flat file, with
one record per row, into a columnar format.
Here is an example of the data:
u'WARC/1.0',
u'WARC-Type: warcinfo',
u'WARC-Date: 2016-12-08T13:00:23Z',
u'WARC-Record-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>',
u'Content-Length: 344',
u'Content-Type: application/warc-fields',
u'WARC-Filename: CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.
warc.gz',
u'',
u'robots: classic',
u'hostname: ip-10-31-129-80.ec2.internal',
u'software: Nutch 1.6 (CC)/CC WarcExport 1.0',
u'isPartOf: CC-MAIN-2016-50',
u'operator: CommonCrawl Admin',
u'description: Wide crawl of the web for November 2016',
u'publisher: CommonCrawl',
u'format: WARC File Format 1.0',
u'conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_
latestdraft.pdf',
u'',
u'',
u'WARC/1.0',
u'WARC-Type: request',
u'WARC-Date: 2016-12-02T17:54:09Z',
u'WARC-Record-ID: <urn:uuid:cc7ddf8b-4646-4440-a70a-e253818cf10b>',
u'Content-Length: 220',
u'Content-Type: application/http; msgtype=request',
u'WARC-Warcinfo-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>',
u'WARC-IP-Address: 217.197.115.133',
u'WARC-Target-URI: http://1018201.vkrugudruzei.ru/blog/',
u'',
u'GET /blog/ HTTP/1.0',
u'Host: 1018201.vkrugudruzei.ru',
u'Accept-Encoding: x-gzip, gzip, deflate',
u'User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)',
u'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
u'',
u'',
u'',
u'WARC/1.0',
u'WARC-Type: response',
u'WARC-Date: 2016-12-02T17:54:09Z',
u'WARC-Record-ID: <urn:uuid:4c5e6d1a-e64f-4b6e-8101-c5e46feb84a0>',
u'Content-Length: 577',
u'Content-Type: application/http; msgtype=response',
u'WARC-Warcinfo-ID: <urn:uuid:f609f246-df68-46ef-a1c5-2f66e833ffd6>',
u'WARC-Concurrent-To: <urn:uuid:cc7ddf8b-4646-4440-a70a-e253818cf10b>',
u'WARC-IP-Address: 217.197.115.133',
u'WARC-Target-URI: http://1018201.vkrugudruzei.ru/blog/',
u'WARC-Payload-Digest: sha1:Y4TZFLB6UTXHU4HUVONBXC5NZQW2LYMM',
u'WARC-Block-Digest: sha1:3J7HHBMWTSC7W53DDB7BHTUVPM26QS4B',
u'']
I want to convert it to something like:
{warc-type='request',warc-date='2016-12-02'. ward-record-id='<urn:uuid:
cc7ddf8b-4646-4440-a70a-e253818cf10b....}
In Python I would simply set a flag, and read line by line (create a state
machine). You can't do this in spark, though.
Thanks
Henry
--
Paul Henry Tremblay
Robert Half Technology