You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by JOHN MILLER <jm...@gmail.com> on 2015/12/04 16:39:06 UTC

Fwd: DATA TRANSFORMATION PROBLEM

---------- Forwarded message ----------
From: JOHN MILLER <jm...@gmail.com>
Date: Fri, Dec 4, 2015 at 10:24 AM
Subject: DATA TRANSFORMATION PROBLEM
To: info@data-artisans.com




*Greetings*


*I am writing to obtain an approach to resolve a data transformation
problem  The problem is that I want to format a new dataset which would
allow processing continue instead of bombing.  The dataset i want to
convert is a series of WARC files (currently read in as text...examples are
attached)​ CC-MAIN-TEXT-20130516092621-00003-ip-10-60-113-...
<https://drive.google.com/file/d/0B5QdPKF22EFxMDlTX3BzSW9uTDg/view?usp=drive_web>​
I am trying to parse out a field names and values and format a new dataset
which would then be converted to CSV or TSV*











*The fields in questionHeader: {WARC-Type=warcinfo,
WARC-Filename=CC-MAIN-TEXT-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz,
reader-identifier=/home/jmill383/wdcdemobucket/CC-MAIN-TEXT-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz,
WARC-Date=2013-11-22T14:51:12Z, absolute-offset=0, Content-Length=372,
WARC-Record-ID=<urn:uuid:efdf19de-e663-4747-8a98-754bd224520f>,
Content-Type=application/warc-fields}URL: null*


*Please advise if you can assist with an approach to resolve this problem
I am using Apache Flink...Scalding ....Scala   I havent been able to get
too far as of yet  Please advise if you can assist*

*John M*