You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aakash Basu <aa...@gmail.com> on 2018/03/13 09:36:22 UTC

EDI (Electronic Data Interchange) parser on Spark

Hi,

Did anyone built parallel and large scale X12 EDI parser to XML or JSON
using Spark?

Thanks,
Aakash.

Re: EDI (Electronic Data Interchange) parser on Spark

Posted by Kurt Fehlhauer <kf...@gmail.com>.

If no pre-built solution exists, writing your own would not be that
difficult. I suggest looking at a parser combinator such as FastParse to
create your own.

http://www.lihaoyi.com/fastparse/

Regards,
Kurt

On Tue, Mar 13, 2018 at 7:47 AM Aakash Basu <aa...@gmail.com>
wrote:

> Thanks again for the detailed explanation, would like to go through.
>
> In my case, I'm having to parse large scale *.as2*, *.P3193*, *.edi *and *.txt
> *data mapping it with the respective standards and then building a JSON
> (so XML doesn't comes into the picture), containing the following (small
> example of EDI) -
>
> ISA*00*          *00*          *ZZ*D00XXX         *ZZ*00AA           *070305*1832*^*00501*676048320*0*P*\~
> GS*BE*D00XXX*00AA*20150305*1832*260007982*X*005010X220A1~
> ST*834*0001*005010X220A1~
> BGN*00*88880070301  00*20150305*181245****4~
> DTP*007*D8*20150301~
> N1*P5*PAYER 1*FI*999999999~
> N1*IN*KCMHSAS*FI*999999999~
> INS*Y*18*030*XN*A*C   **FT~
> REF*0F*00389999~
> REF*1L*000003409999~
> REF*3H*K129999A~
> DTP*356*D8*20150301~
> NM1*IL*1*DOE*JOHN*A***34*999999999~
> N3*777 ELM ST~
> N4*ALLEGAN*MI*49010**CY*03~
> DMG*D8*19670330*M**O~
> LUI***ESSPANISH~
> HD*030**AK*064703*IND~
> DTP*348*D8*20150301~
> AMT*P3*45.34~
> REF*17*E  1F~
> SE*20*0001~
> GE*1*260007982~
> IEA*1*676048320~
>
>
>
> Thanks,
> Aakash.
>
> On Tue, Mar 13, 2018 at 6:37 PM, Darin McBeath <dd...@yahoo.com>
> wrote:
>
>> I'm not familiar with EDI, but perhaps one option might be
>> spark-xml-utils (https://github.com/elsevierlabs-os/spark-xml-utils).
>> You could transform the XML to the XML format required by  the xml-to-json
>> function and then return the json.  Spark-xml-utils wraps the open source
>> Saxon project and supports XPath, XQuery, and XSLT.    Spark-xml-utils
>> doesn't parallelize the parsing of an individual document, but if you have
>> your documents split across a cluster, the processing can be parallelized.
>> We use this package extensively within our company to process millions of
>> XML records.  If you happen to be attending Spark summit in a few months,
>> someone will be presenting on this topic (
>> https://databricks.com/session/mining-the-worlds-science-large-scale-data-matching-and-integration-from-xml-corpora
>> ).
>>
>>
>> Below is a snippet for xquery.
>>
>> let $retval :=
>>      <map>
>>        <string key="doi">{$doi}</string>
>>        <string key="cid">{$cid}</string>
>>        <string key="pii">{$pii}</string>
>>        <string key="contentType">{$content-type}</string>
>>        <string key="srctitle">{$srctitle}</string>
>>        <string key="documentType">{$document-type}</string>
>>        <string key="documentSubtype">{$document-subtype}</string>
>>        <string key="publicationDate">{$publication-date}</string>
>>        <string key="articleTitle">{$article-title}</string>
>>        <string key="issn">{$issn}</string>
>>        <string key="isbn">{$isbn}</string>
>>        <string key="lang">{$lang}</string>
>>        {$tables}
>>      </map>
>>
>> return xml-to-json($retval)
>>
>>
>> Darin.
>>
>> On Tuesday, March 13, 2018, 8:52:42 AM EDT, Aakash Basu <
>> aakash.spark.raj@gmail.com> wrote:
>>
>>
>> Hi Jörn,
>>
>> Thanks for a quick revert. I already built a EDI to JSON parser from
>> scratch using the 811 and 820 standard mapping document. It can run on any
>> standard and for any type of EDI. But my built is in native python and
>> doesn't leverage Spark's parallel processing, which I want to do for large
>> and huge amount of EDI data.
>>
>> Any pointers on that?
>>
>> Thanks,
>> Aakash.
>>
>> On Tue, Mar 13, 2018 at 3:44 PM, Jörn Franke <jo...@gmail.com>
>> wrote:
>>
>> Maybe there are commercial ones. You could also some of the open source
>> parser for xml.
>>
>> However xml is very inefficient and you need to du a lot of tricks to
>> make it run in parallel. This also depends on type of edit message etc.
>> sophisticated unit testing and performance testing is key.
>>
>> Nevertheless it is also not as difficult as I made it sound now.
>>
>> > On 13. Mar 2018, at 10:36, Aakash Basu <aa...@gmail.com>
>> wrote:
>> >
>> > Hi,
>> >
>> > Did anyone built parallel and large scale X12 EDI parser to XML or JSON
>> using Spark?
>> >
>> > Thanks,
>> > Aakash.
>>
>>
>>
>

Re: EDI (Electronic Data Interchange) parser on Spark

Posted by Aakash Basu <aa...@gmail.com>.

Thanks again for the detailed explanation, would like to go through.

In my case, I'm having to parse large scale *.as2*, *.P3193*, *.edi *and *.txt
*data mapping it with the respective standards and then building a JSON (so
XML doesn't comes into the picture), containing the following (small
example of EDI) -

ISA*00*          *00*          *ZZ*D00XXX         *ZZ*00AA
*070305*1832*^*00501*676048320*0*P*\~
GS*BE*D00XXX*00AA*20150305*1832*260007982*X*005010X220A1~
ST*834*0001*005010X220A1~
BGN*00*88880070301  00*20150305*181245****4~
DTP*007*D8*20150301~
N1*P5*PAYER 1*FI*999999999~
N1*IN*KCMHSAS*FI*999999999~
INS*Y*18*030*XN*A*C   **FT~
REF*0F*00389999~
REF*1L*000003409999~
REF*3H*K129999A~
DTP*356*D8*20150301~
NM1*IL*1*DOE*JOHN*A***34*999999999~
N3*777 ELM ST~
N4*ALLEGAN*MI*49010**CY*03~
DMG*D8*19670330*M**O~
LUI***ESSPANISH~
HD*030**AK*064703*IND~
DTP*348*D8*20150301~
AMT*P3*45.34~
REF*17*E  1F~
SE*20*0001~
GE*1*260007982~
IEA*1*676048320~



Thanks,
Aakash.

On Tue, Mar 13, 2018 at 6:37 PM, Darin McBeath <dd...@yahoo.com> wrote:

> I'm not familiar with EDI, but perhaps one option might be spark-xml-utils
> (https://github.com/elsevierlabs-os/spark-xml-utils).  You could
> transform the XML to the XML format required by  the xml-to-json function
> and then return the json.  Spark-xml-utils wraps the open source Saxon
> project and supports XPath, XQuery, and XSLT.    Spark-xml-utils doesn't
> parallelize the parsing of an individual document, but if you have your
> documents split across a cluster, the processing can be parallelized.  We
> use this package extensively within our company to process millions of XML
> records.  If you happen to be attending Spark summit in a few months,
> someone will be presenting on this topic (https://databricks.com/
> session/mining-the-worlds-science-large-scale-data-
> matching-and-integration-from-xml-corpora).
>
>
> Below is a snippet for xquery.
>
> let $retval :=
>      <map>
>        <string key="doi">{$doi}</string>
>        <string key="cid">{$cid}</string>
>        <string key="pii">{$pii}</string>
>        <string key="contentType">{$content-type}</string>
>        <string key="srctitle">{$srctitle}</string>
>        <string key="documentType">{$document-type}</string>
>        <string key="documentSubtype">{$document-subtype}</string>
>        <string key="publicationDate">{$publication-date}</string>
>        <string key="articleTitle">{$article-title}</string>
>        <string key="issn">{$issn}</string>
>        <string key="isbn">{$isbn}</string>
>        <string key="lang">{$lang}</string>
>        {$tables}
>      </map>
>
> return xml-to-json($retval)
>
>
> Darin.
>
> On Tuesday, March 13, 2018, 8:52:42 AM EDT, Aakash Basu <
> aakash.spark.raj@gmail.com> wrote:
>
>
> Hi Jörn,
>
> Thanks for a quick revert. I already built a EDI to JSON parser from
> scratch using the 811 and 820 standard mapping document. It can run on any
> standard and for any type of EDI. But my built is in native python and
> doesn't leverage Spark's parallel processing, which I want to do for large
> and huge amount of EDI data.
>
> Any pointers on that?
>
> Thanks,
> Aakash.
>
> On Tue, Mar 13, 2018 at 3:44 PM, Jörn Franke <jo...@gmail.com> wrote:
>
> Maybe there are commercial ones. You could also some of the open source
> parser for xml.
>
> However xml is very inefficient and you need to du a lot of tricks to make
> it run in parallel. This also depends on type of edit message etc.
> sophisticated unit testing and performance testing is key.
>
> Nevertheless it is also not as difficult as I made it sound now.
>
> > On 13. Mar 2018, at 10:36, Aakash Basu <aa...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > Did anyone built parallel and large scale X12 EDI parser to XML or JSON
> using Spark?
> >
> > Thanks,
> > Aakash.
>
>
>

Re: EDI (Electronic Data Interchange) parser on Spark

Posted by Darin McBeath <dd...@yahoo.com.INVALID>.

 I'm not familiar with EDI, but perhaps one option might be spark-xml-utils (https://github.com/elsevierlabs-os/spark-xml-utils).  You could transform the XML to the XML format required by  the xml-to-json function and then return the json.  Spark-xml-utils wraps the open source Saxon project and supports XPath, XQuery, and XSLT.    Spark-xml-utils doesn't parallelize the parsing of an individual document, but if you have your documents split across a cluster, the processing can be parallelized.  We use this package extensively within our company to process millions of XML records.  If you happen to be attending Spark summit in a few months, someone will be presenting on this topic (https://databricks.com/session/mining-the-worlds-science-large-scale-data-matching-and-integration-from-xml-corpora).

Below is a snippet for xquery.
let $retval :=     <map>       <string key="doi">{$doi}</string>       <string key="cid">{$cid}</string>       <string key="pii">{$pii}</string>       <string key="contentType">{$content-type}</string>       <string key="srctitle">{$srctitle}</string>       <string key="documentType">{$document-type}</string>       <string key="documentSubtype">{$document-subtype}</string>       <string key="publicationDate">{$publication-date}</string>       <string key="articleTitle">{$article-title}</string>       <string key="issn">{$issn}</string>       <string key="isbn">{$isbn}</string>            <string key="lang">{$lang}</string>        {$tables}     </map>  return xml-to-json($retval)

Darin.
    On Tuesday, March 13, 2018, 8:52:42 AM EDT, Aakash Basu <aa...@gmail.com> wrote:  

 Hi Jörn,

Thanks for a quick revert. I already built a EDI to JSON parser from scratch using the 811 and 820 standard mapping document. It can run on any standard and for any type of EDI. But my built is in native python and doesn't leverage Spark's parallel processing, which I want to do for large and huge amount of EDI data.

Any pointers on that?

Thanks,
Aakash.

On Tue, Mar 13, 2018 at 3:44 PM, Jörn Franke <jo...@gmail.com> wrote:

Maybe there are commercial ones. You could also some of the open source parser for xml.

However xml is very inefficient and you need to du a lot of tricks to make it run in parallel. This also depends on type of edit message etc. sophisticated unit testing and performance testing is key.

Nevertheless it is also not as difficult as I made it sound now.

> On 13. Mar 2018, at 10:36, Aakash Basu <aa...@gmail.com> wrote:
>
> Hi,
>
> Did anyone built parallel and large scale X12 EDI parser to XML or JSON using Spark?
>
> Thanks,
> Aakash.

Re: EDI (Electronic Data Interchange) parser on Spark

Posted by Jörn Franke <jo...@gmail.com>.

Ah sorry I thought you use EDI xml.

Then you would need to build your own Spark datasource. Depending on the number of different type of messages this will be much more or less effort.

I am not aware of any commercial or open source solution for it.

> On 13. Mar 2018, at 13:52, Aakash Basu <aa...@gmail.com> wrote:
> 
> Hi Jörn,
> 
> Thanks for a quick revert. I already built a EDI to JSON parser from scratch using the 811 and 820 standard mapping document. It can run on any standard and for any type of EDI. But my built is in native python and doesn't leverage Spark's parallel processing, which I want to do for large and huge amount of EDI data.
> 
> Any pointers on that?
> 
> Thanks,
> Aakash.
> 
>> On Tue, Mar 13, 2018 at 3:44 PM, Jörn Franke <jo...@gmail.com> wrote:
>> Maybe there are commercial ones. You could also some of the open source parser for xml.
>> 
>> However xml is very inefficient and you need to du a lot of tricks to make it run in parallel. This also depends on type of edit message etc. sophisticated unit testing and performance testing is key.
>> 
>> Nevertheless it is also not as difficult as I made it sound now.
>> 
>> > On 13. Mar 2018, at 10:36, Aakash Basu <aa...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > Did anyone built parallel and large scale X12 EDI parser to XML or JSON using Spark?
>> >
>> > Thanks,
>> > Aakash.
>

Re: EDI (Electronic Data Interchange) parser on Spark

Posted by Aakash Basu <aa...@gmail.com>.

Hi Jörn,

Thanks for a quick revert. I already built a EDI to JSON parser from
scratch using the 811 and 820 standard mapping document. It can run on any
standard and for any type of EDI. But my built is in native python and
doesn't leverage Spark's parallel processing, which I want to do for large
and huge amount of EDI data.

Any pointers on that?

Thanks,
Aakash.

On Tue, Mar 13, 2018 at 3:44 PM, Jörn Franke <jo...@gmail.com> wrote:

> Maybe there are commercial ones. You could also some of the open source
> parser for xml.
>
> However xml is very inefficient and you need to du a lot of tricks to make
> it run in parallel. This also depends on type of edit message etc.
> sophisticated unit testing and performance testing is key.
>
> Nevertheless it is also not as difficult as I made it sound now.
>
> > On 13. Mar 2018, at 10:36, Aakash Basu <aa...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > Did anyone built parallel and large scale X12 EDI parser to XML or JSON
> using Spark?
> >
> > Thanks,
> > Aakash.
>

Re: EDI (Electronic Data Interchange) parser on Spark

Posted by Jörn Franke <jo...@gmail.com>.

Maybe there are commercial ones. You could also some of the open source parser for xml. 

However xml is very inefficient and you need to du a lot of tricks to make it run in parallel. This also depends on type of edit message etc. sophisticated unit testing and performance testing is key.

Nevertheless it is also not as difficult as I made it sound now.

> On 13. Mar 2018, at 10:36, Aakash Basu <aa...@gmail.com> wrote:
> 
> Hi,
> 
> Did anyone built parallel and large scale X12 EDI parser to XML or JSON using Spark?
> 
> Thanks,
> Aakash.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org