You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airavata.apache.org by Lahiru Jayathilake <la...@cse.mrt.ac.lk> on 2018/06/22 04:37:53 UTC

Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Hi Everyone,

In the last couple of days, I've been working on the data parsing tasks. To
give an update about it, I have already converted the code-base of
Gaussian, Molpro, Newchem, and Gamess parsers to python[1]. With compared
to code-base of seagrid-data there won't be any codes related to
experiments in the project(for example no JSON mappings). The main reason
for doing this because to de-couple experiments with the data parsing
tasks.

While I was converting the codes of Gaussian, Molpro, Newchem, and Gamess I
found some JSON key value-pairs in the data-catalog docker container have
not been used in the seagrid-data to generate the final output file. I have
commented unused key-value pairs in the code itself [2], [3], [4], [5]. I
would like to know is there any specific reason for this, hope @Supun
Nakandala <https://plus.google.com/u/1/103731766138074233701?prsrc=4> can
answer it.

The next update about the data parsing architecture.
The new requirement is to come up with a framework which is capable of
parsing any kind of document to a known type when the metadata is given. By
this new design, data parsing will not be restricted only to
experiments(Gaussian, Molpro, etc.)

The following architecture is designed according to the requirements
specified by @dimuthu in the last GSoC meeting.

The following diagram depicts the top level architecture.


​
Following are the key components.

*Abstract Parser *

This is a basic template for the Parser which specifies the parameters
required for parsing task. For example, input file type, output file type,
experiment type( if this is related to an experiment), etc.


*Parser Manager*

Constructs the set of parsers considering the input file type, output file
type, and the experiment type.
Parser Manager will construct a graph to find the shortest path between
input file type and output file type. Then it will return the constructed
set of Parsers.


​*Catalog *

A mapping which has records to get a Docker container that can be used to
parse from one file type to another file type. For example, if the
requirement is to parse a *Gaussian .out file to JSON* then *"app/gaussian
.out to JSON"* docker will be fetched


*Parsers*

There are two types of parsers (according to the suggested way)


The first type is the parsers those will be directly coded into the project
code-base. For example, parsing Text file to a JSON will be
straightforward, then it is not necessarily required to maintain a separate
docker container to convert text file to JSON. With the help of a library
and putting an entry to the catalog will be enough to get the work done.

The second type is parsers which have a separate docker container. For
example Gaussian .out file to JSON docker container


For the overall scenario consider the following examples to get an idea

*Example 1*
Suppose a PDF should be parsed to XML
Parser Manager will look the catalog and find the shortest path to get the
XML output from PDF. The available parsers are(both the coded parsers in
the project and the dockerized parsers),

• PDF to Text
• Text to JSON
• JSON to XML
• application/gaussian .out to JSON (This is a very specific parsing
mechanism not similar  to parsing a simple .out file to a JSON)

and the rest which I have included in the diagram

Then Parser Manager will construct the graph and find the shortest path as
*PDF -> Text -> JSON -> XML* from the available parsers.


Then Parser Manager will return 3 Parsers. From the three parsers a DAG
will be constructed as follows,


​
The reason for this architectural decision to have three parsers than doing
in the single parser because if one of the parsers fails it would be easy
to identify which parser it is.

*Example 2*
Consider a separate example to parse a Gaussian *.out* file to *JSON* then
it is pretty straightforward. Same as the aforementioned example it will
construct a Parser which linking the dockerized *app/gaussian .out to JSON*
 container.

*Example 3*
Problem is when it is needed to parse a Gaussian *.out* file to *XML*.
There are two options.

*1st option* - If an application related parsing should happen there must
be application typed parsers to get the work done if not it is not allowed.
In the list of parsers, there is no application related parser to convert
*.out* file to *XML*. In this case even Parser Manager could construct a
path like,
*.out/gaussian -> JSON/gaussian -> XML*, this process is not allowed.

*2nd option* - Once the application-specific content has been parsed it
will be same as converting a normal JSON to XML assuming that we could
allow the path
*.out/gaussian -> JSON/gaussian -> XML*.
What actually should be done? 1st option or the 2nd option? This is one
point I need a suggestion.

I would really appreciate any suggestions to improve this.

[1] https://github.com/Lahiru-J/airavata-data-parser
[2] https://github.com/Lahiru-J/airavata-data-parser/blob/
master/datacat/gaussian/gaussian.py#L191-L288
[3]
https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gamess/gamess.py#L76-L175

[4]
https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175
[5]
https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175

Cheers,

On 28 May 2018 at 18:05, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
wrote:

> Note this is the High-level architecture diagram. (Since it was not
> visible in the previous email)
>
>
> ​Thanks,
> Lahiru
>
> On 28 May 2018 at 18:02, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
> wrote:
>
>> Hi Everyone,
>>
>> During the past few days, I’ve been implementing the tasks which are
>> related to the Data Parsing. To give a heads up, the following image
>> depicts the top level architecture of the implementation.
>>
>>
>> ​
>> Following are the main task components have been identified,
>>
>> *1. DataParsing Task*
>>
>> This task will get the stored output and will find the matching Parser
>> (Gaussian, Lammps, QChem, etc.) and send the output through the selected
>> parser to get a well-structured JSON
>>
>>
>> *2. Validating Task*
>>
>> This is to validate the desired JSON output is achieved or not. That is
>> JSON output should match with the respective schema(Gaussian Schema, Lammps
>> Schema, QChem Schema, etc.)
>>
>>
>> *3. Persisting Task*
>>
>> This task will persist the validated JSON outputs
>>
>> The successfully stored outputs will be exposed to the outer world.
>>
>>
>> According to the diagram the generated JSON should be shared between the
>> tasks(DataParsing, Validating, and, Persisting tasks). Neither DataParsing
>> task nor Validating task persists the JSON, therefore, helix task framework
>> should make sure to share the content between the tasks.
>>
>> In this Helix tutorial [1] it says how to share the content between Helix
>> tasks. The problem is, the method [2] which has been given only capable of
>> sharing String typed key-value data.
>> However, I can come up with an implementation to share all the values
>> related to the JSON output. That involves calling this method [2] many
>> times. I believe that is not a very efficient method because Helix task
>> framework has to call this [3] method many times (taking into consideration
>> that the generated JSON output can be larger).
>>
>> I have already sent an email to the Helix mailing list to clarify whether
>> there is another way and also will it be efficient if this method [2] is
>> called multiple times to get the work done.
>>
>> Am I on the right track? Your suggestions would be very helpful and
>> please add if anything is missing.
>>
>>
>> [1] http://helix.apache.org/0.8.0-docs/tutorial_task_framewo
>> rk.html#Share_Content_Across_Tasks_and_Jobs
>> [2] https://github.com/apache/helix/blob/helix-0.6.x/helix-c
>> ore/src/main/java/org/apache/helix/task/UserContentStore.java#L75
>> [3] https://github.com/apache/helix/blob/helix-0.6.x/helix-c
>> ore/src/main/java/org/apache/helix/task/TaskUtil.java#L361
>>
>> Thanks,
>> Lahiru
>>
>> On 26 March 2018 at 19:44, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
>> wrote:
>>
>>> Hi Dimuthu, Suresh,
>>>
>>> Thanks a lot for the feedback. I will update the proposal accordingly.
>>>
>>> Regards,
>>> Lahiru
>>>
>>> On 26 March 2018 at 08:48, Suresh Marru <sm...@apache.org> wrote:
>>>
>>>> Hi Lahiru,
>>>>
>>>> I echo Dimuthu’s comment. You have a good starting point, it will be
>>>> nice if you can cover how users can interact with the parsed data.
>>>> Essentially adding API access to the parsed metadata database and having
>>>> proof of concept UI’s. This task could be challenging as the queries are
>>>> very data specific and generalizing API access and building custom UI’s can
>>>> be explanatory (less  defined) portions of your proposal.
>>>>
>>>> Cheers,
>>>> Suresh
>>>>
>>>>
>>>> On Mar 25, 2018, at 8:12 PM, DImuthu Upeksha <
>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>
>>>> Hi Lahiru,
>>>>
>>>> Nice document. And I like how you illustrate the systems through
>>>> diagrams. However try to address how you are going to expose parsed data to
>>>> outside through thrift APIs and how to design those data APIs in
>>>> application specific manner. And in the persisting task, you have to make
>>>> sure data integrity is preserved. For example in a Gaussian parsed output,
>>>> you might have to validate the parsed output using a schema before
>>>> persisting them in the database.
>>>>
>>>> Thanks
>>>> Dimuthu
>>>>
>>>> On Sun, Mar 25, 2018 at 5:05 PM, Lahiru Jayathilake <
>>>> lahiruj.14@cse.mrt.ac.lk> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> I have shared a draft proposal [1] for the GSoC project, AIRAVATA-2718
>>>>> [2]. Any comments would be very helpful to improve it.
>>>>>
>>>>> [1] https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJ
>>>>> JNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing
>>>>> [2] https://issues.apache.org/jira/browse/AIRAVATA-2718
>>>>>
>>>>> Thanks & Regards,
>>>>> --
>>>>> Lahiru Jayathilake
>>>>> Department of Computer Science and Engineering,
>>>>> Faculty of Engineering,
>>>>> University of Moratuwa
>>>>>
>>>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Lahiru Jayathilake
>>> Department of Computer Science and Engineering,
>>> Faculty of Engineering,
>>> University of Moratuwa
>>>
>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>
>>
>>
>>
>> --
>> Lahiru Jayathilake
>> Department of Computer Science and Engineering,
>> Faculty of Engineering,
>> University of Moratuwa
>>
>> <https://lk.linkedin.com/in/lahirujayathilake>
>>
>
>
>
> --
> Lahiru Jayathilake
> Department of Computer Science and Engineering,
> Faculty of Engineering,
> University of Moratuwa
>
> <https://lk.linkedin.com/in/lahirujayathilake>
>



-- 
Lahiru Jayathilake
Department of Computer Science and Engineering,
Faculty of Engineering,
University of Moratuwa

<https://lk.linkedin.com/in/lahirujayathilake>

Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Posted by Lahiru Jayathilake <la...@cse.mrt.ac.lk>.
Hi Everyone,

This is to give an idea of how the interfaces of the UI based approach for
Data Parsing Framework looks like.

Basically parsers can be drag and drop and should connect them accordingly
to make a parser DAG depending on the requirement. There are mainly two
methods how a parser can be added in tothe working area.

A user can import parsers using a JSON catalog file. For a particular
parser, there are a set of required parameters and the JSON keys
corresponding to those parameters should match with the keys that are in
the user-defined catalog file. For example, the following image depicts an
example catalog file.

[image: Screen Shot 2018-08-31 at 10.06.54 PM.png]

Once the parser catalog file is imported the set of parsers will be
appeared on the left side of the window as the following image illustrates.
[image: Screen Shot 2018-08-31 at 1.56.43 PM.png]

Then a user can drag and drop a parser to make a flow. The following
diagram shows parsers dragged into the working area from the imported set
of parsers.
[image: Screen Shot 2018-08-31 at 2.00.26 PM.png]

The following image shows how a new parser can be configured. (Note that
this parser is not one of the parsers imported from the catalog file)
[image: Screen Shot 2018-08-31 at 2.12.18 PM.png]
After defining the DAG user can generate the configuration file and it can
be used in the Airavata Data Parsing Framework.

I would really appreciate any suggestions to improve this.

Cheers!
Lahiru

On Sun, 12 Aug 2018 at 02:13, Supun Nakandala <su...@gmail.com>
wrote:

> Hi Devs,
>
> Sorry for joining in late.
>
> Regarding the challenges that Lahiru mentioned, I think it is a question
> of whether to use configurations or conventions. Personally, I like
> following a convention based approach in this context, as more and more
> configurations will make the system more cumbersome for the users. But I
> agree that it has it's downsides too.
>
> But as Lahiru mentioned, I think using a UI based approach will be a
> better approach. It helps to shield the complexities of the configurations
> and provide an intuitive interface for the users.
>
> Overall, I feel the task of output data parsing perfectly aligns with new
> Airavata architecture on distributed task execution. Maybe we should
> brainstorm what it would take to incorporate these parsers into the
> application catalog (extend or abstract out a generic catalog). If we can
> incorporate without making it overly complicated, I feel that it will be a
> good direction to follow up.
>
> On Sat, Aug 11, 2018 at 12:50 PM Lahiru Jayathilake <
> lahiruj.14@cse.mrt.ac.lk> wrote:
>
>> Hi Everyone,
>>
>> First of all Suresh, Marlon, and Dimuthu thanks for the suggestions and
>> comments.
>>
>> Yes Suresh we can include QC_JSON_Schema[1] for airavata-data-parser[2].
>> However, there is a challenge of using InterMol[3], as you mentioned the
>> depending ParamEd[4] is LGPL and according to Apache Legal page[5], LGPL is
>> not to be used.
>> About the Data Parsing Project, yes I did look into the Apache Tikka
>> whether I can use it and there are some challenges while making a generic
>> framework. I will discuss them in detail in this same thread later.
>>
>> *This is an update about what I have accomplished.*
>>
>> I have created a separate parser project [2] for Gaussian, Gamess,
>> Molpro, and NwChem. One advantage of separating the Gaussian, Molpro, etc.
>> codes from the core is to make high cohesive and less coupled. When it
>> comes to the maintainability it is far easier in this manner.
>>
>> Next regarding the Data Parsing Framework.
>>
>> As I have mentioned in my previous email I have implemented the Data
>> Parsing Framework with some additional features. The method of achieving
>> the goal had to be slightly changed in order to accompany some features. I
>> will start from the bottom.
>>
>> Here is the scenario, A user has to define what are the Catalog Entries.
>> A Catalog Entry is nothing more than basic key-value properties of a
>> Dockerized parser. The following image shows an example of how I have
>> defined it.
>>
>>
>> The above image shows the entry corresponds to the Dockerized Gaussian
>> Parser. There are both mandatory and optional properties. For example, *dockerImageName,
>> inputFileExtension, dockerWorkingDirPath *must be stated whereas *securityOpt,
>> envVariables *like properties are optional. Some of the properties are
>> needed to run the Docker container.
>>
>> There are two special properties called *applicationType *and *operation*
>> . *applicationType *states whether the Docker container is for parsing
>> Gaussian, Molpro, NwChem, or Gamess files. Property *operation* is to
>> mention that the parser can perform some operation to the file. For
>> example, converting all the text characters to lower/upper case, removing
>> last *n *lines, appending some text.. you get the point. A Dockerized
>> Parser cannot have both the application and operation. It should be either
>> an application or operation or none of them. (This is a design decision)
>>
>> For the time being catalog file is a JSON file which will be picked by
>> the DataParsing Framework according to the user given file path.
>> For the further explanation consider the following set of parsers. Note
>> that I have only mentioned the most essential properties just to explain
>> the example scenarios.
>>
>>
>>
>> Once the user has defined the catalog entries then DataParsing Framework
>> expects a Parser Request to parse a given file. Consider the user has given
>> the following set of Parser Requests.
>>
>>
>>
>> Once the above facts have been completed then the baton is on the hand of
>> DataParsing Framework.
>>
>> At the runtime Data Parsing Framework pick Catalog File and makes a
>> directed graph G(V,E) using the indicated parsers. I have already given a
>> detailed summary about how the path will be identified but in this
>> implementation, I changed it a little bit to facilitate application parsing
>> (Gaussian, Molpro, etc) as well as multiple operations to a single file.
>> Every vertex of the graph will be a file extension type and every edge of
>> the graph represents a Catalog Entry. Then DataParsing Framework generates
>> the directed graph as follows.
>>
>> The graph is based on the aforementioned Catalog Parsers and only the
>> required properties have been defined on the graph edges for simplicity.
>>
>>
>>
>> This is how it connects with the file extensions. In the previous method
>> we had nodes like *.out/gaussian* but instead of that multiple edges are
>> allowed here.
>>
>> When a parser request comes the DataParsing Framework will find the
>> shortest possible path with fulfilling all the requirements to parse the
>> particular file.
>> Following DAGs will be created for the aforementioned parser requests
>>
>>
>>
>> *Parser Requests*
>>
>> *Parser Request 1*
>>
>> This is Straightforward *P**6* parser is selected to parser *.txt* to
>> *.xml*
>>
>>
>> *Parser Request 2*
>>
>> The file should go through a Gaussian parser. *P1* is selected as the
>> Gaussian parser but that parser's output file extension is *.json *since
>> the request expect the output file extension to be *.xml, *the* P7*
>> parser is selected at the end of the DAG
>>
>>
>> *Parser Request 3*
>>
>> Similar to the Parser Request 2 but need an extra operation to be
>> incorporated. The file's text should be converted to lower case. Only the
>> *P9* parser exhibits the desired property. Hence *P9* parser is also
>> used to create the DAG
>>
>>
>> *Parser Request 4*
>>
>> Similar to Parser Request 3 however, in this case, two more operations
>> should be considered. Those are *operation1* and *operation2. **P11* and
>> *P12* parsers provide those operations respectively hence they are used
>> when creating the DAG
>>
>>
>>
>> I completed these parts, a couple of weeks back. With the discussion of
>> Dimuthu, he suggested to me, that I should try to generify the framework.
>> The goal was not to declare the properties such as "*inputFileExtension*",
>> "*outputFileExtension*" etc. at the coding level. The user is the one
>> who is defining that language in Catalog Entries and in the Parser Request.
>> The Data Parsing Framework should be capable of getting any kind of
>> metadata keys and create the parser DAG.
>>
>> For example, one user can specify the input file extension as "
>> *inputFileExtension*", another can specify it as "*inputEx*", another
>> can specify it as "*input*", and another can declare it as "*x*". The
>> method of defining is totally up to the user.
>>
>> While I was researching how to find a solution to this I faced some
>> challenges.
>>
>> *Challenge 1*
>>
>> Without knowing the exact name of the keys it is not possible to identify
>> the dependencies between such keys.
>>
>> For example, *inputFileExtension* and *outputFileExtension* exhibit a
>> dependency that is 1st parser's *outputFileExtension* should be equal to
>> the 2nd parser's *inputFileExtension*.
>>
>>
>> One solution to overcome this problem is, a user has to give those kinds
>> of dependency relationship between keys. For example, think a user has
>> defined the keys of input file extension and output file extension as "
>> *x*" and "*y*" respectively. Then he has to indicate that using some
>> kind of a notation(eg. *key(y) ≊ key(x)) *inside a separate file.
>>
>> *Challenge 2*
>>
>> When it is required to parse a file through an application, the parser
>> with the application should always come first. For example, suppose I need
>> to parse a file through Gaussian and some kind of an operation. Then I
>> cannot first parse the file through the parser which holds the operation
>> and then pass through the Gaussian parser. What I always have to do is,
>> initially parse through the Gaussian parser and then parse the rest of the
>> content through the other one.
>>
>> As I mentioned earlier this could also be solvable by maintaining a file
>> which has the details to what Parsers the priority should be given.
>>
>>
>> Overall we might never know what kind of keys will be there what kind of
>> relationships should be maintained with the keys etc. The solutions for the
>> above challenges were to introduce another file which maintains all the
>> relationships, dependencies, priorities, etc. held by keys and parsers. Our
>> main goal to make this DataParsing Framework to be generic is, make the
>> user's life easier. But with this approach, that goal is quite harder to
>> achieve.
>>
>> There was another solution suggested by Dimuthu which is to come up with
>> a UI based framework. Where users can direct towards their Catalog Entries
>> and drag and drop Parsers to make their own DAG. This is the next milestone
>> we are going to achieve. However, I am still looking for a solution to get
>> the work done extending/modifying the work I have already completed.
>>
>> A detailed description of this Data Parsing Framework with contributions
>> can be found here[5]
>>
>> Cheers,
>> Lahiru
>>
>> [1] https://github.com/MolSSI/QC_JSON_Schema
>> [2] https://github.com/Lahiru-J/airavata-data-parser/tree/master/datacat
>> [3] https://github.com/shirtsgroup/InterMol
>> [4] https://github.com/ParmEd/ParmEd
>> [5]
>> https://medium.com/@lahiru_j/gsoc-2018-re-architect-output-data-parsing-into-airavata-core-81da4b37057e
>>
>>
>> On 26 June 2018 at 21:27, DImuthu Upeksha <di...@gmail.com>
>> wrote:
>>
>>> Hi Lahiru
>>>
>>> Thanks for sharing this with the dev list. I would like to suggest few
>>> changes to your data parsing framework. Please have a look at following
>>> diagram
>>>
>>>
>>>
>>> I would like to come up with a sample use case so that you can
>>> understand the data flow.
>>>
>>> I have a application output file called gaussian.out and I need to parse
>>> it to a JSON file. However your have a parser that can parse gaussian files
>>> into xml format. But you have another parser that can parse XML files into
>>> JSON. You have a parser catalog that contains all details about parsers you
>>> currently have and you can filter out necessary parsers based on metadata
>>> like application type, output type, input type and etc.
>>>
>>> Challenge is how we are going to combine these two parsers in correct
>>> order and how the data passing within these parsers are going to handle.
>>> That's where we need a workflow manager. Workflow manager gets your
>>> requirement then talk to the catalog to fetch necessary parser information
>>> and build the correct parser DAG. Once the DAG is finalized, it can be
>>> passed to helix to execute. There could be multiple DAGs that can achieve
>>> same requirement, but workflow manager should select the highest
>>> constrained path.
>>>
>>> What do you think?
>>>
>>> Thanks
>>> Dimuthu
>>>
>>> On Fri, Jun 22, 2018 at 8:49 AM, Pierce, Marlon <ma...@iu.edu> wrote:
>>>
>>>> Yes, +1 on the detailed email summaries.
>>>>
>>>>
>>>>
>>>> Marlon
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From: *Suresh Marru <sm...@apache.org>
>>>> *Reply-To: *"dev@airavata.apache.org" <de...@airavata.apache.org>
>>>> *Date: *Friday, June 22, 2018 at 8:46 AM
>>>> *To: *Airavata Dev <de...@airavata.apache.org>
>>>> *Cc: *Supun Nakandala <su...@gmail.com>
>>>> *Subject: *Re: [GSoC] Re-architect Output Data Parsing into Airavata
>>>> core
>>>>
>>>>
>>>>
>>>> Hi Lahiru,
>>>>
>>>>
>>>>
>>>> Thank you for sharing the detailed summary. I do not have comments on
>>>> your questions, may be Supun can weigh in. I have couple of meta requests
>>>> though:
>>>>
>>>>
>>>>
>>>> Can you consider adding few Molecular dynamics parsers in this order
>>>> LAMMPS,  Amber, and CHARMM. The cclib library you used for others do not
>>>> cover these, but InterMol [1] provides a python library to parse these. We
>>>> have to be careful here, InterMol itself is MIT licensed and we can have
>>>> its dependency but it depends upon ParamEd[2] which is LGPL license. Its a
>>>> TODO for me on how to deal wit this but please see if you can include
>>>> adding these parsers into your timeline.
>>>>
>>>>
>>>>
>>>> Can you evaluate if we can provide export to Quantum Chemistry JSON
>>>> Scheme [3]? Is this is trivial we can pursue it.
>>>>
>>>>
>>>>
>>>> Lastly, can you see if Apache Tikka will help with any of your efforts.
>>>>
>>>>
>>>>
>>>> I will say my kudos again for your mailing list communications,
>>>>
>>>> Suresh
>>>>
>>>>
>>>>
>>>> [1] - https://github.com/shirtsgroup/InterMol
>>>>
>>>> [2] - https://github.com/ParmEd/ParmEd
>>>>
>>>> [3] - https://github.com/MolSSI/QC_JSON_Schema
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Jun 22, 2018, at 12:37 AM, Lahiru Jayathilake <
>>>> lahiruj.14@cse.mrt.ac.lk> wrote:
>>>>
>>>>
>>>>
>>>> Hi Everyone,
>>>>
>>>>
>>>>
>>>> In the last couple of days, I've been working on the data parsing
>>>> tasks. To give an update about it, I have already converted the code-base
>>>> of Gaussian, Molpro, Newchem, and Gamess parsers to python[1]. With
>>>> compared to code-base of seagrid-data there won't be any codes related to
>>>> experiments in the project(for example no JSON mappings). The main reason
>>>> for doing this because to de-couple experiments with the data parsing
>>>> tasks.
>>>>
>>>>
>>>>
>>>> While I was converting the codes of Gaussian, Molpro, Newchem, and
>>>> Gamess I found some JSON key value-pairs in the data-catalog docker
>>>> container have not been used in the seagrid-data to generate the final
>>>> output file. I have commented unused key-value pairs in the code itself
>>>> [2], [3], [4], [5]. I would like to know is there any specific reason for
>>>> this, hope @Supun Nakandala
>>>> <https://plus.google.com/u/1/103731766138074233701?prsrc=4> can answer
>>>> it.
>>>>
>>>>
>>>>
>>>> The next update about the data parsing architecture.
>>>>
>>>> The new requirement is to come up with a framework which is capable of
>>>> parsing any kind of document to a known type when the metadata is given. By
>>>> this new design, data parsing will not be restricted only to
>>>> experiments(Gaussian, Molpro, etc.)
>>>>
>>>>
>>>>
>>>> The following architecture is designed according to the requirements
>>>> specified by @dimuthu in the last GSoC meeting.
>>>>
>>>>
>>>>
>>>> The following diagram depicts the top level architecture.
>>>>
>>>>
>>>>
>>>> <suggested architecture.png>
>>>>
>>>> Following are the key components.
>>>>
>>>>
>>>>
>>>> *Abstract Parser *
>>>>
>>>> This is a basic template for the Parser which specifies the parameters
>>>> required for parsing task. For example, input file type, output file type,
>>>> experiment type( if this is related to an experiment), etc.
>>>>
>>>>
>>>>
>>>> *Parser Manager*
>>>>
>>>> Constructs the set of parsers considering the input file type, output
>>>> file type, and the experiment type.
>>>>
>>>> Parser Manager will construct a graph to find the shortest path between
>>>> input file type and output file type. Then it will return the constructed
>>>> set of Parsers.
>>>>
>>>>
>>>>
>>>> <graph.png>
>>>>
>>>> *Catalog *
>>>>
>>>> A mapping which has records to get a Docker container that can be used
>>>> to parse from one file type to another file type. For example, if the
>>>> requirement is to parse a *Gaussian .out file to JSON* then *"app/gaussian
>>>> .out to JSON"* docker will be fetched
>>>>
>>>>
>>>>
>>>> *Parsers*
>>>>
>>>> There are two types of parsers (according to the suggested way)
>>>>
>>>>
>>>>
>>>> The first type is the parsers those will be directly coded into the
>>>> project code-base. For example, parsing Text file to a JSON will be
>>>> straightforward, then it is not necessarily required to maintain a separate
>>>> docker container to convert text file to JSON. With the help of a library
>>>> and putting an entry to the catalog will be enough to get the work done.
>>>>
>>>>
>>>>
>>>> The second type is parsers which have a separate docker container. For
>>>> example Gaussian .out file to JSON docker container
>>>>
>>>>
>>>>
>>>> For the overall scenario consider the following examples to get an idea
>>>>
>>>>
>>>>
>>>> *Example 1*
>>>>
>>>> Suppose a PDF should be parsed to XML
>>>>
>>>> Parser Manager will look the catalog and find the shortest path to get
>>>> the XML output from PDF. The available parsers are(both the coded parsers
>>>> in the project and the dockerized parsers),
>>>>
>>>> • PDF to Text
>>>>
>>>> • Text to JSON
>>>>
>>>> • JSON to XML
>>>>
>>>> • application/gaussian .out to JSON (This is a very specific parsing
>>>> mechanism not similar  to parsing a simple .out file to a JSON)
>>>>
>>>> and the rest which I have included in the diagram
>>>>
>>>>
>>>>
>>>> Then Parser Manager will construct the graph and find the shortest path
>>>> as
>>>>
>>>> *PDF -> Text -> JSON -> XML* from the available parsers.
>>>>
>>>>
>>>>
>>>> <graph 2.png>
>>>>
>>>> Then Parser Manager will return 3 Parsers. From the three parsers a DAG
>>>> will be constructed as follows,
>>>>
>>>>
>>>>
>>>> <parser dag.png>
>>>>
>>>> The reason for this architectural decision to have three parsers than
>>>> doing in the single parser because if one of the parsers fails it would be
>>>> easy to identify which parser it is.
>>>>
>>>>
>>>>
>>>> *Example 2*
>>>>
>>>> Consider a separate example to parse a Gaussian *.out* file to *JSON* then
>>>> it is pretty straightforward. Same as the aforementioned example it will
>>>> construct a Parser which linking the dockerized *app/gaussian .out to
>>>> JSON* container.
>>>>
>>>>
>>>>
>>>> *Example 3*
>>>>
>>>> Problem is when it is needed to parse a Gaussian *.out* file to *XML*.
>>>> There are two options.
>>>>
>>>>
>>>>
>>>> *1st option* - If an application related parsing should happen there
>>>> must be application typed parsers to get the work done if not it is not
>>>> allowed.
>>>>
>>>> In the list of parsers, there is no application related parser to
>>>> convert *.out* file to *XML*. In this case even Parser Manager could
>>>> construct a path like,
>>>>
>>>> *.out/gaussian -> JSON/gaussian -> XML*, this process is not allowed.
>>>>
>>>>
>>>>
>>>> *2nd option* - Once the application-specific content has been parsed
>>>> it will be same as converting a normal JSON to XML assuming that we could
>>>> allow the path
>>>>
>>>> *.out/gaussian -> JSON/gaussian -> XML*.
>>>>
>>>> What actually should be done? 1st option or the 2nd option? This is one
>>>> point I need a suggestion.
>>>>
>>>>
>>>>
>>>> I would really appreciate any suggestions to improve this.
>>>>
>>>>
>>>>
>>>> [1] https://github.com/Lahiru-J/airavata-data-parser
>>>>
>>>> [2]
>>>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gaussian/gaussian.py#L191-L288
>>>>
>>>> [3]
>>>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gamess/gamess.py#L76-L175
>>>>
>>>>
>>>> [4]
>>>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175
>>>>
>>>> [5]
>>>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175
>>>>
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>>
>>>> On 28 May 2018 at 18:05, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
>>>> wrote:
>>>>
>>>> Note this is the High-level architecture diagram. (Since it was not
>>>> visible in the previous email)
>>>>
>>>>
>>>>
>>>> <Screen Shot 2018-05-28 at 9.30.43 AM.png>
>>>>
>>>> Thanks,
>>>>
>>>> Lahiru
>>>>
>>>>
>>>>
>>>> On 28 May 2018 at 18:02, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
>>>> wrote:
>>>>
>>>> Hi Everyone,
>>>>
>>>>
>>>>
>>>> During the past few days, I’ve been implementing the tasks which are
>>>> related to the Data Parsing. To give a heads up, the following image
>>>> depicts the top level architecture of the implementation.
>>>>
>>>>
>>>>
>>>> [image: mage removed by sender.]
>>>>
>>>> Following are the main task components have been identified,
>>>>
>>>>
>>>>
>>>> *1. DataParsing Task*
>>>>
>>>> This task will get the stored output and will find the matching Parser
>>>> (Gaussian, Lammps, QChem, etc.) and send the output through the selected
>>>> parser to get a well-structured JSON
>>>>
>>>>
>>>>
>>>> *2. Validating Task*
>>>>
>>>> This is to validate the desired JSON output is achieved or not. That is
>>>> JSON output should match with the respective schema(Gaussian Schema, Lammps
>>>> Schema, QChem Schema, etc.)
>>>>
>>>>
>>>>
>>>> *3. Persisting Task*
>>>>
>>>> This task will persist the validated JSON outputs
>>>>
>>>>
>>>>
>>>> The successfully stored outputs will be exposed to the outer world.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> According to the diagram the generated JSON should be shared between
>>>> the tasks(DataParsing, Validating, and, Persisting tasks). Neither
>>>> DataParsing task nor Validating task persists the JSON, therefore, helix
>>>> task framework should make sure to share the content between the tasks.
>>>>
>>>>
>>>>
>>>> In this Helix tutorial [1] it says how to share the content between
>>>> Helix tasks. The problem is, the method [2] which has been given only
>>>> capable of sharing String typed key-value data.
>>>>
>>>> However, I can come up with an implementation to share all the values
>>>> related to the JSON output. That involves calling this method [2] many
>>>> times. I believe that is not a very efficient method because Helix task
>>>> framework has to call this [3] method many times (taking into consideration
>>>> that the generated JSON output can be larger).
>>>>
>>>>
>>>>
>>>> I have already sent an email to the Helix mailing list to clarify
>>>> whether there is another way and also will it be efficient if this method
>>>> [2] is called multiple times to get the work done.
>>>>
>>>>
>>>>
>>>> Am I on the right track? Your suggestions would be very helpful and
>>>> please add if anything is missing.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> [1]
>>>> http://helix.apache.org/0.8.0-docs/tutorial_task_framework.html#Share_Content_Across_Tasks_and_Jobs
>>>>
>>>> [2]
>>>> https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/UserContentStore.java#L75
>>>>
>>>> [3]
>>>> https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L361
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Lahiru
>>>>
>>>>
>>>>
>>>> On 26 March 2018 at 19:44, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
>>>> wrote:
>>>>
>>>> Hi Dimuthu, Suresh,
>>>>
>>>>
>>>>
>>>> Thanks a lot for the feedback. I will update the proposal accordingly.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Lahiru
>>>>
>>>>
>>>>
>>>> On 26 March 2018 at 08:48, Suresh Marru <sm...@apache.org> wrote:
>>>>
>>>> Hi Lahiru,
>>>>
>>>>
>>>>
>>>> I echo Dimuthu’s comment. You have a good starting point, it will be
>>>> nice if you can cover how users can interact with the parsed data.
>>>> Essentially adding API access to the parsed metadata database and having
>>>> proof of concept UI’s. This task could be challenging as the queries are
>>>> very data specific and generalizing API access and building custom UI’s can
>>>> be explanatory (less  defined) portions of your proposal.
>>>>
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Suresh
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mar 25, 2018, at 8:12 PM, DImuthu Upeksha <
>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Hi Lahiru,
>>>>
>>>>
>>>>
>>>> Nice document. And I like how you illustrate the systems through
>>>> diagrams. However try to address how you are going to expose parsed data to
>>>> outside through thrift APIs and how to design those data APIs in
>>>> application specific manner. And in the persisting task, you have to make
>>>> sure data integrity is preserved. For example in a Gaussian parsed output,
>>>> you might have to validate the parsed output using a schema before
>>>> persisting them in the database.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Dimuthu
>>>>
>>>>
>>>>
>>>> On Sun, Mar 25, 2018 at 5:05 PM, Lahiru Jayathilake <
>>>> lahiruj.14@cse.mrt.ac.lk> wrote:
>>>>
>>>> Hi Everyone,
>>>>
>>>>
>>>>
>>>> I have shared a draft proposal [1] for the GSoC project, AIRAVATA-2718
>>>> [2]. Any comments would be very helpful to improve it.
>>>>
>>>>
>>>>
>>>> [1]
>>>> https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJJNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing
>>>>
>>>>
>>>> [2] https://issues.apache.org/jira/browse/AIRAVATA-2718
>>>>
>>>>
>>>>
>>>> Thanks & Regards,
>>>>
>>>> --
>>>>
>>>> Lahiru Jayathilake
>>>>
>>>> Department of Computer Science and Engineering,
>>>>
>>>> Faculty of Engineering,
>>>>
>>>> University of Moratuwa
>>>>
>>>>
>>>>
>>>> [image: mage removed by sender.]
>>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Lahiru Jayathilake
>>>>
>>>> Department of Computer Science and Engineering,
>>>>
>>>> Faculty of Engineering,
>>>>
>>>> University of Moratuwa
>>>>
>>>>
>>>>
>>>> [image: mage removed by sender.]
>>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Lahiru Jayathilake
>>>>
>>>> Department of Computer Science and Engineering,
>>>>
>>>> Faculty of Engineering,
>>>>
>>>> University of Moratuwa
>>>>
>>>>
>>>>
>>>> [image: mage removed by sender.]
>>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Lahiru Jayathilake
>>>>
>>>> Department of Computer Science and Engineering,
>>>>
>>>> Faculty of Engineering,
>>>>
>>>> University of Moratuwa
>>>>
>>>>
>>>>
>>>> [image: mage removed by sender.]
>>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Lahiru Jayathilake
>>>>
>>>> Department of Computer Science and Engineering,
>>>>
>>>> Faculty of Engineering,
>>>>
>>>> University of Moratuwa
>>>>
>>>>
>>>>
>>>> [image: mage removed by sender.]
>>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Lahiru Jayathilake
>> Department of Computer Science and Engineering,
>> Faculty of Engineering,
>> University of Moratuwa
>>
>> <https://lk.linkedin.com/in/lahirujayathilake>
>>
>

-- 
Lahiru Jayathilake
Department of Computer Science and Engineering,
Faculty of Engineering,
University of Moratuwa

<https://lk.linkedin.com/in/lahirujayathilake>

Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Posted by Supun Nakandala <su...@gmail.com>.
Hi Devs,

Sorry for joining in late.

Regarding the challenges that Lahiru mentioned, I think it is a question of
whether to use configurations or conventions. Personally, I like following
a convention based approach in this context, as more and more
configurations will make the system more cumbersome for the users. But I
agree that it has it's downsides too.

But as Lahiru mentioned, I think using a UI based approach will be a better
approach. It helps to shield the complexities of the configurations and
provide an intuitive interface for the users.

Overall, I feel the task of output data parsing perfectly aligns with new
Airavata architecture on distributed task execution. Maybe we should
brainstorm what it would take to incorporate these parsers into the
application catalog (extend or abstract out a generic catalog). If we can
incorporate without making it overly complicated, I feel that it will be a
good direction to follow up.

On Sat, Aug 11, 2018 at 12:50 PM Lahiru Jayathilake <
lahiruj.14@cse.mrt.ac.lk> wrote:

> Hi Everyone,
>
> First of all Suresh, Marlon, and Dimuthu thanks for the suggestions and
> comments.
>
> Yes Suresh we can include QC_JSON_Schema[1] for airavata-data-parser[2].
> However, there is a challenge of using InterMol[3], as you mentioned the
> depending ParamEd[4] is LGPL and according to Apache Legal page[5], LGPL is
> not to be used.
> About the Data Parsing Project, yes I did look into the Apache Tikka
> whether I can use it and there are some challenges while making a generic
> framework. I will discuss them in detail in this same thread later.
>
> *This is an update about what I have accomplished.*
>
> I have created a separate parser project [2] for Gaussian, Gamess, Molpro,
> and NwChem. One advantage of separating the Gaussian, Molpro, etc. codes
> from the core is to make high cohesive and less coupled. When it comes to
> the maintainability it is far easier in this manner.
>
> Next regarding the Data Parsing Framework.
>
> As I have mentioned in my previous email I have implemented the Data
> Parsing Framework with some additional features. The method of achieving
> the goal had to be slightly changed in order to accompany some features. I
> will start from the bottom.
>
> Here is the scenario, A user has to define what are the Catalog Entries. A
> Catalog Entry is nothing more than basic key-value properties of a
> Dockerized parser. The following image shows an example of how I have
> defined it.
>
>
> The above image shows the entry corresponds to the Dockerized Gaussian
> Parser. There are both mandatory and optional properties. For example, *dockerImageName,
> inputFileExtension, dockerWorkingDirPath *must be stated whereas *securityOpt,
> envVariables *like properties are optional. Some of the properties are
> needed to run the Docker container.
>
> There are two special properties called *applicationType *and *operation*
> . *applicationType *states whether the Docker container is for parsing
> Gaussian, Molpro, NwChem, or Gamess files. Property *operation* is to
> mention that the parser can perform some operation to the file. For
> example, converting all the text characters to lower/upper case, removing
> last *n *lines, appending some text.. you get the point. A Dockerized
> Parser cannot have both the application and operation. It should be either
> an application or operation or none of them. (This is a design decision)
>
> For the time being catalog file is a JSON file which will be picked by the
> DataParsing Framework according to the user given file path.
> For the further explanation consider the following set of parsers. Note
> that I have only mentioned the most essential properties just to explain
> the example scenarios.
>
>
>
> Once the user has defined the catalog entries then DataParsing Framework
> expects a Parser Request to parse a given file. Consider the user has given
> the following set of Parser Requests.
>
>
>
> Once the above facts have been completed then the baton is on the hand of
> DataParsing Framework.
>
> At the runtime Data Parsing Framework pick Catalog File and makes a
> directed graph G(V,E) using the indicated parsers. I have already given a
> detailed summary about how the path will be identified but in this
> implementation, I changed it a little bit to facilitate application parsing
> (Gaussian, Molpro, etc) as well as multiple operations to a single file.
> Every vertex of the graph will be a file extension type and every edge of
> the graph represents a Catalog Entry. Then DataParsing Framework generates
> the directed graph as follows.
>
> The graph is based on the aforementioned Catalog Parsers and only the
> required properties have been defined on the graph edges for simplicity.
>
>
>
> This is how it connects with the file extensions. In the previous method
> we had nodes like *.out/gaussian* but instead of that multiple edges are
> allowed here.
>
> When a parser request comes the DataParsing Framework will find the
> shortest possible path with fulfilling all the requirements to parse the
> particular file.
> Following DAGs will be created for the aforementioned parser requests
>
>
>
> *Parser Requests*
>
> *Parser Request 1*
>
> This is Straightforward *P**6* parser is selected to parser *.txt* to
> *.xml*
>
>
> *Parser Request 2*
>
> The file should go through a Gaussian parser. *P1* is selected as the
> Gaussian parser but that parser's output file extension is *.json *since
> the request expect the output file extension to be *.xml, *the* P7*
> parser is selected at the end of the DAG
>
>
> *Parser Request 3*
>
> Similar to the Parser Request 2 but need an extra operation to be
> incorporated. The file's text should be converted to lower case. Only the
> *P9* parser exhibits the desired property. Hence *P9* parser is also used
> to create the DAG
>
>
> *Parser Request 4*
>
> Similar to Parser Request 3 however, in this case, two more operations
> should be considered. Those are *operation1* and *operation2. **P11* and
> *P12* parsers provide those operations respectively hence they are used
> when creating the DAG
>
>
>
> I completed these parts, a couple of weeks back. With the discussion of
> Dimuthu, he suggested to me, that I should try to generify the framework.
> The goal was not to declare the properties such as "*inputFileExtension*",
> "*outputFileExtension*" etc. at the coding level. The user is the one who
> is defining that language in Catalog Entries and in the Parser Request. The
> Data Parsing Framework should be capable of getting any kind of metadata
> keys and create the parser DAG.
>
> For example, one user can specify the input file extension as "
> *inputFileExtension*", another can specify it as "*inputEx*", another can
> specify it as "*input*", and another can declare it as "*x*". The method
> of defining is totally up to the user.
>
> While I was researching how to find a solution to this I faced some
> challenges.
>
> *Challenge 1*
>
> Without knowing the exact name of the keys it is not possible to identify
> the dependencies between such keys.
>
> For example, *inputFileExtension* and *outputFileExtension* exhibit a
> dependency that is 1st parser's *outputFileExtension* should be equal to
> the 2nd parser's *inputFileExtension*.
>
>
> One solution to overcome this problem is, a user has to give those kinds
> of dependency relationship between keys. For example, think a user has
> defined the keys of input file extension and output file extension as "*x*"
> and "*y*" respectively. Then he has to indicate that using some kind of a
> notation(eg. *key(y) ≊ key(x)) *inside a separate file.
>
> *Challenge 2*
>
> When it is required to parse a file through an application, the parser
> with the application should always come first. For example, suppose I need
> to parse a file through Gaussian and some kind of an operation. Then I
> cannot first parse the file through the parser which holds the operation
> and then pass through the Gaussian parser. What I always have to do is,
> initially parse through the Gaussian parser and then parse the rest of the
> content through the other one.
>
> As I mentioned earlier this could also be solvable by maintaining a file
> which has the details to what Parsers the priority should be given.
>
>
> Overall we might never know what kind of keys will be there what kind of
> relationships should be maintained with the keys etc. The solutions for the
> above challenges were to introduce another file which maintains all the
> relationships, dependencies, priorities, etc. held by keys and parsers. Our
> main goal to make this DataParsing Framework to be generic is, make the
> user's life easier. But with this approach, that goal is quite harder to
> achieve.
>
> There was another solution suggested by Dimuthu which is to come up with a
> UI based framework. Where users can direct towards their Catalog Entries
> and drag and drop Parsers to make their own DAG. This is the next milestone
> we are going to achieve. However, I am still looking for a solution to get
> the work done extending/modifying the work I have already completed.
>
> A detailed description of this Data Parsing Framework with contributions
> can be found here[5]
>
> Cheers,
> Lahiru
>
> [1] https://github.com/MolSSI/QC_JSON_Schema
> [2] https://github.com/Lahiru-J/airavata-data-parser/tree/master/datacat
> [3] https://github.com/shirtsgroup/InterMol
> [4] https://github.com/ParmEd/ParmEd
> [5]
> https://medium.com/@lahiru_j/gsoc-2018-re-architect-output-data-parsing-into-airavata-core-81da4b37057e
>
>
> On 26 June 2018 at 21:27, DImuthu Upeksha <di...@gmail.com>
> wrote:
>
>> Hi Lahiru
>>
>> Thanks for sharing this with the dev list. I would like to suggest few
>> changes to your data parsing framework. Please have a look at following
>> diagram
>>
>>
>>
>> I would like to come up with a sample use case so that you can understand
>> the data flow.
>>
>> I have a application output file called gaussian.out and I need to parse
>> it to a JSON file. However your have a parser that can parse gaussian files
>> into xml format. But you have another parser that can parse XML files into
>> JSON. You have a parser catalog that contains all details about parsers you
>> currently have and you can filter out necessary parsers based on metadata
>> like application type, output type, input type and etc.
>>
>> Challenge is how we are going to combine these two parsers in correct
>> order and how the data passing within these parsers are going to handle.
>> That's where we need a workflow manager. Workflow manager gets your
>> requirement then talk to the catalog to fetch necessary parser information
>> and build the correct parser DAG. Once the DAG is finalized, it can be
>> passed to helix to execute. There could be multiple DAGs that can achieve
>> same requirement, but workflow manager should select the highest
>> constrained path.
>>
>> What do you think?
>>
>> Thanks
>> Dimuthu
>>
>> On Fri, Jun 22, 2018 at 8:49 AM, Pierce, Marlon <ma...@iu.edu> wrote:
>>
>>> Yes, +1 on the detailed email summaries.
>>>
>>>
>>>
>>> Marlon
>>>
>>>
>>>
>>>
>>>
>>> *From: *Suresh Marru <sm...@apache.org>
>>> *Reply-To: *"dev@airavata.apache.org" <de...@airavata.apache.org>
>>> *Date: *Friday, June 22, 2018 at 8:46 AM
>>> *To: *Airavata Dev <de...@airavata.apache.org>
>>> *Cc: *Supun Nakandala <su...@gmail.com>
>>> *Subject: *Re: [GSoC] Re-architect Output Data Parsing into Airavata
>>> core
>>>
>>>
>>>
>>> Hi Lahiru,
>>>
>>>
>>>
>>> Thank you for sharing the detailed summary. I do not have comments on
>>> your questions, may be Supun can weigh in. I have couple of meta requests
>>> though:
>>>
>>>
>>>
>>> Can you consider adding few Molecular dynamics parsers in this order
>>> LAMMPS,  Amber, and CHARMM. The cclib library you used for others do not
>>> cover these, but InterMol [1] provides a python library to parse these. We
>>> have to be careful here, InterMol itself is MIT licensed and we can have
>>> its dependency but it depends upon ParamEd[2] which is LGPL license. Its a
>>> TODO for me on how to deal wit this but please see if you can include
>>> adding these parsers into your timeline.
>>>
>>>
>>>
>>> Can you evaluate if we can provide export to Quantum Chemistry JSON
>>> Scheme [3]? Is this is trivial we can pursue it.
>>>
>>>
>>>
>>> Lastly, can you see if Apache Tikka will help with any of your efforts.
>>>
>>>
>>>
>>> I will say my kudos again for your mailing list communications,
>>>
>>> Suresh
>>>
>>>
>>>
>>> [1] - https://github.com/shirtsgroup/InterMol
>>>
>>> [2] - https://github.com/ParmEd/ParmEd
>>>
>>> [3] - https://github.com/MolSSI/QC_JSON_Schema
>>>
>>>
>>>
>>>
>>>
>>> On Jun 22, 2018, at 12:37 AM, Lahiru Jayathilake <
>>> lahiruj.14@cse.mrt.ac.lk> wrote:
>>>
>>>
>>>
>>> Hi Everyone,
>>>
>>>
>>>
>>> In the last couple of days, I've been working on the data parsing tasks.
>>> To give an update about it, I have already converted the code-base of
>>> Gaussian, Molpro, Newchem, and Gamess parsers to python[1]. With compared
>>> to code-base of seagrid-data there won't be any codes related to
>>> experiments in the project(for example no JSON mappings). The main reason
>>> for doing this because to de-couple experiments with the data parsing
>>> tasks.
>>>
>>>
>>>
>>> While I was converting the codes of Gaussian, Molpro, Newchem, and
>>> Gamess I found some JSON key value-pairs in the data-catalog docker
>>> container have not been used in the seagrid-data to generate the final
>>> output file. I have commented unused key-value pairs in the code itself
>>> [2], [3], [4], [5]. I would like to know is there any specific reason for
>>> this, hope @Supun Nakandala
>>> <https://plus.google.com/u/1/103731766138074233701?prsrc=4> can answer
>>> it.
>>>
>>>
>>>
>>> The next update about the data parsing architecture.
>>>
>>> The new requirement is to come up with a framework which is capable of
>>> parsing any kind of document to a known type when the metadata is given. By
>>> this new design, data parsing will not be restricted only to
>>> experiments(Gaussian, Molpro, etc.)
>>>
>>>
>>>
>>> The following architecture is designed according to the requirements
>>> specified by @dimuthu in the last GSoC meeting.
>>>
>>>
>>>
>>> The following diagram depicts the top level architecture.
>>>
>>>
>>>
>>> <suggested architecture.png>
>>>
>>> Following are the key components.
>>>
>>>
>>>
>>> *Abstract Parser *
>>>
>>> This is a basic template for the Parser which specifies the parameters
>>> required for parsing task. For example, input file type, output file type,
>>> experiment type( if this is related to an experiment), etc.
>>>
>>>
>>>
>>> *Parser Manager*
>>>
>>> Constructs the set of parsers considering the input file type, output
>>> file type, and the experiment type.
>>>
>>> Parser Manager will construct a graph to find the shortest path between
>>> input file type and output file type. Then it will return the constructed
>>> set of Parsers.
>>>
>>>
>>>
>>> <graph.png>
>>>
>>> *Catalog *
>>>
>>> A mapping which has records to get a Docker container that can be used
>>> to parse from one file type to another file type. For example, if the
>>> requirement is to parse a *Gaussian .out file to JSON* then *"app/gaussian
>>> .out to JSON"* docker will be fetched
>>>
>>>
>>>
>>> *Parsers*
>>>
>>> There are two types of parsers (according to the suggested way)
>>>
>>>
>>>
>>> The first type is the parsers those will be directly coded into the
>>> project code-base. For example, parsing Text file to a JSON will be
>>> straightforward, then it is not necessarily required to maintain a separate
>>> docker container to convert text file to JSON. With the help of a library
>>> and putting an entry to the catalog will be enough to get the work done.
>>>
>>>
>>>
>>> The second type is parsers which have a separate docker container. For
>>> example Gaussian .out file to JSON docker container
>>>
>>>
>>>
>>> For the overall scenario consider the following examples to get an idea
>>>
>>>
>>>
>>> *Example 1*
>>>
>>> Suppose a PDF should be parsed to XML
>>>
>>> Parser Manager will look the catalog and find the shortest path to get
>>> the XML output from PDF. The available parsers are(both the coded parsers
>>> in the project and the dockerized parsers),
>>>
>>> • PDF to Text
>>>
>>> • Text to JSON
>>>
>>> • JSON to XML
>>>
>>> • application/gaussian .out to JSON (This is a very specific parsing
>>> mechanism not similar  to parsing a simple .out file to a JSON)
>>>
>>> and the rest which I have included in the diagram
>>>
>>>
>>>
>>> Then Parser Manager will construct the graph and find the shortest path
>>> as
>>>
>>> *PDF -> Text -> JSON -> XML* from the available parsers.
>>>
>>>
>>>
>>> <graph 2.png>
>>>
>>> Then Parser Manager will return 3 Parsers. From the three parsers a DAG
>>> will be constructed as follows,
>>>
>>>
>>>
>>> <parser dag.png>
>>>
>>> The reason for this architectural decision to have three parsers than
>>> doing in the single parser because if one of the parsers fails it would be
>>> easy to identify which parser it is.
>>>
>>>
>>>
>>> *Example 2*
>>>
>>> Consider a separate example to parse a Gaussian *.out* file to *JSON* then
>>> it is pretty straightforward. Same as the aforementioned example it will
>>> construct a Parser which linking the dockerized *app/gaussian .out to
>>> JSON* container.
>>>
>>>
>>>
>>> *Example 3*
>>>
>>> Problem is when it is needed to parse a Gaussian *.out* file to *XML*.
>>> There are two options.
>>>
>>>
>>>
>>> *1st option* - If an application related parsing should happen there
>>> must be application typed parsers to get the work done if not it is not
>>> allowed.
>>>
>>> In the list of parsers, there is no application related parser to
>>> convert *.out* file to *XML*. In this case even Parser Manager could
>>> construct a path like,
>>>
>>> *.out/gaussian -> JSON/gaussian -> XML*, this process is not allowed.
>>>
>>>
>>>
>>> *2nd option* - Once the application-specific content has been parsed it
>>> will be same as converting a normal JSON to XML assuming that we could
>>> allow the path
>>>
>>> *.out/gaussian -> JSON/gaussian -> XML*.
>>>
>>> What actually should be done? 1st option or the 2nd option? This is one
>>> point I need a suggestion.
>>>
>>>
>>>
>>> I would really appreciate any suggestions to improve this.
>>>
>>>
>>>
>>> [1] https://github.com/Lahiru-J/airavata-data-parser
>>>
>>> [2]
>>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gaussian/gaussian.py#L191-L288
>>>
>>> [3]
>>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gamess/gamess.py#L76-L175
>>>
>>>
>>> [4]
>>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175
>>>
>>> [5]
>>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175
>>>
>>>
>>>
>>> Cheers,
>>>
>>>
>>>
>>> On 28 May 2018 at 18:05, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
>>> wrote:
>>>
>>> Note this is the High-level architecture diagram. (Since it was not
>>> visible in the previous email)
>>>
>>>
>>>
>>> <Screen Shot 2018-05-28 at 9.30.43 AM.png>
>>>
>>> Thanks,
>>>
>>> Lahiru
>>>
>>>
>>>
>>> On 28 May 2018 at 18:02, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
>>> wrote:
>>>
>>> Hi Everyone,
>>>
>>>
>>>
>>> During the past few days, I’ve been implementing the tasks which are
>>> related to the Data Parsing. To give a heads up, the following image
>>> depicts the top level architecture of the implementation.
>>>
>>>
>>>
>>> [image: mage removed by sender.]
>>>
>>> Following are the main task components have been identified,
>>>
>>>
>>>
>>> *1. DataParsing Task*
>>>
>>> This task will get the stored output and will find the matching Parser
>>> (Gaussian, Lammps, QChem, etc.) and send the output through the selected
>>> parser to get a well-structured JSON
>>>
>>>
>>>
>>> *2. Validating Task*
>>>
>>> This is to validate the desired JSON output is achieved or not. That is
>>> JSON output should match with the respective schema(Gaussian Schema, Lammps
>>> Schema, QChem Schema, etc.)
>>>
>>>
>>>
>>> *3. Persisting Task*
>>>
>>> This task will persist the validated JSON outputs
>>>
>>>
>>>
>>> The successfully stored outputs will be exposed to the outer world.
>>>
>>>
>>>
>>>
>>>
>>> According to the diagram the generated JSON should be shared between the
>>> tasks(DataParsing, Validating, and, Persisting tasks). Neither DataParsing
>>> task nor Validating task persists the JSON, therefore, helix task framework
>>> should make sure to share the content between the tasks.
>>>
>>>
>>>
>>> In this Helix tutorial [1] it says how to share the content between
>>> Helix tasks. The problem is, the method [2] which has been given only
>>> capable of sharing String typed key-value data.
>>>
>>> However, I can come up with an implementation to share all the values
>>> related to the JSON output. That involves calling this method [2] many
>>> times. I believe that is not a very efficient method because Helix task
>>> framework has to call this [3] method many times (taking into consideration
>>> that the generated JSON output can be larger).
>>>
>>>
>>>
>>> I have already sent an email to the Helix mailing list to clarify
>>> whether there is another way and also will it be efficient if this method
>>> [2] is called multiple times to get the work done.
>>>
>>>
>>>
>>> Am I on the right track? Your suggestions would be very helpful and
>>> please add if anything is missing.
>>>
>>>
>>>
>>>
>>>
>>> [1]
>>> http://helix.apache.org/0.8.0-docs/tutorial_task_framework.html#Share_Content_Across_Tasks_and_Jobs
>>>
>>> [2]
>>> https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/UserContentStore.java#L75
>>>
>>> [3]
>>> https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L361
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Lahiru
>>>
>>>
>>>
>>> On 26 March 2018 at 19:44, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
>>> wrote:
>>>
>>> Hi Dimuthu, Suresh,
>>>
>>>
>>>
>>> Thanks a lot for the feedback. I will update the proposal accordingly.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Lahiru
>>>
>>>
>>>
>>> On 26 March 2018 at 08:48, Suresh Marru <sm...@apache.org> wrote:
>>>
>>> Hi Lahiru,
>>>
>>>
>>>
>>> I echo Dimuthu’s comment. You have a good starting point, it will be
>>> nice if you can cover how users can interact with the parsed data.
>>> Essentially adding API access to the parsed metadata database and having
>>> proof of concept UI’s. This task could be challenging as the queries are
>>> very data specific and generalizing API access and building custom UI’s can
>>> be explanatory (less  defined) portions of your proposal.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Suresh
>>>
>>>
>>>
>>>
>>>
>>> On Mar 25, 2018, at 8:12 PM, DImuthu Upeksha <di...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> Hi Lahiru,
>>>
>>>
>>>
>>> Nice document. And I like how you illustrate the systems through
>>> diagrams. However try to address how you are going to expose parsed data to
>>> outside through thrift APIs and how to design those data APIs in
>>> application specific manner. And in the persisting task, you have to make
>>> sure data integrity is preserved. For example in a Gaussian parsed output,
>>> you might have to validate the parsed output using a schema before
>>> persisting them in the database.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Dimuthu
>>>
>>>
>>>
>>> On Sun, Mar 25, 2018 at 5:05 PM, Lahiru Jayathilake <
>>> lahiruj.14@cse.mrt.ac.lk> wrote:
>>>
>>> Hi Everyone,
>>>
>>>
>>>
>>> I have shared a draft proposal [1] for the GSoC project, AIRAVATA-2718
>>> [2]. Any comments would be very helpful to improve it.
>>>
>>>
>>>
>>> [1]
>>> https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJJNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing
>>>
>>>
>>> [2] https://issues.apache.org/jira/browse/AIRAVATA-2718
>>>
>>>
>>>
>>> Thanks & Regards,
>>>
>>> --
>>>
>>> Lahiru Jayathilake
>>>
>>> Department of Computer Science and Engineering,
>>>
>>> Faculty of Engineering,
>>>
>>> University of Moratuwa
>>>
>>>
>>>
>>> [image: mage removed by sender.]
>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Lahiru Jayathilake
>>>
>>> Department of Computer Science and Engineering,
>>>
>>> Faculty of Engineering,
>>>
>>> University of Moratuwa
>>>
>>>
>>>
>>> [image: mage removed by sender.]
>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Lahiru Jayathilake
>>>
>>> Department of Computer Science and Engineering,
>>>
>>> Faculty of Engineering,
>>>
>>> University of Moratuwa
>>>
>>>
>>>
>>> [image: mage removed by sender.]
>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Lahiru Jayathilake
>>>
>>> Department of Computer Science and Engineering,
>>>
>>> Faculty of Engineering,
>>>
>>> University of Moratuwa
>>>
>>>
>>>
>>> [image: mage removed by sender.]
>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Lahiru Jayathilake
>>>
>>> Department of Computer Science and Engineering,
>>>
>>> Faculty of Engineering,
>>>
>>> University of Moratuwa
>>>
>>>
>>>
>>> [image: mage removed by sender.]
>>> <https://lk.linkedin.com/in/lahirujayathilake>
>>>
>>>
>>>
>>
>>
>
>
> --
> Lahiru Jayathilake
> Department of Computer Science and Engineering,
> Faculty of Engineering,
> University of Moratuwa
>
> <https://lk.linkedin.com/in/lahirujayathilake>
>

Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Posted by Lahiru Jayathilake <la...@cse.mrt.ac.lk>.
Hi Everyone,

First of all Suresh, Marlon, and Dimuthu thanks for the suggestions and
comments.

Yes Suresh we can include QC_JSON_Schema[1] for airavata-data-parser[2].
However, there is a challenge of using InterMol[3], as you mentioned the
depending ParamEd[4] is LGPL and according to Apache Legal page[5], LGPL is
not to be used.
About the Data Parsing Project, yes I did look into the Apache Tikka
whether I can use it and there are some challenges while making a generic
framework. I will discuss them in detail in this same thread later.

*This is an update about what I have accomplished.*

I have created a separate parser project [2] for Gaussian, Gamess, Molpro,
and NwChem. One advantage of separating the Gaussian, Molpro, etc. codes
from the core is to make high cohesive and less coupled. When it comes to
the maintainability it is far easier in this manner.

Next regarding the Data Parsing Framework.

As I have mentioned in my previous email I have implemented the Data
Parsing Framework with some additional features. The method of achieving
the goal had to be slightly changed in order to accompany some features. I
will start from the bottom.

Here is the scenario, A user has to define what are the Catalog Entries. A
Catalog Entry is nothing more than basic key-value properties of a
Dockerized parser. The following image shows an example of how I have
defined it.


The above image shows the entry corresponds to the Dockerized Gaussian
Parser. There are both mandatory and optional properties. For example,
*dockerImageName,
inputFileExtension, dockerWorkingDirPath *must be stated whereas *securityOpt,
envVariables *like properties are optional. Some of the properties are
needed to run the Docker container.

There are two special properties called *applicationType *and
*operation*. *applicationType
*states whether the Docker container is for parsing Gaussian, Molpro,
NwChem, or Gamess files. Property *operation* is to mention that the parser
can perform some operation to the file. For example, converting all the
text characters to lower/upper case, removing last *n *lines, appending
some text.. you get the point. A Dockerized Parser cannot have both the
application and operation. It should be either an application or operation
or none of them. (This is a design decision)

For the time being catalog file is a JSON file which will be picked by the
DataParsing Framework according to the user given file path.
For the further explanation consider the following set of parsers. Note
that I have only mentioned the most essential properties just to explain
the example scenarios.



Once the user has defined the catalog entries then DataParsing Framework
expects a Parser Request to parse a given file. Consider the user has given
the following set of Parser Requests.



Once the above facts have been completed then the baton is on the hand of
DataParsing Framework.

At the runtime Data Parsing Framework pick Catalog File and makes a
directed graph G(V,E) using the indicated parsers. I have already given a
detailed summary about how the path will be identified but in this
implementation, I changed it a little bit to facilitate application parsing
(Gaussian, Molpro, etc) as well as multiple operations to a single file.
Every vertex of the graph will be a file extension type and every edge of
the graph represents a Catalog Entry. Then DataParsing Framework generates
the directed graph as follows.

The graph is based on the aforementioned Catalog Parsers and only the
required properties have been defined on the graph edges for simplicity.



This is how it connects with the file extensions. In the previous method we
had nodes like *.out/gaussian* but instead of that multiple edges are
allowed here.

When a parser request comes the DataParsing Framework will find the
shortest possible path with fulfilling all the requirements to parse the
particular file.
Following DAGs will be created for the aforementioned parser requests



*Parser Requests*

*Parser Request 1*

This is Straightforward *P**6* parser is selected to parser *.txt* to *.xml*


*Parser Request 2*

The file should go through a Gaussian parser. *P1* is selected as the
Gaussian parser but that parser's output file extension is *.json *since
the request expect the output file extension to be *.xml, *the* P7* parser
is selected at the end of the DAG


*Parser Request 3*

Similar to the Parser Request 2 but need an extra operation to be
incorporated. The file's text should be converted to lower case. Only the
*P9* parser exhibits the desired property. Hence *P9* parser is also used
to create the DAG


*Parser Request 4*

Similar to Parser Request 3 however, in this case, two more operations
should be considered. Those are *operation1* and *operation2. **P11* and
*P12* parsers provide those operations respectively hence they are used
when creating the DAG



I completed these parts, a couple of weeks back. With the discussion of
Dimuthu, he suggested to me, that I should try to generify the framework.
The goal was not to declare the properties such as "*inputFileExtension*", "
*outputFileExtension*" etc. at the coding level. The user is the one who is
defining that language in Catalog Entries and in the Parser Request. The
Data Parsing Framework should be capable of getting any kind of metadata
keys and create the parser DAG.

For example, one user can specify the input file extension as "
*inputFileExtension*", another can specify it as "*inputEx*", another can
specify it as "*input*", and another can declare it as "*x*". The method of
defining is totally up to the user.

While I was researching how to find a solution to this I faced some
challenges.

*Challenge 1*

Without knowing the exact name of the keys it is not possible to identify
the dependencies between such keys.

For example, *inputFileExtension* and *outputFileExtension* exhibit a
dependency that is 1st parser's *outputFileExtension* should be equal to
the 2nd parser's *inputFileExtension*.


One solution to overcome this problem is, a user has to give those kinds of
dependency relationship between keys. For example, think a user has defined
the keys of input file extension and output file extension as "*x*" and "*y*"
respectively. Then he has to indicate that using some kind of a
notation(eg. *key(y) ≊ key(x)) *inside a separate file.

*Challenge 2*

When it is required to parse a file through an application, the parser with
the application should always come first. For example, suppose I need to
parse a file through Gaussian and some kind of an operation. Then I cannot
first parse the file through the parser which holds the operation and then
pass through the Gaussian parser. What I always have to do is, initially
parse through the Gaussian parser and then parse the rest of the content
through the other one.

As I mentioned earlier this could also be solvable by maintaining a file
which has the details to what Parsers the priority should be given.


Overall we might never know what kind of keys will be there what kind of
relationships should be maintained with the keys etc. The solutions for the
above challenges were to introduce another file which maintains all the
relationships, dependencies, priorities, etc. held by keys and parsers. Our
main goal to make this DataParsing Framework to be generic is, make the
user's life easier. But with this approach, that goal is quite harder to
achieve.

There was another solution suggested by Dimuthu which is to come up with a
UI based framework. Where users can direct towards their Catalog Entries
and drag and drop Parsers to make their own DAG. This is the next milestone
we are going to achieve. However, I am still looking for a solution to get
the work done extending/modifying the work I have already completed.

A detailed description of this Data Parsing Framework with contributions
can be found here[5]

Cheers,
Lahiru

[1] https://github.com/MolSSI/QC_JSON_Schema
[2] https://github.com/Lahiru-J/airavata-data-parser/tree/master/datacat
[3] https://github.com/shirtsgroup/InterMol
[4] https://github.com/ParmEd/ParmEd
[5] https://medium.com/@lahiru_j/gsoc-2018-re-architect-output-data-parsing-
into-airavata-core-81da4b37057e


On 26 June 2018 at 21:27, DImuthu Upeksha <di...@gmail.com>
wrote:

> Hi Lahiru
>
> Thanks for sharing this with the dev list. I would like to suggest few
> changes to your data parsing framework. Please have a look at following
> diagram
>
>
>
> I would like to come up with a sample use case so that you can understand
> the data flow.
>
> I have a application output file called gaussian.out and I need to parse
> it to a JSON file. However your have a parser that can parse gaussian files
> into xml format. But you have another parser that can parse XML files into
> JSON. You have a parser catalog that contains all details about parsers you
> currently have and you can filter out necessary parsers based on metadata
> like application type, output type, input type and etc.
>
> Challenge is how we are going to combine these two parsers in correct
> order and how the data passing within these parsers are going to handle.
> That's where we need a workflow manager. Workflow manager gets your
> requirement then talk to the catalog to fetch necessary parser information
> and build the correct parser DAG. Once the DAG is finalized, it can be
> passed to helix to execute. There could be multiple DAGs that can achieve
> same requirement, but workflow manager should select the highest
> constrained path.
>
> What do you think?
>
> Thanks
> Dimuthu
>
> On Fri, Jun 22, 2018 at 8:49 AM, Pierce, Marlon <ma...@iu.edu> wrote:
>
>> Yes, +1 on the detailed email summaries.
>>
>>
>>
>> Marlon
>>
>>
>>
>>
>>
>> *From: *Suresh Marru <sm...@apache.org>
>> *Reply-To: *"dev@airavata.apache.org" <de...@airavata.apache.org>
>> *Date: *Friday, June 22, 2018 at 8:46 AM
>> *To: *Airavata Dev <de...@airavata.apache.org>
>> *Cc: *Supun Nakandala <su...@gmail.com>
>> *Subject: *Re: [GSoC] Re-architect Output Data Parsing into Airavata core
>>
>>
>>
>> Hi Lahiru,
>>
>>
>>
>> Thank you for sharing the detailed summary. I do not have comments on
>> your questions, may be Supun can weigh in. I have couple of meta requests
>> though:
>>
>>
>>
>> Can you consider adding few Molecular dynamics parsers in this order
>> LAMMPS,  Amber, and CHARMM. The cclib library you used for others do not
>> cover these, but InterMol [1] provides a python library to parse these. We
>> have to be careful here, InterMol itself is MIT licensed and we can have
>> its dependency but it depends upon ParamEd[2] which is LGPL license. Its a
>> TODO for me on how to deal wit this but please see if you can include
>> adding these parsers into your timeline.
>>
>>
>>
>> Can you evaluate if we can provide export to Quantum Chemistry JSON
>> Scheme [3]? Is this is trivial we can pursue it.
>>
>>
>>
>> Lastly, can you see if Apache Tikka will help with any of your efforts.
>>
>>
>>
>> I will say my kudos again for your mailing list communications,
>>
>> Suresh
>>
>>
>>
>> [1] - https://github.com/shirtsgroup/InterMol
>>
>> [2] - https://github.com/ParmEd/ParmEd
>>
>> [3] - https://github.com/MolSSI/QC_JSON_Schema
>>
>>
>>
>>
>>
>> On Jun 22, 2018, at 12:37 AM, Lahiru Jayathilake <
>> lahiruj.14@cse.mrt.ac.lk> wrote:
>>
>>
>>
>> Hi Everyone,
>>
>>
>>
>> In the last couple of days, I've been working on the data parsing tasks.
>> To give an update about it, I have already converted the code-base of
>> Gaussian, Molpro, Newchem, and Gamess parsers to python[1]. With compared
>> to code-base of seagrid-data there won't be any codes related to
>> experiments in the project(for example no JSON mappings). The main reason
>> for doing this because to de-couple experiments with the data parsing
>> tasks.
>>
>>
>>
>> While I was converting the codes of Gaussian, Molpro, Newchem, and Gamess
>> I found some JSON key value-pairs in the data-catalog docker container have
>> not been used in the seagrid-data to generate the final output file. I have
>> commented unused key-value pairs in the code itself [2], [3], [4], [5]. I
>> would like to know is there any specific reason for this, hope @Supun
>> Nakandala <https://plus.google.com/u/1/103731766138074233701?prsrc=4> can
>> answer it.
>>
>>
>>
>> The next update about the data parsing architecture.
>>
>> The new requirement is to come up with a framework which is capable of
>> parsing any kind of document to a known type when the metadata is given. By
>> this new design, data parsing will not be restricted only to
>> experiments(Gaussian, Molpro, etc.)
>>
>>
>>
>> The following architecture is designed according to the requirements
>> specified by @dimuthu in the last GSoC meeting.
>>
>>
>>
>> The following diagram depicts the top level architecture.
>>
>>
>>
>> <suggested architecture.png>
>>
>> Following are the key components.
>>
>>
>>
>> *Abstract Parser *
>>
>> This is a basic template for the Parser which specifies the parameters
>> required for parsing task. For example, input file type, output file type,
>> experiment type( if this is related to an experiment), etc.
>>
>>
>>
>> *Parser Manager*
>>
>> Constructs the set of parsers considering the input file type, output
>> file type, and the experiment type.
>>
>> Parser Manager will construct a graph to find the shortest path between
>> input file type and output file type. Then it will return the constructed
>> set of Parsers.
>>
>>
>>
>> <graph.png>
>>
>> *Catalog *
>>
>> A mapping which has records to get a Docker container that can be used to
>> parse from one file type to another file type. For example, if the
>> requirement is to parse a *Gaussian .out file to JSON* then *"app/gaussian
>> .out to JSON"* docker will be fetched
>>
>>
>>
>> *Parsers*
>>
>> There are two types of parsers (according to the suggested way)
>>
>>
>>
>> The first type is the parsers those will be directly coded into the
>> project code-base. For example, parsing Text file to a JSON will be
>> straightforward, then it is not necessarily required to maintain a separate
>> docker container to convert text file to JSON. With the help of a library
>> and putting an entry to the catalog will be enough to get the work done.
>>
>>
>>
>> The second type is parsers which have a separate docker container. For
>> example Gaussian .out file to JSON docker container
>>
>>
>>
>> For the overall scenario consider the following examples to get an idea
>>
>>
>>
>> *Example 1*
>>
>> Suppose a PDF should be parsed to XML
>>
>> Parser Manager will look the catalog and find the shortest path to get
>> the XML output from PDF. The available parsers are(both the coded parsers
>> in the project and the dockerized parsers),
>>
>> • PDF to Text
>>
>> • Text to JSON
>>
>> • JSON to XML
>>
>> • application/gaussian .out to JSON (This is a very specific parsing
>> mechanism not similar  to parsing a simple .out file to a JSON)
>>
>> and the rest which I have included in the diagram
>>
>>
>>
>> Then Parser Manager will construct the graph and find the shortest path
>> as
>>
>> *PDF -> Text -> JSON -> XML* from the available parsers.
>>
>>
>>
>> <graph 2.png>
>>
>> Then Parser Manager will return 3 Parsers. From the three parsers a DAG
>> will be constructed as follows,
>>
>>
>>
>> <parser dag.png>
>>
>> The reason for this architectural decision to have three parsers than
>> doing in the single parser because if one of the parsers fails it would be
>> easy to identify which parser it is.
>>
>>
>>
>> *Example 2*
>>
>> Consider a separate example to parse a Gaussian *.out* file to *JSON* then
>> it is pretty straightforward. Same as the aforementioned example it will
>> construct a Parser which linking the dockerized *app/gaussian .out to
>> JSON* container.
>>
>>
>>
>> *Example 3*
>>
>> Problem is when it is needed to parse a Gaussian *.out* file to *XML*.
>> There are two options.
>>
>>
>>
>> *1st option* - If an application related parsing should happen there
>> must be application typed parsers to get the work done if not it is not
>> allowed.
>>
>> In the list of parsers, there is no application related parser to convert
>> *.out* file to *XML*. In this case even Parser Manager could construct a
>> path like,
>>
>> *.out/gaussian -> JSON/gaussian -> XML*, this process is not allowed.
>>
>>
>>
>> *2nd option* - Once the application-specific content has been parsed it
>> will be same as converting a normal JSON to XML assuming that we could
>> allow the path
>>
>> *.out/gaussian -> JSON/gaussian -> XML*.
>>
>> What actually should be done? 1st option or the 2nd option? This is one
>> point I need a suggestion.
>>
>>
>>
>> I would really appreciate any suggestions to improve this.
>>
>>
>>
>> [1] https://github.com/Lahiru-J/airavata-data-parser
>>
>> [2] https://github.com/Lahiru-J/airavata-data-parser/blob/ma
>> ster/datacat/gaussian/gaussian.py#L191-L288
>>
>> [3] https://github.com/Lahiru-J/airavata-data-parser/blob/ma
>> ster/datacat/gamess/gamess.py#L76-L175
>>
>> [4] https://github.com/Lahiru-J/airavata-data-parser/blob/ma
>> ster/datacat/molpro/molpro.py#L76-L175
>>
>> [5] https://github.com/Lahiru-J/airavata-data-parser/blob/ma
>> ster/datacat/molpro/molpro.py#L76-L175
>>
>>
>>
>> Cheers,
>>
>>
>>
>> On 28 May 2018 at 18:05, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
>> wrote:
>>
>> Note this is the High-level architecture diagram. (Since it was not
>> visible in the previous email)
>>
>>
>>
>> <Screen Shot 2018-05-28 at 9.30.43 AM.png>
>>
>> Thanks,
>>
>> Lahiru
>>
>>
>>
>> On 28 May 2018 at 18:02, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
>> wrote:
>>
>> Hi Everyone,
>>
>>
>>
>> During the past few days, I’ve been implementing the tasks which are
>> related to the Data Parsing. To give a heads up, the following image
>> depicts the top level architecture of the implementation.
>>
>>
>>
>> [image: mage removed by sender.]
>>
>> Following are the main task components have been identified,
>>
>>
>>
>> *1. DataParsing Task*
>>
>> This task will get the stored output and will find the matching Parser
>> (Gaussian, Lammps, QChem, etc.) and send the output through the selected
>> parser to get a well-structured JSON
>>
>>
>>
>> *2. Validating Task*
>>
>> This is to validate the desired JSON output is achieved or not. That is
>> JSON output should match with the respective schema(Gaussian Schema, Lammps
>> Schema, QChem Schema, etc.)
>>
>>
>>
>> *3. Persisting Task*
>>
>> This task will persist the validated JSON outputs
>>
>>
>>
>> The successfully stored outputs will be exposed to the outer world.
>>
>>
>>
>>
>>
>> According to the diagram the generated JSON should be shared between the
>> tasks(DataParsing, Validating, and, Persisting tasks). Neither DataParsing
>> task nor Validating task persists the JSON, therefore, helix task framework
>> should make sure to share the content between the tasks.
>>
>>
>>
>> In this Helix tutorial [1] it says how to share the content between Helix
>> tasks. The problem is, the method [2] which has been given only capable of
>> sharing String typed key-value data.
>>
>> However, I can come up with an implementation to share all the values
>> related to the JSON output. That involves calling this method [2] many
>> times. I believe that is not a very efficient method because Helix task
>> framework has to call this [3] method many times (taking into consideration
>> that the generated JSON output can be larger).
>>
>>
>>
>> I have already sent an email to the Helix mailing list to clarify whether
>> there is another way and also will it be efficient if this method [2] is
>> called multiple times to get the work done.
>>
>>
>>
>> Am I on the right track? Your suggestions would be very helpful and
>> please add if anything is missing.
>>
>>
>>
>>
>>
>> [1] http://helix.apache.org/0.8.0-docs/tutorial_task_framewo
>> rk.html#Share_Content_Across_Tasks_and_Jobs
>>
>> [2] https://github.com/apache/helix/blob/helix-0.6.x/helix-c
>> ore/src/main/java/org/apache/helix/task/UserContentStore.java#L75
>>
>> [3] https://github.com/apache/helix/blob/helix-0.6.x/helix-c
>> ore/src/main/java/org/apache/helix/task/TaskUtil.java#L361
>>
>>
>>
>> Thanks,
>>
>> Lahiru
>>
>>
>>
>> On 26 March 2018 at 19:44, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
>> wrote:
>>
>> Hi Dimuthu, Suresh,
>>
>>
>>
>> Thanks a lot for the feedback. I will update the proposal accordingly.
>>
>>
>>
>> Regards,
>>
>> Lahiru
>>
>>
>>
>> On 26 March 2018 at 08:48, Suresh Marru <sm...@apache.org> wrote:
>>
>> Hi Lahiru,
>>
>>
>>
>> I echo Dimuthu’s comment. You have a good starting point, it will be nice
>> if you can cover how users can interact with the parsed data. Essentially
>> adding API access to the parsed metadata database and having proof of
>> concept UI’s. This task could be challenging as the queries are very data
>> specific and generalizing API access and building custom UI’s can be
>> explanatory (less  defined) portions of your proposal.
>>
>>
>>
>> Cheers,
>>
>> Suresh
>>
>>
>>
>>
>>
>> On Mar 25, 2018, at 8:12 PM, DImuthu Upeksha <di...@gmail.com>
>> wrote:
>>
>>
>>
>> Hi Lahiru,
>>
>>
>>
>> Nice document. And I like how you illustrate the systems through
>> diagrams. However try to address how you are going to expose parsed data to
>> outside through thrift APIs and how to design those data APIs in
>> application specific manner. And in the persisting task, you have to make
>> sure data integrity is preserved. For example in a Gaussian parsed output,
>> you might have to validate the parsed output using a schema before
>> persisting them in the database.
>>
>>
>>
>> Thanks
>>
>> Dimuthu
>>
>>
>>
>> On Sun, Mar 25, 2018 at 5:05 PM, Lahiru Jayathilake <
>> lahiruj.14@cse.mrt.ac.lk> wrote:
>>
>> Hi Everyone,
>>
>>
>>
>> I have shared a draft proposal [1] for the GSoC project, AIRAVATA-2718
>> [2]. Any comments would be very helpful to improve it.
>>
>>
>>
>> [1] https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJ
>> JNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing
>>
>> [2] https://issues.apache.org/jira/browse/AIRAVATA-2718
>>
>>
>>
>> Thanks & Regards,
>>
>> --
>>
>> Lahiru Jayathilake
>>
>> Department of Computer Science and Engineering,
>>
>> Faculty of Engineering,
>>
>> University of Moratuwa
>>
>>
>>
>> [image: mage removed by sender.]
>> <https://lk.linkedin.com/in/lahirujayathilake>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Lahiru Jayathilake
>>
>> Department of Computer Science and Engineering,
>>
>> Faculty of Engineering,
>>
>> University of Moratuwa
>>
>>
>>
>> [image: mage removed by sender.]
>> <https://lk.linkedin.com/in/lahirujayathilake>
>>
>>
>>
>>
>>
>> --
>>
>> Lahiru Jayathilake
>>
>> Department of Computer Science and Engineering,
>>
>> Faculty of Engineering,
>>
>> University of Moratuwa
>>
>>
>>
>> [image: mage removed by sender.]
>> <https://lk.linkedin.com/in/lahirujayathilake>
>>
>>
>>
>>
>>
>> --
>>
>> Lahiru Jayathilake
>>
>> Department of Computer Science and Engineering,
>>
>> Faculty of Engineering,
>>
>> University of Moratuwa
>>
>>
>>
>> [image: mage removed by sender.]
>> <https://lk.linkedin.com/in/lahirujayathilake>
>>
>>
>>
>>
>>
>> --
>>
>> Lahiru Jayathilake
>>
>> Department of Computer Science and Engineering,
>>
>> Faculty of Engineering,
>>
>> University of Moratuwa
>>
>>
>>
>> [image: mage removed by sender.]
>> <https://lk.linkedin.com/in/lahirujayathilake>
>>
>>
>>
>
>


-- 
Lahiru Jayathilake
Department of Computer Science and Engineering,
Faculty of Engineering,
University of Moratuwa

<https://lk.linkedin.com/in/lahirujayathilake>

Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi Lahiru

Thanks for sharing this with the dev list. I would like to suggest few
changes to your data parsing framework. Please have a look at following
diagram



I would like to come up with a sample use case so that you can understand
the data flow.

I have a application output file called gaussian.out and I need to parse it
to a JSON file. However your have a parser that can parse gaussian files
into xml format. But you have another parser that can parse XML files into
JSON. You have a parser catalog that contains all details about parsers you
currently have and you can filter out necessary parsers based on metadata
like application type, output type, input type and etc.

Challenge is how we are going to combine these two parsers in correct order
and how the data passing within these parsers are going to handle. ​That's
where we need a workflow manager. Workflow manager gets your requirement
then talk to the catalog to fetch necessary parser information and build
the correct parser DAG. Once the DAG is finalized, it can be passed to
helix to execute. There could be multiple DAGs that can achieve same
requirement, but workflow manager should select the highest constrained
path.

What do you think?

Thanks
Dimuthu

On Fri, Jun 22, 2018 at 8:49 AM, Pierce, Marlon <ma...@iu.edu> wrote:

> Yes, +1 on the detailed email summaries.
>
>
>
> Marlon
>
>
>
>
>
> *From: *Suresh Marru <sm...@apache.org>
> *Reply-To: *"dev@airavata.apache.org" <de...@airavata.apache.org>
> *Date: *Friday, June 22, 2018 at 8:46 AM
> *To: *Airavata Dev <de...@airavata.apache.org>
> *Cc: *Supun Nakandala <su...@gmail.com>
> *Subject: *Re: [GSoC] Re-architect Output Data Parsing into Airavata core
>
>
>
> Hi Lahiru,
>
>
>
> Thank you for sharing the detailed summary. I do not have comments on your
> questions, may be Supun can weigh in. I have couple of meta requests though:
>
>
>
> Can you consider adding few Molecular dynamics parsers in this order
> LAMMPS,  Amber, and CHARMM. The cclib library you used for others do not
> cover these, but InterMol [1] provides a python library to parse these. We
> have to be careful here, InterMol itself is MIT licensed and we can have
> its dependency but it depends upon ParamEd[2] which is LGPL license. Its a
> TODO for me on how to deal wit this but please see if you can include
> adding these parsers into your timeline.
>
>
>
> Can you evaluate if we can provide export to Quantum Chemistry JSON Scheme
> [3]? Is this is trivial we can pursue it.
>
>
>
> Lastly, can you see if Apache Tikka will help with any of your efforts.
>
>
>
> I will say my kudos again for your mailing list communications,
>
> Suresh
>
>
>
> [1] - https://github.com/shirtsgroup/InterMol
>
> [2] - https://github.com/ParmEd/ParmEd
>
> [3] - https://github.com/MolSSI/QC_JSON_Schema
>
>
>
>
>
> On Jun 22, 2018, at 12:37 AM, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
> wrote:
>
>
>
> Hi Everyone,
>
>
>
> In the last couple of days, I've been working on the data parsing tasks.
> To give an update about it, I have already converted the code-base of
> Gaussian, Molpro, Newchem, and Gamess parsers to python[1]. With compared
> to code-base of seagrid-data there won't be any codes related to
> experiments in the project(for example no JSON mappings). The main reason
> for doing this because to de-couple experiments with the data parsing
> tasks.
>
>
>
> While I was converting the codes of Gaussian, Molpro, Newchem, and Gamess
> I found some JSON key value-pairs in the data-catalog docker container have
> not been used in the seagrid-data to generate the final output file. I have
> commented unused key-value pairs in the code itself [2], [3], [4], [5]. I
> would like to know is there any specific reason for this, hope @Supun
> Nakandala <https://plus.google.com/u/1/103731766138074233701?prsrc=4> can
> answer it.
>
>
>
> The next update about the data parsing architecture.
>
> The new requirement is to come up with a framework which is capable of
> parsing any kind of document to a known type when the metadata is given. By
> this new design, data parsing will not be restricted only to
> experiments(Gaussian, Molpro, etc.)
>
>
>
> The following architecture is designed according to the requirements
> specified by @dimuthu in the last GSoC meeting.
>
>
>
> The following diagram depicts the top level architecture.
>
>
>
> <suggested architecture.png>
>
> ​
>
> Following are the key components.
>
>
>
> *Abstract Parser *
>
> This is a basic template for the Parser which specifies the parameters
> required for parsing task. For example, input file type, output file type,
> experiment type( if this is related to an experiment), etc.
>
>
>
> *Parser Manager*
>
> Constructs the set of parsers considering the input file type, output file
> type, and the experiment type.
>
> Parser Manager will construct a graph to find the shortest path between
> input file type and output file type. Then it will return the constructed
> set of Parsers.
>
>
>
> <graph.png>
>
> ​*Catalog *
>
> A mapping which has records to get a Docker container that can be used to
> parse from one file type to another file type. For example, if the
> requirement is to parse a *Gaussian .out file to JSON* then *"app/gaussian
> .out to JSON"* docker will be fetched
>
>
>
> *Parsers*
>
> There are two types of parsers (according to the suggested way)
>
>
>
> The first type is the parsers those will be directly coded into the
> project code-base. For example, parsing Text file to a JSON will be
> straightforward, then it is not necessarily required to maintain a separate
> docker container to convert text file to JSON. With the help of a library
> and putting an entry to the catalog will be enough to get the work done.
>
>
>
> The second type is parsers which have a separate docker container. For
> example Gaussian .out file to JSON docker container
>
>
>
> For the overall scenario consider the following examples to get an idea
>
>
>
> *Example 1*
>
> Suppose a PDF should be parsed to XML
>
> Parser Manager will look the catalog and find the shortest path to get the
> XML output from PDF. The available parsers are(both the coded parsers in
> the project and the dockerized parsers),
>
> • PDF to Text
>
> • Text to JSON
>
> • JSON to XML
>
> • application/gaussian .out to JSON (This is a very specific parsing
> mechanism not similar  to parsing a simple .out file to a JSON)
>
> and the rest which I have included in the diagram
>
>
>
> Then Parser Manager will construct the graph and find the shortest path as
>
> *PDF -> Text -> JSON -> XML* from the available parsers.
>
>
>
> <graph 2.png>
>
> Then Parser Manager will return 3 Parsers. From the three parsers a DAG
> will be constructed as follows,
>
>
>
> <parser dag.png>
>
> ​
>
> The reason for this architectural decision to have three parsers than
> doing in the single parser because if one of the parsers fails it would be
> easy to identify which parser it is.
>
>
>
> *Example 2*
>
> Consider a separate example to parse a Gaussian *.out* file to *JSON* then
> it is pretty straightforward. Same as the aforementioned example it will
> construct a Parser which linking the dockerized *app/gaussian .out to
> JSON* container.
>
>
>
> *Example 3*
>
> Problem is when it is needed to parse a Gaussian *.out* file to *XML*.
> There are two options.
>
>
>
> *1st option* - If an application related parsing should happen there must
> be application typed parsers to get the work done if not it is not allowed.
>
> In the list of parsers, there is no application related parser to convert
> *.out* file to *XML*. In this case even Parser Manager could construct a
> path like,
>
> *.out/gaussian -> JSON/gaussian -> XML*, this process is not allowed.
>
>
>
> *2nd option* - Once the application-specific content has been parsed it
> will be same as converting a normal JSON to XML assuming that we could
> allow the path
>
> *.out/gaussian -> JSON/gaussian -> XML*.
>
> What actually should be done? 1st option or the 2nd option? This is one
> point I need a suggestion.
>
>
>
> I would really appreciate any suggestions to improve this.
>
>
>
> [1] https://github.com/Lahiru-J/airavata-data-parser
>
> [2] https://github.com/Lahiru-J/airavata-data-parser/blob/ma
> ster/datacat/gaussian/gaussian.py#L191-L288
>
> [3] https://github.com/Lahiru-J/airavata-data-parser/blob/ma
> ster/datacat/gamess/gamess.py#L76-L175
>
> [4] https://github.com/Lahiru-J/airavata-data-parser/blob/ma
> ster/datacat/molpro/molpro.py#L76-L175
>
> [5] https://github.com/Lahiru-J/airavata-data-parser/blob/ma
> ster/datacat/molpro/molpro.py#L76-L175
>
>
>
> Cheers,
>
>
>
> On 28 May 2018 at 18:05, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
> wrote:
>
> Note this is the High-level architecture diagram. (Since it was not
> visible in the previous email)
>
>
>
> <Screen Shot 2018-05-28 at 9.30.43 AM.png>
>
> ​Thanks,
>
> Lahiru
>
>
>
> On 28 May 2018 at 18:02, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
> wrote:
>
> Hi Everyone,
>
>
>
> During the past few days, I’ve been implementing the tasks which are
> related to the Data Parsing. To give a heads up, the following image
> depicts the top level architecture of the implementation.
>
>
>
> [image: mage removed by sender.]
>
> ​
>
> Following are the main task components have been identified,
>
>
>
> *1. DataParsing Task*
>
> This task will get the stored output and will find the matching Parser
> (Gaussian, Lammps, QChem, etc.) and send the output through the selected
> parser to get a well-structured JSON
>
>
>
> *2. Validating Task*
>
> This is to validate the desired JSON output is achieved or not. That is
> JSON output should match with the respective schema(Gaussian Schema, Lammps
> Schema, QChem Schema, etc.)
>
>
>
> *3. Persisting Task*
>
> This task will persist the validated JSON outputs
>
>
>
> The successfully stored outputs will be exposed to the outer world.
>
>
>
>
>
> According to the diagram the generated JSON should be shared between the
> tasks(DataParsing, Validating, and, Persisting tasks). Neither DataParsing
> task nor Validating task persists the JSON, therefore, helix task framework
> should make sure to share the content between the tasks.
>
>
>
> In this Helix tutorial [1] it says how to share the content between Helix
> tasks. The problem is, the method [2] which has been given only capable of
> sharing String typed key-value data.
>
> However, I can come up with an implementation to share all the values
> related to the JSON output. That involves calling this method [2] many
> times. I believe that is not a very efficient method because Helix task
> framework has to call this [3] method many times (taking into consideration
> that the generated JSON output can be larger).
>
>
>
> I have already sent an email to the Helix mailing list to clarify whether
> there is another way and also will it be efficient if this method [2] is
> called multiple times to get the work done.
>
>
>
> Am I on the right track? Your suggestions would be very helpful and
> please add if anything is missing.
>
>
>
>
>
> [1] http://helix.apache.org/0.8.0-docs/tutorial_task_framewo
> rk.html#Share_Content_Across_Tasks_and_Jobs
>
> [2] https://github.com/apache/helix/blob/helix-0.6.x/helix-c
> ore/src/main/java/org/apache/helix/task/UserContentStore.java#L75
>
> [3] https://github.com/apache/helix/blob/helix-0.6.x/helix-c
> ore/src/main/java/org/apache/helix/task/TaskUtil.java#L361
>
>
>
> Thanks,
>
> Lahiru
>
>
>
> On 26 March 2018 at 19:44, Lahiru Jayathilake <la...@cse.mrt.ac.lk>
> wrote:
>
> Hi Dimuthu, Suresh,
>
>
>
> Thanks a lot for the feedback. I will update the proposal accordingly.
>
>
>
> Regards,
>
> Lahiru
>
>
>
> On 26 March 2018 at 08:48, Suresh Marru <sm...@apache.org> wrote:
>
> Hi Lahiru,
>
>
>
> I echo Dimuthu’s comment. You have a good starting point, it will be nice
> if you can cover how users can interact with the parsed data. Essentially
> adding API access to the parsed metadata database and having proof of
> concept UI’s. This task could be challenging as the queries are very data
> specific and generalizing API access and building custom UI’s can be
> explanatory (less  defined) portions of your proposal.
>
>
>
> Cheers,
>
> Suresh
>
>
>
>
>
> On Mar 25, 2018, at 8:12 PM, DImuthu Upeksha <di...@gmail.com>
> wrote:
>
>
>
> Hi Lahiru,
>
>
>
> Nice document. And I like how you illustrate the systems through diagrams.
> However try to address how you are going to expose parsed data to outside
> through thrift APIs and how to design those data APIs in application
> specific manner. And in the persisting task, you have to make sure data
> integrity is preserved. For example in a Gaussian parsed output, you might
> have to validate the parsed output using a schema before persisting them in
> the database.
>
>
>
> Thanks
>
> Dimuthu
>
>
>
> On Sun, Mar 25, 2018 at 5:05 PM, Lahiru Jayathilake <
> lahiruj.14@cse.mrt.ac.lk> wrote:
>
> Hi Everyone,
>
>
>
> I have shared a draft proposal [1] for the GSoC project, AIRAVATA-2718
> [2]. Any comments would be very helpful to improve it.
>
>
>
> [1] https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJ
> JNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing
>
> [2] https://issues.apache.org/jira/browse/AIRAVATA-2718
>
>
>
> Thanks & Regards,
>
> --
>
> Lahiru Jayathilake
>
> Department of Computer Science and Engineering,
>
> Faculty of Engineering,
>
> University of Moratuwa
>
>
>
> [image: mage removed by sender.]
> <https://lk.linkedin.com/in/lahirujayathilake>
>
>
>
>
>
>
>
>
>
> --
>
> Lahiru Jayathilake
>
> Department of Computer Science and Engineering,
>
> Faculty of Engineering,
>
> University of Moratuwa
>
>
>
> [image: mage removed by sender.]
> <https://lk.linkedin.com/in/lahirujayathilake>
>
>
>
>
>
> --
>
> Lahiru Jayathilake
>
> Department of Computer Science and Engineering,
>
> Faculty of Engineering,
>
> University of Moratuwa
>
>
>
> [image: mage removed by sender.]
> <https://lk.linkedin.com/in/lahirujayathilake>
>
>
>
>
>
> --
>
> Lahiru Jayathilake
>
> Department of Computer Science and Engineering,
>
> Faculty of Engineering,
>
> University of Moratuwa
>
>
>
> [image: mage removed by sender.]
> <https://lk.linkedin.com/in/lahirujayathilake>
>
>
>
>
>
> --
>
> Lahiru Jayathilake
>
> Department of Computer Science and Engineering,
>
> Faculty of Engineering,
>
> University of Moratuwa
>
>
>
> [image: mage removed by sender.]
> <https://lk.linkedin.com/in/lahirujayathilake>
>
>
>

Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Posted by "Pierce, Marlon" <ma...@iu.edu>.
Yes, +1 on the detailed email summaries.

 

Marlon

 

 

From: Suresh Marru <sm...@apache.org>
Reply-To: "dev@airavata.apache.org" <de...@airavata.apache.org>
Date: Friday, June 22, 2018 at 8:46 AM
To: Airavata Dev <de...@airavata.apache.org>
Cc: Supun Nakandala <su...@gmail.com>
Subject: Re: [GSoC] Re-architect Output Data Parsing into Airavata core

 

Hi Lahiru, 

 

Thank you for sharing the detailed summary. I do not have comments on your questions, may be Supun can weigh in. I have couple of meta requests though:

 

Can you consider adding few Molecular dynamics parsers in this order LAMMPS,  Amber, and CHARMM. The cclib library you used for others do not cover these, but InterMol [1] provides a python library to parse these. We have to be careful here, InterMol itself is MIT licensed and we can have its dependency but it depends upon ParamEd[2] which is LGPL license. Its a TODO for me on how to deal wit this but please see if you can include adding these parsers into your timeline. 

 

Can you evaluate if we can provide export to Quantum Chemistry JSON Scheme [3]? Is this is trivial we can pursue it. 

 

Lastly, can you see if Apache Tikka will help with any of your efforts. 

 

I will say my kudos again for your mailing list communications,

Suresh 

 

[1] - https://github.com/shirtsgroup/InterMol

[2] - https://github.com/ParmEd/ParmEd 

[3] - https://github.com/MolSSI/QC_JSON_Schema 

 



On Jun 22, 2018, at 12:37 AM, Lahiru Jayathilake <la...@cse.mrt.ac.lk> wrote:

 

Hi Everyone, 

 

In the last couple of days, I've been working on the data parsing tasks. To give an update about it, I have already converted the code-base of Gaussian, Molpro, Newchem, and Gamess parsers to python[1]. With compared to code-base of seagrid-data there won't be any codes related to experiments in the project(for example no JSON mappings). The main reason for doing this because to de-couple experiments with the data parsing tasks. 

 

While I was converting the codes of Gaussian, Molpro, Newchem, and Gamess I found some JSON key value-pairs in the data-catalog docker container have not been used in the seagrid-data to generate the final output file. I have commented unused key-value pairs in the code itself [2], [3], [4], [5]. I would like to know is there any specific reason for this, hope @Supun Nakandala can answer it. 

 

The next update about the data parsing architecture.

The new requirement is to come up with a framework which is capable of parsing any kind of document to a known type when the metadata is given. By this new design, data parsing will not be restricted only to experiments(Gaussian, Molpro, etc.)  

 

The following architecture is designed according to the requirements specified by @dimuthu in the last GSoC meeting.

 

The following diagram depicts the top level architecture.

 

<suggested architecture.png>

​

Following are the key components.

 

Abstract Parser 

This is a basic template for the Parser which specifies the parameters required for parsing task. For example, input file type, output file type, experiment type( if this is related to an experiment), etc.

 

Parser Manager

Constructs the set of parsers considering the input file type, output file type, and the experiment type.

Parser Manager will construct a graph to find the shortest path between input file type and output file type. Then it will return the constructed set of Parsers.

 

<graph.png>

​Catalog 

A mapping which has records to get a Docker container that can be used to parse from one file type to another file type. For example, if the requirement is to parse a Gaussian .out file to JSON then "app/gaussian .out to JSON" docker will be fetched

 

Parsers

There are two types of parsers (according to the suggested way) 

 

The first type is the parsers those will be directly coded into the project code-base. For example, parsing Text file to a JSON will be straightforward, then it is not necessarily required to maintain a separate docker container to convert text file to JSON. With the help of a library and putting an entry to the catalog will be enough to get the work done.

 

The second type is parsers which have a separate docker container. For example Gaussian .out file to JSON docker container

 

For the overall scenario consider the following examples to get an idea

 

Example 1

Suppose a PDF should be parsed to XML

Parser Manager will look the catalog and find the shortest path to get the XML output from PDF. The available parsers are(both the coded parsers in the project and the dockerized parsers),

• PDF to Text

• Text to JSON

• JSON to XML

• application/gaussian .out to JSON (This is a very specific parsing mechanism not similar  to parsing a simple .out file to a JSON)

and the rest which I have included in the diagram

 

Then Parser Manager will construct the graph and find the shortest path as 

PDF -> Text -> JSON -> XML from the available parsers. 

 

<graph 2.png>

Then Parser Manager will return 3 Parsers. From the three parsers a DAG will be constructed as follows,

 

<parser dag.png>

​

The reason for this architectural decision to have three parsers than doing in the single parser because if one of the parsers fails it would be easy to identify which parser it is. 

 

Example 2

Consider a separate example to parse a Gaussian .out file to JSON then it is pretty straightforward. Same as the aforementioned example it will construct a Parser which linking the dockerized app/gaussian .out to JSON container. 

 

Example 3

Problem is when it is needed to parse a Gaussian .out file to XML. There are two options.

 

1st option - If an application related parsing should happen there must be application typed parsers to get the work done if not it is not allowed. 

In the list of parsers, there is no application related parser to convert .out file to XML. In this case even Parser Manager could construct a path like, 

.out/gaussian -> JSON/gaussian -> XML, this process is not allowed.

 

2nd option - Once the application-specific content has been parsed it will be same as converting a normal JSON to XML assuming that we could allow the path 

.out/gaussian -> JSON/gaussian -> XML. 

What actually should be done? 1st option or the 2nd option? This is one point I need a suggestion.

 

I would really appreciate any suggestions to improve this.

 

[1] https://github.com/Lahiru-J/airavata-data-parser

[2] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gaussian/gaussian.py#L191-L288

[3] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gamess/gamess.py#L76-L175 

[4] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175

[5] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175

 

Cheers,

 

On 28 May 2018 at 18:05, Lahiru Jayathilake <la...@cse.mrt.ac.lk> wrote:

Note this is the High-level architecture diagram. (Since it was not visible in the previous email) 

 

<Screen Shot 2018-05-28 at 9.30.43 AM.png>

​Thanks,

Lahiru

 

On 28 May 2018 at 18:02, Lahiru Jayathilake <la...@cse.mrt.ac.lk> wrote:

Hi Everyone, 

 

During the past few days, I’ve been implementing the tasks which are related to the Data Parsing. To give a heads up, the following image depicts the top level architecture of the implementation.

 

​

Following are the main task components have been identified,

 

1. DataParsing Task

This task will get the stored output and will find the matching Parser (Gaussian, Lammps, QChem, etc.) and send the output through the selected parser to get a well-structured JSON

 

2. Validating Task

This is to validate the desired JSON output is achieved or not. That is JSON output should match with the respective schema(Gaussian Schema, Lammps Schema, QChem Schema, etc.)

 

3. Persisting Task

This task will persist the validated JSON outputs

 

The successfully stored outputs will be exposed to the outer world. 

 

 

According to the diagram the generated JSON should be shared between the tasks(DataParsing, Validating, and, Persisting tasks). Neither DataParsing task nor Validating task persists the JSON, therefore, helix task framework should make sure to share the content between the tasks.

 

In this Helix tutorial [1] it says how to share the content between Helix tasks. The problem is, the method [2] which has been given only capable of sharing String typed key-value data. 

However, I can come up with an implementation to share all the values related to the JSON output. That involves calling this method [2] many times. I believe that is not a very efficient method because Helix task framework has to call this [3] method many times (taking into consideration that the generated JSON output can be larger).

 

I have already sent an email to the Helix mailing list to clarify whether there is another way and also will it be efficient if this method [2] is called multiple times to get the work done.

 

Am I on the right track? Your suggestions would be very helpful and please add if anything is missing.

 

 

[1] http://helix.apache.org/0.8.0-docs/tutorial_task_framework.html#Share_Content_Across_Tasks_and_Jobs

[2] https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/UserContentStore.java#L75

[3] https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L361

 

Thanks,

Lahiru

 

On 26 March 2018 at 19:44, Lahiru Jayathilake <la...@cse.mrt.ac.lk> wrote:

Hi Dimuthu, Suresh, 

 

Thanks a lot for the feedback. I will update the proposal accordingly.

 

Regards,

Lahiru

 

On 26 March 2018 at 08:48, Suresh Marru <sm...@apache.org> wrote:

Hi Lahiru, 

 

I echo Dimuthu’s comment. You have a good starting point, it will be nice if you can cover how users can interact with the parsed data. Essentially adding API access to the parsed metadata database and having proof of concept UI’s. This task could be challenging as the queries are very data specific and generalizing API access and building custom UI’s can be explanatory (less  defined) portions of your proposal. 

 

Cheers,

Suresh 

 



On Mar 25, 2018, at 8:12 PM, DImuthu Upeksha <di...@gmail.com> wrote:

 

Hi Lahiru, 

 

Nice document. And I like how you illustrate the systems through diagrams. However try to address how you are going to expose parsed data to outside through thrift APIs and how to design those data APIs in application specific manner. And in the persisting task, you have to make sure data integrity is preserved. For example in a Gaussian parsed output, you might have to validate the parsed output using a schema before persisting them in the database. 

 

Thanks

Dimuthu

 

On Sun, Mar 25, 2018 at 5:05 PM, Lahiru Jayathilake <la...@cse.mrt.ac.lk> wrote:

Hi Everyone, 

 

I have shared a draft proposal [1] for the GSoC project, AIRAVATA-2718 [2]. Any comments would be very helpful to improve it.

 

[1] https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJJNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing 

[2] https://issues.apache.org/jira/browse/AIRAVATA-2718

 

Thanks & Regards,

-- 

Lahiru Jayathilake 

Department of Computer Science and Engineering,

Faculty of Engineering,

University of Moratuwa

 

 

 



 

-- 

Lahiru Jayathilake 

Department of Computer Science and Engineering,

Faculty of Engineering,

University of Moratuwa

 



 

-- 

Lahiru Jayathilake 

Department of Computer Science and Engineering,

Faculty of Engineering,

University of Moratuwa

 



 

-- 

Lahiru Jayathilake 

Department of Computer Science and Engineering,

Faculty of Engineering,

University of Moratuwa

 



 

-- 

Lahiru Jayathilake 

Department of Computer Science and Engineering,

Faculty of Engineering,

University of Moratuwa

 

 


Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Posted by Suresh Marru <sm...@apache.org>.
Hi Lahiru,

Thank you for sharing the detailed summary. I do not have comments on your questions, may be Supun can weigh in. I have couple of meta requests though:

Can you consider adding few Molecular dynamics parsers in this order LAMMPS,  Amber, and CHARMM. The cclib library you used for others do not cover these, but InterMol [1] provides a python library to parse these. We have to be careful here, InterMol itself is MIT licensed and we can have its dependency but it depends upon ParamEd[2] which is LGPL license. Its a TODO for me on how to deal wit this but please see if you can include adding these parsers into your timeline. 

Can you evaluate if we can provide export to Quantum Chemistry JSON Scheme [3]? Is this is trivial we can pursue it. 

Lastly, can you see if Apache Tikka will help with any of your efforts. 

I will say my kudos again for your mailing list communications,
Suresh 

[1] - https://github.com/shirtsgroup/InterMol <https://github.com/shirtsgroup/InterMol>
[2] - https://github.com/ParmEd/ParmEd <https://github.com/ParmEd/ParmEd> 
[3] - https://github.com/MolSSI/QC_JSON_Schema <https://github.com/MolSSI/QC_JSON_Schema> 


> On Jun 22, 2018, at 12:37 AM, Lahiru Jayathilake <la...@cse.mrt.ac.lk> wrote:
> 
> Hi Everyone,
> 
> In the last couple of days, I've been working on the data parsing tasks. To give an update about it, I have already converted the code-base of Gaussian, Molpro, Newchem, and Gamess parsers to python[1]. With compared to code-base of seagrid-data there won't be any codes related to experiments in the project(for example no JSON mappings). The main reason for doing this because to de-couple experiments with the data parsing tasks. 
> 
> While I was converting the codes of Gaussian, Molpro, Newchem, and Gamess I found some JSON key value-pairs in the data-catalog docker container have not been used in the seagrid-data to generate the final output file. I have commented unused key-value pairs in the code itself [2], [3], [4], [5]. I would like to know is there any specific reason for this, hope @Supun Nakandala <https://plus.google.com/u/1/103731766138074233701?prsrc=4> can answer it. 
> 
> The next update about the data parsing architecture.
> The new requirement is to come up with a framework which is capable of parsing any kind of document to a known type when the metadata is given. By this new design, data parsing will not be restricted only to experiments(Gaussian, Molpro, etc.)  
> 
> The following architecture is designed according to the requirements specified by @dimuthu in the last GSoC meeting.
> 
> The following diagram depicts the top level architecture.
> 
> <suggested architecture.png>
> ​
> Following are the key components.
> 
> Abstract Parser 
> This is a basic template for the Parser which specifies the parameters required for parsing task. For example, input file type, output file type, experiment type( if this is related to an experiment), etc.
> 
> Parser Manager
> Constructs the set of parsers considering the input file type, output file type, and the experiment type.
> Parser Manager will construct a graph to find the shortest path between input file type and output file type. Then it will return the constructed set of Parsers.
> 
> <graph.png>
> ​Catalog 
> A mapping which has records to get a Docker container that can be used to parse from one file type to another file type. For example, if the requirement is to parse a Gaussian .out file to JSON then "app/gaussian .out to JSON" docker will be fetched
> 
> Parsers
> There are two types of parsers (according to the suggested way) 
> 
> The first type is the parsers those will be directly coded into the project code-base. For example, parsing Text file to a JSON will be straightforward, then it is not necessarily required to maintain a separate docker container to convert text file to JSON. With the help of a library and putting an entry to the catalog will be enough to get the work done.
> 
> The second type is parsers which have a separate docker container. For example Gaussian .out file to JSON docker container
> 
> For the overall scenario consider the following examples to get an idea
> 
> Example 1
> Suppose a PDF should be parsed to XML
> Parser Manager will look the catalog and find the shortest path to get the XML output from PDF. The available parsers are(both the coded parsers in the project and the dockerized parsers),
> • PDF to Text
> • Text to JSON
> • JSON to XML
> • application/gaussian .out to JSON (This is a very specific parsing mechanism not similar  to parsing a simple .out file to a JSON)
> and the rest which I have included in the diagram
> 
> Then Parser Manager will construct the graph and find the shortest path as 
> PDF -> Text -> JSON -> XML from the available parsers. 
> 
> <graph 2.png>
> Then Parser Manager will return 3 Parsers. From the three parsers a DAG will be constructed as follows,
> 
> <parser dag.png>
> ​
> The reason for this architectural decision to have three parsers than doing in the single parser because if one of the parsers fails it would be easy to identify which parser it is. 
> 
> Example 2
> Consider a separate example to parse a Gaussian .out file to JSON then it is pretty straightforward. Same as the aforementioned example it will construct a Parser which linking the dockerized app/gaussian .out to JSON container. 
> 
> Example 3
> Problem is when it is needed to parse a Gaussian .out file to XML. There are two options.
> 
> 1st option - If an application related parsing should happen there must be application typed parsers to get the work done if not it is not allowed. 
> In the list of parsers, there is no application related parser to convert .out file to XML. In this case even Parser Manager could construct a path like, 
> .out/gaussian -> JSON/gaussian -> XML, this process is not allowed.
> 
> 2nd option - Once the application-specific content has been parsed it will be same as converting a normal JSON to XML assuming that we could allow the path 
> .out/gaussian -> JSON/gaussian -> XML. 
> What actually should be done? 1st option or the 2nd option? This is one point I need a suggestion.
>  
> I would really appreciate any suggestions to improve this.
> 
> [1] https://github.com/Lahiru-J/airavata-data-parser <https://github.com/Lahiru-J/airavata-data-parser>
> [2] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gaussian/gaussian.py#L191-L288 <https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gaussian/gaussian.py#L191-L288>
> [3] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gamess/gamess.py#L76-L175 <https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gamess/gamess.py#L76-L175> 
> [4] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175 <https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175>
> [5] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175 <https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175>
> 
> Cheers,
> 
> On 28 May 2018 at 18:05, Lahiru Jayathilake <lahiruj.14@cse.mrt.ac.lk <ma...@cse.mrt.ac.lk>> wrote:
> Note this is the High-level architecture diagram. (Since it was not visible in the previous email)
> 
> <Screen Shot 2018-05-28 at 9.30.43 AM.png>
> ​Thanks,
> Lahiru
> 
> On 28 May 2018 at 18:02, Lahiru Jayathilake <lahiruj.14@cse.mrt.ac.lk <ma...@cse.mrt.ac.lk>> wrote:
> Hi Everyone,
> 
> During the past few days, I’ve been implementing the tasks which are related to the Data Parsing. To give a heads up, the following image depicts the top level architecture of the implementation.
> 
> 
> ​
> Following are the main task components have been identified,
> 
> 1. DataParsing Task
> This task will get the stored output and will find the matching Parser (Gaussian, Lammps, QChem, etc.) and send the output through the selected parser to get a well-structured JSON
> 
> 2. Validating Task
> This is to validate the desired JSON output is achieved or not. That is JSON output should match with the respective schema(Gaussian Schema, Lammps Schema, QChem Schema, etc.)
> 
> 3. Persisting Task
> This task will persist the validated JSON outputs
> 
> The successfully stored outputs will be exposed to the outer world.
> 
> 
> According to the diagram the generated JSON should be shared between the tasks(DataParsing, Validating, and, Persisting tasks). Neither DataParsing task nor Validating task persists the JSON, therefore, helix task framework should make sure to share the content between the tasks.
> 
> In this Helix tutorial [1] it says how to share the content between Helix tasks. The problem is, the method [2] which has been given only capable of sharing String typed key-value data. 
> However, I can come up with an implementation to share all the values related to the JSON output. That involves calling this method [2] many times. I believe that is not a very efficient method because Helix task framework has to call this [3] method many times (taking into consideration that the generated JSON output can be larger).
> 
> I have already sent an email to the Helix mailing list to clarify whether there is another way and also will it be efficient if this method [2] is called multiple times to get the work done.
> 
> Am I on the right track? Your suggestions would be very helpful and please add if anything is missing.
> 
> 
> [1] http://helix.apache.org/0.8.0-docs/tutorial_task_framework.html#Share_Content_Across_Tasks_and_Jobs <http://helix.apache.org/0.8.0-docs/tutorial_task_framework.html#Share_Content_Across_Tasks_and_Jobs>
> [2] https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/UserContentStore.java#L75 <https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/UserContentStore.java#L75>
> [3] https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L361 <https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L361>
> 
> Thanks,
> Lahiru
> 
> On 26 March 2018 at 19:44, Lahiru Jayathilake <lahiruj.14@cse.mrt.ac.lk <ma...@cse.mrt.ac.lk>> wrote:
> Hi Dimuthu, Suresh,
> 
> Thanks a lot for the feedback. I will update the proposal accordingly.
> 
> Regards,
> Lahiru
> 
> On 26 March 2018 at 08:48, Suresh Marru <smarru@apache.org <ma...@apache.org>> wrote:
> Hi Lahiru,
> 
> I echo Dimuthu’s comment. You have a good starting point, it will be nice if you can cover how users can interact with the parsed data. Essentially adding API access to the parsed metadata database and having proof of concept UI’s. This task could be challenging as the queries are very data specific and generalizing API access and building custom UI’s can be explanatory (less  defined) portions of your proposal. 
> 
> Cheers,
> Suresh
> 
> 
>> On Mar 25, 2018, at 8:12 PM, DImuthu Upeksha <dimuthu.upeksha2@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi Lahiru,
>> 
>> Nice document. And I like how you illustrate the systems through diagrams. However try to address how you are going to expose parsed data to outside through thrift APIs and how to design those data APIs in application specific manner. And in the persisting task, you have to make sure data integrity is preserved. For example in a Gaussian parsed output, you might have to validate the parsed output using a schema before persisting them in the database. 
>> 
>> Thanks
>> Dimuthu
>> 
>> On Sun, Mar 25, 2018 at 5:05 PM, Lahiru Jayathilake <lahiruj.14@cse.mrt.ac.lk <ma...@cse.mrt.ac.lk>> wrote:
>> Hi Everyone,
>> 
>> I have shared a draft proposal [1] for the GSoC project, AIRAVATA-2718 [2]. Any comments would be very helpful to improve it.
>> 
>> [1] https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJJNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing <https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJJNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing> 
>> [2] https://issues.apache.org/jira/browse/AIRAVATA-2718 <https://issues.apache.org/jira/browse/AIRAVATA-2718>
>> 
>> Thanks & Regards,
>> -- 
>> Lahiru Jayathilake
>> Department of Computer Science and Engineering,
>> Faculty of Engineering,
>> University of Moratuwa
>> 
>>  <https://lk.linkedin.com/in/lahirujayathilake>
> 
> 
> 
> 
> -- 
> Lahiru Jayathilake
> Department of Computer Science and Engineering,
> Faculty of Engineering,
> University of Moratuwa
> 
>  <https://lk.linkedin.com/in/lahirujayathilake>
> 
> 
> -- 
> Lahiru Jayathilake
> Department of Computer Science and Engineering,
> Faculty of Engineering,
> University of Moratuwa
> 
>  <https://lk.linkedin.com/in/lahirujayathilake>
> 
> 
> -- 
> Lahiru Jayathilake
> Department of Computer Science and Engineering,
> Faculty of Engineering,
> University of Moratuwa
> 
>  <https://lk.linkedin.com/in/lahirujayathilake>
> 
> 
> -- 
> Lahiru Jayathilake
> Department of Computer Science and Engineering,
> Faculty of Engineering,
> University of Moratuwa
> 
>  <https://lk.linkedin.com/in/lahirujayathilake>