You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@marmotta.apache.org by Junyue Wang <ju...@gmail.com> on 2015/07/01 05:16:28 UTC

Re: (GSoC 2015: MARMOTTA-593) Project Midterm Report

Hello Peter,

The problem is, there are no up-to-date, complete and detailed
specifications of RDF HDT. The W3C submission [1] in 2011 is out of date.
The documentation [2] is new ,but it contains just general description
without much details. For example, the first few bytes of a RDF HDT file
are "Global ControlInformation", but neither of the above 2 docs mention
the details. For the "Global ControlInformation", the format information
should be "<http://purl.org/HDT/hdt#HDTv1>", but there's no such
information in either of the docs.

I've tried to ask for the up-to-date specification from the authors of RDF
HDT. I've also inquired the licence issue in @legal-discuss. But none
useful reply comes out until now.

In order to code the parser from scratch, I had to study the source code of
HDT Java implementation (LGPL Licence), or more explicitly, HDTImpl.java
[3]. Then I re-writed the code in my own way with the same functionality.
For example, ControlInformation in HDT Java implementation is coded in
Object-Oriented way, but I made it just using some functions/methods, with
much of the idea inspired from BinaryRDFParser [4] in Sesame (BSD
License?). However I borrowed some code of low-level byte processing from
HDT Java implementation. Is this way OK with the licence issue?

yours,
Junyue

[1] http://www.w3.org/Submission/2011/SUBM-HDT-20110330/
[2] http://www.rdfhdt.org/hdt-internals/
[3]
https://github.com/rdfhdt/hdt-java/blob/master/hdt-java-core/src/main/java/org/rdfhdt/hdt/hdt/impl/HDTImpl.java
[4]
http://grepcode.com/file/repo1.maven.org/maven2/org.openrdf.sesame/sesame-rio-binary/2.7.14/org/openrdf/rio/binary/BinaryRDFParser.java/



On Sun, Jun 28, 2015 at 3:34 PM, Peter Ansell <an...@gmail.com>
wrote:

> Hi Junyue,
>
> Thanks for the update. See some comments inline below.
>
> On 28 June 2015 at 00:17, Junyue Wang <ju...@gmail.com> wrote:
> > Hi Peter, Sergio,
> >
> > I'm here to summarize the status for the first-half part of the GSoC
> > project:
> >
> > 1. Test data preparation
> > It's useful to have test data of hdt files prepared for testing the new
> > parser. But the dataset from [1] are too big for small tests. So I
> borrowed
> > some examples from W3C RDF documentation [2]. I used HDT java
> implementation
> > to transform example02.rdf~20.rdf into test02.hdt~20.hdt in the code base
> > [3]
>
> Having small tight examples is vital for unit testing, so that sounds
> good to me, as long as the current spec is backwards compatible with
> it.
>
> > 2. HDT RDF parser based on HDT java implementation
> > I'm sorry that the project goal was misunderstood during the project
> > proposal period. In the first few weeks of the project, I was devoted to
> > code the HDT RDF parser based on HDT java implementation. I also sent
> email
> > to legal-discuss@, for clarifying the licence issue, but no response
> showed
> > up until now. Anyway, I committed the code [4], in case it may be useful
> in
> > future.
>
> We can always rebase that commit out when contributing the final patch
> back, if it is an issue.
>
> > 3. HDT RDF parser from scratch
> > I've began to code the HDT RDF parser from scratch. Now the new parser
> can
> > parse the Global Information of the hdt files [5]. I'll continue in this
> way
> > for the next half-part of the project.
>
> That looks like a good start. See how you go after that parsing the
> other two sections and do let us know if you have any issues or
> queries.
>
> Thanks,
>
> Peter
>
> > yours,
> > Junyue
> >
> > [1] http://www.rdfhdt.org/datasets/
> > [2] https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-xml/index.html
> > [3]
> >
> https://github.com/junyuew/marmotta/tree/MARMOTTA-593/commons/marmotta-sesame-tools/marmotta-rio-rdfhdt/src/test/resources/org/apache/marmotta/commons/sesame/rio/rdfhdt
> > [4]
> >
> https://github.com/junyuew/marmotta/commit/e4b5d7492f102711c1227f592a36e26353f33812
> > [5]
> >
> https://github.com/junyuew/marmotta/commit/a7711b8338aafda9d812f0f2bb98cbde53a7cefa
> >
> >
>

Re: (GSoC 2015: MARMOTTA-593) Project Midterm Report

Posted by Peter Ansell <an...@gmail.com>.
Hi Junyue,

I was under the impression when we proposed the project that the W3C
Submission was up to date. Given that it isn't up to date and the HDT
team have not submitted any updates or formally documented their
changes since then, I would focus on making it work and we can focus
on the legal issues later. It is a shame that they referred us to
updated documentation that isn't actually a specification and they
haven't written up their changes properly outside of their codebase.

As far as I know, you are fine to derive code patterns from Sesame,
based on its BSD license. Marmotta already includes Sesame as a
dependency.

The most important thing at this stage is for you to get the work
done, given that the Apache lawyers have not replied at all yet.

We won't merge your patches into Marmotta until the legal team sign
off on them, but I don't want that to affect your GSOC progress so try
not to stress about that part right now. I would be happy for you to
have a fully functioning, interoperable, HDT parser and writer by the
end of your GSOC project, even if it isn't merged in when you complete
the GSOC time.

Thanks,

Peter


On 1 July 2015 at 13:16, Junyue Wang <ju...@gmail.com> wrote:
> Hello Peter,
>
> The problem is, there are no up-to-date, complete and detailed
> specifications of RDF HDT. The W3C submission [1] in 2011 is out of date.
> The documentation [2] is new ,but it contains just general description
> without much details. For example, the first few bytes of a RDF HDT file are
> "Global ControlInformation", but neither of the above 2 docs mention the
> details. For the "Global ControlInformation", the format information should
> be "<http://purl.org/HDT/hdt#HDTv1>", but there's no such information in
> either of the docs.
>
> I've tried to ask for the up-to-date specification from the authors of RDF
> HDT. I've also inquired the licence issue in @legal-discuss. But none useful
> reply comes out until now.
>
> In order to code the parser from scratch, I had to study the source code of
> HDT Java implementation (LGPL Licence), or more explicitly, HDTImpl.java
> [3]. Then I re-writed the code in my own way with the same functionality.
> For example, ControlInformation in HDT Java implementation is coded in
> Object-Oriented way, but I made it just using some functions/methods, with
> much of the idea inspired from BinaryRDFParser [4] in Sesame (BSD License?).
> However I borrowed some code of low-level byte processing from HDT Java
> implementation. Is this way OK with the licence issue?
>
> yours,
> Junyue
>
> [1] http://www.w3.org/Submission/2011/SUBM-HDT-20110330/
> [2] http://www.rdfhdt.org/hdt-internals/
> [3]
> https://github.com/rdfhdt/hdt-java/blob/master/hdt-java-core/src/main/java/org/rdfhdt/hdt/hdt/impl/HDTImpl.java
> [4]
> http://grepcode.com/file/repo1.maven.org/maven2/org.openrdf.sesame/sesame-rio-binary/2.7.14/org/openrdf/rio/binary/BinaryRDFParser.java/
>
>
>
> On Sun, Jun 28, 2015 at 3:34 PM, Peter Ansell <an...@gmail.com>
> wrote:
>>
>> Hi Junyue,
>>
>> Thanks for the update. See some comments inline below.
>>
>> On 28 June 2015 at 00:17, Junyue Wang <ju...@gmail.com> wrote:
>> > Hi Peter, Sergio,
>> >
>> > I'm here to summarize the status for the first-half part of the GSoC
>> > project:
>> >
>> > 1. Test data preparation
>> > It's useful to have test data of hdt files prepared for testing the new
>> > parser. But the dataset from [1] are too big for small tests. So I
>> > borrowed
>> > some examples from W3C RDF documentation [2]. I used HDT java
>> > implementation
>> > to transform example02.rdf~20.rdf into test02.hdt~20.hdt in the code
>> > base
>> > [3]
>>
>> Having small tight examples is vital for unit testing, so that sounds
>> good to me, as long as the current spec is backwards compatible with
>> it.
>>
>> > 2. HDT RDF parser based on HDT java implementation
>> > I'm sorry that the project goal was misunderstood during the project
>> > proposal period. In the first few weeks of the project, I was devoted to
>> > code the HDT RDF parser based on HDT java implementation. I also sent
>> > email
>> > to legal-discuss@, for clarifying the licence issue, but no response
>> > showed
>> > up until now. Anyway, I committed the code [4], in case it may be useful
>> > in
>> > future.
>>
>> We can always rebase that commit out when contributing the final patch
>> back, if it is an issue.
>>
>> > 3. HDT RDF parser from scratch
>> > I've began to code the HDT RDF parser from scratch. Now the new parser
>> > can
>> > parse the Global Information of the hdt files [5]. I'll continue in this
>> > way
>> > for the next half-part of the project.
>>
>> That looks like a good start. See how you go after that parsing the
>> other two sections and do let us know if you have any issues or
>> queries.
>>
>> Thanks,
>>
>> Peter
>>
>> > yours,
>> > Junyue
>> >
>> > [1] http://www.rdfhdt.org/datasets/
>> > [2] https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-xml/index.html
>> > [3]
>> >
>> > https://github.com/junyuew/marmotta/tree/MARMOTTA-593/commons/marmotta-sesame-tools/marmotta-rio-rdfhdt/src/test/resources/org/apache/marmotta/commons/sesame/rio/rdfhdt
>> > [4]
>> >
>> > https://github.com/junyuew/marmotta/commit/e4b5d7492f102711c1227f592a36e26353f33812
>> > [5]
>> >
>> > https://github.com/junyuew/marmotta/commit/a7711b8338aafda9d812f0f2bb98cbde53a7cefa
>> >
>> >
>
>