You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2015/10/03 20:21:03 UTC

JSON-LD upgrade - impact on Elephas

Upgrading the dependency for jsonld-java to 0.7.0 picks up a bug fix 
(jsonld-java issue 144) that Jena has a workaround for.

The issue is that the Jackson JSON parser does not flag trailing junk. 
It reads the JSON object and stops there.  Worse, it creates a buffered 
reader so the caller can't handle the stream afterwards.

---------------
{
   "@id" : "http://example/s",
   "http://example/p" : "str"
}
xxxxxxxxxxxxxxx
---------------

Jena (JsonLdReader) contains code taken from jsonld-java and modified to 
run the Jackson JSON parser, produce triples and then check for trailing 
junk.  The detect end of junk was contributed back to the project.  PR 145.

jsonld-java treats it more systematically.

If the JSON is syntactically bad in the {}, no triples merge. The 
process is completely read the JSON object then let the RDF conversion 
run.  Bad object -> no RDF at all.

If there is trailing junk, it is detected before passing up the JSON 
object so trailing junk, no triples unlike Jena currently.

I had hoped to remove the workaround and not duplicate jsonld-java code.

Elephas testing is impacted. It is sensitive to the "JSON object, 
trailing junk, triples" vs "JSON object, triples, trailing junk" 
differences.

Unless there is a specific reason to support that behaviour, I'd like to 
switch to jsonld-java behaviour.

(Rob) Thoughts?

	Andy

[1] https://github.com/jsonld-java/jsonld-java/issues/144

Re: JSON-LD upgrade - impact on Elephas

Posted by Andy Seaborne <an...@apache.org>.

On 13/10/15 14:21, Rob Vesse wrote:
> If the counts are different purely because we are failing in a different
> (but predictable) way then I see no reason not to change them
>
> Rob

Change made - the tests now ask what the expected answer are and the 
JSON-LD (triples, quads) tests return their own different answers.

https://github.com/apache/jena/commit/3607c1c9a0760db3a7b92f4bd9a6d2be66fe7d50

	Andy

>
> On 09/10/2015 16:42, "Andy Seaborne" <an...@apache.org> wrote:
>
>> Rob - Would changing the count results be acceptable?
>>
>> 	Andy
>>
>> On 05/10/15 13:22, Andy Seaborne wrote:
>>> On 05/10/15 09:31, Rob Vesse wrote:
>>>> Yes the tests are designed to be pragmatic
>>>>
>>>> If you are processing large amounts of data on Hadoop there are two
>>>> cases:
>>>>
>>>> - You want to skip/ignore bad data
>>>> - You want to fail fast on bad data
>>>>
>>>> The failing tests are presumably the ones testing the second case.
>>>
>>> The failing tests are:
>>>
>>> org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDTripleInputTest
>>>
>>> single_input_05
>>> java.lang.AssertionError: expected:<50> but was:<0>
>>>
>>> multiple_inputs_02
>>> java.lang.AssertionError: expected:<10150> but was:<10100>
>>>
>>> org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDQuadInputTest
>>>
>>> single_input_05
>>> java.lang.AssertionError: expected:<50> but was:<0>
>>>
>>> multiple_inputs_02
>>> java.lang.AssertionError: expected:<10150> but was:<10100>
>>>
>>> so 2 tests, repeated.
>>>
>>> See also JENA-1013 which was previous work done in this area - JSON-LD
>>> Elephas tests were not failing when they were supposed to.
>>>
>>>> My
>>>> general hacky approach to testing that is simply to generate some valid
>>>> data followed by some junk data.  If we change to the JSON-LD behaviour
>>>> then those tests in Elephas that cover JSON-LD will need to change to
>>>> generate a valid JSON object that happens to be invalid wrt. JSON-LD
>>>> but
>>>> since I don't know JSON-LD (and have zero desire to learn) I don't know
>>>> what we'd need to generate to do that
>>>
>>> No need to learn anything about JSON-LD.  My knowledge of how Hadoop
>>> processing works in the presence of failures isn't very strong.
>>>
>>> The tests already generate bad data by adding the trailing text "junk
>>> data" to a valid document - same for all formats.  JSON-LD does not have
>>> (and never has) the partial set of triples case that other formats have.
>>> But the Elephas tests don't test for that anyway - the only bad data is
>>> with the trailing string "junk data".
>>>
>>> So the issue is that the JSON-LD processor we use has a particular
>>> failure mode (which is correct for JSON-LD according to that community)
>>> that makes those two abstract tests need different answers for JSON-LD.
>>>    Would changing the count results be acceptable?
>>>
>>> This looks like the long-term solution that leads to the least
>>> maintenance.  We can retain our own code with its different
>>> characteristics but then we have to maintain it and probably get the
>>> occasional question as to why Jena is different in behaviour to other
>>> systems.
>>>
>>>       Andy
>>>
>>>>
>>>> Rob
>>>>
>>>> On 04/10/2015 10:02, "Andy Seaborne" <an...@apache.org> wrote:
>>>>
>>>>> Claude,
>>>>>
>>>>> The point is more on the pragmatic side than the ideal design with a
>>>>> tradeoff between maintaining our own code vs using a maintained
>>>>> library.
>>>>>
>>>>> The jsonld-java parsing process isn't streaming in either use case so
>>>>> it's not a case of some triples read from the input.  The jsonld-java
>>>>> process is layered, not streamed - all the JSON parsing is done, then
>>>>> the conversion to RDF happens.
>>>>>
>>>>> The two processes are:
>>>>>
>>>>> (Jena calling low level, non-API calls of jsonld-java):
>>>>> 1a/ Parse JSON
>>>>> 2a/ Do all triples
>>>>> 3a/ Check for trailing junk
>>>>>
>>>>> vs
>>>>>
>>>>> (jsonld-java API)
>>>>> 1b/ Parse JSON
>>>>> 2b/ Check for trailing junk
>>>>> 3b/ Do all triples
>>>>>
>>>>> I am wondering if the Elephas tests are tuned to the way Jena works in
>>>>> these error cases, rather than relying on a feature of it.
>>>>>
>>>>>      Andy
>>>>>
>>>>> AbstractWholeFileQuadInputFormatTests
>>>>>
>>>>> On 04/10/15 09:19, Claude Warren wrote:
>>>>>> not Rob but my 2 cents.....
>>>>>>
>>>>>> I think that when we read turtle documents if there is an error the
>>>>>> triples
>>>>>> we have already read and left in the graph/model (yes, transactions
>>>>>> can
>>>>>> change this).  Shouldn't all parsers follow the same pattern?
>>>>>>
>>>>>> Currently that pattern seems to be:  read until eof or error and
>>>>>> process
>>>>>> what was read.
>>>>>>
>>>>>> Unless I am wrong about the above, I think that the JSON parser
>>>>>> should
>>>>>> return the json object that was parsed before the junk.
>>>>>>
>>>>>>
>>>>>> Claude
>>>>>>
>>>>>> On Sat, Oct 3, 2015 at 7:21 PM, Andy Seaborne <an...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Upgrading the dependency for jsonld-java to 0.7.0 picks up a bug fix
>>>>>>> (jsonld-java issue 144) that Jena has a workaround for.
>>>>>>>
>>>>>>> The issue is that the Jackson JSON parser does not flag trailing
>>>>>>> junk.
>>>>>>> It
>>>>>>> reads the JSON object and stops there.  Worse, it creates a buffered
>>>>>>> reader
>>>>>>> so the caller can't handle the stream afterwards.
>>>>>>>
>>>>>>> ---------------
>>>>>>> {
>>>>>>>      "@id" : "http://example/s",
>>>>>>>      "http://example/p" : "str"
>>>>>>> }
>>>>>>> xxxxxxxxxxxxxxx
>>>>>>> ---------------
>>>>>>>
>>>>>>> Jena (JsonLdReader) contains code taken from jsonld-java and
>>>>>>> modified
>>>>>>> to
>>>>>>> run the Jackson JSON parser, produce triples and then check for
>>>>>>> trailing
>>>>>>> junk.  The detect end of junk was contributed back to the project.
>>>>>>> PR
>>>>>>> 145.
>>>>>>>
>>>>>>> jsonld-java treats it more systematically.
>>>>>>>
>>>>>>> If the JSON is syntactically bad in the {}, no triples merge. The
>>>>>>> process
>>>>>>> is completely read the JSON object then let the RDF conversion run.
>>>>>>> Bad
>>>>>>> object -> no RDF at all.
>>>>>>>
>>>>>>> If there is trailing junk, it is detected before passing up the JSON
>>>>>>> object so trailing junk, no triples unlike Jena currently.
>>>>>>>
>>>>>>> I had hoped to remove the workaround and not duplicate jsonld-java
>>>>>>> code.
>>>>>>>
>>>>>>> Elephas testing is impacted. It is sensitive to the "JSON object,
>>>>>>> trailing
>>>>>>> junk, triples" vs "JSON object, triples, trailing junk" differences.
>>>>>>>
>>>>>>> Unless there is a specific reason to support that behaviour, I'd
>>>>>>> like
>>>>>>> to
>>>>>>> switch to jsonld-java behaviour.
>>>>>>>
>>>>>>> (Rob) Thoughts?
>>>>>>>
>>>>>>>            Andy
>>>>>>>
>>>>>>> [1] https://github.com/jsonld-java/jsonld-java/issues/144
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>
>
>

Re: JSON-LD upgrade - impact on Elephas

Posted by Rob Vesse <rv...@dotnetrdf.org>.

If the counts are different purely because we are failing in a different
(but predictable) way then I see no reason not to change them

Rob

On 09/10/2015 16:42, "Andy Seaborne" <an...@apache.org> wrote:

>Rob - Would changing the count results be acceptable?
>
>	Andy
>
>On 05/10/15 13:22, Andy Seaborne wrote:
>> On 05/10/15 09:31, Rob Vesse wrote:
>>> Yes the tests are designed to be pragmatic
>>>
>>> If you are processing large amounts of data on Hadoop there are two
>>> cases:
>>>
>>> - You want to skip/ignore bad data
>>> - You want to fail fast on bad data
>>>
>>> The failing tests are presumably the ones testing the second case.
>>
>> The failing tests are:
>>
>> org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDTripleInputTest
>>
>> single_input_05
>> java.lang.AssertionError: expected:<50> but was:<0>
>>
>> multiple_inputs_02
>> java.lang.AssertionError: expected:<10150> but was:<10100>
>>
>> org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDQuadInputTest
>>
>> single_input_05
>> java.lang.AssertionError: expected:<50> but was:<0>
>>
>> multiple_inputs_02
>> java.lang.AssertionError: expected:<10150> but was:<10100>
>>
>> so 2 tests, repeated.
>>
>> See also JENA-1013 which was previous work done in this area - JSON-LD
>> Elephas tests were not failing when they were supposed to.
>>
>>> My
>>> general hacky approach to testing that is simply to generate some valid
>>> data followed by some junk data.  If we change to the JSON-LD behaviour
>>> then those tests in Elephas that cover JSON-LD will need to change to
>>> generate a valid JSON object that happens to be invalid wrt. JSON-LD
>>>but
>>> since I don't know JSON-LD (and have zero desire to learn) I don't know
>>> what we'd need to generate to do that
>>
>> No need to learn anything about JSON-LD.  My knowledge of how Hadoop
>> processing works in the presence of failures isn't very strong.
>>
>> The tests already generate bad data by adding the trailing text "junk
>> data" to a valid document - same for all formats.  JSON-LD does not have
>> (and never has) the partial set of triples case that other formats have.
>> But the Elephas tests don't test for that anyway - the only bad data is
>> with the trailing string "junk data".
>>
>> So the issue is that the JSON-LD processor we use has a particular
>> failure mode (which is correct for JSON-LD according to that community)
>> that makes those two abstract tests need different answers for JSON-LD.
>>   Would changing the count results be acceptable?
>>
>> This looks like the long-term solution that leads to the least
>> maintenance.  We can retain our own code with its different
>> characteristics but then we have to maintain it and probably get the
>> occasional question as to why Jena is different in behaviour to other
>> systems.
>>
>>      Andy
>>
>>>
>>> Rob
>>>
>>> On 04/10/2015 10:02, "Andy Seaborne" <an...@apache.org> wrote:
>>>
>>>> Claude,
>>>>
>>>> The point is more on the pragmatic side than the ideal design with a
>>>> tradeoff between maintaining our own code vs using a maintained
>>>>library.
>>>>
>>>> The jsonld-java parsing process isn't streaming in either use case so
>>>> it's not a case of some triples read from the input.  The jsonld-java
>>>> process is layered, not streamed - all the JSON parsing is done, then
>>>> the conversion to RDF happens.
>>>>
>>>> The two processes are:
>>>>
>>>> (Jena calling low level, non-API calls of jsonld-java):
>>>> 1a/ Parse JSON
>>>> 2a/ Do all triples
>>>> 3a/ Check for trailing junk
>>>>
>>>> vs
>>>>
>>>> (jsonld-java API)
>>>> 1b/ Parse JSON
>>>> 2b/ Check for trailing junk
>>>> 3b/ Do all triples
>>>>
>>>> I am wondering if the Elephas tests are tuned to the way Jena works in
>>>> these error cases, rather than relying on a feature of it.
>>>>
>>>>     Andy
>>>>
>>>> AbstractWholeFileQuadInputFormatTests
>>>>
>>>> On 04/10/15 09:19, Claude Warren wrote:
>>>>> not Rob but my 2 cents.....
>>>>>
>>>>> I think that when we read turtle documents if there is an error the
>>>>> triples
>>>>> we have already read and left in the graph/model (yes, transactions
>>>>>can
>>>>> change this).  Shouldn't all parsers follow the same pattern?
>>>>>
>>>>> Currently that pattern seems to be:  read until eof or error and
>>>>> process
>>>>> what was read.
>>>>>
>>>>> Unless I am wrong about the above, I think that the JSON parser
>>>>>should
>>>>> return the json object that was parsed before the junk.
>>>>>
>>>>>
>>>>> Claude
>>>>>
>>>>> On Sat, Oct 3, 2015 at 7:21 PM, Andy Seaborne <an...@apache.org>
>>>>>wrote:
>>>>>
>>>>>> Upgrading the dependency for jsonld-java to 0.7.0 picks up a bug fix
>>>>>> (jsonld-java issue 144) that Jena has a workaround for.
>>>>>>
>>>>>> The issue is that the Jackson JSON parser does not flag trailing
>>>>>>junk.
>>>>>> It
>>>>>> reads the JSON object and stops there.  Worse, it creates a buffered
>>>>>> reader
>>>>>> so the caller can't handle the stream afterwards.
>>>>>>
>>>>>> ---------------
>>>>>> {
>>>>>>     "@id" : "http://example/s",
>>>>>>     "http://example/p" : "str"
>>>>>> }
>>>>>> xxxxxxxxxxxxxxx
>>>>>> ---------------
>>>>>>
>>>>>> Jena (JsonLdReader) contains code taken from jsonld-java and
>>>>>>modified
>>>>>> to
>>>>>> run the Jackson JSON parser, produce triples and then check for
>>>>>> trailing
>>>>>> junk.  The detect end of junk was contributed back to the project.
>>>>>>PR
>>>>>> 145.
>>>>>>
>>>>>> jsonld-java treats it more systematically.
>>>>>>
>>>>>> If the JSON is syntactically bad in the {}, no triples merge. The
>>>>>> process
>>>>>> is completely read the JSON object then let the RDF conversion run.
>>>>>> Bad
>>>>>> object -> no RDF at all.
>>>>>>
>>>>>> If there is trailing junk, it is detected before passing up the JSON
>>>>>> object so trailing junk, no triples unlike Jena currently.
>>>>>>
>>>>>> I had hoped to remove the workaround and not duplicate jsonld-java
>>>>>> code.
>>>>>>
>>>>>> Elephas testing is impacted. It is sensitive to the "JSON object,
>>>>>> trailing
>>>>>> junk, triples" vs "JSON object, triples, trailing junk" differences.
>>>>>>
>>>>>> Unless there is a specific reason to support that behaviour, I'd
>>>>>>like
>>>>>> to
>>>>>> switch to jsonld-java behaviour.
>>>>>>
>>>>>> (Rob) Thoughts?
>>>>>>
>>>>>>           Andy
>>>>>>
>>>>>> [1] https://github.com/jsonld-java/jsonld-java/issues/144
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>

Re: JSON-LD upgrade - impact on Elephas

Posted by Andy Seaborne <an...@apache.org>.

Rob - Would changing the count results be acceptable?

	Andy

On 05/10/15 13:22, Andy Seaborne wrote:
> On 05/10/15 09:31, Rob Vesse wrote:
>> Yes the tests are designed to be pragmatic
>>
>> If you are processing large amounts of data on Hadoop there are two
>> cases:
>>
>> - You want to skip/ignore bad data
>> - You want to fail fast on bad data
>>
>> The failing tests are presumably the ones testing the second case.
>
> The failing tests are:
>
> org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDTripleInputTest
>
> single_input_05
> java.lang.AssertionError: expected:<50> but was:<0>
>
> multiple_inputs_02
> java.lang.AssertionError: expected:<10150> but was:<10100>
>
> org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDQuadInputTest
>
> single_input_05
> java.lang.AssertionError: expected:<50> but was:<0>
>
> multiple_inputs_02
> java.lang.AssertionError: expected:<10150> but was:<10100>
>
> so 2 tests, repeated.
>
> See also JENA-1013 which was previous work done in this area - JSON-LD
> Elephas tests were not failing when they were supposed to.
>
>> My
>> general hacky approach to testing that is simply to generate some valid
>> data followed by some junk data.  If we change to the JSON-LD behaviour
>> then those tests in Elephas that cover JSON-LD will need to change to
>> generate a valid JSON object that happens to be invalid wrt. JSON-LD but
>> since I don't know JSON-LD (and have zero desire to learn) I don't know
>> what we'd need to generate to do that
>
> No need to learn anything about JSON-LD.  My knowledge of how Hadoop
> processing works in the presence of failures isn't very strong.
>
> The tests already generate bad data by adding the trailing text "junk
> data" to a valid document - same for all formats.  JSON-LD does not have
> (and never has) the partial set of triples case that other formats have.
> But the Elephas tests don't test for that anyway - the only bad data is
> with the trailing string "junk data".
>
> So the issue is that the JSON-LD processor we use has a particular
> failure mode (which is correct for JSON-LD according to that community)
> that makes those two abstract tests need different answers for JSON-LD.
>   Would changing the count results be acceptable?
>
> This looks like the long-term solution that leads to the least
> maintenance.  We can retain our own code with its different
> characteristics but then we have to maintain it and probably get the
> occasional question as to why Jena is different in behaviour to other
> systems.
>
>      Andy
>
>>
>> Rob
>>
>> On 04/10/2015 10:02, "Andy Seaborne" <an...@apache.org> wrote:
>>
>>> Claude,
>>>
>>> The point is more on the pragmatic side than the ideal design with a
>>> tradeoff between maintaining our own code vs using a maintained library.
>>>
>>> The jsonld-java parsing process isn't streaming in either use case so
>>> it's not a case of some triples read from the input.  The jsonld-java
>>> process is layered, not streamed - all the JSON parsing is done, then
>>> the conversion to RDF happens.
>>>
>>> The two processes are:
>>>
>>> (Jena calling low level, non-API calls of jsonld-java):
>>> 1a/ Parse JSON
>>> 2a/ Do all triples
>>> 3a/ Check for trailing junk
>>>
>>> vs
>>>
>>> (jsonld-java API)
>>> 1b/ Parse JSON
>>> 2b/ Check for trailing junk
>>> 3b/ Do all triples
>>>
>>> I am wondering if the Elephas tests are tuned to the way Jena works in
>>> these error cases, rather than relying on a feature of it.
>>>
>>>     Andy
>>>
>>> AbstractWholeFileQuadInputFormatTests
>>>
>>> On 04/10/15 09:19, Claude Warren wrote:
>>>> not Rob but my 2 cents.....
>>>>
>>>> I think that when we read turtle documents if there is an error the
>>>> triples
>>>> we have already read and left in the graph/model (yes, transactions can
>>>> change this).  Shouldn't all parsers follow the same pattern?
>>>>
>>>> Currently that pattern seems to be:  read until eof or error and
>>>> process
>>>> what was read.
>>>>
>>>> Unless I am wrong about the above, I think that the JSON parser should
>>>> return the json object that was parsed before the junk.
>>>>
>>>>
>>>> Claude
>>>>
>>>> On Sat, Oct 3, 2015 at 7:21 PM, Andy Seaborne <an...@apache.org> wrote:
>>>>
>>>>> Upgrading the dependency for jsonld-java to 0.7.0 picks up a bug fix
>>>>> (jsonld-java issue 144) that Jena has a workaround for.
>>>>>
>>>>> The issue is that the Jackson JSON parser does not flag trailing junk.
>>>>> It
>>>>> reads the JSON object and stops there.  Worse, it creates a buffered
>>>>> reader
>>>>> so the caller can't handle the stream afterwards.
>>>>>
>>>>> ---------------
>>>>> {
>>>>>     "@id" : "http://example/s",
>>>>>     "http://example/p" : "str"
>>>>> }
>>>>> xxxxxxxxxxxxxxx
>>>>> ---------------
>>>>>
>>>>> Jena (JsonLdReader) contains code taken from jsonld-java and modified
>>>>> to
>>>>> run the Jackson JSON parser, produce triples and then check for
>>>>> trailing
>>>>> junk.  The detect end of junk was contributed back to the project.  PR
>>>>> 145.
>>>>>
>>>>> jsonld-java treats it more systematically.
>>>>>
>>>>> If the JSON is syntactically bad in the {}, no triples merge. The
>>>>> process
>>>>> is completely read the JSON object then let the RDF conversion run.
>>>>> Bad
>>>>> object -> no RDF at all.
>>>>>
>>>>> If there is trailing junk, it is detected before passing up the JSON
>>>>> object so trailing junk, no triples unlike Jena currently.
>>>>>
>>>>> I had hoped to remove the workaround and not duplicate jsonld-java
>>>>> code.
>>>>>
>>>>> Elephas testing is impacted. It is sensitive to the "JSON object,
>>>>> trailing
>>>>> junk, triples" vs "JSON object, triples, trailing junk" differences.
>>>>>
>>>>> Unless there is a specific reason to support that behaviour, I'd like
>>>>> to
>>>>> switch to jsonld-java behaviour.
>>>>>
>>>>> (Rob) Thoughts?
>>>>>
>>>>>           Andy
>>>>>
>>>>> [1] https://github.com/jsonld-java/jsonld-java/issues/144
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>

Re: JSON-LD upgrade - impact on Elephas

Posted by Andy Seaborne <an...@apache.org>.

On 05/10/15 09:31, Rob Vesse wrote:
> Yes the tests are designed to be pragmatic
>
> If you are processing large amounts of data on Hadoop there are two cases:
>
> - You want to skip/ignore bad data
> - You want to fail fast on bad data
>
> The failing tests are presumably the ones testing the second case.

The failing tests are:

org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDTripleInputTest

single_input_05
java.lang.AssertionError: expected:<50> but was:<0>

multiple_inputs_02
java.lang.AssertionError: expected:<10150> but was:<10100>

org.apache.jena.hadoop.rdf.io.input.jsonld.JsonLDQuadInputTest

single_input_05
java.lang.AssertionError: expected:<50> but was:<0>

multiple_inputs_02
java.lang.AssertionError: expected:<10150> but was:<10100>

so 2 tests, repeated.

See also JENA-1013 which was previous work done in this area - JSON-LD 
Elephas tests were not failing when they were supposed to.

> My
> general hacky approach to testing that is simply to generate some valid
> data followed by some junk data.  If we change to the JSON-LD behaviour
> then those tests in Elephas that cover JSON-LD will need to change to
> generate a valid JSON object that happens to be invalid wrt. JSON-LD but
> since I don't know JSON-LD (and have zero desire to learn) I don't know
> what we'd need to generate to do that

No need to learn anything about JSON-LD.  My knowledge of how Hadoop 
processing works in the presence of failures isn't very strong.

The tests already generate bad data by adding the trailing text "junk 
data" to a valid document - same for all formats.  JSON-LD does not have 
(and never has) the partial set of triples case that other formats have. 
But the Elephas tests don't test for that anyway - the only bad data is 
with the trailing string "junk data".

So the issue is that the JSON-LD processor we use has a particular 
failure mode (which is correct for JSON-LD according to that community)
that makes those two abstract tests need different answers for JSON-LD. 
  Would changing the count results be acceptable?

This looks like the long-term solution that leads to the least 
maintenance.  We can retain our own code with its different 
characteristics but then we have to maintain it and probably get the 
occasional question as to why Jena is different in behaviour to other 
systems.

	Andy

>
> Rob
>
> On 04/10/2015 10:02, "Andy Seaborne" <an...@apache.org> wrote:
>
>> Claude,
>>
>> The point is more on the pragmatic side than the ideal design with a
>> tradeoff between maintaining our own code vs using a maintained library.
>>
>> The jsonld-java parsing process isn't streaming in either use case so
>> it's not a case of some triples read from the input.  The jsonld-java
>> process is layered, not streamed - all the JSON parsing is done, then
>> the conversion to RDF happens.
>>
>> The two processes are:
>>
>> (Jena calling low level, non-API calls of jsonld-java):
>> 1a/ Parse JSON
>> 2a/ Do all triples
>> 3a/ Check for trailing junk
>>
>> vs
>>
>> (jsonld-java API)
>> 1b/ Parse JSON
>> 2b/ Check for trailing junk
>> 3b/ Do all triples
>>
>> I am wondering if the Elephas tests are tuned to the way Jena works in
>> these error cases, rather than relying on a feature of it.
>>
>> 	Andy
>>
>> AbstractWholeFileQuadInputFormatTests
>>
>> On 04/10/15 09:19, Claude Warren wrote:
>>> not Rob but my 2 cents.....
>>>
>>> I think that when we read turtle documents if there is an error the
>>> triples
>>> we have already read and left in the graph/model (yes, transactions can
>>> change this).  Shouldn't all parsers follow the same pattern?
>>>
>>> Currently that pattern seems to be:  read until eof or error and process
>>> what was read.
>>>
>>> Unless I am wrong about the above, I think that the JSON parser should
>>> return the json object that was parsed before the junk.
>>>
>>>
>>> Claude
>>>
>>> On Sat, Oct 3, 2015 at 7:21 PM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>>> Upgrading the dependency for jsonld-java to 0.7.0 picks up a bug fix
>>>> (jsonld-java issue 144) that Jena has a workaround for.
>>>>
>>>> The issue is that the Jackson JSON parser does not flag trailing junk.
>>>> It
>>>> reads the JSON object and stops there.  Worse, it creates a buffered
>>>> reader
>>>> so the caller can't handle the stream afterwards.
>>>>
>>>> ---------------
>>>> {
>>>>     "@id" : "http://example/s",
>>>>     "http://example/p" : "str"
>>>> }
>>>> xxxxxxxxxxxxxxx
>>>> ---------------
>>>>
>>>> Jena (JsonLdReader) contains code taken from jsonld-java and modified
>>>> to
>>>> run the Jackson JSON parser, produce triples and then check for
>>>> trailing
>>>> junk.  The detect end of junk was contributed back to the project.  PR
>>>> 145.
>>>>
>>>> jsonld-java treats it more systematically.
>>>>
>>>> If the JSON is syntactically bad in the {}, no triples merge. The
>>>> process
>>>> is completely read the JSON object then let the RDF conversion run.
>>>> Bad
>>>> object -> no RDF at all.
>>>>
>>>> If there is trailing junk, it is detected before passing up the JSON
>>>> object so trailing junk, no triples unlike Jena currently.
>>>>
>>>> I had hoped to remove the workaround and not duplicate jsonld-java
>>>> code.
>>>>
>>>> Elephas testing is impacted. It is sensitive to the "JSON object,
>>>> trailing
>>>> junk, triples" vs "JSON object, triples, trailing junk" differences.
>>>>
>>>> Unless there is a specific reason to support that behaviour, I'd like
>>>> to
>>>> switch to jsonld-java behaviour.
>>>>
>>>> (Rob) Thoughts?
>>>>
>>>>           Andy
>>>>
>>>> [1] https://github.com/jsonld-java/jsonld-java/issues/144
>>>>
>>>
>>>
>>>
>>
>
>
>
>

Re: JSON-LD upgrade - impact on Elephas

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Yes the tests are designed to be pragmatic

If you are processing large amounts of data on Hadoop there are two cases:

- You want to skip/ignore bad data
- You want to fail fast on bad data

The failing tests are presumably the ones testing the second case.  My
general hacky approach to testing that is simply to generate some valid
data followed by some junk data.  If we change to the JSON-LD behaviour
then those tests in Elephas that cover JSON-LD will need to change to
generate a valid JSON object that happens to be invalid wrt. JSON-LD but
since I don't know JSON-LD (and have zero desire to learn) I don't know
what we'd need to generate to do that

Rob

On 04/10/2015 10:02, "Andy Seaborne" <an...@apache.org> wrote:

>Claude,
>
>The point is more on the pragmatic side than the ideal design with a
>tradeoff between maintaining our own code vs using a maintained library.
>
>The jsonld-java parsing process isn't streaming in either use case so
>it's not a case of some triples read from the input.  The jsonld-java
>process is layered, not streamed - all the JSON parsing is done, then
>the conversion to RDF happens.
>
>The two processes are:
>
>(Jena calling low level, non-API calls of jsonld-java):
>1a/ Parse JSON
>2a/ Do all triples
>3a/ Check for trailing junk
>
>vs
>
>(jsonld-java API)
>1b/ Parse JSON
>2b/ Check for trailing junk
>3b/ Do all triples
>
>I am wondering if the Elephas tests are tuned to the way Jena works in
>these error cases, rather than relying on a feature of it.
>
>	Andy
>
>AbstractWholeFileQuadInputFormatTests
>
>On 04/10/15 09:19, Claude Warren wrote:
>> not Rob but my 2 cents.....
>>
>> I think that when we read turtle documents if there is an error the
>>triples
>> we have already read and left in the graph/model (yes, transactions can
>> change this).  Shouldn't all parsers follow the same pattern?
>>
>> Currently that pattern seems to be:  read until eof or error and process
>> what was read.
>>
>> Unless I am wrong about the above, I think that the JSON parser should
>> return the json object that was parsed before the junk.
>>
>>
>> Claude
>>
>> On Sat, Oct 3, 2015 at 7:21 PM, Andy Seaborne <an...@apache.org> wrote:
>>
>>> Upgrading the dependency for jsonld-java to 0.7.0 picks up a bug fix
>>> (jsonld-java issue 144) that Jena has a workaround for.
>>>
>>> The issue is that the Jackson JSON parser does not flag trailing junk.
>>>It
>>> reads the JSON object and stops there.  Worse, it creates a buffered
>>>reader
>>> so the caller can't handle the stream afterwards.
>>>
>>> ---------------
>>> {
>>>    "@id" : "http://example/s",
>>>    "http://example/p" : "str"
>>> }
>>> xxxxxxxxxxxxxxx
>>> ---------------
>>>
>>> Jena (JsonLdReader) contains code taken from jsonld-java and modified
>>>to
>>> run the Jackson JSON parser, produce triples and then check for
>>>trailing
>>> junk.  The detect end of junk was contributed back to the project.  PR
>>>145.
>>>
>>> jsonld-java treats it more systematically.
>>>
>>> If the JSON is syntactically bad in the {}, no triples merge. The
>>>process
>>> is completely read the JSON object then let the RDF conversion run.
>>>Bad
>>> object -> no RDF at all.
>>>
>>> If there is trailing junk, it is detected before passing up the JSON
>>> object so trailing junk, no triples unlike Jena currently.
>>>
>>> I had hoped to remove the workaround and not duplicate jsonld-java
>>>code.
>>>
>>> Elephas testing is impacted. It is sensitive to the "JSON object,
>>>trailing
>>> junk, triples" vs "JSON object, triples, trailing junk" differences.
>>>
>>> Unless there is a specific reason to support that behaviour, I'd like
>>>to
>>> switch to jsonld-java behaviour.
>>>
>>> (Rob) Thoughts?
>>>
>>>          Andy
>>>
>>> [1] https://github.com/jsonld-java/jsonld-java/issues/144
>>>
>>
>>
>>
>

Re: JSON-LD upgrade - impact on Elephas

Posted by Andy Seaborne <an...@apache.org>.

Claude,

The point is more on the pragmatic side than the ideal design with a 
tradeoff between maintaining our own code vs using a maintained library.

The jsonld-java parsing process isn't streaming in either use case so 
it's not a case of some triples read from the input.  The jsonld-java 
process is layered, not streamed - all the JSON parsing is done, then 
the conversion to RDF happens.

The two processes are:

(Jena calling low level, non-API calls of jsonld-java):
1a/ Parse JSON
2a/ Do all triples
3a/ Check for trailing junk

vs

(jsonld-java API)
1b/ Parse JSON
2b/ Check for trailing junk
3b/ Do all triples

I am wondering if the Elephas tests are tuned to the way Jena works in 
these error cases, rather than relying on a feature of it.

	Andy

AbstractWholeFileQuadInputFormatTests

On 04/10/15 09:19, Claude Warren wrote:
> not Rob but my 2 cents.....
>
> I think that when we read turtle documents if there is an error the triples
> we have already read and left in the graph/model (yes, transactions can
> change this).  Shouldn't all parsers follow the same pattern?
>
> Currently that pattern seems to be:  read until eof or error and process
> what was read.
>
> Unless I am wrong about the above, I think that the JSON parser should
> return the json object that was parsed before the junk.
>
>
> Claude
>
> On Sat, Oct 3, 2015 at 7:21 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> Upgrading the dependency for jsonld-java to 0.7.0 picks up a bug fix
>> (jsonld-java issue 144) that Jena has a workaround for.
>>
>> The issue is that the Jackson JSON parser does not flag trailing junk. It
>> reads the JSON object and stops there.  Worse, it creates a buffered reader
>> so the caller can't handle the stream afterwards.
>>
>> ---------------
>> {
>>    "@id" : "http://example/s",
>>    "http://example/p" : "str"
>> }
>> xxxxxxxxxxxxxxx
>> ---------------
>>
>> Jena (JsonLdReader) contains code taken from jsonld-java and modified to
>> run the Jackson JSON parser, produce triples and then check for trailing
>> junk.  The detect end of junk was contributed back to the project.  PR 145.
>>
>> jsonld-java treats it more systematically.
>>
>> If the JSON is syntactically bad in the {}, no triples merge. The process
>> is completely read the JSON object then let the RDF conversion run.  Bad
>> object -> no RDF at all.
>>
>> If there is trailing junk, it is detected before passing up the JSON
>> object so trailing junk, no triples unlike Jena currently.
>>
>> I had hoped to remove the workaround and not duplicate jsonld-java code.
>>
>> Elephas testing is impacted. It is sensitive to the "JSON object, trailing
>> junk, triples" vs "JSON object, triples, trailing junk" differences.
>>
>> Unless there is a specific reason to support that behaviour, I'd like to
>> switch to jsonld-java behaviour.
>>
>> (Rob) Thoughts?
>>
>>          Andy
>>
>> [1] https://github.com/jsonld-java/jsonld-java/issues/144
>>
>
>
>

Re: JSON-LD upgrade - impact on Elephas

Posted by Claude Warren <cl...@xenei.com>.

not Rob but my 2 cents.....

I think that when we read turtle documents if there is an error the triples
we have already read and left in the graph/model (yes, transactions can
change this).  Shouldn't all parsers follow the same pattern?

Currently that pattern seems to be:  read until eof or error and process
what was read.

Unless I am wrong about the above, I think that the JSON parser should
return the json object that was parsed before the junk.


Claude

On Sat, Oct 3, 2015 at 7:21 PM, Andy Seaborne <an...@apache.org> wrote:

> Upgrading the dependency for jsonld-java to 0.7.0 picks up a bug fix
> (jsonld-java issue 144) that Jena has a workaround for.
>
> The issue is that the Jackson JSON parser does not flag trailing junk. It
> reads the JSON object and stops there.  Worse, it creates a buffered reader
> so the caller can't handle the stream afterwards.
>
> ---------------
> {
>   "@id" : "http://example/s",
>   "http://example/p" : "str"
> }
> xxxxxxxxxxxxxxx
> ---------------
>
> Jena (JsonLdReader) contains code taken from jsonld-java and modified to
> run the Jackson JSON parser, produce triples and then check for trailing
> junk.  The detect end of junk was contributed back to the project.  PR 145.
>
> jsonld-java treats it more systematically.
>
> If the JSON is syntactically bad in the {}, no triples merge. The process
> is completely read the JSON object then let the RDF conversion run.  Bad
> object -> no RDF at all.
>
> If there is trailing junk, it is detected before passing up the JSON
> object so trailing junk, no triples unlike Jena currently.
>
> I had hoped to remove the workaround and not duplicate jsonld-java code.
>
> Elephas testing is impacted. It is sensitive to the "JSON object, trailing
> junk, triples" vs "JSON object, triples, trailing junk" differences.
>
> Unless there is a specific reason to support that behaviour, I'd like to
> switch to jsonld-java behaviour.
>
> (Rob) Thoughts?
>
>         Andy
>
> [1] https://github.com/jsonld-java/jsonld-java/issues/144
>



-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren