You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (JIRA)" <ji...@apache.org> on 2014/11/27 18:02:13 UTC

[jira] [Commented] (JENA-820) Blank Node output under Hadoop can cause identifiers to diverge in multi-stage pipelines

    [ https://issues.apache.org/jira/browse/JENA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227830#comment-14227830 ] 

Andy Seaborne commented on JENA-820:
------------------------------------

There are mechanisms for this in RIOT.

1 -- Use {{<_:LABEL>}} for writing a bNode and then the label is preserved.

2 -- Consistent parsing across streams:

The {{ParserProfile}} dicates the conversion from syntax read to Nodes.  One operation is {{createBlankNode(String label,..}}.  {{ParserProfileBase}} has a "label to node" controls this with a {{LableToNode}} policy object.

For example, {{LabelToNode createScopeByDocumentHash}} takes a seed (a UUID so very large). The default is to use new UUID per parser.  This then can scale to arbitrary scale data files because parsing retains no growing state to track labels throughout the run.

You can chnage the policy to start at a fixed seed for all files.

The Thrift format preserves bNodes labels.

> Blank Node output under Hadoop can cause identifiers to diverge in multi-stage pipelines
> ----------------------------------------------------------------------------------------
>
>                 Key: JENA-820
>                 URL: https://issues.apache.org/jira/browse/JENA-820
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RDF Tools for Hadoop
>            Reporter: Rob Vesse
>            Assignee: Rob Vesse
>             Fix For: Jena 2.12.2
>
>
> In writing up the documentation on the RDF Tools for Hadoop and enumerating the possible issues that blank nodes imply I discovered an issue that I hadn't previously considered.
> For a single job the input and output formats all ensure that blank nodes are consistently given the same identifiers if they had the same syntactic ID and were in the same file.  This is done even when a file is being read in multiple chunks by multiple map tasks.  However by its nature each reduce task will create an output file so potentially you can end up with blank nodes spread over multiple files.
> However if we then read these files into a subsequent job the blank nodes may now be spread across multiple files so even though they were the same node originally our allocation policy will cause the identifiers to diverge and become distinct blank nodes which is incorrect behaviour.
> Since there is no clear universal fix for this what I am considering doing is instead introducing a configuration setting that will allow the file path to be ignored for the purpose of blank node identifier allocations within a job.  This will mean that identifiers are purely allocated on the basis of the Job ID and thus the same syntactic ID in any file will result in the same blank node identifier.  As the user will hopefully will have left this turned off for the first job even if we start with the same syntactic ID but in different files the normal allocation policy for the first job should ensure unique IDs for the later jobs.
> My next step on this is to implement a failing unit test (and then temporarily ignore it) which demonstrates this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)