You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (Jira)" <ji...@apache.org> on 2022/03/13 12:59:00 UTC
[jira] [Commented] (JENA-2309) Enhancing Riot for Big Data

    [ https://issues.apache.org/jira/browse/JENA-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17505442#comment-17505442 ] 

Andy Seaborne commented on JENA-2309:
-------------------------------------

It is difficult to understand the details here out of context of your big data processing stack.

{quote}However, for use with Big Data we need to 
 * disable blank node relabeling
{quote}

Already possible:

{code:java}
RDFParser.source(...).labelToNode(LabelToNode.createUseLabelAsGiven()). ...
{code}

But also the internal system id for bnode in default configuration uses a consistent
algorithm.

See {{LabelToNode.createScopeByDocumentHash}}. 

The start of parsing a large random number is created (UUID of which 122-bits are
random). Blank node labels in the parser stream are combined with the UUID bits
(using MurmurHash3).  There is a LRU cache for this; being a cache, the consistent calculation is critical.

You are configuring each processing node anyway. Pass the same UUID to all of
them during set up.

{quote}
 * preconfigure the StreamRDF with a given set of prefixes (that is broadcasted to each node)
{quote}
From the description in this ticket, this isn't a Jena issue.

{quote}
IRIxResolver
{quote}

The JDK implementation of URIs is buggy for semantic web usage. What aspect of {{IRIProviderJDK}} are you
relying on? Do you normalize relative IRIs? Why does {{IRIProviderJenaIRI}} not work for you?

See also [iri4ld|https://github.com/afs/x4ld/tree/main/iri4ld]. There is a provider (not in the Jena code base).

{quote}
Prologue: Prologue: We use reflection to set the resolver
{quote}
Prologue is historical - see the graph and datasetgraph writers - they don't use prologues.

Why do you wish to modify one in place?

Create new one or use the constructor {{{}Prologue(PrefixMapping pmap, IRIxResolver resolver)}} and have a switchable {{IRIxResolver}}.

Prologues can be shared and that includes with parsed queries (historical).

{quote} * The default PrefixMapping implementation is very inefficient when it comes to handling a dump of prefix.cc. I am using 2500 prefixes. Each RDF term in the output results in a scan of the full prefix map
{quote}

Don't use {{PrefixMapping}}, use {{PrefixMap}}.

We have looked at this issue in the past for 500+ prefixes. 

There was, for a while, a Trie-based {{PrefixMap}}. After experimentation, tuning {{PrefixMapStd}} with a uri to prefix cache was as
fast. A cache approach means it adapts to the case of large prefix sets with small data.

This is abstracted in {{PrefixMapFactory.createForOutput}}. Jena writers build a per-output prefix map to ensure that get {{PrefixMapStd}} and not a projection of, say, a TDB2-backed prefix map. See {{RDFWriter.prefixMap}}. 

{quote}
AsyncParser
{quote}

{{AsyncParser}} reads ahead and send blocks of work to the receiver. If you want to synchronously control the parser, y probably want
{{PipedRDFIterator}}. If you want receiver control we can expose {{AsyncParser.asyncParseIterator}} and {{EltStreamRDF}}(or some variant of) to give receiver-side control of the incoming stream. It would also be possible to so this with logic in the receiving {{StreamRDF}}.

{quote}
I'd prefer to address these things with only one or a few PRs
{quote}

As above - some or them are already addressed, some may be addressed in different
ways.

There are several independent changes being suggested. They have different timescales.

Composite PRs make it hard to review now, and hard to track back feature changes. Better to have a history that can be looked
back at in 1-5 years time.


> Enhancing Riot for Big Data
> ---------------------------
>
>                 Key: JENA-2309
>                 URL: https://issues.apache.org/jira/browse/JENA-2309
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RIOT
>    Affects Versions: Jena 4.5.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> We have successfully managed to adapt Jena Riot to quite efficiently work within Apache Spark, however we needed to make certain adaption that rely on brittle reflection hacks and APIs that are marked for removal (namely PipedRDFIterator):
> In principle, for writing RDF data out, we implemented a mapPartition operation that maps the input RDF to lines of text via StreamRDF which is understood by apache spark's RDD.saveAsText();
> However, for use with Big Data we need to
>  * disable blank node relabeling
>  * preconfigure the StreamRDF with a given set of prefixes (that is broadcasted to each node)
> Furthermore
>  * The default PrefixMapping implementation is very inefficient when it comes to handling a dump of prefix.cc. I am using 2500 prefixes. Each RDF term in the output results in a scan of the full prefix map
>  * Even if the PrefixMapping is optimized, the recently added PrefixMap adapter again does scanning - and its a final class so no easy override.
> And finally, we have a use case to allow for relative IRIs in the RDF: We are creating DCAT catalogs from directory content as in this file:
> DCAT catalog with relative IRIs over directory content: [work-in-progress example|https://hobbitdata.informatik.uni-leipzig.de/lsqv2/dumps/dcat.trig]
> If you retrieve the file with a semantic web client (riot, rapper, etc) it will automatically use the download location as the base url and thus giving absolute URLs to the published artifacts - regardless under which URL that directory is hosted.
> *IRIxResolver: We rely on IRIProviderJDK which states "do not use in production" however it is the only one the let us achieve the goal. [our code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/irixresolver/IRIxResolverUtils.java#L30]
>  * Prologue: We use reflection to set the resolver and would like the setResolver method [our code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prologue/PrologueUtils.java#L65]
>  * WriterStreamRDFBase: We need to be able to create instances of WriterStreamRDF classes which we can configure with our own PrefixMap instance (e.g. trie-backed), and our own LabelToNode stragegy ("asGiven") - [our code|https://github.com/SANSA-Stack/SANSA-Stack/blob/40fa6f89f421eee22c9789973ec828ec3f970c33/sansa-spark-jena-java/src/main/java/net/sansa_stack/spark/io/rdf/output/RddRdfWriter.java#L387]
>  * PrefixMapAdapter: We need an adapter that inherits the performance characteristics of the backing PrefixMapping [our code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMapAdapter.java#L57]
>  * PrefixMapping: We need a trie-based implementation for efficiency. We created one based on the trie class in jena which on initial experiments was sufficiently fast. Though we did not benchmark whether e.g. PatriciaTrie from commons collection would be faster. [our code|https://github.com/Scaseco/jenax/blob/dd51ef9a39013d4ddbb4806fcad36b03a4dbaa7c/jenax-arq-parent/jenax-arq-utils/src/main/java/org/aksw/jenax/arq/util/prefix/PrefixMappingTrie.java#L27]
> With PrefixMapTrie the profiler showed that the amout of time spent on abbreviate went from ~100% to 1% - though not totally sure about standard conformance here.
>  * PipedRDFIterator / AsyncParser: We can read trig as a Splittable format (which is pretty cool) - however this requires being able to start and stop the RDF parser at will for probing. In other words, AsyncParser needs to return ClosableIterators whose close method actually stops the parsing thread. Also when scanning for prefixes we want to be able to create rules such as "as long as the parser emits a prefix with less than e.g. 100 non-prefix events in between keep looking for prefixes" - AsyncParser has the API for it with EltStreamRDF but it is private.
> For future-proofness we'd have these use cases to be reflected in jena.
> Because we have sorted all the above issues mostly out I'd prefer to address these things with only one or a few PRs (maybe the ClosableIterators on AsyncParsers would be more work because our code only did that for PipedRDFIterator and I haven't looked in detail into the new architecture).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)