You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (Jira)" <ji...@apache.org> on 2021/04/25 12:01:00 UTC

[jira] [Updated] (JENA-1505) add function apf:strIndexSplit

     [ https://issues.apache.org/jira/browse/JENA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne updated JENA-1505:
--------------------------------
    Labels: First  (was: )

> add function apf:strIndexSplit
> ------------------------------
>
>                 Key: JENA-1505
>                 URL: https://issues.apache.org/jira/browse/JENA-1505
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Vladimir Alexiev
>            Priority: Major
>              Labels: First
>
> We use Tarql to convert some company CSV data to RDF.
>  We had cases of multiple values in a field (eg aliases) that we handle with apf:strSplit.
> But now we've hit another case: several multi-value fields arranged in parallel arrays.
>  Each CSV row is a Joint Venture (?jvId, ?jvName) and there are 3 newline-separated parallel arrays that describe the participant companies: ?coIds, ?coNames, ?coIndustries.
>  If we use several apf:strSplit in one query, that will cause a Cartesian product, and mix up all company ids, names, industries together.
> Tarql allows multiple CONSTRUCT queries in one script, and "the triples generated by previous CONSTRUCT clauses can be queries in subsequent WHERE clauses to retrieve additional data". So my idea is to split each column in a separate CONSTRUCT, attach the values to temporary nodes, and reassemble them in a final CONSTRUCT.
> But we can't do this with apf:strSplit, since it loses the index (ordering) of the individual values.
>  We need a new Jena ARQ function, eg with a signature like this where ? indicates unbound and $indicates bound:
> {noformat}
> (?index ?value) apf:strIndexSplit ($string $separator)
> Splits $string on regex $separator and produces a number of binding pairs
> where ?index is bound to a sequential number (starting from 1)
> and ?value is bound to the consecutive string part that is split off.
> {noformat}
> Then we could hack the problem with something like this:
> {noformat}
> construct { # get first multiValue field
>  ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
>  bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
>  (?INDEX ?VALUE) apf:strIndexSplit (?coIds, "\\n")
> }
> construct { # get second multiValue field
>  ?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
>  bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
>  (?INDEX ?VALUE) apf:strIndexSplit (?coNames, "\\n")
> }
> construct { # get third multiValue field
>  ?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
>  bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
>  (?INDEX ?VALUE) apf:strIndexSplit (?coIndustries, "\\n")
> }
> construct { # make JV node
>  ?JV ex:id ?jvId; ex:name ?jvName.
> } where {
>  bind(uri(concat("jv/",?jvId) as ?JV))
> }
> construct { # make Company node and relation
>  ?CO ex:id ?coId; ex:name ?coName; ex:industry ?INDUSTRY.
>  ?JV ex:hasParticipant ?CO
> } where {
>  bind(uri(concat("jv/",?jvId) as ?JV))
>  bind(uri(concat("urn:tmp:",?ROWNUM) as ?ROW))
>            ?ROW tmp:coIds        [tmp:index ?INDEX; tmp:value ?coId]
>  optional {?ROW tmp:coNames      [tmp:index ?INDEX; tmp:value ?coName]}
>  optional {?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?coIndustry]}
>  bind(uri(concat("company/",?coId) as ?CO)
>  bind(uri(concat("industry/",?coIndustry) as ?INDUSTRY)
> }
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)