You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Vladimir Alexiev (JIRA)" <ji...@apache.org> on 2018/03/10 12:47:00 UTC
[jira] [Created] (JENA-1505) add function apf:strIndexSplit
Vladimir Alexiev created JENA-1505:
--------------------------------------
Summary: add function apf:strIndexSplit
Key: JENA-1505
URL: https://issues.apache.org/jira/browse/JENA-1505
Project: Apache Jena
Issue Type: Improvement
Components: ARQ
Reporter: Vladimir Alexiev
We use Tarql to convert some company CSV data to RDF.
We had cases of multiple values in a field (eg aliases) that we handle with apf:strSplit.
But now we've hit another case: several multi-value fields arranged in parallel arrays.
Each CSV row is a Joint Venture (?jvId, ?jvName) and there are 3 newline-separated parallel arrays that describe the participant companies: ?coIds, ?coNames, ?coIndustries.
If we use several apf:strSplit in one query, that will cause a Cartesian product, and mix up all company ids, names, industries together.
Tarql allows multiple CONSTRUCT queries in one script, and |the triples generated by previous CONSTRUCT clauses can be queries in subsequent WHERE clauses to retrieve additional data".
So my idea is to split each column in a separate CONSTRUCT, attach the values to temporary nodes, and reassemble them in a final CONSTRUCT.
But we can't do this with apf:strSplit, since it loses the index (ordering) of the individual values.
We need a new Jena ARQ function, eg with a signature like this where ? indicates unbound and $indicates bound:
{noformat}
(?index ?value) apf:strIndexSplit ($string $separator)
Splits $string on regex $separator and produces a number of binding pairs
where ?index is bound to a sequential number (starting from 1)
and ?value is bound to the consecutive string part that is split off.
{noformat}
Then we could hack the problem with something like this:
{noformat}
construct { # get first multiValue field
?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
bind(uri("uri:tmp:",?ROWNUM) as ?ROW)
(?INDEX ?VALUE) apf:strIndexSplit (?coIds, "\\n")
}
construct { # get second multiValue field
?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
bind(uri("uri:tmp:",?ROWNUM) as ?ROW)
(?INDEX ?VALUE) apf:strIndexSplit (?coNames, "\\n")
}
construct { # get third multiValue field
?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
bind(uri("uri:tmp:",?ROWNUM) as ?ROW)
(?INDEX ?VALUE) apf:strIndexSplit (?coIndustries, "\\n")
}
construct { # make JV node
?JV ex:id ?jvId; ex:name ?jvName.
} where {
bind(uri(concat("jv/",?jvId) as ?JV))
}
construct { # make Company node and relation
?CO ex:id ?coId; ex:name ?coName; ex:industry ?INDUSTRY.
?JV ex:hasParticipant ?CO
} where {
bind(uri(concat("jv/",?jvId) as ?JV))
bind(uri(concat("uri:tmp:",?ROWNUM) as ?ROW))
?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?coId]
optional \{?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?coName]}
optional \{?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?coIndustry]}
bind(uri(concat("company/",?coId) as ?CO)
bind(uri(concat("industry/",?coIndustry) as ?INDUSTRY)
}
{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)