You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Laura Morales <la...@mail.com> on 2017/04/07 12:10:17 UTC

tdbloader2 ignore ill-formed nquads

I'm trying to import the LOV dump [1] into Fuseki using tdbloader2. Unfortunately some quads are "broken" in the sense that they're not well-formed. For example this one
 
ERROR [line: 203556, col: 152] Bad character in IRI (space): <http://securitytoolbox.appspot.com/MASO#Objectif[space]...>
org.apache.jena.riot.RiotException: [line: 203556, col: 152] Bad character in IRI (space): <http://securitytoolbox.appspot.com/MASO#Objectif[space]...>
 
Is there an option to tell tdbloader2 to simply ignore these nquads (or show a warning) and keep going instead of raising an exception and halting?
 
-----------------
[1] http://lov.okfn.org/lov.nq.gz

Re: tdbloader2 ignore ill-formed nquads

Posted by ba...@gmail.com.
> Spaces in URIs are particularly problematic; even if you can get them  
> into the data, using the data will likely break.
>
> When ingesting data from somewhere else, it is good to check it before  
> loading, then fix as needed before loading.
>
>     riot --check file....
>
>      Andy
>
> http://lov.okfn.org/lov.nq.gz is only 749810 quads. tdbloader2 is  
> overkill. Use tdbloader.  tdbloader2 is an advantage for much larger  
> data (100 million+ and even then it is not always faster)
>
> On 07/04/17 13:17, Martynas Jusevi\u010dius wrote:
>> This question comes up regurarly:  
>> http://markmail.org/message/seqiw74hhdx2u64j
>>
>> On Fri, Apr 7, 2017 at 2:10 PM, Laura Morales <la...@mail.com> wrote:
>>> I'm trying to import the LOV dump [1] into Fuseki using tdbloader2.  
>>> Unfortunately some quads are "broken" in the sense that they're not  
>>> well-formed. For example this one
>>>
>>> ERROR [line: 203556, col: 152] Bad character in IRI (space):  
>>> <http://securitytoolbox.appspot.com/MASO#Objectif[space]...>
>>> org.apache.jena.riot.RiotException: [line: 203556, col: 152] Bad  
>>> character in IRI (space):  
>>> <http://securitytoolbox.appspot.com/MASO#Objectif[space]...>
>>>
>>> Is there an option to tell tdbloader2 to simply ignore these nquads  
>>> (or show a warning) and keep going instead of raising an exception and  
>>> halting?

-----------------

The problem is much more the 'Spaces'.

But last not least, i think, a utility making database for Fuseki, may not  
'encourage' the users throwing away this and that triple/quad-line because  
the user wants to run it to the end. It is clear where this ends, than  
there is no logic in that what you do...

I have had this proplem usally with downloaded dbpedia files

long_abstracts_en.nt
long_abstracts_en_uris_de.nt

I repaired each line in an editor, as our utility likes it and if i  
couldn't guess where the problem is for an object string, i wrote 'not  
readable' for it...

Yes, i did it so, may be i was an idiot...

baran


-- 
Using Opera's mail client: http://www.opera.com/mail/

Re: tdbloader2 ignore ill-formed nquads

Posted by Andy Seaborne <an...@apache.org>.
Spaces in URIs are particularly problematic; even if you can get them 
into the data, using the data will likely break.

When ingesting data from somewhere else, it is good to check it before 
loading, then fix as needed before loading.

    riot --check file....

     Andy

http://lov.okfn.org/lov.nq.gz is only 749810 quads. tdbloader2 is 
overkill. Use tdbloader.  tdbloader2 is an advantage for much larger 
data (100 million+ and even then it is not always faster)

On 07/04/17 13:17, Martynas Jusevi\u010dius wrote:
> This question comes up regurarly: http://markmail.org/message/seqiw74hhdx2u64j
>
> On Fri, Apr 7, 2017 at 2:10 PM, Laura Morales <la...@mail.com> wrote:
>> I'm trying to import the LOV dump [1] into Fuseki using tdbloader2. Unfortunately some quads are "broken" in the sense that they're not well-formed. For example this one
>>
>> ERROR [line: 203556, col: 152] Bad character in IRI (space): <http://securitytoolbox.appspot.com/MASO#Objectif[space]...>
>> org.apache.jena.riot.RiotException: [line: 203556, col: 152] Bad character in IRI (space): <http://securitytoolbox.appspot.com/MASO#Objectif[space]...>
>>
>> Is there an option to tell tdbloader2 to simply ignore these nquads (or show a warning) and keep going instead of raising an exception and halting?
>>
>> -----------------
>> [1] http://lov.okfn.org/lov.nq.gz

Re: tdbloader2 ignore ill-formed nquads

Posted by Martynas Jusevičius <ma...@graphity.org>.
This question comes up regurarly: http://markmail.org/message/seqiw74hhdx2u64j

On Fri, Apr 7, 2017 at 2:10 PM, Laura Morales <la...@mail.com> wrote:
> I'm trying to import the LOV dump [1] into Fuseki using tdbloader2. Unfortunately some quads are "broken" in the sense that they're not well-formed. For example this one
>
> ERROR [line: 203556, col: 152] Bad character in IRI (space): <http://securitytoolbox.appspot.com/MASO#Objectif[space]...>
> org.apache.jena.riot.RiotException: [line: 203556, col: 152] Bad character in IRI (space): <http://securitytoolbox.appspot.com/MASO#Objectif[space]...>
>
> Is there an option to tell tdbloader2 to simply ignore these nquads (or show a warning) and keep going instead of raising an exception and halting?
>
> -----------------
> [1] http://lov.okfn.org/lov.nq.gz