You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Shengyu Li <sl...@jagmail.southalabama.edu> on 2017/12/25 04:51:06 UTC

Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Hello,

I am uploading my .ttl data to my database, there are totally about 10,000
files and each file is about 4M. My new data is totally about 40GB. My
origional db is also about 40GB. The server is in my local computer.

I use tdbloader.bat --loc to upload data. After the Finish quads load, it
will pause at this status for a long time (about half an hr for one file
(4M), but if for 200 files one time(200*4M), the pause time will be 2 hrs).
After the pause, the work will go back to the cmd.
[image: Inline image 1]

I guess the pause means the db is doing the organization about the data I
uploaded just now, so won't return for a long time, am I right? Is there
any way to shorten the waiting time?

Thank you very much! Jena is really a useful thing!

Best,
Shengyu

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Andy Seaborne <an...@apache.org>.


On 26/12/17 19:39, Dick Murray wrote:
> On 26 Dec 2017 19:10, "Laura Morales" <la...@mail.com> wrote:
> 
>>> What is more, it gets bNode labels across files right (so using _:a in
>>> two files is two bNodes).
>> 
>> Thinking about this...
>> 
>> - if the files contain anonymous blank nodes (for example in Turtle), each
>> node (converted with RIOT) should be assigned a random name (this is where
>> rapper fails, and RIOT works)
>> - if the files already contain named blank nodes (eg _:node1 <predicate>
>> <object>) then I guess these nodes should probably keep their names and not
>> be reassigned a random ID, because they are probably intended to mean the
>> same blank node
>> 
> 
> Blank node identifiers are only limited in scope to a serialization of a
> particular RDF graph, i.e. the node _:b does not represent the same node as
> a node named _:b in any other graph.
> 

Right.

Don't mix up "internal identifiers" with "blank node labels".

Strictly the latter are only in syntax.

What Node.getBlankNodeLabel does is strictly "getInternalId".

If the app knows something beyond what the parser can assume, it can 
join up blank nodes. (Sometimes called "smushing".)

     Andy

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by ajs6f <aj...@apache.org>.

At which point it would seem to be the responsibility of that project to ensure that there are no collisions, because that's not a couple of RDF graphs, that's a single RDF graph in several pieces. If they reuse bnode identifiers for different nodes in different file-chunks, that is incorrect serialization, from the POV of the _whole_ graph (not from the POV of subgraphs).

Either concatenate the files while manually swapping out for new non-colliding identifiers (perhaps use UUIDs), or use the multi-file tdbloader idiom that Andy mentioned (which works with many of the Jena CLI tools, actually).

Introducing new identifiers for bnodes to avoid collisions is pretty standard fare. The RDF Semantics document:

https://www.w3.org/TR/rdf11-mt/#shared-blank-nodes-unions-and-merges
https://www.w3.org/TR/rdf11-mt/#dfn-standardize

gives a really clear explanation.

Adam Soroka

> On Dec 26, 2017, at 3:27 PM, Laura Morales <la...@mail.com> wrote:
> 
>> Blank node identifiers are only limited in scope to a serialization of a
>> particular RDF graph, i.e. the node _:b does not represent the same node as
>> a node named _:b in any other graph.
> 
> Yes I understand this, but I've seen some projects distribute their data as one graph split into multiple files (eg one file per item).

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Shengyu Li <sl...@jagmail.southalabama.edu>.

Thanks for everyone! I will try to transfer the files into a single .nt.
ps: My server is win 10, which I didn't mention just now.

On Tue, Dec 26, 2017 at 4:50 PM, Laura Morales <la...@mail.com> wrote:

> > That's one graph in many pieces and the owner of the graph should clearly
> > state what is what!
>
> Yes, agreed. I was only trying to say if the publisher publishes a graph
> in chunks and blank nodes are not anonymous, then whatever software is
> converting the files into another format should probably not rename the
> blank nodes.
>

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Laura Morales <la...@mail.com>.

> That's one graph in many pieces and the owner of the graph should clearly
> state what is what!

Yes, agreed. I was only trying to say if the publisher publishes a graph in chunks and blank nodes are not anonymous, then whatever software is converting the files into another format should probably not rename the blank nodes.

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Dick Murray <da...@gmail.com>.

That's one graph in many pieces and the owner of the graph should clearly
state what is what!

On 26 Dec 2017 20:28, "Laura Morales" <la...@mail.com> wrote:

> Blank node identifiers are only limited in scope to a serialization of a
> particular RDF graph, i.e. the node _:b does not represent the same node
as
> a node named _:b in any other graph.

Yes I understand this, but I've seen some projects distribute their data as
one graph split into multiple files (eg one file per item).

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Laura Morales <la...@mail.com>.

> Blank node identifiers are only limited in scope to a serialization of a
> particular RDF graph, i.e. the node _:b does not represent the same node as
> a node named _:b in any other graph.

Yes I understand this, but I've seen some projects distribute their data as one graph split into multiple files (eg one file per item).

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Dick Murray <da...@gmail.com>.

On 26 Dec 2017 19:10, "Laura Morales" <la...@mail.com> wrote:

> What is more, it gets bNode labels across files right (so using _:a in
> two files is two bNodes).

Thinking about this...

- if the files contain anonymous blank nodes (for example in Turtle), each
node (converted with RIOT) should be assigned a random name (this is where
rapper fails, and RIOT works)
- if the files already contain named blank nodes (eg _:node1 <predicate>
<object>) then I guess these nodes should probably keep their names and not
be reassigned a random ID, because they are probably intended to mean the
same blank node


Blank node identifiers are only limited in scope to a serialization of a
particular RDF graph, i.e. the node _:b does not represent the same node as
a node named _:b in any other graph.

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Laura Morales <la...@mail.com>.

> What is more, it gets bNode labels across files right (so using _:a in
> two files is two bNodes).

Thinking about this...

- if the files contain anonymous blank nodes (for example in Turtle), each node (converted with RIOT) should be assigned a random name (this is where rapper fails, and RIOT works)
- if the files already contain named blank nodes (eg _:node1 <predicate> <object>) then I guess these nodes should probably keep their names and not be reassigned a random ID, because they are probably intended to mean the same blank node

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Andy Seaborne <an...@apache.org>.


On 25/12/17 06:41, Laura Morales wrote:
>  From what I can tell, and from my little experience, you should not see such long waiting/idling times. But I've never used Windows (and I'm confident you'll get a better environment if you just switched to gnu/linux).
> Anyway, you could try to merge all your files into a single .nt (using RIOT) and load this file only.

Yes - a single call of tdbloader on an empty database can load more 
efficienlty than if there are already triples.

"the pause time" suggest the database already has contents (it prints 
progress otherwise)..

It only has to be a single call, it can be multiple files (subject to 
shell limits).

    tdbloader --loc DB file1 file2 ...

tdbloader will in effect do the file merge as it loads.

What is more, it gets bNode labels across files right (so using _:a in 
two files is two bNodes).

     Andy

>   
>   
> 
> Sent: Monday, December 25, 2017 at 5:51 AM
> From: "Shengyu Li" <sl...@jagmail.southalabama.edu>
> To: users@jena.apache.org
> Subject: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?
> 
> Hello,
>   
> I am uploading my .ttl data to my database, there are totally about 10,000 files and each file is about 4M. My new data is totally about 40GB. My origional db is also about 40GB. The server is in my local computer.
>   
> I use tdbloader.bat --loc to upload data. After the Finish quads load, it will pause at this status for a long time (about half an hr for one file (4M), but if for 200 files one time(200*4M), the pause time will be 2 hrs). After the pause, the work will go back to the cmd.
>   
> I guess the pause means the db is doing the organization about the data I uploaded just now, so won't return for a long time, am I right? Is there any way to shorten the waiting time?
>   
> Thank you very much! Jena is really a useful thing!
>   
> Best,
> Shengyu
>

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Laura Morales <la...@mail.com>.

From what I can tell, and from my little experience, you should not see such long waiting/idling times. But I've never used Windows (and I'm confident you'll get a better environment if you just switched to gnu/linux).
Anyway, you could try to merge all your files into a single .nt (using RIOT) and load this file only.

Sent: Monday, December 25, 2017 at 5:51 AM
From: "Shengyu Li" <sl...@jagmail.southalabama.edu>
To: users@jena.apache.org
Subject: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Hello,

I am uploading my .ttl data to my database, there are totally about 10,000 files and each file is about 4M. My new data is totally about 40GB. My origional db is also about 40GB. The server is in my local computer.

I use tdbloader.bat --loc to upload data. After the Finish quads load, it will pause at this status for a long time (about half an hr for one file (4M), but if for 200 files one time(200*4M), the pause time will be 2 hrs). After the pause, the work will go back to the cmd.

I guess the pause means the db is doing the organization about the data I uploaded just now, so won't return for a long time, am I right? Is there any way to shorten the waiting time?

Thank you very much! Jena is really a useful thing! 

Best,
Shengyu

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Dick Murray <da...@gmail.com>.

That seems slow for the size.

We bulk load triples into Windows and get similar times to Centos/Fedora on
the same hardware.

You can hack the tdbloader2 to run on Windows as basically you're
exploiting the OS sort which on Windows is;

*sort* [*/r*] [*/+**n*] [*/m* *kilobytes*] [*/l* *locale*] [*/rec*
*characters*] [[*drive1**:*][*path1*]*filename1*] [*/t* [*drive2**:*][
*path2*]] [*/o* [*drive3**:*][*path3*]*filename3*]

Merge all the files together using copy *.txt newfile.txt This assumes you
understand the blank nodes..?

Use unique from gnu utils for Windows or the following native

@ECHO ON

SET InputFile=C:\folder\path\Input.txt
::SET InputFile=%~1
SET OutputFile=C:\folder\path\Output.txt

SET PSScript=%Temp%\~tmpRemoveDupe.ps1
IF EXIST "%PSScript%" DEL /Q /F "%PSScript%"
ECHO Get-Content "%InputFile%" ^| Sort-Object ^| Get-Unique ^>
"%OutputFile%">>"%PSScript%"

SET PowerShellDir=C:\Windows\System32\WindowsPowerShell\v1.0
CD /D "%PowerShellDir%"
Powershell -ExecutionPolicy Bypass -Command "& '%PSScript%'"

GOTO EOF



If you do the *SET InputFile=%~1 Window* will allow you to drag and drop
the source file into the CMD... Got to be some advantage to using Windows.!?

Dick

On 25 Dec 2017 4:51 am, "Shengyu Li" <sl...@jagmail.southalabama.edu>
wrote:

Hello,

I am uploading my .ttl data to my database, there are totally about 10,000
files and each file is about 4M. My new data is totally about 40GB. My
origional db is also about 40GB. The server is in my local computer.

I use tdbloader.bat --loc to upload data. After the Finish quads load, it
will pause at this status for a long time (about half an hr for one file
(4M), but if for 200 files one time(200*4M), the pause time will be 2 hrs).
After the pause, the work will go back to the cmd.
[image: Inline image 1]

I guess the pause means the db is doing the organization about the data I
uploaded just now, so won't return for a long time, am I right? Is there
any way to shorten the waiting time?

Thank you very much! Jena is really a useful thing!

Best,
Shengyu

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

> Rapper doesn't work, I've already had to go through that pain.
Well, you should be careful with such a generic statement as initial
sentence! Not every Turtle file contains blank nodes and at least I
don't know anything about the data used by the TO.

On the other hand, I totally agree with you that in case of blank nodes
an alternative like RIOT should be used. But to be honest, I don't know
anything about the Windows world - luckily, I have not to work with this
OS, just read that the Windows Power shell now added cURL :D


Cheers,

Lorenz


On 25.12.2017 15:06, Laura Morales wrote:
>> I'd suggest to use command line tools to convert the files to
>> N-Triples (e.g. using rapper), then concat (cat), then load the single file.
>
> Rapper doesn't work, I've already had to go through that pain. The problem with rapper is, that when converting to nt it creates blank nodes with sequential identifiers more or less like _:bn1 _:bn2 _:bn3 etc... So if you convert 2 file, each one containing a blank nodes, both nodes will be given the same name such as _:bn1. If then you `cat` the nt files, you've basically merged the two nodes since they're given the same name even though they could be completely unrelated nodes :/
> I've also discovered that there is an open issue from 2013 that nobody cares about: https://github.com/dajobe/raptor/pull/8
>
> I think the only option is RIOT (or at least, I couldnt find anything else), which as far as I can tell does generate randomized IDs indeed.

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Laura Morales <la...@mail.com>.

> I'd suggest to use command line tools to convert the files to
> N-Triples (e.g. using rapper), then concat (cat), then load the single file.


Rapper doesn't work, I've already had to go through that pain. The problem with rapper is, that when converting to nt it creates blank nodes with sequential identifiers more or less like _:bn1 _:bn2 _:bn3 etc... So if you convert 2 file, each one containing a blank nodes, both nodes will be given the same name such as _:bn1. If then you `cat` the nt files, you've basically merged the two nodes since they're given the same name even though they could be completely unrelated nodes :/
I've also discovered that there is an open issue from 2013 that nobody cares about: https://github.com/dajobe/raptor/pull/8

I think the only option is RIOT (or at least, I couldnt find anything else), which as far as I can tell does generate randomized IDs indeed.

Re: Is There Any Way to Shorten The Waiting Time After Upload Triples in Jena?

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

Half an hour for a 4MB Turtle file?! Even without knowing the number of
triples per file, I'd say, something is wrong here.

Are you calling

tdbloader --loc

for each file separately?

As Laura already said:

If you'd use Linux, I'd suggest to use command line tools to convert the
files to N-Triples (e.g. using rapper), then concat (cat), then load the
single file. In your case, maybe create a few files instead of a single
file, or even compression might also be suitable. But ok, you're using
Windows ...

What I'm missing from your email is the description of your system and
TDB setup. "Local computer" can be anything. File sizes are indeed
interesting, number of triples might be also relevant.

Cheers,

Lorenz

On 25.12.2017 05:51, Shengyu Li wrote:
> Hello,
>
> I am uploading my .ttl data to my database, there are totally about
> 10,000 files and each file is about 4M. My new data is totally about
> 40GB. My origional db is also about 40GB. The server is in my local
> computer.
>
> I use tdbloader.bat --loc to upload data. After the Finish quads load,
> it will pause at this status for a long time (about half an hr for one
> file (4M), but if for 200 files one time(200*4M), the pause time will
> be 2 hrs). After the pause, the work will go back to the cmd.
> Inline image 1
>
> I guess the pause means the db is doing the organization about the
> data I uploaded just now, so won't return for a long time, am I right?
> Is there any way to shorten the waiting time?
>
> Thank you very much! Jena is really a useful thing! 
>
> Best,
> Shengyu