You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Stian Soiland-Reyes <st...@apache.org> on 2015/01/26 17:53:39 UTC

Fuseki 2 docker image - some questions

I have created an experimental Docker image for running Apache Jena Fuseki 2.

See https://registry.hub.docker.com/u/stain/jena-fuseki/ for details.

The short story is:

    docker run -p 3030:3030 -it stain/jena-fuseki

Look for the password in the console, log in to http://localhost:3030/
and enjoy the beautiful interface, upload data, and start querying it!


(At a Bring Your Own Data meeting last week, several people were
asking what cool triple store I was running when they saw I had Fuseki
2 on my laptop. (They were using Virtuoso 6). We really need to get
this out to the people! :)

There was also one guy who was using Jena to do automatic ontology
reconstruction on a SPARQL endpoint, he said he had some performance
patches to Jena which I encouraged him to join this mailing list and
tell us about.)


# Questions to dev@jena

Q: I don't know why Fuseki (here with basic auth) always asks for the
admin password - perhaps some Fuseki / shiro.ini experts can help? Is
it because of the server status icon?


Q: How should I do mass-data-loading? Can you run tdbloader /
tdbloader2 through the jena-fuseki2.jar, or do I need to bundle in the
shell-scripts from the normal jena distribution?

Qa: What --loc do I give? /fuseki/databases/EXAMPLE I guess? They look
like tdb folders

    $ docker run -v $HOME/ops/chembl19:/staging --volumes-from
fuseki-data stain/jena-fuseki ./fuseki-server tdb.loader2
--loc=/fuseki/databases/test
    Multiple dataset path names given



Qb: How do I tell Fuseki about the dataset? Do I need to create it on
the web first and then stop the server? Do I need to create one those
assembler magic files?



Q: Should I base this on openjdk7 or 8? See
https://registry.hub.docker.com/u/library/java/ for options.


Q: Should I run the standalone Fuseki or a WAR inside say Tomcat?


Q: Are there any memory settings that should be made accessible?


Q: Is it safe to just stop fuseki-server with Ctrl-C / docker stop? It
seems to stop a bit too fast, with no log outputs.




## How was it made?

https://github.com/stain/jena/tree/fuseki2-docker/jena-fuseki2/jena-fuseki-docker

Note that I had to do it with a tricky download from the Maven
snapshot server (Xpath!), as if I try to do the whole Maven build of
Jena from within the Dockerfile, I get a DVD-sized image:

stain@biggie-utopic:~/src/jena$ du -hs .
1.5G    .

(the one above clocks in at nice 51 MB+18+

Once fuseki2 is released in stable versions this should be a bit
easier as there would be no need to xpath out the latest snapshot
timestamp.



-- 
Stian Soiland-Reyes
Apache Taverna (incubating)
http://orcid.org/0000-0001-9842-9718

Re: Fuseki 2 docker image - some questions

Posted by Andy Seaborne <an...@apache.org>.

On 28/01/15 12:23, Stian Soiland-Reyes wrote:
> On 28 January 2015 at 10:34, Andy Seaborne <an...@apache.org> wrote:
>
>> tdbloader does not do better when it's an existing, non-empty database.  It
>> avoids some transactional scaling issues but otherwise uploading to the
>> server live is much the same.
>
> Uploading many files of several GB of data over HTTP sounds fragile to
> me.. and anyway, at least in my case, I don't have those files on the
> machine with the browser :)

curl and GSP.

If you are ultra clever/careful/... , rsync.

>
> tdbloader2 might be good for the biggest stuff, though, as for me at
> least it seems to give a big improvement in performance - but at the
> risk of not loading anything at all if something goes wrong with one
> of the files, as it does indexing in the end. (?)
>
>
> BTW - is it normal that tdbloader performance decrease as more triples
> go in? Getting more expensive to maintain the indexes or match
> existing identifiers?
>
> INFO  Add: 338,100,000 triples (Batch: 2,853 / Avg: 6,824)
> INFO  Add: 351,550,000 triples (Batch: 3,943 / Avg: 6,503)
> INFO  Add: 357,200,000 triples (Batch: 7,685 / Avg: 6,195)
> INFO  Add: 371,900,000 triples (Batch: 6,900 / Avg: 5,565)
> INFO  Add: 386,000,000 triples (Batch: 4,506 / Avg: 5,094)

Yes.

Both the node table and the SPO or GSPO indexes slow down over time. 
Mainly the node table; that also affects stdbloader2.

> This is from before I understood JVM_ARGS so it's probably memory
> unbounded (my tdbloader didn't set -Xmx), using about 5 GB or so of
> heap.

Too much.

On 64 bit - most of the work and index bytes are in memory mapped files, 
not the heap.

>
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND                                 20203 root      20   0 21.655g
> 5.066g 3.225g S  30.6 74.7 257:36.85 java
>
> (it's IO-bound - I don't have those fancy SSD raids at home :)
>
>
>> Don't follow.  config-tdb-dir is that minimal config isn't it?
>> The templates take NAME and DIRectory.  That's it.
>
> Right, I can replace those variables with sed, so that should work - I
> guess from the dist I could just unzip the template from the
> fuseki-server.jar.  (Or should it perhaps better be exposed in the
> dist?)
>
>
>>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-fuseki-dist/2.0.0-SNAPSHOT/maven-metadata.xml
>>
>>
>> There is a timestamp and an incremental count (ATM "21")
>> jena-fuseki-dist-2.0.0-20150128.100051-21.zip
>
> I already parse that with xpath:
>
> https://github.com/stain/jena/blob/fuseki2-docker/jena-fuseki2/jena-fuseki-docker/Dockerfile#L52
>
> .. but a more robust way would be to temporary install mvn and have a
> dummy pom.xml which <version> is updated from the parent. I went with
> xpath for now as that would need more cleanup..  Maven would download
> many things that it won't really need, that then again must be deleted
> from /root/.m2
>
>
> Would there be general interest in adding the Fuseki2 docker image to
> Jena, or have I made another novelty thing :) ? (Here is Virtuoso:
> https://registry.hub.docker.com/u/stain/virtuoso/

If it's universal.

If there are any "decisions", (and give any need to reset shiro.ini 
before containization there might well be) adding the instructions might 
be better.

Centralization of artifacts can be convenient but also a limitation on 
the team.

>
>
> If it went in, it would need to add that kind of Maven polishing, of
> course, so that it works smoothly in releases. I don't think I would
> try to get Maven to actually build the Docker image, that would put in
> quite strong OS requirements.
>
> Presumably - if Jena was to upload such an image officially - it would
> then have to be voted over as (part of) a release, even though it
> would primarily contain the fuseki2 dist.

Yes.

> Would there be licensing issues over the docker image depending on
> Linux and OpenJDK (or Oracle JDK)?
> Docker folks seem to just not worry much about licensing :-/
>
> There could be some issues with layering - my image now contain one
> layer which adds both pwgen (GPL) and Fuseki (Apache 2.0) - but that's
> easy enough to split.
>
> (or FreeBSD to the rescue!?)
>
>
> There's also this docker image by Tazro Inutano Ohta:
>    https://registry.hub.docker.com/u/inutano/jena/dockerfile/
> which simply provides the official Jena distribution under
>    /apache-jena-2.12.1/
>
> With a bit more work this could allow usage of the tdb commands as-is.
>

	Andy

Re: Fuseki 2 docker image - some questions

Posted by Stian Soiland-Reyes <st...@apache.org>.

On 28 January 2015 at 10:34, Andy Seaborne <an...@apache.org> wrote:

> tdbloader does not do better when it's an existing, non-empty database.  It
> avoids some transactional scaling issues but otherwise uploading to the
> server live is much the same.

Uploading many files of several GB of data over HTTP sounds fragile to
me.. and anyway, at least in my case, I don't have those files on the
machine with the browser :)

tdbloader2 might be good for the biggest stuff, though, as for me at
least it seems to give a big improvement in performance - but at the
risk of not loading anything at all if something goes wrong with one
of the files, as it does indexing in the end. (?)

BTW - is it normal that tdbloader performance decrease as more triples
go in? Getting more expensive to maintain the indexes or match
existing identifiers?

INFO  Add: 338,100,000 triples (Batch: 2,853 / Avg: 6,824)
INFO  Add: 351,550,000 triples (Batch: 3,943 / Avg: 6,503)
INFO  Add: 357,200,000 triples (Batch: 7,685 / Avg: 6,195)
INFO  Add: 371,900,000 triples (Batch: 6,900 / Avg: 5,565)
INFO  Add: 386,000,000 triples (Batch: 4,506 / Avg: 5,094)

This is from before I understood JVM_ARGS so it's probably memory
unbounded (my tdbloader didn't set -Xmx), using about 5 GB or so of
heap.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND                                 20203 root      20   0 21.655g
5.066g 3.225g S  30.6 74.7 257:36.85 java

(it's IO-bound - I don't have those fancy SSD raids at home :)

> Don't follow.  config-tdb-dir is that minimal config isn't it?
> The templates take NAME and DIRectory.  That's it.

Right, I can replace those variables with sed, so that should work - I
guess from the dist I could just unzip the template from the
fuseki-server.jar.  (Or should it perhaps better be exposed in the
dist?)

>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-fuseki-dist/2.0.0-SNAPSHOT/maven-metadata.xml
>
>
> There is a timestamp and an incremental count (ATM "21")
> jena-fuseki-dist-2.0.0-20150128.100051-21.zip

I already parse that with xpath:

https://github.com/stain/jena/blob/fuseki2-docker/jena-fuseki2/jena-fuseki-docker/Dockerfile#L52

.. but a more robust way would be to temporary install mvn and have a
dummy pom.xml which <version> is updated from the parent. I went with
xpath for now as that would need more cleanup..  Maven would download
many things that it won't really need, that then again must be deleted
from /root/.m2

Would there be general interest in adding the Fuseki2 docker image to
Jena, or have I made another novelty thing :) ? (Here is Virtuoso:
https://registry.hub.docker.com/u/stain/virtuoso/

If it went in, it would need to add that kind of Maven polishing, of
course, so that it works smoothly in releases. I don't think I would
try to get Maven to actually build the Docker image, that would put in
quite strong OS requirements.

Presumably - if Jena was to upload such an image officially - it would
then have to be voted over as (part of) a release, even though it
would primarily contain the fuseki2 dist.

Would there be licensing issues over the docker image depending on
Linux and OpenJDK (or Oracle JDK)?
Docker folks seem to just not worry much about licensing :-/

There could be some issues with layering - my image now contain one
layer which adds both pwgen (GPL) and Fuseki (Apache 2.0) - but that's
easy enough to split.

(or FreeBSD to the rescue!?)

There's also this docker image by Tazro Inutano Ohta:
  https://registry.hub.docker.com/u/inutano/jena/dockerfile/
which simply provides the official Jena distribution under
  /apache-jena-2.12.1/

With a bit more work this could allow usage of the tdb commands as-is.

-- 
Stian Soiland-Reyes
Apache Taverna (incubating)
http://orcid.org/0000-0001-9842-9718

Re: Fuseki 2 docker image - some questions

Posted by Andy Seaborne <an...@apache.org>.

On 27/01/15 16:27, Stian Soiland-Reyes wrote:
> On 27 January 2015 at 15:23, Andy Seaborne <an...@apache.org> wrote:
>


> I didn't use tdbloader2 for now as it will only work with fresh
> databases (could add confusion to the README).

tdbloader does not do better when it's an existing, non-empty database. 
  It avoids some transactional scaling issues but otherwise uploading to 
the server live is much the same.

> Would I get reasonable defaults if I add a minimal config with just
> DatasetTDB, fuseki:dataset and tdb:location - or do I need to include
> all of those from the template?

Don't follow.  config-tdb-dir is that minimal config isn't it?

The templates take NAME and DIRectory.  That's it.

> The datasets created in the UI seem to live under
> system-configuration/ rather than in config.ttl

Yes - config.ttl is not updated by the server, just read. config.ttl is 
under user control.

> -- should such a register-load shell script put things in there
> instead of config.ttl?
>
> Are those UUIDs registered somewhere?

The system database IIRC.

>>> Q: Should I run the standalone Fuseki or a WAR inside say Tomcat?
>> Your choice!
>
> I'll stick with the standalone then, as it means I get the tdb.loader
> for free :)
>
>
>>> Q: Are there any memory settings that should be made accessible?
>> Heap - usually it does not need changing but if you have lots of datasets it
>> might.
>
> I see - now it is this line:
>
>      JVM_ARGS=${JVM_ARGS:--Xmx1200M}
>
> Why is this using ":" here? I don't understand that bit. :-)

as Rob says.  And the "--" is really two different "-", one for the 
":-", one for the "-Xm".

>
> I guess I would need to freshen my sed skills to set the memory limit
> from a shell variable.
>
>
> So you generally mean.. a high number of datasets? Size of dataset or
> query doesn't matter much with TDB?
>
>
>>> Q: Is it safe to just stop fuseki-server with Ctrl-C / docker stop? It
>>> seems to stop a bit too fast, with no log outputs.
>> Yes - or even "kill -9"
>
> Nice! So
>
>    docker stop fuseki
>
> just works safely.
>
>
>> The distribution has an unversioned file in it IIRC
>
> Could you clarify..?
>
> apache-jena-2.12.1.tar.gz and jena-fuseki-1.1.1-distribution.tar.gz
> don't seem to include fuseki2.


>
>
> So I have to (at least until Fuseki 2.0.0 is released) download
> jena-fuseki-dist from
>
> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-fuseki-dist/2.0.0-SNAPSHOT/
>
> by finding the timestamp in
>
> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-fuseki-dist/2.0.0-SNAPSHOT/maven-metadata.xml

There is a timestamp and an incremental count (ATM "21")
jena-fuseki-dist-2.0.0-20150128.100051-21.zip

Scroll down the list.

(artifact may change jena-fuseki-dist => apache-jena-fuseki to agree 
with our style of delivery artifacts are "apache-jena-*")

	Andy

Re: Fuseki 2 docker image - some questions

Posted by Stian Soiland-Reyes <st...@apache.org>.

Ah, great, then my job is done, beyond a bit of README :)


On 27 January 2015 at 17:21, Rob Vesse <rv...@dotnetrdf.org> wrote:
> Re: the scripting is a use of parameter substitution -
> http://tldp.org/LDP/abs/html/parameter-substitution.html - the general
> form of which is as follows:
>
> VAR=${VAR:-DEFAULT}
>
> This initialising a variable from the existing value if available and the
> :- says if it is currently undeclared/null use the given default instead
>
> So in the specific example you gave it says honour the users JVM_ARGS
> variable if set in their environment otherwise use -Xmx1200M as the value
>
> Rob
>
> On 27/01/2015 08:27, "Stian Soiland-Reyes" <st...@apache.org> wrote:
>
>>On 27 January 2015 at 15:23, Andy Seaborne <an...@apache.org> wrote:
>>
>>> tdbloader is pure java and is included in the fuseki jar.
>>> tdbloader2 has a shell script to drive it.
>>> You can't run either loader on a live database
>>
>>Right. A nice feature (or side-effect?) I found was that Fuseki can
>>pick up live any existing datasets (made by tdloader)  by simply
>>"Creating a new dataset" that has exactly the same name:
>>
>>https://github.com/stain/jena/blob/fuseki2-docker/jena-fuseki2/jena-fuseki
>>-docker/README.md#recognizing-the-dataset-in-fuseki
>>
>>
>>
>>
>>> did you mean
>>> java -cp fuseki-server.jar tdb.tdbloader --loc
>>
>>
>>Right - I got this to work, and even added a little shell-script to
>>help with *ttl.gz expansion (as that could get tricky through the
>>shell on the host).
>>
>>See
>>https://github.com/stain/jena/tree/fuseki2-docker/jena-fuseki2/jena-fuseki
>>-docker#data-loading
>>
>>
>>With not much adjustment (usual FUSEKI_HOME/FUSEKI_BASE stuff) this
>>script could work well also in the normal fuseki2 dist?
>>
>>https://github.com/stain/jena/blob/fuseki2-docker/jena-fuseki2/jena-fuseki
>>-docker/load.sh
>>
>>
>>The default is to load all *ttl *rdf.gz etc. in the /staging directory.
>>
>>
>>As tdbloader with just filenames chucks everything into the default
>>graph, I guess the only option to get named graphs is through an
>>assembler file?
>>
>>I didn't use tdbloader2 for now as it will only work with fresh
>>databases (could add confusion to the README).
>>
>>
>>> config.ttl or the UI.
>>
>>I went with just telling people to click the button in the UI, this
>>worked smoothly. :-)
>>
>>I guess it would only be a challenge if you wanted to load many datasets.
>>
>>
>>It would be nice to just add it to config.ttl, but it seems then I
>>have to copy the defaults for "query" endpoints etc?  (Where from?).
>>It would also have to detect if the dataset already exists.
>>
>>
>>Would I get reasonable defaults if I add a minimal config with just
>>DatasetTDB, fuseki:dataset and tdb:location - or do I need to include
>>all of those from the template?
>>
>>
>>The datasets created in the UI seem to live under
>>system-configuration/ rather than in config.ttl
>>-- should such a register-load shell script put things in there
>>instead of config.ttl?
>>
>>Are those UUIDs registered somewhere?
>>
>>
>>
>>>> Q: Should I run the standalone Fuseki or a WAR inside say Tomcat?
>>> Your choice!
>>
>>I'll stick with the standalone then, as it means I get the tdb.loader
>>for free :)
>>
>>
>>>> Q: Are there any memory settings that should be made accessible?
>>> Heap - usually it does not need changing but if you have lots of
>>>datasets it
>>> might.
>>
>>I see - now it is this line:
>>
>>    JVM_ARGS=${JVM_ARGS:--Xmx1200M}
>>
>>Why is this using ":" here? I don't understand that bit. :-)
>>
>>I guess I would need to freshen my sed skills to set the memory limit
>>from a shell variable.
>>
>>
>>So you generally mean.. a high number of datasets? Size of dataset or
>>query doesn't matter much with TDB?
>>
>>
>>>> Q: Is it safe to just stop fuseki-server with Ctrl-C / docker stop? It
>>>> seems to stop a bit too fast, with no log outputs.
>>> Yes - or even "kill -9"
>>
>>Nice! So
>>
>>  docker stop fuseki
>>
>>just works safely.
>>
>>
>>> The distribution has an unversioned file in it IIRC
>>
>>Could you clarify..?
>>
>>apache-jena-2.12.1.tar.gz and jena-fuseki-1.1.1-distribution.tar.gz
>>don't seem to include fuseki2.
>>
>>
>>So I have to (at least until Fuseki 2.0.0 is released) download
>>jena-fuseki-dist from
>>
>>https://repository.apache.org/content/groups/snapshots/org/apache/jena/jen
>>a-fuseki-dist/2.0.0-SNAPSHOT/
>>
>>by finding the timestamp in
>>
>>https://repository.apache.org/content/groups/snapshots/org/apache/jena/jen
>>a-fuseki-dist/2.0.0-SNAPSHOT/maven-metadata.xml
>>
>>
>>
>>
>>--
>>Stian Soiland-Reyes
>>Apache Taverna (incubating)
>>http://orcid.org/0000-0001-9842-9718
>
>
>
>



-- 
Stian Soiland-Reyes
Apache Taverna (incubating)
http://orcid.org/0000-0001-9842-9718

Re: Fuseki 2 docker image - some questions

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Re: the scripting is a use of parameter substitution -
http://tldp.org/LDP/abs/html/parameter-substitution.html - the general
form of which is as follows:

VAR=${VAR:-DEFAULT}

This initialising a variable from the existing value if available and the
:- says if it is currently undeclared/null use the given default instead

So in the specific example you gave it says honour the users JVM_ARGS
variable if set in their environment otherwise use -Xmx1200M as the value

Rob

On 27/01/2015 08:27, "Stian Soiland-Reyes" <st...@apache.org> wrote:

>On 27 January 2015 at 15:23, Andy Seaborne <an...@apache.org> wrote:
>
>> tdbloader is pure java and is included in the fuseki jar.
>> tdbloader2 has a shell script to drive it.
>> You can't run either loader on a live database
>
>Right. A nice feature (or side-effect?) I found was that Fuseki can
>pick up live any existing datasets (made by tdloader)  by simply
>"Creating a new dataset" that has exactly the same name:
>
>https://github.com/stain/jena/blob/fuseki2-docker/jena-fuseki2/jena-fuseki
>-docker/README.md#recognizing-the-dataset-in-fuseki
>
>
>
>
>> did you mean
>> java -cp fuseki-server.jar tdb.tdbloader --loc
>
>
>Right - I got this to work, and even added a little shell-script to
>help with *ttl.gz expansion (as that could get tricky through the
>shell on the host).
>
>See 
>https://github.com/stain/jena/tree/fuseki2-docker/jena-fuseki2/jena-fuseki
>-docker#data-loading
>
>
>With not much adjustment (usual FUSEKI_HOME/FUSEKI_BASE stuff) this
>script could work well also in the normal fuseki2 dist?
>
>https://github.com/stain/jena/blob/fuseki2-docker/jena-fuseki2/jena-fuseki
>-docker/load.sh
>
>
>The default is to load all *ttl *rdf.gz etc. in the /staging directory.
>
>
>As tdbloader with just filenames chucks everything into the default
>graph, I guess the only option to get named graphs is through an
>assembler file?
>
>I didn't use tdbloader2 for now as it will only work with fresh
>databases (could add confusion to the README).
>
>
>> config.ttl or the UI.
>
>I went with just telling people to click the button in the UI, this
>worked smoothly. :-)
>
>I guess it would only be a challenge if you wanted to load many datasets.
>
>
>It would be nice to just add it to config.ttl, but it seems then I
>have to copy the defaults for "query" endpoints etc?  (Where from?).
>It would also have to detect if the dataset already exists.
>
>
>Would I get reasonable defaults if I add a minimal config with just
>DatasetTDB, fuseki:dataset and tdb:location - or do I need to include
>all of those from the template?
>
>
>The datasets created in the UI seem to live under
>system-configuration/ rather than in config.ttl
>-- should such a register-load shell script put things in there
>instead of config.ttl?
>
>Are those UUIDs registered somewhere?
>
>
>
>>> Q: Should I run the standalone Fuseki or a WAR inside say Tomcat?
>> Your choice!
>
>I'll stick with the standalone then, as it means I get the tdb.loader
>for free :)
>
>
>>> Q: Are there any memory settings that should be made accessible?
>> Heap - usually it does not need changing but if you have lots of
>>datasets it
>> might.
>
>I see - now it is this line:
>
>    JVM_ARGS=${JVM_ARGS:--Xmx1200M}
>
>Why is this using ":" here? I don't understand that bit. :-)
>
>I guess I would need to freshen my sed skills to set the memory limit
>from a shell variable.
>
>
>So you generally mean.. a high number of datasets? Size of dataset or
>query doesn't matter much with TDB?
>
>
>>> Q: Is it safe to just stop fuseki-server with Ctrl-C / docker stop? It
>>> seems to stop a bit too fast, with no log outputs.
>> Yes - or even "kill -9"
>
>Nice! So
>
>  docker stop fuseki
>
>just works safely.
>
>
>> The distribution has an unversioned file in it IIRC
>
>Could you clarify..?
>
>apache-jena-2.12.1.tar.gz and jena-fuseki-1.1.1-distribution.tar.gz
>don't seem to include fuseki2.
>
>
>So I have to (at least until Fuseki 2.0.0 is released) download
>jena-fuseki-dist from
>
>https://repository.apache.org/content/groups/snapshots/org/apache/jena/jen
>a-fuseki-dist/2.0.0-SNAPSHOT/
>
>by finding the timestamp in
>
>https://repository.apache.org/content/groups/snapshots/org/apache/jena/jen
>a-fuseki-dist/2.0.0-SNAPSHOT/maven-metadata.xml
>
>
>
>
>-- 
>Stian Soiland-Reyes
>Apache Taverna (incubating)
>http://orcid.org/0000-0001-9842-9718

Re: Fuseki 2 docker image - some questions

Posted by Stian Soiland-Reyes <st...@apache.org>.

On 27 January 2015 at 15:23, Andy Seaborne <an...@apache.org> wrote:

> tdbloader is pure java and is included in the fuseki jar.
> tdbloader2 has a shell script to drive it.
> You can't run either loader on a live database

Right. A nice feature (or side-effect?) I found was that Fuseki can
pick up live any existing datasets (made by tdloader)  by simply
"Creating a new dataset" that has exactly the same name:

https://github.com/stain/jena/blob/fuseki2-docker/jena-fuseki2/jena-fuseki-docker/README.md#recognizing-the-dataset-in-fuseki

> did you mean
> java -cp fuseki-server.jar tdb.tdbloader --loc

Right - I got this to work, and even added a little shell-script to
help with *ttl.gz expansion (as that could get tricky through the
shell on the host).

See https://github.com/stain/jena/tree/fuseki2-docker/jena-fuseki2/jena-fuseki-docker#data-loading

With not much adjustment (usual FUSEKI_HOME/FUSEKI_BASE stuff) this
script could work well also in the normal fuseki2 dist?

https://github.com/stain/jena/blob/fuseki2-docker/jena-fuseki2/jena-fuseki-docker/load.sh

The default is to load all *ttl *rdf.gz etc. in the /staging directory.

As tdbloader with just filenames chucks everything into the default
graph, I guess the only option to get named graphs is through an
assembler file?

I didn't use tdbloader2 for now as it will only work with fresh
databases (could add confusion to the README).

> config.ttl or the UI.

I went with just telling people to click the button in the UI, this
worked smoothly. :-)

I guess it would only be a challenge if you wanted to load many datasets.

It would be nice to just add it to config.ttl, but it seems then I
have to copy the defaults for "query" endpoints etc?  (Where from?).
It would also have to detect if the dataset already exists.

Would I get reasonable defaults if I add a minimal config with just
DatasetTDB, fuseki:dataset and tdb:location - or do I need to include
all of those from the template?

The datasets created in the UI seem to live under
system-configuration/ rather than in config.ttl
-- should such a register-load shell script put things in there
instead of config.ttl?

Are those UUIDs registered somewhere?

>> Q: Should I run the standalone Fuseki or a WAR inside say Tomcat?
> Your choice!

I'll stick with the standalone then, as it means I get the tdb.loader
for free :)

>> Q: Are there any memory settings that should be made accessible?
> Heap - usually it does not need changing but if you have lots of datasets it
> might.

I see - now it is this line:

    JVM_ARGS=${JVM_ARGS:--Xmx1200M}

Why is this using ":" here? I don't understand that bit. :-)

I guess I would need to freshen my sed skills to set the memory limit
from a shell variable.

So you generally mean.. a high number of datasets? Size of dataset or
query doesn't matter much with TDB?

>> Q: Is it safe to just stop fuseki-server with Ctrl-C / docker stop? It
>> seems to stop a bit too fast, with no log outputs.
> Yes - or even "kill -9"

Nice! So

  docker stop fuseki

just works safely.

> The distribution has an unversioned file in it IIRC

Could you clarify..?

apache-jena-2.12.1.tar.gz and jena-fuseki-1.1.1-distribution.tar.gz
don't seem to include fuseki2.

So I have to (at least until Fuseki 2.0.0 is released) download
jena-fuseki-dist from

https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-fuseki-dist/2.0.0-SNAPSHOT/

by finding the timestamp in

https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-fuseki-dist/2.0.0-SNAPSHOT/maven-metadata.xml

-- 
Stian Soiland-Reyes
Apache Taverna (incubating)
http://orcid.org/0000-0001-9842-9718

Re: Fuseki 2 docker image - some questions

Posted by Andy Seaborne <an...@apache.org>.

On 26/01/15 16:53, Stian Soiland-Reyes wrote:
> I have created an experimental Docker image for running Apache Jena Fuseki 2.
>
> See https://registry.hub.docker.com/u/stain/jena-fuseki/ for details.
>
> The short story is:
>
>      docker run -p 3030:3030 -it stain/jena-fuseki
>
> Look for the password in the console, log in to http://localhost:3030/
> and enjoy the beautiful interface, upload data, and start querying it!
>
>
> (At a Bring Your Own Data meeting last week, several people were
> asking what cool triple store I was running when they saw I had Fuseki
> 2 on my laptop. (They were using Virtuoso 6). We really need to get
> this out to the people! :)
>
> There was also one guy who was using Jena to do automatic ontology
> reconstruction on a SPARQL endpoint, he said he had some performance
> patches to Jena which I encouraged him to join this mailing list and
> tell us about.)
>
>
> # Questions to dev@jena
>
> Q: I don't know why Fuseki (here with basic auth) always asks for the
> admin password - perhaps some Fuseki / shiro.ini experts can help? Is
> it because of the server status icon?

Don't know but one thought --- maybe (unlikely) the connection is not 
being preserved.  More likely, the shiro setup is wrong.  I think Trveo 
had similar issues.

> Q: How should I do mass-data-loading? Can you run tdbloader /
> tdbloader2 through the jena-fuseki2.jar, or do I need to bundle in the
> shell-scripts from the normal jena distribution?

tdbloader is pure java and is included in the fuseki jar.

tdbloader2 has a shell script to drive it.

You can't run either loader on a live database

>
> Qa: What --loc do I give? /fuseki/databases/EXAMPLE I guess? They look
> like tdb folders

Yes.

>
>      $ docker run -v $HOME/ops/chembl19:/staging --volumes-from
> fuseki-data stain/jena-fuseki ./fuseki-server tdb.loader2
> --loc=/fuseki/databases/test
>      Multiple dataset path names given

did you mean

java -cp fuseki-server.jar tdb.tdbloader --loc

./fuseki-server starts the server using -jar)

> Qb: How do I tell Fuseki about the dataset? Do I need to create it on
> the web first and then stop the server? Do I need to create one those
> assembler magic files?

config.ttl or the UI.

A missing feature is the ability to upload a assembler into a running 
server.

There are example assembler files.  The UI is just templating one.

> Q: Should I base this on openjdk7 or 8? See
> https://registry.hub.docker.com/u/library/java/ for options.

java7

Jena (and Fuseki) are at Java7.

> Q: Should I run the standalone Fuseki or a WAR inside say Tomcat?

Your choice!

> Q: Are there any memory settings that should be made accessible?

Heap - usually it does not need changing but if you have lots of 
datasets it might.

> Q: Is it safe to just stop fuseki-server with Ctrl-C / docker stop? It
> seems to stop a bit too fast, with no log outputs.

Yes - or even "kill -9"

Transactions!

> ## How was it made?
>
> https://github.com/stain/jena/tree/fuseki2-docker/jena-fuseki2/jena-fuseki-docker
>
> Note that I had to do it with a tricky download from the Maven
> snapshot server (Xpath!), as if I try to do the whole Maven build of
> Jena from within the Dockerfile, I get a DVD-sized image:
>
> stain@biggie-utopic:~/src/jena$ du -hs .
> 1.5G    .
>
> (the one above clocks in at nice 51 MB+18+
>
> Once fuseki2 is released in stable versions this should be a bit
> easier as there would be no need to xpath out the latest snapshot
> timestamp.

The distribution has an unversioned file in it IIRC

	Andy

>
>
>

Re: Fuseki 2 docker image - some questions

Posted by Andy Seaborne <an...@apache.org>.

On 28/01/15 13:45, Stian Soiland-Reyes wrote:
> Right - but ideally I should be allowed to stay anonymous until trying
> to do sysadmin stuff (e.g. "Manage datasets")

Shiro only triggers for certain URLs.  See shiro.ini.

[urls]
## Control functions open to anyone
/$/status = anon
/$/ping   = anon

## and the rest are restricted to localhost.
/$/** = localhost
...
# Everything else
/**=anon

> Looking at this closer, it seems to give the first 401 on a GET
> http://localhost:8081/$/server
>
> I can cancel the dialogue, and now I don't get any drop-down list
> under http://localhost:8081/dataset.html

It is making an admin call to find out about datasets.  Whether that is 
to be made admin-only or not depends on style.  It's not SPARQL protocol.

And if you want high grade security - put it behind a Apache httpd or 
etc and use that.

>
> The suggested shiro.ini does indeed protect almost everything under /$/:
>
> https://github.com/apache/jena/blob/master/jena-fuseki2/jena-fuseki-core/src/main/resources/org/apache/jena/fuseki/server/shiro.ini#L25
>
> I edited this to do basicauth exactly as it says in the comments there
>   (but with a pwgen-generated password)
>
>
>
> What does /$/server do?

The machine that goes 'ping'

See web.xml.

Calls ActionServerStatus to see if it is alive and run the green/red 
indicator.



> On 28 January 2015 at 13:30, Andy Seaborne <an...@apache.org> wrote:
>> On 26/01/15 16:53, Stian Soiland-Reyes wrote:
>>>
>>> # Questions to dev@jena
>>>
>>> Q: I don't know why Fuseki (here with basic auth) always asks for the
>>> admin password - perhaps some Fuseki / shiro.ini experts can help? Is
>>> it because of the server status icon?
>>
>>
>> Not an expert but with some digging ...
>>
>> What seems to be happening is that if you turn off localhost authentication
>> and turn on basic authentication, then the user/password has to go in every
>> request.  There is no session state, cookies whatever in basic
>> authentication.
>>
>> Usually the way to do that is to have the browser remember the details and
>> it will send them when asked.
>>
>> I guess you have told your browser not to record user/password.
>>
>> There is a two hop exchange with a via 401 - the web UI does not know the
>> user/password to add each GET/POST - maybe that is easy to configure the
>> environment to do that, I don't know, not my area of expertise.
>>
>> Having the Fuseki code add user/password seems rather heavy and with such a
>> common setup I wodner if there is a way to tell jQuery or whatever is doing
>> the HTTP bit to learn the details (not that basic auth is very safe unless
>> it's an SSL connection).
>>
>>          Andy
>>
>> https://issues.apache.org/jira/browse/SHIRO-20 for digest auth.
>
>
>

Re: Fuseki 2 docker image - some questions

Posted by Stian Soiland-Reyes <st...@apache.org>.

Right - but ideally I should be allowed to stay anonymous until trying
to do sysadmin stuff (e.g. "Manage datasets")


Looking at this closer, it seems to give the first 401 on a GET
http://localhost:8081/$/server

I can cancel the dialogue, and now I don't get any drop-down list
under http://localhost:8081/dataset.html

The suggested shiro.ini does indeed protect almost everything under /$/:

https://github.com/apache/jena/blob/master/jena-fuseki2/jena-fuseki-core/src/main/resources/org/apache/jena/fuseki/server/shiro.ini#L25

I edited this to do basicauth exactly as it says in the comments there
 (but with a pwgen-generated password)



What does /$/server do?



On 28 January 2015 at 13:30, Andy Seaborne <an...@apache.org> wrote:
> On 26/01/15 16:53, Stian Soiland-Reyes wrote:
>>
>> # Questions to dev@jena
>>
>> Q: I don't know why Fuseki (here with basic auth) always asks for the
>> admin password - perhaps some Fuseki / shiro.ini experts can help? Is
>> it because of the server status icon?
>
>
> Not an expert but with some digging ...
>
> What seems to be happening is that if you turn off localhost authentication
> and turn on basic authentication, then the user/password has to go in every
> request.  There is no session state, cookies whatever in basic
> authentication.
>
> Usually the way to do that is to have the browser remember the details and
> it will send them when asked.
>
> I guess you have told your browser not to record user/password.
>
> There is a two hop exchange with a via 401 - the web UI does not know the
> user/password to add each GET/POST - maybe that is easy to configure the
> environment to do that, I don't know, not my area of expertise.
>
> Having the Fuseki code add user/password seems rather heavy and with such a
> common setup I wodner if there is a way to tell jQuery or whatever is doing
> the HTTP bit to learn the details (not that basic auth is very safe unless
> it's an SSL connection).
>
>         Andy
>
> https://issues.apache.org/jira/browse/SHIRO-20 for digest auth.



-- 
Stian Soiland-Reyes
Apache Taverna (incubating)
http://orcid.org/0000-0001-9842-9718

Re: Fuseki 2 docker image - some questions

Posted by Andy Seaborne <an...@apache.org>.

On 26/01/15 16:53, Stian Soiland-Reyes wrote:
> # Questions to dev@jena
>
> Q: I don't know why Fuseki (here with basic auth) always asks for the
> admin password - perhaps some Fuseki / shiro.ini experts can help? Is
> it because of the server status icon?

Not an expert but with some digging ...

What seems to be happening is that if you turn off localhost 
authentication and turn on basic authentication, then the user/password 
has to go in every request.  There is no session state, cookies whatever 
in basic authentication.

Usually the way to do that is to have the browser remember the details 
and it will send them when asked.

I guess you have told your browser not to record user/password.

There is a two hop exchange with a via 401 - the web UI does not know 
the user/password to add each GET/POST - maybe that is easy to configure 
the environment to do that, I don't know, not my area of expertise.

Having the Fuseki code add user/password seems rather heavy and with 
such a common setup I wodner if there is a way to tell jQuery or 
whatever is doing the HTTP bit to learn the details (not that basic auth 
is very safe unless it's an SSL connection).

	Andy

https://issues.apache.org/jira/browse/SHIRO-20 for digest auth.