You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2014/06/23 22:26:26 UTC

GSoC : CSV PropertyTables : mid-term checkpoint

Hi Ying,

It's the mid-point of the GSoC programme so it's a good time to assess 
the state of the project. It looks close to the plan and I'd like you to 
(briefly) write-up how the project is going. Check you are getting what 
you want out of the project as well.  It is not just code production. Is 
the rest of the plan looking right still?


Looking on at the repository, there are a few things I'd like to see:

1/ More tests - tests should be structured so each tests a specific 
thing so when/if there are test failures, it's easier to see what might 
the the root cause.

2/ Examples and documentation

3/ Evaluation :

For example, is the property table specialisation resulting in a smaller 
storage cost? And, iteratively, can the design be changed to be more 
compact? Maybe some indexing isn't needed; maybe a different way to 
index the same access patterns would take less space.



Other:

The code can be packaged under org.apache.jena.  We're trying to avoid 
com.hp.hpl.jena.

A specific question:

Access by subject is an important use case even when the rows are blank 
nodes.  It will matter for SPARQL and even in since - "find by subject 
column/value then get row by subject", that is two graph.find calls, 
seems a reasonable access pattern.

I could not see that graph.find(subject, ANY, ANY) is using 
PropertyTable.getRow in the graph.find codepath and I expected it would 
be.  Did I miss something?

	Andy

Re: GSoC : CSV PropertyTables : mid-term checkpoint

Posted by Ying Jiang <jp...@gmail.com>.
Hi,

Here's the mid-term report.

In short, this project is to make Jena SPARQL processing abilities on
CSVs, or generally, the regular table shaped data.

Project Progress:
I changed the project plan a bit. In the first half part, I actually worked on:
2.1 RIOT Reader for CSV Files ( done )
1.2 PropertyTable ( the design of the API )
1.3 GraphPropertyTable  ( a preliminary implementation based on Hashmap )
In the remaining weeks, I'm supposed to complete the followings:
2.2 CSV2RDF tool
2.3 OpExecutor Optimization for GraphPropertyTable
1.1 CSV Parser

The project plan is going on well, but there're some issues that
pointed out by Andy to be resolved:
(1) I'll evaluate (e.g. by testing with large data) the current
implementation PropertyTable and GraphPropertyTable, and try to
optimize it with a smaller storage cost (e.g, by removing and adding
some indexes, to support fast "find by subject column/value then get
row by subject" query, which is slow now). Or make a new
implementation (maybe Java array) to compare. This work should also
consider the part of 2.3, related source code need to be investigated
before hand.
(2) Currently the tests are restricted in some examples for
demonstration. I'll make more (structured) tests, especially for 1.1
and 2.1
(3) Documentation with examples for users in the Jena website.

So far, I've got much more insights of Jena thanks to the project. The
challenging problem of (1) interests me the most. I appreciate the
help from Andy very much.

Cheers,
Ying Jiang


On Tue, Jun 24, 2014 at 4:26 AM, Andy Seaborne <an...@apache.org> wrote:
> Hi Ying,
>
> It's the mid-point of the GSoC programme so it's a good time to assess the
> state of the project. It looks close to the plan and I'd like you to
> (briefly) write-up how the project is going. Check you are getting what you
> want out of the project as well.  It is not just code production. Is the
> rest of the plan looking right still?
>
>
> Looking on at the repository, there are a few things I'd like to see:
>
> 1/ More tests - tests should be structured so each tests a specific thing so
> when/if there are test failures, it's easier to see what might the the root
> cause.
>
> 2/ Examples and documentation
>
> 3/ Evaluation :
>
> For example, is the property table specialisation resulting in a smaller
> storage cost? And, iteratively, can the design be changed to be more
> compact? Maybe some indexing isn't needed; maybe a different way to index
> the same access patterns would take less space.
>
>
>
> Other:
>
> The code can be packaged under org.apache.jena.  We're trying to avoid
> com.hp.hpl.jena.
>
> A specific question:
>
> Access by subject is an important use case even when the rows are blank
> nodes.  It will matter for SPARQL and even in since - "find by subject
> column/value then get row by subject", that is two graph.find calls, seems a
> reasonable access pattern.
>
> I could not see that graph.find(subject, ANY, ANY) is using
> PropertyTable.getRow in the graph.find codepath and I expected it would be.
> Did I miss something?
>
>         Andy