You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2014/06/04 17:07:11 UTC

Re: GSoC routine

Ying,

The next part of the project is the property tables which are compact 
storage for CSV files that exploit the regular structure of the data.

These will be useful both for CSV files but potentially for other uses 
storing regular data (outside this project).  These are what an SQL 
database would call .... a "table" :-)

For the "Design PropertyTable" item, it would be really good to write up 
and share a design on this list.

	Andy

On 26/05/14 11:06, Andy Seaborne wrote:
> On 24/05/14 14:32, Ying Jiang wrote:
>> Dear Andy,
>>
>> I see the discussion of JENA-699 about the CSV/TSV parser. It seems
>> that Apache Commons CSV would be a better choice for future.
>> Therefore, I'm not strictly following the project plan in the proposal
>> [1], which I'm supposed to develop the CSV parser at the beginning of
>> the project.
>
> It looks very good.
>
> As I'm finding on the CSV - working group "CSV" is a somewhat broad
> catch-all piece of terminology, ranging from using ";" for the separator
> (common in areas of the world where the decimal number separator is ",")
> to fixed width layout.  We'll stick to RFC 4180 CSV files for now. There
> is going to be a revised spec at some time but not soon.
>
> One of the advantages of Apache Commons CSV, or other parsers, is the
> ability to cope with the variety out there.  The CSV parser dropped in
> recently only does comma separated, properly escaped files.
>
> (Honestly, it was quicker to write it that investiage all the existing
> parser! It was needed quicker for SPARQL test cases.).
>
>> Instead, I'm working on  "2.1 RIOT Reader for CSV Files". Things are
>> going well until now. I just and the new "LangCSV" and its unit test.
>> Please check the code commited just now. Any comments are welcome!
>
> Slight problem:
>
> --------------
> col1, col2
> abc,"23""4"
> --------------
>
> "23""4" is a CSV field using quotes and "" is an internal escaped double
> quote charcater - the base CSV parser deals with the quotes.
>
> So it the token is the string 23"4
>
> You call LangCSV.parse which in turn invokes the tokenizer for Turtle
> which then complains as 23"4 is a mess in Turtle.
>
> There's no need to parse - either it's a string or a double (for now).
> It's not an a RDF term with language, datatype etc (the SPARQL results
> in TSV does do that)
>
> Fix added - I also abused the parsers use of row/col for CSV errors.
>
>>
>> In the next week, I'd like to complete 2.1, which means Jena can read
>> ".csv" file into Model.
>
> As you do this, more tests to push all the cases are going to be needed,
> both for more strange cases like the above, and other situations
> including what happens when a column name has a space in it?  Or other
> non-URI fragment character in it (answer - %-encode it).
>
> For testing the outcome of parsing, you can determine if two models are
> "the same" by using model.isIsomorphicWith(otherModel)
>
> It returns true/false depending on whether there is a consistent
> renaming of bNodes from one model to the other (that's the isomorphism).
>
> So testing can have the right answer as a Turtle model, and compare ti
> to the parsed CSV file.
>
>      Andy
>
>>
>> Best regards,
>> Ying Jiang
>>
>> [1]
>> http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080
>>
>
> https://issues.apache.org/jira/browse/JENA-625
>
> (I thought all accepted melange proposals became public automatically
> when accepted and the programme started.  Maybe it'll happen soon.)
>

Re: GSoC routine

Posted by Ying Jiang <jp...@gmail.com>.

Dear Andy,

Thanks for your pointing it out! IRILib.encodeUriComponent(String) works well.

Is there a tool in the Jena code base for parsing String to Jena
Literal for different data types? For example:
1) "65000" -> "65000"^^<http://www.w3.org/2001/XMLSchema#integer>
2) "65000.123" -> "65000.123"^^<http://www.w3.org/2001/XMLSchema#double>
3) "true" -> "true"^^<http://www.w3.org/2001/XMLSchema#boolean>
There would be more advanced features like parsing date/time to
xsd:date or xsd:dateTime for CSV values.
The NodeFactory.createLiteral() method does not provide these features.

Best regards,
Ying Jiang





On Sun, Jun 22, 2014 at 2:16 AM, Andy Seaborne <an...@apache.org> wrote:
> I completely forgot about:
>
> IRILibh.encodeUriComponent(String)
>
>         Andy
>
>
> On 19/06/14 11:09, Andy Seaborne wrote:
>>
>> On 15/06/14 18:18, Ying Jiang wrote:
>>>
>>> 1) space and other non-URI characters in column name
>>> I introduce the LangCSV.encodeURIComponent() method borrowed from [1].
>>> However it does not strictly conform to RFC 3986 [2].
>>> TestLangCSV.testNonURICharacters() [7] shows the escaping result.
>>> There's also another related standard of RFC 2396 [3]. I'm confused by
>>> them.
>>
>>
>> RFC 2396 is superseded by RFC 3986.
>>
>>> Which one is Jena URI supposed to stick to?
>>> There're other escaping method from libs, such as spring-web [4],
>>> guava [5] and the old commons-httpclient [6]. Is it OK to make Jena
>>> (jena-arq) depending on one of these libs?
>>
>>
>>
>> Jena has some IRI code that may be useful to you:
>>
>> // Includes punycode for host names!
>> IRI iri = IRIFactory.iriImplementation()
>>                      .create("http://examplé/foo bar?query=a b") ;
>> System.out.println(iri.toASCIIString()) ;
>>
>> iri = IRIFactory.iriImplementation()
>>                      .create("foo bar?query=a b") ;
>> System.out.println(iri.toASCIIString()) ;
>>
>> It's not query-string sensitive, "a b" becomes "a%20b" and not "a+b",
>> but for producing URIs in the CSV case that does not matter (?).
>>
>> You'll need to be careful about '?' anyway as you'll need to specially
>> %-encode it.
>>
>> Jena already depends on org.apache.httpcomponents.httpclient so that is
>> no extra dependency.
>>
>>      Andy
>
>

Re: GSoC routine

Posted by Andy Seaborne <an...@apache.org>.

I completely forgot about:

IRILibh.encodeUriComponent(String)

	Andy

On 19/06/14 11:09, Andy Seaborne wrote:
> On 15/06/14 18:18, Ying Jiang wrote:
>> 1) space and other non-URI characters in column name
>> I introduce the LangCSV.encodeURIComponent() method borrowed from [1].
>> However it does not strictly conform to RFC 3986 [2].
>> TestLangCSV.testNonURICharacters() [7] shows the escaping result.
>> There's also another related standard of RFC 2396 [3]. I'm confused by
>> them.
>
> RFC 2396 is superseded by RFC 3986.
>
>> Which one is Jena URI supposed to stick to?
>> There're other escaping method from libs, such as spring-web [4],
>> guava [5] and the old commons-httpclient [6]. Is it OK to make Jena
>> (jena-arq) depending on one of these libs?
>
>
> Jena has some IRI code that may be useful to you:
>
> // Includes punycode for host names!
> IRI iri = IRIFactory.iriImplementation()
>                      .create("http://examplé/foo bar?query=a b") ;
> System.out.println(iri.toASCIIString()) ;
>
> iri = IRIFactory.iriImplementation()
>                      .create("foo bar?query=a b") ;
> System.out.println(iri.toASCIIString()) ;
>
> It's not query-string sensitive, "a b" becomes "a%20b" and not "a+b",
> but for producing URIs in the CSV case that does not matter (?).
>
> You'll need to be careful about '?' anyway as you'll need to specially
> %-encode it.
>
> Jena already depends on org.apache.httpcomponents.httpclient so that is
> no extra dependency.
>
>      Andy

Re: GSoC routine

Posted by Andy Seaborne <an...@apache.org>.

On 15/06/14 18:18, Ying Jiang wrote:
> 1) space and other non-URI characters in column name
> I introduce the LangCSV.encodeURIComponent() method borrowed from [1].
> However it does not strictly conform to RFC 3986 [2].
> TestLangCSV.testNonURICharacters() [7] shows the escaping result.
> There's also another related standard of RFC 2396 [3]. I'm confused by
> them.

RFC 2396 is superseded by RFC 3986.

> Which one is Jena URI supposed to stick to?
> There're other escaping method from libs, such as spring-web [4],
> guava [5] and the old commons-httpclient [6]. Is it OK to make Jena
> (jena-arq) depending on one of these libs?


Jena has some IRI code that may be useful to you:

// Includes punycode for host names!
IRI iri = IRIFactory.iriImplementation()
                     .create("http://examplé/foo bar?query=a b") ;
System.out.println(iri.toASCIIString()) ;

iri = IRIFactory.iriImplementation()
                     .create("foo bar?query=a b") ;
System.out.println(iri.toASCIIString()) ;

It's not query-string sensitive, "a b" becomes "a%20b" and not "a+b", 
but for producing URIs in the CSV case that does not matter (?).

You'll need to be careful about '?' anyway as you'll need to specially 
%-encode it.

Jena already depends on org.apache.httpcomponents.httpclient so that is 
no extra dependency.

	Andy

Re: GSoC routine

Posted by Ying Jiang <jp...@gmail.com>.

Dear Andy,

Thank you for your comments! Accordingly, I've just made 2 major
commits to jena-csv and jena-arq. Some of the issues you mentioned
have been resolved, while others require further discussions. Please
check the following ones:

1) space and other non-URI characters in column name
I introduce the LangCSV.encodeURIComponent() method borrowed from [1].
However it does not strictly conform to RFC 3986 [2].
TestLangCSV.testNonURICharacters() [7] shows the escaping result.
There's also another related standard of RFC 2396 [3]. I'm confused by
them. Which one is Jena URI supposed to stick to?
There're other escaping method from libs, such as spring-web [4],
guava [5] and the old commons-httpclient [6]. Is it OK to make Jena
(jena-arq) depending on one of these libs?

2)  using model.isIsomorphicWith(otherModel) for testing
Done

3)  blank nodes labels for each row
Done.Now using NodeFactory.createAnon() for each row generated blank
nodes labels. If the subject might even by a URI calculated from the
row in future, we can just change the code of
LangCSV.caculateSubject().

4)  indexes of PropertyTableImpl for SPARQL executing efficiently on CSV data
Now PropertyTableImpl has 2 indexes of PSO and POS using HashMap.
Therefore, the current performance of queries should be (not tested
yet):
 (a) - Query for ?s <p> <o> :  use POS index directly (fast)
 (b) - Query for ?s <p> ?o :  use PSO index directly (fast)
 (c) - Query for ?s ?p <o> :  use POS index ( O(n), n= column count )
 (d) - Query for <s> ?p ?o :  use PSO index ( O(n), n= column count )
 (e) - Query for ?s ?p ?o : use PSO index to scan all the table (slow)
Since the subject of the Triples within PropertyTable are blank nodes
or caculated URIs, the most popular queries should be (a), (b), and
they are fast now. If the subjects are queried frequently, I can
possibly add the SPO index in similar ways, so that accessing by
subject could pick a row directly. Any comments?

5)  SUM/COUNT and the other aggregates on a column
I'm not sure whether this issue is within the scope of my work. For
example, it seems that COUNT is handled by AggCountVar [8]. If my
GraphPropertyTable delivers the right graphBaseFind() method, I think
the COUNT can work well by AggCountVar (not tested yet). Am I right?
How can I modify the COUNT behavior to improve the performance through
indexes? Could you please show me how TDB does this by indexing?

6)  storing a PropertyTable as a Java array vs. HashMap
I can make several different implementations of PropertyTable to
compare. At least, the HashMap implementation now can be used for
testing performance and memory storage space as a base line.
The simplest way of storing PropertyTable is a 2 dimensional Java
array of Node. But using Java array requires knowing the total number
of the rows and columns in the first place. Yes, we can scan the csv
files lines to get the number. What if we later add more rows/columns
than the Java array can provide (ArrayOutofBoundExceptioin)? Is
PropertyTable itself immutable?.

7) javadoc for the PropertyTable API
Done. Please check whether the API is sufficient. Any more
interfaces/methods required?

Best regards,
Ying Jiang


[1] http://stackoverflow.com/questions/607176/java-equivalent-to-javascripts-encodeuricomponent-that-produces-identical-outpu
[2] http://www.ietf.org/rfc/rfc3986.txt
[3] http://www.ietf.org/rfc/rfc2396.txt
[4] http://docs.spring.io/spring/docs/3.0.x/api/org/springframework/web/util/UriUtils.html
[5] http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/net/UrlEscapers.html
[6] http://grepcode.com/file/repo1.maven.org/maven2/commons-httpclient/commons-httpclient/3.1/org/apache/commons/httpclient/util/URIUtil.java#URIUtil
[7] https://svn.apache.org/repos/asf/jena/trunk/jena-arq/src/test/java/org/apache/jena/riot/lang/TestLangCSV.java
[8] https://svn.apache.org/repos/asf/jena/trunk/jena-arq/src/main/java/com/hp/hpl/jena/sparql/expr/aggregate/AggCountVar.java

On Mon, Jun 9, 2014 at 8:58 PM, Andy Seaborne <an...@apache.org> wrote:
> Hi there,
>
> I think I see what's going on but could you write some javadoc for the
> interfaces please?
>
>
> RDF blank nodes have their own rules and requirements.  Each row is supposed
> to be allocated a blank node a subject for each row and a later access may
> be using that blank node.  In the general case, and the WG is working on
> this, the subject might even by a URI calculated from the row.
>
> I'm afraid you can't do:
>
> PropertyTableImpl::getTripleIterator(Column)
> Node s  = NodeFactory.createAnon(AnonId.create( "_:"+rowNum  ));
>
> (although it's very very useful to have that capability for debugging! The
> LabelToNode code in ARQ, with a new policy will help you to abstract this.)
>
> because different tables end up with the same bNode for each row 1 etc.
> You'll need to create a subject bNode (and let Jena choose the label so it's
> unique for every table read) as the table is read.
>
> Then for the find operation, it would good if access by subject could pick a
> row directly, rather than a scan of the property table.  This is one case of
> having additional indexes into the data.  Is the design suitable for this
> kind of access?
>
> Discussion points:
>
> 1/
>
>>> For now, the API of PropertyTable is enough for performing SPARQL
>>> querying. If the advanced features are required in future, it's
>>> possible to add some methods to PropertyTable, Row or Column. For
>>> example, is PropertyTable supposed to be mutable or read-only? Any
>>> other suggestion for the API, e.g. for an SQL database?
>
>
> Yes, because it provides  find() it can perform SPARQL queries but it may be
> quite slow as it can involve a whole table-scan. Could you make some
> suggestions as to how exploit the datastructures so SPARQL can execute
> efficiently on CSV data.  For example, is there a way that SUM/COUNT and the
> other aggregates on a column can be handled.  (At this stage, it's design
> work not implementation - it will help verify the implementation in
> PropertyTableImpl.java has the right datastructures.)
>
> 2/
> I'd like to get your opinion of storing a property table as a Java array,
> indexec by row number because a array is more compact than a hash map.  (I'm
> trying to understand is the space taken by PropertyTableImpl can be reduced
> - if we think of reading in either large tables or many tables, or
> both(!!!), then space can become an issue)
>
>         Andy
>
>
>
> On 07/06/14 14:26, Ying Jiang wrote:
>>
>> Dear Andy,
>>
>> I've just committed the API of PropertyTable with the implementations
>> [1]. The code follow the original design ( and I paste it below) in
>> the project proposal, which can accommodate other regular data besides
>> CSV files:
>> ------
>> 1.2 PropertyTable
>> A PropertyTable is collection of data that is sufficiently regular in
>> shape it can be treated as a table. That means each subject has a
>> value for each one of the set of properties. Irregularity in terms of
>> missing values needs to be handled but not multiple values for the
>> same property. With special storage, a PropertyTable
>> - is more compact and more amenable to custom storage (e.g. a JSON
>> document store)
>> - can have custom indexes on specific columns
>> - can guarantee access orders
>> Providing these features is out of scope of the project but the
>> architecture of the work must be mindful of these possibilities.
>> For this project, PropertyTable is designed to be a table of RDF
>> terms, or Nodes in Jena. The interface should GraphPropertyTableprovide
>> the methods of
>>
>> getting Nodes by subject or property-value. Using the CSV Paser that
>> generates tuples of RDF terms, I can add code to take a tabular CSV
>> file, and create a PropertyTable, or literally its implementation of
>> PropertyTableImpl, using PropertyTableBuilder. To support testing,
>> there should be RDF tuples to PropertyTable path through RDF Tuple I/O
>> [5] as well.
>> 1.3 GraphPropertyTable
>> GraphPropertyTable implements the Graph interface (readonly) over a
>> PropertyTable. This is subclass from GraphBase and implements find().
>> The find() method needs to choose the access route based on the find
>> arguments. This will offer the PropertyTable interface, or an
>> appropriate subset or variation, so that such a graph can be treated
>> in a more table-like fashion.
>> GraphCSV is a sub class of GraphPropertyTable for aiming at CSV
>> powered by th> For now, the API of PropertyTable is enough for performing
>> SPARQL
>>
>> querying. If the advanced features are required in future, it's
>> possible to add some methods to PropertyTable, Row or Column. For
>> example, is PropertyTable supposed to be mutable or read-only? Any
>> other suggestion for the API, e.g. for an SQL database?e CSV Parser.
>>
>> -----
>>
>> GraphPropertyTable and GraphCSV have also been implemented. Please
>> check the test case [2], which realizes the example from the project
>> proposal that performs SPARQL querying over a GraphCSV.
>>
>> For now, the API of PropertyTable is enough for performing SPARQL
>> querying. If the advanced features are required in future, it's
>> possible to add some methods to PropertyTable, Row or Column. For
>> example, is PropertyTable supposed to be mutable or read-only? Any
>> other suggestion for the API, e.g. for an SQL database?
>>
>> In the next steps, I'd like to refine the code and add more tests, for
>> more robust CSV parsing and SPARQL querying, especially for the
>> problems you pointed out in your previous email on 26th May.
>>
>> Cheers,
>> Ying Jiang
>>
>> [1]
>> https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/
>> [2]
>> https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/test/java/com/hp/hpl/jena/propertytable/impl/GraphCSVTest.java
>>
>> On Wed, Jun 4, 2014 at 11:07 PM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>> Ying,
>>>
>>> The next part of the project is the property tables which are compact
>>> storage for CSV files that exploit the regular structure of the data.
>>>
>>> These will be useful both for CSV files but potentially for other uses
>>> storing regular data (outside this project).  These are what an SQL
>>> database
>>> would call .... a "table" :-)
>>>
>>> For the "Design PropertyTable" item, it would be really good to write up
>>> and
>>> share a design on this list.
>>>
>>>          Andy
>>>
>>>
>>> On 26/05/14 11:06, Andy Seaborne wrote:
>>>>
>>>>
>>>> On 24/05/14 14:32, Ying Jiang wrote:
>>>>>
>>>>>
>>>>> Dear Andy,
>>>>>
>>>>> I see the discussion of JENA-699 about the CSV/TSV parser. It seems
>>>>> that Apache Commons CSV would be a better choice for future.
>>>>> Therefore, I'm not strictly following the project plan in the proposal
>>>>> [1], which I'm supposed to develop the CSV parser at the beginning of
>>>>> the project.
>>>>
>>>>
>>>>
>>>> It looks very good.
>>>>
>>>> As I'm finding on the CSV - working group "CSV" is a somewhat broad
>>>> catch-all piece of terminology, ranging from using ";" for the separator
>>>> (common in areas of the world where the decimal number separator is ",")
>>>> to fixed width layout.  We'll stick to RFC 4180 CSV files for now. There
>>>> is going to be a revised spec at some time but not soon.
>>>>
>>>> One of the advantages of Apache Commons CSV, or other parsers, is the
>>>> ability to cope with the variety out there.  The CSV parser dropped in
>>>> recently only does comma separated, properly escaped files.
>>>>
>>>> (Honestly, it was quicker to write it that investiage all the existing
>>>> parser! It was needed quicker for SPARQL test cases.).
>>>>
>>>>> Instead, I'm working on  "2.1 RIOT Reader for CSV Files". Things are
>>>>> going well until now. I just and the new "LangCSV" and its unit test.
>>>>> Please check the code commited just now. Any comments are welcome!
>>>>
>>>>
>>>>
>>>> Slight problem:
>>>>
>>>> --------------
>>>> col1, col2
>>>> abc,"23""4"
>>>> --------------
>>>>
>>>> "23""4" is a CSV field using quotes and "" is an internal escaped double
>>>> quote charcater - the base CSV parser deals with the quotes.
>>>>
>>>> So it the token is the string 23"4
>>>>
>>>> You call LangCSV.parse which in turn invokes the tokenizer for Turtle
>>>> which then complains as 23"4 is a mess in Turtle.
>>>>
>>>> There's no need to parse - either it's a string or a double (for now).
>>>> It's not an a RDF term with language, datatype etc (the SPARQL results
>>>> in TSV does do that)
>>>>
>>>> Fix added - I also abused the parsers use of row/col for CSV errors.
>>>>
>>>>>
>>>>> In the next week, I'd like to complete 2.1, which means Jena can read
>>>>> ".csv" file into Model.
>>>>
>>>>
>>>>
>>>> As you do this, more tests to push all the cases are going to be needed,
>>>> both for more strange cases like the above, and other situations
>>>> including what happens when a column name has a space in it?  Or other
>>>> non-URI fragment character in it (answer - %-encode it).
>>>>
>>>> For testing the outcome of parsing, you can determine if two models are
>>>> "the same" by using model.isIsomorphicWith(otherModel)
>>>>
>>>> It returns true/false depending on whether there is a consistent
>>>> renaming of bNodes from one model to the other (that's the isomorphism).
>>>>
>>>> So testing can have the right answer as a Turtle model, and compare ti
>>>> to the parsed CSV file.
>>>>
>>>>       Andy
>>>>
>>>>>
>>>>> Best regards,
>>>>> Ying Jiang
>>>>>
>>>>> [1]
>>>>>
>>>>>
>>>>> http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080
>>>>>
>>>>
>>>> https://issues.apache.org/jira/browse/JENA-625
>>>>
>>>> (I thought all accepted melange proposals became public automatically
>>>> when accepted and the programme started.  Maybe it'll happen soon.)
>>>>
>>>
>

Re: GSoC routine

Posted by Andy Seaborne <an...@apache.org>.

Hi there,

I think I see what's going on but could you write some javadoc for the 
interfaces please?


RDF blank nodes have their own rules and requirements.  Each row is 
supposed to be allocated a blank node a subject for each row and a later 
access may be using that blank node.  In the general case, and the WG is 
working on this, the subject might even by a URI calculated from the row.

I'm afraid you can't do:

PropertyTableImpl::getTripleIterator(Column)
Node s  = NodeFactory.createAnon(AnonId.create( "_:"+rowNum  ));

(although it's very very useful to have that capability for debugging! 
The LabelToNode code in ARQ, with a new policy will help you to abstract 
this.)

because different tables end up with the same bNode for each row 1 etc. 
  You'll need to create a subject bNode (and let Jena choose the label 
so it's unique for every table read) as the table is read.

Then for the find operation, it would good if access by subject could 
pick a row directly, rather than a scan of the property table.  This is 
one case of having additional indexes into the data.  Is the design 
suitable for this kind of access?

Discussion points:

1/
>> For now, the API of PropertyTable is enough for performing SPARQL
>> querying. If the advanced features are required in future, it's
>> possible to add some methods to PropertyTable, Row or Column. For
>> example, is PropertyTable supposed to be mutable or read-only? Any
>> other suggestion for the API, e.g. for an SQL database?

Yes, because it provides  find() it can perform SPARQL queries but it 
may be quite slow as it can involve a whole table-scan. Could you make 
some suggestions as to how exploit the datastructures so SPARQL can 
execute efficiently on CSV data.  For example, is there a way that 
SUM/COUNT and the other aggregates on a column can be handled.  (At this 
stage, it's design work not implementation - it will help verify the 
implementation in PropertyTableImpl.java has the right datastructures.)

2/
I'd like to get your opinion of storing a property table as a Java 
array, indexec by row number because a array is more compact than a hash 
map.  (I'm trying to understand is the space taken by PropertyTableImpl 
can be reduced - if we think of reading in either large tables or many 
tables, or both(!!!), then space can become an issue)

	Andy


On 07/06/14 14:26, Ying Jiang wrote:
> Dear Andy,
>
> I've just committed the API of PropertyTable with the implementations
> [1]. The code follow the original design ( and I paste it below) in
> the project proposal, which can accommodate other regular data besides
> CSV files:
> ------
> 1.2 PropertyTable
> A PropertyTable is collection of data that is sufficiently regular in
> shape it can be treated as a table. That means each subject has a
> value for each one of the set of properties. Irregularity in terms of
> missing values needs to be handled but not multiple values for the
> same property. With special storage, a PropertyTable
> - is more compact and more amenable to custom storage (e.g. a JSON
> document store)
> - can have custom indexes on specific columns
> - can guarantee access orders
> Providing these features is out of scope of the project but the
> architecture of the work must be mindful of these possibilities.
> For this project, PropertyTable is designed to be a table of RDF
> terms, or Nodes in Jena. The interface should GraphPropertyTableprovide the methods of
> getting Nodes by subject or property-value. Using the CSV Paser that
> generates tuples of RDF terms, I can add code to take a tabular CSV
> file, and create a PropertyTable, or literally its implementation of
> PropertyTableImpl, using PropertyTableBuilder. To support testing,
> there should be RDF tuples to PropertyTable path through RDF Tuple I/O
> [5] as well.
> 1.3 GraphPropertyTable
> GraphPropertyTable implements the Graph interface (readonly) over a
> PropertyTable. This is subclass from GraphBase and implements find().
> The find() method needs to choose the access route based on the find
> arguments. This will offer the PropertyTable interface, or an
> appropriate subset or variation, so that such a graph can be treated
> in a more table-like fashion.
> GraphCSV is a sub class of GraphPropertyTable for aiming at CSV
> powered by th> For now, the API of PropertyTable is enough for performing SPARQL
> querying. If the advanced features are required in future, it's
> possible to add some methods to PropertyTable, Row or Column. For
> example, is PropertyTable supposed to be mutable or read-only? Any
> other suggestion for the API, e.g. for an SQL database?e CSV Parser.
> -----
>
> GraphPropertyTable and GraphCSV have also been implemented. Please
> check the test case [2], which realizes the example from the project
> proposal that performs SPARQL querying over a GraphCSV.
>
> For now, the API of PropertyTable is enough for performing SPARQL
> querying. If the advanced features are required in future, it's
> possible to add some methods to PropertyTable, Row or Column. For
> example, is PropertyTable supposed to be mutable or read-only? Any
> other suggestion for the API, e.g. for an SQL database?
>
> In the next steps, I'd like to refine the code and add more tests, for
> more robust CSV parsing and SPARQL querying, especially for the
> problems you pointed out in your previous email on 26th May.
>
> Cheers,
> Ying Jiang
>
> [1] https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/
> [2] https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/test/java/com/hp/hpl/jena/propertytable/impl/GraphCSVTest.java
>
> On Wed, Jun 4, 2014 at 11:07 PM, Andy Seaborne <an...@apache.org> wrote:
>> Ying,
>>
>> The next part of the project is the property tables which are compact
>> storage for CSV files that exploit the regular structure of the data.
>>
>> These will be useful both for CSV files but potentially for other uses
>> storing regular data (outside this project).  These are what an SQL database
>> would call .... a "table" :-)
>>
>> For the "Design PropertyTable" item, it would be really good to write up and
>> share a design on this list.
>>
>>          Andy
>>
>>
>> On 26/05/14 11:06, Andy Seaborne wrote:
>>>
>>> On 24/05/14 14:32, Ying Jiang wrote:
>>>>
>>>> Dear Andy,
>>>>
>>>> I see the discussion of JENA-699 about the CSV/TSV parser. It seems
>>>> that Apache Commons CSV would be a better choice for future.
>>>> Therefore, I'm not strictly following the project plan in the proposal
>>>> [1], which I'm supposed to develop the CSV parser at the beginning of
>>>> the project.
>>>
>>>
>>> It looks very good.
>>>
>>> As I'm finding on the CSV - working group "CSV" is a somewhat broad
>>> catch-all piece of terminology, ranging from using ";" for the separator
>>> (common in areas of the world where the decimal number separator is ",")
>>> to fixed width layout.  We'll stick to RFC 4180 CSV files for now. There
>>> is going to be a revised spec at some time but not soon.
>>>
>>> One of the advantages of Apache Commons CSV, or other parsers, is the
>>> ability to cope with the variety out there.  The CSV parser dropped in
>>> recently only does comma separated, properly escaped files.
>>>
>>> (Honestly, it was quicker to write it that investiage all the existing
>>> parser! It was needed quicker for SPARQL test cases.).
>>>
>>>> Instead, I'm working on  "2.1 RIOT Reader for CSV Files". Things are
>>>> going well until now. I just and the new "LangCSV" and its unit test.
>>>> Please check the code commited just now. Any comments are welcome!
>>>
>>>
>>> Slight problem:
>>>
>>> --------------
>>> col1, col2
>>> abc,"23""4"
>>> --------------
>>>
>>> "23""4" is a CSV field using quotes and "" is an internal escaped double
>>> quote charcater - the base CSV parser deals with the quotes.
>>>
>>> So it the token is the string 23"4
>>>
>>> You call LangCSV.parse which in turn invokes the tokenizer for Turtle
>>> which then complains as 23"4 is a mess in Turtle.
>>>
>>> There's no need to parse - either it's a string or a double (for now).
>>> It's not an a RDF term with language, datatype etc (the SPARQL results
>>> in TSV does do that)
>>>
>>> Fix added - I also abused the parsers use of row/col for CSV errors.
>>>
>>>>
>>>> In the next week, I'd like to complete 2.1, which means Jena can read
>>>> ".csv" file into Model.
>>>
>>>
>>> As you do this, more tests to push all the cases are going to be needed,
>>> both for more strange cases like the above, and other situations
>>> including what happens when a column name has a space in it?  Or other
>>> non-URI fragment character in it (answer - %-encode it).
>>>
>>> For testing the outcome of parsing, you can determine if two models are
>>> "the same" by using model.isIsomorphicWith(otherModel)
>>>
>>> It returns true/false depending on whether there is a consistent
>>> renaming of bNodes from one model to the other (that's the isomorphism).
>>>
>>> So testing can have the right answer as a Turtle model, and compare ti
>>> to the parsed CSV file.
>>>
>>>       Andy
>>>
>>>>
>>>> Best regards,
>>>> Ying Jiang
>>>>
>>>> [1]
>>>>
>>>> http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080
>>>>
>>>
>>> https://issues.apache.org/jira/browse/JENA-625
>>>
>>> (I thought all accepted melange proposals became public automatically
>>> when accepted and the programme started.  Maybe it'll happen soon.)
>>>
>>

Re: GSoC routine

Posted by Ying Jiang <jp...@gmail.com>.

Dear Andy,

I've just committed the API of PropertyTable with the implementations
[1]. The code follow the original design ( and I paste it below) in
the project proposal, which can accommodate other regular data besides
CSV files:
------
1.2 PropertyTable
A PropertyTable is collection of data that is sufficiently regular in
shape it can be treated as a table. That means each subject has a
value for each one of the set of properties. Irregularity in terms of
missing values needs to be handled but not multiple values for the
same property. With special storage, a PropertyTable
- is more compact and more amenable to custom storage (e.g. a JSON
document store)
- can have custom indexes on specific columns
- can guarantee access orders
Providing these features is out of scope of the project but the
architecture of the work must be mindful of these possibilities.
For this project, PropertyTable is designed to be a table of RDF
terms, or Nodes in Jena. The interface should provide the methods of
getting Nodes by subject or property-value. Using the CSV Paser that
generates tuples of RDF terms, I can add code to take a tabular CSV
file, and create a PropertyTable, or literally its implementation of
PropertyTableImpl, using PropertyTableBuilder. To support testing,
there should be RDF tuples to PropertyTable path through RDF Tuple I/O
[5] as well.
1.3 GraphPropertyTable
GraphPropertyTable implements the Graph interface (readonly) over a
PropertyTable. This is subclass from GraphBase and implements find().
The find() method needs to choose the access route based on the find
arguments. This will offer the PropertyTable interface, or an
appropriate subset or variation, so that such a graph can be treated
in a more table-like fashion.
GraphCSV is a sub class of GraphPropertyTable for aiming at CSV
powered by the CSV Parser.
-----

GraphPropertyTable and GraphCSV have also been implemented. Please
check the test case [2], which realizes the example from the project
proposal that performs SPARQL querying over a GraphCSV.

For now, the API of PropertyTable is enough for performing SPARQL
querying. If the advanced features are required in future, it's
possible to add some methods to PropertyTable, Row or Column. For
example, is PropertyTable supposed to be mutable or read-only? Any
other suggestion for the API, e.g. for an SQL database?

In the next steps, I'd like to refine the code and add more tests, for
more robust CSV parsing and SPARQL querying, especially for the
problems you pointed out in your previous email on 26th May.

Cheers,
Ying Jiang

[1] https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/main/java/com/hp/hpl/jena/propertytable/
[2] https://svn.apache.org/repos/asf/jena/Experimental/jena-csv/src/test/java/com/hp/hpl/jena/propertytable/impl/GraphCSVTest.java

On Wed, Jun 4, 2014 at 11:07 PM, Andy Seaborne <an...@apache.org> wrote:
> Ying,
>
> The next part of the project is the property tables which are compact
> storage for CSV files that exploit the regular structure of the data.
>
> These will be useful both for CSV files but potentially for other uses
> storing regular data (outside this project).  These are what an SQL database
> would call .... a "table" :-)
>
> For the "Design PropertyTable" item, it would be really good to write up and
> share a design on this list.
>
>         Andy
>
>
> On 26/05/14 11:06, Andy Seaborne wrote:
>>
>> On 24/05/14 14:32, Ying Jiang wrote:
>>>
>>> Dear Andy,
>>>
>>> I see the discussion of JENA-699 about the CSV/TSV parser. It seems
>>> that Apache Commons CSV would be a better choice for future.
>>> Therefore, I'm not strictly following the project plan in the proposal
>>> [1], which I'm supposed to develop the CSV parser at the beginning of
>>> the project.
>>
>>
>> It looks very good.
>>
>> As I'm finding on the CSV - working group "CSV" is a somewhat broad
>> catch-all piece of terminology, ranging from using ";" for the separator
>> (common in areas of the world where the decimal number separator is ",")
>> to fixed width layout.  We'll stick to RFC 4180 CSV files for now. There
>> is going to be a revised spec at some time but not soon.
>>
>> One of the advantages of Apache Commons CSV, or other parsers, is the
>> ability to cope with the variety out there.  The CSV parser dropped in
>> recently only does comma separated, properly escaped files.
>>
>> (Honestly, it was quicker to write it that investiage all the existing
>> parser! It was needed quicker for SPARQL test cases.).
>>
>>> Instead, I'm working on  "2.1 RIOT Reader for CSV Files". Things are
>>> going well until now. I just and the new "LangCSV" and its unit test.
>>> Please check the code commited just now. Any comments are welcome!
>>
>>
>> Slight problem:
>>
>> --------------
>> col1, col2
>> abc,"23""4"
>> --------------
>>
>> "23""4" is a CSV field using quotes and "" is an internal escaped double
>> quote charcater - the base CSV parser deals with the quotes.
>>
>> So it the token is the string 23"4
>>
>> You call LangCSV.parse which in turn invokes the tokenizer for Turtle
>> which then complains as 23"4 is a mess in Turtle.
>>
>> There's no need to parse - either it's a string or a double (for now).
>> It's not an a RDF term with language, datatype etc (the SPARQL results
>> in TSV does do that)
>>
>> Fix added - I also abused the parsers use of row/col for CSV errors.
>>
>>>
>>> In the next week, I'd like to complete 2.1, which means Jena can read
>>> ".csv" file into Model.
>>
>>
>> As you do this, more tests to push all the cases are going to be needed,
>> both for more strange cases like the above, and other situations
>> including what happens when a column name has a space in it?  Or other
>> non-URI fragment character in it (answer - %-encode it).
>>
>> For testing the outcome of parsing, you can determine if two models are
>> "the same" by using model.isIsomorphicWith(otherModel)
>>
>> It returns true/false depending on whether there is a consistent
>> renaming of bNodes from one model to the other (that's the isomorphism).
>>
>> So testing can have the right answer as a Turtle model, and compare ti
>> to the parsed CSV file.
>>
>>      Andy
>>
>>>
>>> Best regards,
>>> Ying Jiang
>>>
>>> [1]
>>>
>>> http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080
>>>
>>
>> https://issues.apache.org/jira/browse/JENA-625
>>
>> (I thought all accepted melange proposals became public automatically
>> when accepted and the programme started.  Maybe it'll happen soon.)
>>
>