You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Claude Warren <cl...@xenei.com> on 2013/11/04 13:22:12 UTC

proposal to change return type for size() in graph

Currently graph.size() returns an int.  the maximum value for an int
is  2,147,483,647 (2.1 billion) though the model.size() returns a long.

Does it make sense to change the return type for graph.size() to long?

If not and a graph exceeds 2.1B triples should size just return
Integer.MAX_VALUE.

I ask as I am currently working on a project to load all of DBPedia (2.46
billion triples) into a graph.

Claude

-- 
I like: Like Like - The likeliest place on the web<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: proposal to change return type for size() in graph

Posted by Claude Warren <cl...@xenei.com>.
+1 for doing this change in Jena3

Suggest we just change the return value to a long and add documentation
that clarifies that this is an estimate of size (current docs already imply
this) and that if more than Long.MAX_VALUE triples are present return
MAX_VALUE.

-1 on delaying package renaming.

0 on git usage.  I havn't used it much and when I have it drives me crazy
but that is probably because I don't use it enough.  As long as there is
patience with git checkin screwups I'm OK with moving to Git.

Claude


On Wed, Nov 6, 2013 at 12:41 PM, Rob Vesse <rv...@dotnetrdf.org> wrote:

> Comments inline:
>
> On 06/11/2013 10:48, "Andy Seaborne" <an...@apache.org> wrote:
>
> >We release as a whole so all modules changing at once is do able for us.
> >
> >External implementation don't seem to track versions very closely (years
> >of difference) so all this deprecation cycle stuff can only work on a
> >very long timescale.  Also, they don't allow drop-in later versions of
> >Jena onto old versions of their implementation, which is the killer for
> >smooth changes.
> >
> >So one option is just make the change.  Any smoothed transition is not,
> >in practice, helping anyone.
>
> I agree, see my later comments on package rename but it would be nice to
> just change the API on the Jena 3 branch and leave those who want to stick
> with Jena 2 to lag behind as they will.  Moving to Jena 3 potentially
> allows us to ignore niceties like deprecation cycles and just simply
> remove/change stuff as necessary.  To aid transition we can always mark
> things as deprecated on the Jena 2 branch with notes that the API is
> changing in Jena 3.
>
> >
> >Or a transition might be:
> >
> >We could @Deprecate int/size(), make it return Integer.MAX_VALUE meaning
> >"go ask another method" and add long/size2() that returns the proper
> >answer.  GraphBase implements a not-preferred version where size() calls
> >size2().
> >
> >We switch all out code to use size2() [1].
> >
> >This is still an interface incompatibility but possibly smoother.
> >People using GraphBase have to recompile as they change version of Jena
> >(or maybe not - all the right methods exist and don't change).
> >
> >"Possibly" because of the long lag on versions we see anyway.  Other
> >changes, and we have to have the scope to make other changes somehow, do
> >sufficiently frequently stop drop-in upgrade to old systems.
> >
> >Or.
> >
> >Jena3.  Interface spring cleaning.  Other changes.
>
> +1
>
> >
> >The data change around xsd:string which might warrant Jena3.
> >
> >I want to avoid getting into in a long trough for Jena3 so I'm looking
> >for how we'd get out of the change phase rather than just how to get
> >into it.
>
> A first pass for Jena 3 would literally be package rename, obvious
> interface changes like this one and then push out an initial release.
>
> >
> >Maybe we start running two codebases in parallel for a while, Jena2
> >being "maintenance only". If we delay package renaming for a while, it's
> >quite easy to roll J3 fixes back into J2.
>
> +1
>
> -1 to delaying package renaming since I feel that makes things trickier
> than they need to be and doesn't help version laggers if they pick up
> 3.0.0 and the APIs are virtually the same and then 3.1.0 changes all the
> package names.
>
> Back-porting to Jena 2 will probably mostly just require a Find/Replace on
> com.hp.hpl.jena to org.apache.jena so I don't see this as a reason to
> delay the package rename if we're going to do Jena 3 anyway.
>
> Moving our source control to git would make maintaining parallel branches
> and back porting changes much easier.  We can then take advantage of
> things like git cherry-pick to aid back porting bug fixes from Jena 3 to
> Jena 2.  So I would suggest we proceed to move to git and set up
> appropriate branches for this workflow.
>
> Rob
>
> >
> >Of course, we have the version-lag to take into account.
> >
> >JIRA is a good place to collect ideas and thoughts:
> >
> >JENA-189 (Jena3/technical)
> >JENA-193 (RDF 1.1)
> >
> >Other JIRA include:
> >
> >JENA-190 (delivery)
> >JENA-191 (module structure)
> >JENA-192 (package naming)
> >
> >       Andy
> >
> >
> >PS Not a double please - a long is large enough and doubles have less
> >precision.  2^63-1 really is a very large number - 8 exa-triples.  And
> >in java8 2^64-1 (sortof).
> >
> >[1] Eclipse will do it all in on click.
> >
> >On 06/11/13 08:53, Claude Warren wrote:
> >> ON further consideration, perhaps sizeEstimate could return a Numeric
> >> Literal Node.  This would provide the ability to return very large
> >>numbers
> >> as doubles and smaller numbers as ints and we already have the code to
> >> convert those values to primitive numbers or Number instances.
> >>
> >>
> >> On Wed, Nov 6, 2013 at 7:32 AM, Claude Warren <cl...@xenei.com> wrote:
> >>
> >>> I don't see how to transition unless we change the method name to
> >>> something like sizeEstimate and return a double.  I think in most cases
> >>> size is used to determine which side of a join should go on the left
> >>>for
> >>> efficiency and for unit tests.  We might want to return a statistical
> >>> answer X +/- Y (sort of like the delta in the junit
> >>> assert.equals(double,double,delta) tests )  But this is probably
> >>>stretching
> >>> a bit too far.
> >>>
> >>> Claude
> >>>
> >>>
> >>> On Tue, Nov 5, 2013 at 10:28 PM, Andy Seaborne <an...@apache.org>
> wrote:
> >>>
> >>>> On 04/11/13 12:22, Claude Warren wrote:
> >>>>
> >>>>> Currently graph.size() returns an int.  the maximum value for an int
> >>>>> is  2,147,483,647 (2.1 billion) though the model.size() returns a
> >>>>>long.
> >>>>>
> >>>>> Does it make sense to change the return type for graph.size() to
> >>>>>long?
> >>>>>
> >>>>> If not and a graph exceeds 2.1B triples should size just return
> >>>>> Integer.MAX_VALUE.
> >>>>>
> >>>>> I ask as I am currently working on a project to load all of DBPedia
> >>>>>(2.46
> >>>>> billion triples) into a graph.
> >>>>>
> >>>>> Claude
> >>>>>
> >>>>>
> >>>> Good idea.
> >>>>
> >>>> How would you see the change being made? (any transition process?)
> >>>>
> >>>>          Andy
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> I like: Like Like - The likeliest place on the
> >>>web<http://like-like.xenei.com>
> >>> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>>
> >>
> >>
> >>
> >
>
>
>
>
>


-- 
I like: Like Like - The likeliest place on the web<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: proposal to change return type for size() in graph

Posted by Rob Vesse <rv...@dotnetrdf.org>.
Comments inline:

On 06/11/2013 10:48, "Andy Seaborne" <an...@apache.org> wrote:

>We release as a whole so all modules changing at once is do able for us.
>
>External implementation don't seem to track versions very closely (years
>of difference) so all this deprecation cycle stuff can only work on a
>very long timescale.  Also, they don't allow drop-in later versions of
>Jena onto old versions of their implementation, which is the killer for
>smooth changes.
>
>So one option is just make the change.  Any smoothed transition is not,
>in practice, helping anyone.

I agree, see my later comments on package rename but it would be nice to
just change the API on the Jena 3 branch and leave those who want to stick
with Jena 2 to lag behind as they will.  Moving to Jena 3 potentially
allows us to ignore niceties like deprecation cycles and just simply
remove/change stuff as necessary.  To aid transition we can always mark
things as deprecated on the Jena 2 branch with notes that the API is
changing in Jena 3.

>
>Or a transition might be:
>
>We could @Deprecate int/size(), make it return Integer.MAX_VALUE meaning
>"go ask another method" and add long/size2() that returns the proper
>answer.  GraphBase implements a not-preferred version where size() calls
>size2().
>
>We switch all out code to use size2() [1].
>
>This is still an interface incompatibility but possibly smoother.
>People using GraphBase have to recompile as they change version of Jena
>(or maybe not - all the right methods exist and don't change).
>
>"Possibly" because of the long lag on versions we see anyway.  Other
>changes, and we have to have the scope to make other changes somehow, do
>sufficiently frequently stop drop-in upgrade to old systems.
>
>Or.
>
>Jena3.  Interface spring cleaning.  Other changes.

+1

>
>The data change around xsd:string which might warrant Jena3.
>
>I want to avoid getting into in a long trough for Jena3 so I'm looking
>for how we'd get out of the change phase rather than just how to get
>into it.

A first pass for Jena 3 would literally be package rename, obvious
interface changes like this one and then push out an initial release.

>
>Maybe we start running two codebases in parallel for a while, Jena2
>being "maintenance only". If we delay package renaming for a while, it's
>quite easy to roll J3 fixes back into J2.

+1

-1 to delaying package renaming since I feel that makes things trickier
than they need to be and doesn't help version laggers if they pick up
3.0.0 and the APIs are virtually the same and then 3.1.0 changes all the
package names.

Back-porting to Jena 2 will probably mostly just require a Find/Replace on
com.hp.hpl.jena to org.apache.jena so I don't see this as a reason to
delay the package rename if we're going to do Jena 3 anyway.

Moving our source control to git would make maintaining parallel branches
and back porting changes much easier.  We can then take advantage of
things like git cherry-pick to aid back porting bug fixes from Jena 3 to
Jena 2.  So I would suggest we proceed to move to git and set up
appropriate branches for this workflow.

Rob

>
>Of course, we have the version-lag to take into account.
>
>JIRA is a good place to collect ideas and thoughts:
>
>JENA-189 (Jena3/technical)
>JENA-193 (RDF 1.1)
>
>Other JIRA include:
>
>JENA-190 (delivery)
>JENA-191 (module structure)
>JENA-192 (package naming)
>
>	Andy
>
>
>PS Not a double please - a long is large enough and doubles have less
>precision.  2^63-1 really is a very large number - 8 exa-triples.  And
>in java8 2^64-1 (sortof).
>
>[1] Eclipse will do it all in on click.
>
>On 06/11/13 08:53, Claude Warren wrote:
>> ON further consideration, perhaps sizeEstimate could return a Numeric
>> Literal Node.  This would provide the ability to return very large
>>numbers
>> as doubles and smaller numbers as ints and we already have the code to
>> convert those values to primitive numbers or Number instances.
>>
>>
>> On Wed, Nov 6, 2013 at 7:32 AM, Claude Warren <cl...@xenei.com> wrote:
>>
>>> I don't see how to transition unless we change the method name to
>>> something like sizeEstimate and return a double.  I think in most cases
>>> size is used to determine which side of a join should go on the left
>>>for
>>> efficiency and for unit tests.  We might want to return a statistical
>>> answer X +/- Y (sort of like the delta in the junit
>>> assert.equals(double,double,delta) tests )  But this is probably
>>>stretching
>>> a bit too far.
>>>
>>> Claude
>>>
>>>
>>> On Tue, Nov 5, 2013 at 10:28 PM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>>> On 04/11/13 12:22, Claude Warren wrote:
>>>>
>>>>> Currently graph.size() returns an int.  the maximum value for an int
>>>>> is  2,147,483,647 (2.1 billion) though the model.size() returns a
>>>>>long.
>>>>>
>>>>> Does it make sense to change the return type for graph.size() to
>>>>>long?
>>>>>
>>>>> If not and a graph exceeds 2.1B triples should size just return
>>>>> Integer.MAX_VALUE.
>>>>>
>>>>> I ask as I am currently working on a project to load all of DBPedia
>>>>>(2.46
>>>>> billion triples) into a graph.
>>>>>
>>>>> Claude
>>>>>
>>>>>
>>>> Good idea.
>>>>
>>>> How would you see the change being made? (any transition process?)
>>>>
>>>>          Andy
>>>>
>>>>
>>>
>>>
>>> --
>>> I like: Like Like - The likeliest place on the
>>>web<http://like-like.xenei.com>
>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>>
>>
>>
>>
>





Re: proposal to change return type for size() in graph

Posted by Andy Seaborne <an...@apache.org>.
We release as a whole so all modules changing at once is do able for us.

External implementation don't seem to track versions very closely (years 
of difference) so all this deprecation cycle stuff can only work on a 
very long timescale.  Also, they don't allow drop-in later versions of 
Jena onto old versions of their implementation, which is the killer for 
smooth changes.

So one option is just make the change.  Any smoothed transition is not, 
in practice, helping anyone.

Or a transition might be:

We could @Deprecate int/size(), make it return Integer.MAX_VALUE meaning 
"go ask another method" and add long/size2() that returns the proper 
answer.  GraphBase implements a not-preferred version where size() calls 
size2().

We switch all out code to use size2() [1].

This is still an interface incompatibility but possibly smoother. 
People using GraphBase have to recompile as they change version of Jena 
(or maybe not - all the right methods exist and don't change).

"Possibly" because of the long lag on versions we see anyway.  Other 
changes, and we have to have the scope to make other changes somehow, do 
sufficiently frequently stop drop-in upgrade to old systems.

Or.

Jena3.  Interface spring cleaning.  Other changes.

The data change around xsd:string which might warrant Jena3.

I want to avoid getting into in a long trough for Jena3 so I'm looking 
for how we'd get out of the change phase rather than just how to get 
into it.

Maybe we start running two codebases in parallel for a while, Jena2 
being "maintenance only". If we delay package renaming for a while, it's 
quite easy to roll J3 fixes back into J2.

Of course, we have the version-lag to take into account.

JIRA is a good place to collect ideas and thoughts:

JENA-189 (Jena3/technical)
JENA-193 (RDF 1.1)

Other JIRA include:

JENA-190 (delivery)
JENA-191 (module structure)
JENA-192 (package naming)

	Andy


PS Not a double please - a long is large enough and doubles have less 
precision.  2^63-1 really is a very large number - 8 exa-triples.  And 
in java8 2^64-1 (sortof).

[1] Eclipse will do it all in on click.

On 06/11/13 08:53, Claude Warren wrote:
> ON further consideration, perhaps sizeEstimate could return a Numeric
> Literal Node.  This would provide the ability to return very large numbers
> as doubles and smaller numbers as ints and we already have the code to
> convert those values to primitive numbers or Number instances.
>
>
> On Wed, Nov 6, 2013 at 7:32 AM, Claude Warren <cl...@xenei.com> wrote:
>
>> I don't see how to transition unless we change the method name to
>> something like sizeEstimate and return a double.  I think in most cases
>> size is used to determine which side of a join should go on the left for
>> efficiency and for unit tests.  We might want to return a statistical
>> answer X +/- Y (sort of like the delta in the junit
>> assert.equals(double,double,delta) tests )  But this is probably stretching
>> a bit too far.
>>
>> Claude
>>
>>
>> On Tue, Nov 5, 2013 at 10:28 PM, Andy Seaborne <an...@apache.org> wrote:
>>
>>> On 04/11/13 12:22, Claude Warren wrote:
>>>
>>>> Currently graph.size() returns an int.  the maximum value for an int
>>>> is  2,147,483,647 (2.1 billion) though the model.size() returns a long.
>>>>
>>>> Does it make sense to change the return type for graph.size() to long?
>>>>
>>>> If not and a graph exceeds 2.1B triples should size just return
>>>> Integer.MAX_VALUE.
>>>>
>>>> I ask as I am currently working on a project to load all of DBPedia (2.46
>>>> billion triples) into a graph.
>>>>
>>>> Claude
>>>>
>>>>
>>> Good idea.
>>>
>>> How would you see the change being made? (any transition process?)
>>>
>>>          Andy
>>>
>>>
>>
>>
>> --
>> I like: Like Like - The likeliest place on the web<http://like-like.xenei.com>
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>
>
>


Re: proposal to change return type for size() in graph

Posted by Claude Warren <cl...@xenei.com>.
ON further consideration, perhaps sizeEstimate could return a Numeric
Literal Node.  This would provide the ability to return very large numbers
as doubles and smaller numbers as ints and we already have the code to
convert those values to primitive numbers or Number instances.


On Wed, Nov 6, 2013 at 7:32 AM, Claude Warren <cl...@xenei.com> wrote:

> I don't see how to transition unless we change the method name to
> something like sizeEstimate and return a double.  I think in most cases
> size is used to determine which side of a join should go on the left for
> efficiency and for unit tests.  We might want to return a statistical
> answer X +/- Y (sort of like the delta in the junit
> assert.equals(double,double,delta) tests )  But this is probably stretching
> a bit too far.
>
> Claude
>
>
> On Tue, Nov 5, 2013 at 10:28 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 04/11/13 12:22, Claude Warren wrote:
>>
>>> Currently graph.size() returns an int.  the maximum value for an int
>>> is  2,147,483,647 (2.1 billion) though the model.size() returns a long.
>>>
>>> Does it make sense to change the return type for graph.size() to long?
>>>
>>> If not and a graph exceeds 2.1B triples should size just return
>>> Integer.MAX_VALUE.
>>>
>>> I ask as I am currently working on a project to load all of DBPedia (2.46
>>> billion triples) into a graph.
>>>
>>> Claude
>>>
>>>
>> Good idea.
>>
>> How would you see the change being made? (any transition process?)
>>
>>         Andy
>>
>>
>
>
> --
> I like: Like Like - The likeliest place on the web<http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>



-- 
I like: Like Like - The likeliest place on the web<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: proposal to change return type for size() in graph

Posted by Claude Warren <cl...@xenei.com>.
I don't see how to transition unless we change the method name to something
like sizeEstimate and return a double.  I think in most cases size is used
to determine which side of a join should go on the left for efficiency and
for unit tests.  We might want to return a statistical answer X +/- Y (sort
of like the delta in the junit assert.equals(double,double,delta) tests )
 But this is probably stretching a bit too far.

Claude


On Tue, Nov 5, 2013 at 10:28 PM, Andy Seaborne <an...@apache.org> wrote:

> On 04/11/13 12:22, Claude Warren wrote:
>
>> Currently graph.size() returns an int.  the maximum value for an int
>> is  2,147,483,647 (2.1 billion) though the model.size() returns a long.
>>
>> Does it make sense to change the return type for graph.size() to long?
>>
>> If not and a graph exceeds 2.1B triples should size just return
>> Integer.MAX_VALUE.
>>
>> I ask as I am currently working on a project to load all of DBPedia (2.46
>> billion triples) into a graph.
>>
>> Claude
>>
>>
> Good idea.
>
> How would you see the change being made? (any transition process?)
>
>         Andy
>
>


-- 
I like: Like Like - The likeliest place on the web<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: proposal to change return type for size() in graph

Posted by Andy Seaborne <an...@apache.org>.
On 04/11/13 12:22, Claude Warren wrote:
> Currently graph.size() returns an int.  the maximum value for an int
> is  2,147,483,647 (2.1 billion) though the model.size() returns a long.
>
> Does it make sense to change the return type for graph.size() to long?
>
> If not and a graph exceeds 2.1B triples should size just return
> Integer.MAX_VALUE.
>
> I ask as I am currently working on a project to load all of DBPedia (2.46
> billion triples) into a graph.
>
> Claude
>

Good idea.

How would you see the change being made? (any transition process?)

	Andy