You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@archiva.apache.org by Brett Porter <br...@apache.org> on 2008/12/01 04:13:59 UTC

progress on database decoupling

Hi,

Just a short note - in line with the previous discussion we've had  
about decoupling the database such that Archiva will run without it  
(but can use it for additional stats, etc through a plugin), and  
setting up an extensible metadata format, I've continued the work  
under MRM-1025.

See: http://svn.apache.org/viewvc/archiva/branches/MRM-1025/
and: http://cwiki.apache.org/confluence/display/ARCHIVA/Metadata+storage

Any comments, questions, volunteers? :)

- Brett

--
Brett Porter
brett@apache.org
http://blogs.exist.com/bporter/


Re: progress on database decoupling

Posted by Brett Porter <br...@apache.org>.
On 01/12/2008, at 4:59 PM, Rahul Thakur wrote:

> Hi Brett,
>
> Just had a quick look.
>
> What is the minimum JDK requirement for this - JDK 5.0?

Yep, we're using that actively now, though some of the code is  
catching up.

- Brett

>
>
> I noticed ProjectModelDAO#queryProjectModels(... while similar methods
> ArtifactDAO#queryArtifacts(..)
> RepositoryProblemDAO#queryRepositoryProblems(..)
>
> do not.
>
> Cheers,
> Rahul
>
>
> On 12/1/2008 4:13 PM, Brett Porter wrote:
>> Hi,
>>
>> Just a short note - in line with the previous discussion we've had  
>> about decoupling the database such that Archiva will run without it  
>> (but can use it for additional stats, etc through a plugin), and  
>> setting up an extensible metadata format, I've continued the work  
>> under MRM-1025.
>>
>> See: http://svn.apache.org/viewvc/archiva/branches/MRM-1025/
>> and: http://cwiki.apache.org/confluence/display/ARCHIVA/Metadata+storage
>>
>> Any comments, questions, volunteers? :)
>>
>> - Brett
>>
>> -- 
>> Brett Porter
>> brett@apache.org
>> http://blogs.exist.com/bporter/
>>
>>
>

--
Brett Porter
brett@apache.org
http://blogs.exist.com/bporter/


Re: progress on database decoupling

Posted by Rahul Thakur <ra...@gmail.com>.
Hi Brett,

Just had a quick look.

What is the minimum JDK requirement for this - JDK 5.0?

I noticed ProjectModelDAO#queryProjectModels(... while similar methods
ArtifactDAO#queryArtifacts(..)
RepositoryProblemDAO#queryRepositoryProblems(..)

do not.

Cheers,
Rahul


On 12/1/2008 4:13 PM, Brett Porter wrote:
> Hi,
>
> Just a short note - in line with the previous discussion we've had 
> about decoupling the database such that Archiva will run without it 
> (but can use it for additional stats, etc through a plugin), and 
> setting up an extensible metadata format, I've continued the work 
> under MRM-1025.
>
> See: http://svn.apache.org/viewvc/archiva/branches/MRM-1025/
> and: http://cwiki.apache.org/confluence/display/ARCHIVA/Metadata+storage
>
> Any comments, questions, volunteers? :)
>
> - Brett
>
> -- 
> Brett Porter
> brett@apache.org
> http://blogs.exist.com/bporter/
>
>


Re: progress on database decoupling

Posted by Joakim Erdfelt <jo...@gmail.com>.
One thing I think brett failed to mention, is that this decoupling is
just a step towards having the database as an optional component via
the plugin system being worked on by James.

The database is just moving from being a core component to being an
optional component.

- Joakim

On Mon, Dec 1, 2008 at 7:25 AM, Brett Porter <br...@apache.org> wrote:
>
> On 01/12/2008, at 7:17 PM, Brett Porter wrote:
>
>> There is one particular reference to a thread at the bottom of the wiki
>> page linked below, but the main reference thread would be the target
>> architecture one [1] (I'm not sure why Markmail has stopped detecting
>> threads though...).
>>
>> It is not so much to remove, but decouple so that it will run with basic
>> functionality without the database.
>>
>> That theme is probably scattered, so I can summarise:
>> - derby takes quite a lot of memory which is a potential hinderance to
>> running your own instance
>> - the performance of populating the database has been poor on a large
>> repository
>
> just to attempt to quantify this, the preliminary results are (37938
> artifacts):
> - current scan: 10 Minutes 54 Seconds (update database, including generating
> checksums)
> - alternate scan: 35 seconds (not generating checksums), 2 Minutes 55
> Seconds (generating checksums)
>
> Not highly scientific - and once fleshed out the metadata writing might
> increase marginally - but I think the magnitude of difference is clear :)
>
> We can also get a decent percentage win just by deferring all of the bits
> that need to read the entire file contents (checksums, jarinfo) to a later
> time, and generate it all at once if possible.
>
>>
>> - harder to diagnose problems when the database is not in a consistent
>> state
>> - we don't particularly take advantage of the "robustness, reliability and
>> scalability" of the database as it effectively acts as a cache for the local
>> storage, doesn't handle concurrent servers, etc.
>>
>> More importantly, there are a number of things about the current design
>> (not necessarily the database) that are a barrier to contribution IMO. Some
>> parts are quite tightly coupled, and the database code is mixed in to the
>> model. There is a mix of using paths and artifact references which causes a
>> lot of back and forward conversions, and some Maven concepts are baked in
>> that don't make sense for other repository types. The over-reliance on
>> scanning which is a hang over from the very first code I checked in is
>> biting us worst of all I think.
>>
>> I hope this all makes sense :)
>>
>> Cheers,
>> Brett
>>
>> [1] http://markmail.org/message/6o6byzjsccgzgkmr
>>
>>
>> On 01/12/2008, at 2:24 PM, Martin Cooper wrote:
>>
>>> Hey Brett,
>>>
>>> Do you have a handy link to the previous discussions you mention? I'm
>>> curious as to why someone would elect to give up the robustness,
>>> reliability
>>> and scalability of a database, since I would have counted those as assets
>>> rather than something to work to remove.
>>>
>>> Thanks!
>>>
>>> --
>>> Martin Cooper
>>>
>
> --
> Brett Porter
> brett@apache.org
> http://blogs.exist.com/bporter/
>
>

Re: progress on database decoupling

Posted by Brett Porter <br...@apache.org>.
On 01/12/2008, at 7:17 PM, Brett Porter wrote:

> There is one particular reference to a thread at the bottom of the  
> wiki page linked below, but the main reference thread would be the  
> target architecture one [1] (I'm not sure why Markmail has stopped  
> detecting threads though...).
>
> It is not so much to remove, but decouple so that it will run with  
> basic functionality without the database.
>
> That theme is probably scattered, so I can summarise:
> - derby takes quite a lot of memory which is a potential hinderance  
> to running your own instance
> - the performance of populating the database has been poor on a  
> large repository

just to attempt to quantify this, the preliminary results are (37938  
artifacts):
- current scan: 10 Minutes 54 Seconds (update database, including  
generating checksums)
- alternate scan: 35 seconds (not generating checksums), 2 Minutes 55  
Seconds (generating checksums)

Not highly scientific - and once fleshed out the metadata writing  
might increase marginally - but I think the magnitude of difference is  
clear :)

We can also get a decent percentage win just by deferring all of the  
bits that need to read the entire file contents (checksums, jarinfo)  
to a later time, and generate it all at once if possible.

>
> - harder to diagnose problems when the database is not in a  
> consistent state
> - we don't particularly take advantage of the "robustness,  
> reliability and scalability" of the database as it effectively acts  
> as a cache for the local storage, doesn't handle concurrent servers,  
> etc.
>
> More importantly, there are a number of things about the current  
> design (not necessarily the database) that are a barrier to  
> contribution IMO. Some parts are quite tightly coupled, and the  
> database code is mixed in to the model. There is a mix of using  
> paths and artifact references which causes a lot of back and forward  
> conversions, and some Maven concepts are baked in that don't make  
> sense for other repository types. The over-reliance on scanning  
> which is a hang over from the very first code I checked in is biting  
> us worst of all I think.
>
> I hope this all makes sense :)
>
> Cheers,
> Brett
>
> [1] http://markmail.org/message/6o6byzjsccgzgkmr
>
>
> On 01/12/2008, at 2:24 PM, Martin Cooper wrote:
>
>> Hey Brett,
>>
>> Do you have a handy link to the previous discussions you mention? I'm
>> curious as to why someone would elect to give up the robustness,  
>> reliability
>> and scalability of a database, since I would have counted those as  
>> assets
>> rather than something to work to remove.
>>
>> Thanks!
>>
>> --
>> Martin Cooper
>>

--
Brett Porter
brett@apache.org
http://blogs.exist.com/bporter/


Re: progress on database decoupling

Posted by Brett Porter <br...@apache.org>.
There is one particular reference to a thread at the bottom of the  
wiki page linked below, but the main reference thread would be the  
target architecture one [1] (I'm not sure why Markmail has stopped  
detecting threads though...).

It is not so much to remove, but decouple so that it will run with  
basic functionality without the database.

That theme is probably scattered, so I can summarise:
- derby takes quite a lot of memory which is a potential hinderance to  
running your own instance
- the performance of populating the database has been poor on a large  
repository
- harder to diagnose problems when the database is not in a consistent  
state
- we don't particularly take advantage of the "robustness, reliability  
and scalability" of the database as it effectively acts as a cache for  
the local storage, doesn't handle concurrent servers, etc.

More importantly, there are a number of things about the current  
design (not necessarily the database) that are a barrier to  
contribution IMO. Some parts are quite tightly coupled, and the  
database code is mixed in to the model. There is a mix of using paths  
and artifact references which causes a lot of back and forward  
conversions, and some Maven concepts are baked in that don't make  
sense for other repository types. The over-reliance on scanning which  
is a hang over from the very first code I checked in is biting us  
worst of all I think.

I hope this all makes sense :)

Cheers,
Brett

[1] http://markmail.org/message/6o6byzjsccgzgkmr


On 01/12/2008, at 2:24 PM, Martin Cooper wrote:

> Hey Brett,
>
> Do you have a handy link to the previous discussions you mention? I'm
> curious as to why someone would elect to give up the robustness,  
> reliability
> and scalability of a database, since I would have counted those as  
> assets
> rather than something to work to remove.
>
> Thanks!
>
> --
> Martin Cooper
>
>
> On Sun, Nov 30, 2008 at 7:13 PM, Brett Porter <br...@apache.org>  
> wrote:
>
>> Hi,
>>
>> Just a short note - in line with the previous discussion we've had  
>> about
>> decoupling the database such that Archiva will run without it (but  
>> can use
>> it for additional stats, etc through a plugin), and setting up an  
>> extensible
>> metadata format, I've continued the work under MRM-1025.
>>
>> See: http://svn.apache.org/viewvc/archiva/branches/MRM-1025/
>> and: http://cwiki.apache.org/confluence/display/ARCHIVA/Metadata+storage
>>
>> Any comments, questions, volunteers? :)
>>
>> - Brett
>>
>> --
>> Brett Porter
>> brett@apache.org
>> http://blogs.exist.com/bporter/
>>
>>

--
Brett Porter
brett@apache.org
http://blogs.exist.com/bporter/


Re: progress on database decoupling

Posted by Martin Cooper <ma...@apache.org>.
Hey Brett,

Do you have a handy link to the previous discussions you mention? I'm
curious as to why someone would elect to give up the robustness, reliability
and scalability of a database, since I would have counted those as assets
rather than something to work to remove.

Thanks!

--
Martin Cooper


On Sun, Nov 30, 2008 at 7:13 PM, Brett Porter <br...@apache.org> wrote:

> Hi,
>
> Just a short note - in line with the previous discussion we've had about
> decoupling the database such that Archiva will run without it (but can use
> it for additional stats, etc through a plugin), and setting up an extensible
> metadata format, I've continued the work under MRM-1025.
>
> See: http://svn.apache.org/viewvc/archiva/branches/MRM-1025/
> and: http://cwiki.apache.org/confluence/display/ARCHIVA/Metadata+storage
>
> Any comments, questions, volunteers? :)
>
> - Brett
>
> --
> Brett Porter
> brett@apache.org
> http://blogs.exist.com/bporter/
>
>