You are viewing a plain text version of this content. The canonical link for it is here.

Posted to droids-dev@incubator.apache.org by Chapuis Bertil <bc...@agimem.com> on 2011/02/08 08:50:06 UTC

Local copies of droids

In previous emails and jira comments I saw several people mentionning the
fact they have a local copy of droids which evolved too much to be merged
back with the trunk. This is my case, and I think Paul Rogalinski is in the
same situation.

Since the patches have only been applied periodically on the trunk during
the last months, I'd love to know if someone else is in the same situation
and what can of changes they made locally.

-- 
Bertil Chapuis
Agimem Sàrl
http://www.agimem.com

Re: Local copies of droids

Posted by Eugen Paraschiv <ha...@gmail.com>.

I'm also using a local branch which I'm now starting to integrate back into
the trunk. I'm mostly creating 0.0.2 issues, in the hope that 0.0.1 is going
to be released soon(ish). Some of the local branches described in this
thread sound very interesting, and it would be cool to at least see the
smaller bullet points committed back into the trunk, to at least reduce the
diff between the trunk and these local branches and to make committing
larger improvements possible.

On Wed, Feb 9, 2011 at 10:16 PM, paul.vc <pa...@paul.vc> wrote:

> Hey Otis,
>
> I am staring a "big" crawl (~50m hosts) this or the next week. I am sure
> it will bring back some new bugs and issues to solve. Furthermore there is
> still the robots.txt part to be taken care of. I have been contracted to
> implement that crawler by another company, and I do also have the
> permission to contribute most of the work back (probably all but the
> content-extraction part). So I see no real problems in releasing the
> sources after a minor code review.
>
> If you just need some specific pieces now, lets meet on IRC
> freenode/#droids (not monitoring the channel actively though - ping me on
> ircnet/pulsar) or some IM (see below for my accounts). I'll be able to pull
> stuff out and post it online as needed.
>
> Regards,
> Paul.
>
> msn: pulsar@highteq.net · aim: pu1s4r · icq: 1177279 · skype: pulsar ·
> yahoo: paulrogalinski · gtalk/XMPP: pulzar@gmail.com
>
>
> On Wed, 9 Feb 2011 11:32:18 -0800 (PST), Otis Gospodnetic
> <og...@yahoo.com> wrote:
> > Hi,
> >
> > Wow, juicy!
> > I have just 1 question: (when) can you contribute individual pieces of
> > your
> > great work? :)
> >
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > ----- Original Message ----
> >> From: paul.vc <pa...@paul.vc>
> >> To: droids-dev@incubator.apache.org
> >> Sent: Tue, February 8, 2011 6:31:05 AM
> >> Subject: Re: Local copies of droids
> >>
> >> Hey Guys,
> >>
> >> I've focused on a very specific web-crawl task when refactoring  my
> copy
> >> of
> >> droids:
> >>
> >> - Tasks have now a context which can be used to  share arbitrary data
> >> - so do http entities (needed to store headers  for  a longer period of
> >> time)
> >> - Various crawl related metrics / counters  exposed via JMX / MBeans
> >> (+munin config)
> >> - robotx.txt / domain caching -  currently there is a huge problem in
> the
> >> trunk with that! Solutions provided  but no patches... yet :/
> >> - A buch of new link filters to clean up a url and  prevent processing
> >> the
> >> same page over and over again
> >> - Visited-url tree  structure to keep the memory consumption down. I do
> >> also plan to use the same  tree as a Task queue. Should lower the
> Memory
> >> consumption significantly.
> >> -  Pluggable DNS resolvers
> >> - Plenty of small bugfixes, some of them really  minor. Others I tried
> to
> >> report and provide solutions to them.
> >> - Handling  of redirect codes (via meta and/or header) and exposing
> that
> >> information to  the extractors / writers etc.
> >> - Improved encoding detection / handling
> >> -  Added TaskExecutionDeciders - simply filtering the links was not
> >> sufficient  enough in some rare cases.
> >> - Simplification: removed spring dependencies,  threw away classes /
> >> functionality not needed by me, nice and easy methods  for spawning new
> >> crawlers / droids.
> >> - Thread pool for independent parallel  execution of droids, limited by
> a
> >> load a single node in a cluster can take (  see
> >> http://codewut.de/content/load-sensitive-threadpool-queue-java )
> >> -  Managing new droids by polling new hosts to be crawled from an
> >> external
> >> source
> >> - Extended delay framework so delays can be computed based  upon the
> >> processing / response time of the last page/task
> >> - Proxy  support
> >> - Plenty of tweaks to the http-client params to prevent/skip hung
> >> sockets,
> >> slow responses (like 1200 baud)
> >> - Mechanism to do a clean exit  (shutdown / SIG hooks), finish all
> items
> >> in
> >> the queue, close all writers  properly. Alternatively a quick exit can
> be
> >> triggered to flush remaining  items from the queue.
> >> - Stuff i have already forgotten :/
> >>
> >> Maybe i  should also mention where i am going with beat up version of
> >> droids: i am  building a pseudo distributed crawler farm. pseudo
> >> distributed
> >> because there  is not controlling server and shared task queue. Each
> >> node in
> >> my cluster runs  multiple droids, each one crawling one host. Extracted
> >> data
> >> is collected from  all of the instances per node (not droid) and fed
> into
> >> HDFS. Each node has a  thread pool which polls new crawl specs from a
> >> master
> >> queue (in my case jdbc  - although i am thinking about HBase or
> Membase)
> >>
> >> So yes, i took a huge  step away from the idea of implementing generic
> >> droids framework and focused  rather on a very specific way to crawl
> the
> >> web. Right now I tried my best to  make droids fault tolerant (the
> >> internet
> >> is b-r-o-k-e-n, you have no idea how  bad it is!), and produce helpful
> >> logs.
> >>
> >> What's next for me? Probably  rewriting the robots.txt client. The
> >> crawlers
> >> itself did pass a first smaller  crawl with ~20m pages with pleasing
> >> results.
> >>
> >> Fire away if you have  questions.
> >>
> >> Regards,
> >> Paul.
> >>
> >>
> >> On Tue, 8 Feb 2011 09:50:06  +0100, Chapuis Bertil
> <bc...@agimem.com>
> >> wrote:
> >> >  In previous emails and jira comments I saw several people
> mentionning
> >> the
> >> > fact they have a local copy of droids which evolved  too much to be
> >> merged
> >> > back with the trunk. This is my case, and I  think Paul Rogalinski is
> >> > in
> >> the
> >> > same situation.
> >> >
> >> >  Since the patches have only been applied periodically on the  trunk
> >> during
> >> > the last months, I'd love to know if someone else is in  the same
> >> situation
> >> > and what can of changes they made  locally.
> >>
> >> --
> >> Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17  -81543 Munich
> -
> >> Germany - mailto: paul.rogalinski@highteq.net -  Phone: +49-179-3574356
> -
> >> msn: pulsar@highteq.net - aim: pu1s4r - icq:  1177279 - skype: pulsar
> >>
>
> --
> Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich -
> Germany - mailto: paul.rogalinski@highteq.net - Phone: +49-179-3574356 -
> msn: pulsar@highteq.net - aim: pu1s4r - icq: 1177279 - skype: pulsar
>

Re: Local copies of droids

Posted by "paul.vc" <pa...@paul.vc>.

Hey Otis,

I am staring a "big" crawl (~50m hosts) this or the next week. I am sure
it will bring back some new bugs and issues to solve. Furthermore there is
still the robots.txt part to be taken care of. I have been contracted to
implement that crawler by another company, and I do also have the
permission to contribute most of the work back (probably all but the
content-extraction part). So I see no real problems in releasing the
sources after a minor code review.

If you just need some specific pieces now, lets meet on IRC
freenode/#droids (not monitoring the channel actively though - ping me on
ircnet/pulsar) or some IM (see below for my accounts). I'll be able to pull
stuff out and post it online as needed.

Regards,
Paul.

msn: pulsar@highteq.net · aim: pu1s4r · icq: 1177279 · skype: pulsar ·
yahoo: paulrogalinski · gtalk/XMPP: pulzar@gmail.com 


On Wed, 9 Feb 2011 11:32:18 -0800 (PST), Otis Gospodnetic
<og...@yahoo.com> wrote:
> Hi,
> 
> Wow, juicy!
> I have just 1 question: (when) can you contribute individual pieces of
> your 
> great work? :)
> 
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
>> From: paul.vc <pa...@paul.vc>
>> To: droids-dev@incubator.apache.org
>> Sent: Tue, February 8, 2011 6:31:05 AM
>> Subject: Re: Local copies of droids
>> 
>> Hey Guys,
>> 
>> I've focused on a very specific web-crawl task when refactoring  my
copy
>> of
>> droids:
>> 
>> - Tasks have now a context which can be used to  share arbitrary data
>> - so do http entities (needed to store headers  for  a longer period of
>> time)
>> - Various crawl related metrics / counters  exposed via JMX / MBeans
>> (+munin config)
>> - robotx.txt / domain caching -  currently there is a huge problem in
the
>> trunk with that! Solutions provided  but no patches... yet :/
>> - A buch of new link filters to clean up a url and  prevent processing
>> the
>> same page over and over again
>> - Visited-url tree  structure to keep the memory consumption down. I do
>> also plan to use the same  tree as a Task queue. Should lower the
Memory
>> consumption significantly.
>> -  Pluggable DNS resolvers
>> - Plenty of small bugfixes, some of them really  minor. Others I tried
to
>> report and provide solutions to them.
>> - Handling  of redirect codes (via meta and/or header) and exposing
that
>> information to  the extractors / writers etc.
>> - Improved encoding detection / handling
>> -  Added TaskExecutionDeciders - simply filtering the links was not
>> sufficient  enough in some rare cases.
>> - Simplification: removed spring dependencies,  threw away classes /
>> functionality not needed by me, nice and easy methods  for spawning new
>> crawlers / droids.
>> - Thread pool for independent parallel  execution of droids, limited by
a
>> load a single node in a cluster can take (  see
>> http://codewut.de/content/load-sensitive-threadpool-queue-java )
>> -  Managing new droids by polling new hosts to be crawled from an 
>> external
>> source
>> - Extended delay framework so delays can be computed based  upon the
>> processing / response time of the last page/task
>> - Proxy  support
>> - Plenty of tweaks to the http-client params to prevent/skip hung 
>> sockets,
>> slow responses (like 1200 baud)
>> - Mechanism to do a clean exit  (shutdown / SIG hooks), finish all
items
>> in
>> the queue, close all writers  properly. Alternatively a quick exit can
be
>> triggered to flush remaining  items from the queue.
>> - Stuff i have already forgotten :/
>> 
>> Maybe i  should also mention where i am going with beat up version of
>> droids: i am  building a pseudo distributed crawler farm. pseudo
>> distributed
>> because there  is not controlling server and shared task queue. Each
>> node in
>> my cluster runs  multiple droids, each one crawling one host. Extracted
>> data
>> is collected from  all of the instances per node (not droid) and fed
into
>> HDFS. Each node has a  thread pool which polls new crawl specs from a
>> master
>> queue (in my case jdbc  - although i am thinking about HBase or
Membase)
>> 
>> So yes, i took a huge  step away from the idea of implementing generic
>> droids framework and focused  rather on a very specific way to crawl
the
>> web. Right now I tried my best to  make droids fault tolerant (the
>> internet
>> is b-r-o-k-e-n, you have no idea how  bad it is!), and produce helpful
>> logs.
>> 
>> What's next for me? Probably  rewriting the robots.txt client. The
>> crawlers
>> itself did pass a first smaller  crawl with ~20m pages with pleasing
>> results.
>> 
>> Fire away if you have  questions.
>> 
>> Regards,
>> Paul.
>> 
>> 
>> On Tue, 8 Feb 2011 09:50:06  +0100, Chapuis Bertil
<bc...@agimem.com>
>> wrote:
>> >  In previous emails and jira comments I saw several people 
mentionning
>> the
>> > fact they have a local copy of droids which evolved  too much to be
>> merged
>> > back with the trunk. This is my case, and I  think Paul Rogalinski is
>> > in
>> the
>> > same situation.
>> > 
>> >  Since the patches have only been applied periodically on the  trunk
>> during
>> > the last months, I'd love to know if someone else is in  the same
>> situation
>> > and what can of changes they made  locally.
>> 
>> -- 
>> Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17  -81543 Munich
-
>> Germany - mailto: paul.rogalinski@highteq.net -  Phone: +49-179-3574356
-
>> msn: pulsar@highteq.net - aim: pu1s4r - icq:  1177279 - skype: pulsar
>>

-- 
Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich -
Germany - mailto: paul.rogalinski@highteq.net - Phone: +49-179-3574356 -
msn: pulsar@highteq.net - aim: pu1s4r - icq: 1177279 - skype: pulsar

Re: Local copies of droids

Posted by Otis Gospodnetic <og...@yahoo.com>.

Hi,

Wow, juicy!
I have just 1 question: (when) can you contribute individual pieces of your 
great work? :)

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: paul.vc <pa...@paul.vc>
> To: droids-dev@incubator.apache.org
> Sent: Tue, February 8, 2011 6:31:05 AM
> Subject: Re: Local copies of droids
> 
> Hey Guys,
> 
> I've focused on a very specific web-crawl task when refactoring  my copy of
> droids:
> 
> - Tasks have now a context which can be used to  share arbitrary data
> - so do http entities (needed to store headers  for  a longer period of
> time)
> - Various crawl related metrics / counters  exposed via JMX / MBeans
> (+munin config)
> - robotx.txt / domain caching -  currently there is a huge problem in the
> trunk with that! Solutions provided  but no patches... yet :/
> - A buch of new link filters to clean up a url and  prevent processing the
> same page over and over again
> - Visited-url tree  structure to keep the memory consumption down. I do
> also plan to use the same  tree as a Task queue. Should lower the Memory
> consumption significantly.
> -  Pluggable DNS resolvers
> - Plenty of small bugfixes, some of them really  minor. Others I tried to
> report and provide solutions to them.
> - Handling  of redirect codes (via meta and/or header) and exposing that
> information to  the extractors / writers etc.
> - Improved encoding detection / handling
> -  Added TaskExecutionDeciders - simply filtering the links was not
> sufficient  enough in some rare cases.
> - Simplification: removed spring dependencies,  threw away classes /
> functionality not needed by me, nice and easy methods  for spawning new
> crawlers / droids.
> - Thread pool for independent parallel  execution of droids, limited by a
> load a single node in a cluster can take (  see
> http://codewut.de/content/load-sensitive-threadpool-queue-java )
> -  Managing new droids by polling new hosts to be crawled from an  external
> source
> - Extended delay framework so delays can be computed based  upon the
> processing / response time of the last page/task
> - Proxy  support
> - Plenty of tweaks to the http-client params to prevent/skip hung  sockets,
> slow responses (like 1200 baud)
> - Mechanism to do a clean exit  (shutdown / SIG hooks), finish all items in
> the queue, close all writers  properly. Alternatively a quick exit can be
> triggered to flush remaining  items from the queue.
> - Stuff i have already forgotten :/
> 
> Maybe i  should also mention where i am going with beat up version of
> droids: i am  building a pseudo distributed crawler farm. pseudo distributed
> because there  is not controlling server and shared task queue. Each node in
> my cluster runs  multiple droids, each one crawling one host. Extracted data
> is collected from  all of the instances per node (not droid) and fed into
> HDFS. Each node has a  thread pool which polls new crawl specs from a master
> queue (in my case jdbc  - although i am thinking about HBase or Membase)
> 
> So yes, i took a huge  step away from the idea of implementing generic
> droids framework and focused  rather on a very specific way to crawl the
> web. Right now I tried my best to  make droids fault tolerant (the internet
> is b-r-o-k-e-n, you have no idea how  bad it is!), and produce helpful logs.
> 
> What's next for me? Probably  rewriting the robots.txt client. The crawlers
> itself did pass a first smaller  crawl with ~20m pages with pleasing
> results.
> 
> Fire away if you have  questions.
> 
> Regards,
> Paul.
> 
> 
> On Tue, 8 Feb 2011 09:50:06  +0100, Chapuis Bertil <bc...@agimem.com>
> wrote:
> >  In previous emails and jira comments I saw several people  mentionning
> the
> > fact they have a local copy of droids which evolved  too much to be
> merged
> > back with the trunk. This is my case, and I  think Paul Rogalinski is in
> the
> > same situation.
> > 
> >  Since the patches have only been applied periodically on the  trunk
> during
> > the last months, I'd love to know if someone else is in  the same
> situation
> > and what can of changes they made  locally.
> 
> -- 
> Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17  -81543 Munich -
> Germany - mailto: paul.rogalinski@highteq.net -  Phone: +49-179-3574356 -
> msn: pulsar@highteq.net - aim: pu1s4r - icq:  1177279 - skype: pulsar
>

Re: Local copies of droids

Posted by "paul.vc" <pa...@paul.vc>.

Hey Guys,

I've focused on a very specific web-crawl task when refactoring my copy of
droids:

- Tasks have now a context which can be used to share arbitrary data
- so do http entities (needed to store headers  for a longer period of
time)
- Various crawl related metrics / counters exposed via JMX / MBeans
(+munin config)
- robotx.txt / domain caching - currently there is a huge problem in the
trunk with that! Solutions provided but no patches... yet :/
- A buch of new link filters to clean up a url and prevent processing the
same page over and over again
- Visited-url tree structure to keep the memory consumption down. I do
also plan to use the same tree as a Task queue. Should lower the Memory
consumption significantly.
- Pluggable DNS resolvers
- Plenty of small bugfixes, some of them really minor. Others I tried to
report and provide solutions to them.
- Handling of redirect codes (via meta and/or header) and exposing that
information to the extractors / writers etc.
- Improved encoding detection / handling
- Added TaskExecutionDeciders - simply filtering the links was not
sufficient enough in some rare cases.
- Simplification: removed spring dependencies, threw away classes /
functionality not needed by me, nice and easy methods for spawning new
crawlers / droids.
- Thread pool for independent parallel execution of droids, limited by a
load a single node in a cluster can take ( see
http://codewut.de/content/load-sensitive-threadpool-queue-java )
- Managing new droids by polling new hosts to be crawled from an external
source
- Extended delay framework so delays can be computed based upon the
processing / response time of the last page/task
- Proxy support
- Plenty of tweaks to the http-client params to prevent/skip hung sockets,
slow responses (like 1200 baud)
- Mechanism to do a clean exit (shutdown / SIG hooks), finish all items in
the queue, close all writers properly. Alternatively a quick exit can be
triggered to flush remaining items from the queue.
- Stuff i have already forgotten :/

Maybe i should also mention where i am going with beat up version of
droids: i am building a pseudo distributed crawler farm. pseudo distributed
because there is not controlling server and shared task queue. Each node in
my cluster runs multiple droids, each one crawling one host. Extracted data
is collected from all of the instances per node (not droid) and fed into
HDFS. Each node has a thread pool which polls new crawl specs from a master
queue (in my case jdbc - although i am thinking about HBase or Membase)

So yes, i took a huge step away from the idea of implementing generic
droids framework and focused rather on a very specific way to crawl the
web. Right now I tried my best to make droids fault tolerant (the internet
is b-r-o-k-e-n, you have no idea how bad it is!), and produce helpful logs.

What's next for me? Probably rewriting the robots.txt client. The crawlers
itself did pass a first smaller crawl with ~20m pages with pleasing
results.

Fire away if you have questions.

Regards,
Paul.


On Tue, 8 Feb 2011 09:50:06 +0100, Chapuis Bertil <bc...@agimem.com>
wrote:
> In previous emails and jira comments I saw several people mentionning
the
> fact they have a local copy of droids which evolved too much to be
merged
> back with the trunk. This is my case, and I think Paul Rogalinski is in
the
> same situation.
> 
> Since the patches have only been applied periodically on the trunk
during
> the last months, I'd love to know if someone else is in the same
situation
> and what can of changes they made locally.

-- 
Paul Rogalinski - Softwareentwicklung - Jamnitzerstr. 17 -81543 Munich -
Germany - mailto: paul.rogalinski@highteq.net - Phone: +49-179-3574356 -
msn: pulsar@highteq.net - aim: pu1s4r - icq: 1177279 - skype: pulsar

Re: Local copies of droids

Posted by florent andré <fl...@4sengines.com>.

Hi Droids guys !

+1 also for release.

** I have in my bag :

- Interactive droid = don't need a list of link to start a crawl, can
order link by link (imply a refactoring of a droids class i don't recall
the name now)

- Sax output : can pass a sax consumer for output (not a clean
integration thought, have to see pro/con to stax)

- xml format for parametrize a droid and pass him a todo list [1] : this
implementation is linked to Lenya now, but can be easily extracted.
Thinks could be a nice feature in case of droids server and for
communication between droids entities.

++


[1] : here come an example of file. This ns can be included in another
one, so you have the result of crawl include in your original file.
<?xml version="1.0" encoding="UTF-8"?>
<robot xmlns="http://droids.apache.org/droids/0.2">
  <!-- parametrize the droids -->
  <params>
    <!-- TODO : test this configuration, -->
    <delay>10</delay>
    <!-- TODO : use the source resolver in code to get the file, and be
able to use fallback -->
    <filters>

<resource>fallback://lenya/modules/droidsTransformer/samples/regex-urlfilter-null.txt</resource>
    </filters>
  </params>

  <!-- indicate locations -->
  <locations>
    <location>http://www.zegoal.com/foot/france-ligue1/</location>
    <location>http://localhost:8080/lenya/index.html</location>
  </locations>

</robot>


On 02/08/2011 10:52 AM, Chapuis Bertil wrote:
> I agree. We have to release.
> 
> The changes I'd like to contribute back are the following.
> 
>    - TaskQueue repleaced by java.util.Queue
>    - Handling process reviewed.
>    - Factory pattern only used for Worker
>    - Extractors inherit from Handler => no need to parse the document twice
>    - Entity renamed in Identifier
>    - ContentEntity renamed in Resource
>    - Crawler moved to droids-crawler
>    - Parser moved to droids-parser
>    - Walker moved to droids-walker
>    - The walker also use an Extractor
>    - some minor changes
> 
> 
> 
> On 8 February 2011 10:32, Thorsten Scherler <th...@apache.org> wrote:
> 
>> On Tue, 2011-02-08 at 09:50 +0100, Chapuis Bertil wrote:
>>> In previous emails and jira comments I saw several people mentionning the
>>> fact they have a local copy of droids which evolved too much to be merged
>>> back with the trunk. This is my case, and I think Paul Rogalinski is in
>> the
>>> same situation.
>>>
>>> Since the patches have only been applied periodically on the trunk during
>>> the last months, I'd love to know if someone else is in the same
>> situation
>>> and what can of changes they made locally.
>>
>> I am not sure but I see it like you describe.
>>
>> IMO we should release what we have right now and then plan how to merge
>> back all this different versions into a new droids version.
>>
>> IMO the next droids version should focus on ease of reuse and a droid
>> server which starts and monitors the different droids.
>>
>> To start with:
>> * who has a version of droids which (s)he is interested to merge back
>> * what are the main difference between the forge and the "original"
>> * ...
>>
>> salu2
>> --
>> Thorsten Scherler <thorsten.at.apache.org>
>> codeBusters S.L. - web based systems
>> <consulting, training and solutions>
>> http://www.codebusters.es/
>>
>

Re: Local copies of droids

Posted by Chapuis Bertil <bc...@agimem.com>.

On 8 February 2011 15:17, Thorsten Scherler <sc...@gmail.com> wrote:

> On Tue, 2011-02-08 at 10:52 +0100, Chapuis Bertil wrote:
> > I agree. We have to release.
>
> So let us plan that in a new thread.
>
> >
> > The changes I'd like to contribute back are the following.
> >
> >    - TaskQueue repleaced by java.util.Queue
>
> You committed that in /incubator/droids/branch/bchapuis, right?
>


Yes, I mainly created this branch to perform the changes we discussed about
the documentation and the reorganisation of the trunk (DROIDS-107).
I committed some changes to use the maven site plugin and I'm thinking about
translating the current xdoc files into apt files. The aim is to simplify
the creation of documentation.



>
> Still need to review it in detail, but looks good from first sight.
>
> >    - Handling process reviewed.
> >    - Factory pattern only used for Worker
> >    - Extractors inherit from Handler => no need to parse the document
> twice
>
> The three points above I second. I actually using droids ATM in
> combination with cocoon3 which completely changes the extractor/handler
> pattern. So ATM I only use the queue, filter and the cocoon pipeline
> parser for parser/handling (I think later on you call them extractor).
>
> In the cocoon pipeline parser I am using various transformer like the
> solr one you provided to inject the parse result into the different
> third party server (solr, persistence, REST consumer, ...). The nice
> thing on the approach is that I can limit the transformation logic of
> "page of origin" to extracted data to xsl stylesheet since I use the
> html-generator in cocoon. The data mapping is ATM quite based on my
> usecase but that could be generalized (for some transformer like the
> solr one better then other).
>
> >    - Entity renamed in Identifier
> >    - ContentEntity renamed in Resource
> >    - Crawler moved to droids-crawler
> >    - Parser moved to droids-parser
> >    - Walker moved to droids-walker
> >    - The walker also use an Extractor
> >    - some minor changes
>
> sounds like not to difficult to merge/switch to your branch.
>
> salu2
>
> >
> >
> >
> > On 8 February 2011 10:32, Thorsten Scherler <th...@apache.org> wrote:
> >
> > > On Tue, 2011-02-08 at 09:50 +0100, Chapuis Bertil wrote:
> > > > In previous emails and jira comments I saw several people mentionning
> the
> > > > fact they have a local copy of droids which evolved too much to be
> merged
> > > > back with the trunk. This is my case, and I think Paul Rogalinski is
> in
> > > the
> > > > same situation.
> > > >
> > > > Since the patches have only been applied periodically on the trunk
> during
> > > > the last months, I'd love to know if someone else is in the same
> > > situation
> > > > and what can of changes they made locally.
> > >
> > > I am not sure but I see it like you describe.
> > >
> > > IMO we should release what we have right now and then plan how to merge
> > > back all this different versions into a new droids version.
> > >
> > > IMO the next droids version should focus on ease of reuse and a droid
> > > server which starts and monitors the different droids.
> > >
> > > To start with:
> > > * who has a version of droids which (s)he is interested to merge back
> > > * what are the main difference between the forge and the "original"
> > > * ...
> > >
> > > salu2
> > > --
> > > Thorsten Scherler <thorsten.at.apache.org>
> > > codeBusters S.L. - web based systems
> > > <consulting, training and solutions>
> > > http://www.codebusters.es/
> > >
>
> --
> Thorsten Scherler <thorsten.at.apache.org>
> codeBusters S.L. - web based systems
> <consulting, training and solutions>
> http://www.codebusters.es/
>
>


-- 
Bertil Chapuis
Agimem Sàrl
http://www.agimem.com

Re: Local copies of droids

Posted by Thorsten Scherler <sc...@gmail.com>.

On Tue, 2011-02-08 at 10:52 +0100, Chapuis Bertil wrote:
> I agree. We have to release.

So let us plan that in a new thread.

> 
> The changes I'd like to contribute back are the following.
> 
>    - TaskQueue repleaced by java.util.Queue

You committed that in /incubator/droids/branch/bchapuis, right?

Still need to review it in detail, but looks good from first sight. 

>    - Handling process reviewed.
>    - Factory pattern only used for Worker
>    - Extractors inherit from Handler => no need to parse the document twice

The three points above I second. I actually using droids ATM in
combination with cocoon3 which completely changes the extractor/handler
pattern. So ATM I only use the queue, filter and the cocoon pipeline
parser for parser/handling (I think later on you call them extractor). 

In the cocoon pipeline parser I am using various transformer like the
solr one you provided to inject the parse result into the different
third party server (solr, persistence, REST consumer, ...). The nice
thing on the approach is that I can limit the transformation logic of
"page of origin" to extracted data to xsl stylesheet since I use the
html-generator in cocoon. The data mapping is ATM quite based on my
usecase but that could be generalized (for some transformer like the
solr one better then other).

>    - Entity renamed in Identifier
>    - ContentEntity renamed in Resource
>    - Crawler moved to droids-crawler
>    - Parser moved to droids-parser
>    - Walker moved to droids-walker
>    - The walker also use an Extractor
>    - some minor changes

sounds like not to difficult to merge/switch to your branch.

salu2

> 
> 
> 
> On 8 February 2011 10:32, Thorsten Scherler <th...@apache.org> wrote:
> 
> > On Tue, 2011-02-08 at 09:50 +0100, Chapuis Bertil wrote:
> > > In previous emails and jira comments I saw several people mentionning the
> > > fact they have a local copy of droids which evolved too much to be merged
> > > back with the trunk. This is my case, and I think Paul Rogalinski is in
> > the
> > > same situation.
> > >
> > > Since the patches have only been applied periodically on the trunk during
> > > the last months, I'd love to know if someone else is in the same
> > situation
> > > and what can of changes they made locally.
> >
> > I am not sure but I see it like you describe.
> >
> > IMO we should release what we have right now and then plan how to merge
> > back all this different versions into a new droids version.
> >
> > IMO the next droids version should focus on ease of reuse and a droid
> > server which starts and monitors the different droids.
> >
> > To start with:
> > * who has a version of droids which (s)he is interested to merge back
> > * what are the main difference between the forge and the "original"
> > * ...
> >
> > salu2
> > --
> > Thorsten Scherler <thorsten.at.apache.org>
> > codeBusters S.L. - web based systems
> > <consulting, training and solutions>
> > http://www.codebusters.es/
> >

-- 
Thorsten Scherler <thorsten.at.apache.org>
codeBusters S.L. - web based systems
<consulting, training and solutions>
http://www.codebusters.es/

Re: Local copies of droids

Posted by Chapuis Bertil <bc...@agimem.com>.

I agree. We have to release.

The changes I'd like to contribute back are the following.

   - TaskQueue repleaced by java.util.Queue
   - Handling process reviewed.
   - Factory pattern only used for Worker
   - Extractors inherit from Handler => no need to parse the document twice
   - Entity renamed in Identifier
   - ContentEntity renamed in Resource
   - Crawler moved to droids-crawler
   - Parser moved to droids-parser
   - Walker moved to droids-walker
   - The walker also use an Extractor
   - some minor changes



On 8 February 2011 10:32, Thorsten Scherler <th...@apache.org> wrote:

> On Tue, 2011-02-08 at 09:50 +0100, Chapuis Bertil wrote:
> > In previous emails and jira comments I saw several people mentionning the
> > fact they have a local copy of droids which evolved too much to be merged
> > back with the trunk. This is my case, and I think Paul Rogalinski is in
> the
> > same situation.
> >
> > Since the patches have only been applied periodically on the trunk during
> > the last months, I'd love to know if someone else is in the same
> situation
> > and what can of changes they made locally.
>
> I am not sure but I see it like you describe.
>
> IMO we should release what we have right now and then plan how to merge
> back all this different versions into a new droids version.
>
> IMO the next droids version should focus on ease of reuse and a droid
> server which starts and monitors the different droids.
>
> To start with:
> * who has a version of droids which (s)he is interested to merge back
> * what are the main difference between the forge and the "original"
> * ...
>
> salu2
> --
> Thorsten Scherler <thorsten.at.apache.org>
> codeBusters S.L. - web based systems
> <consulting, training and solutions>
> http://www.codebusters.es/
>

Re: Local copies of droids

Posted by Thorsten Scherler <th...@apache.org>.

On Tue, 2011-02-08 at 09:50 +0100, Chapuis Bertil wrote:
> In previous emails and jira comments I saw several people mentionning the
> fact they have a local copy of droids which evolved too much to be merged
> back with the trunk. This is my case, and I think Paul Rogalinski is in the
> same situation.
> 
> Since the patches have only been applied periodically on the trunk during
> the last months, I'd love to know if someone else is in the same situation
> and what can of changes they made locally.

I am not sure but I see it like you describe. 

IMO we should release what we have right now and then plan how to merge
back all this different versions into a new droids version. 

IMO the next droids version should focus on ease of reuse and a droid
server which starts and monitors the different droids.

To start with:
* who has a version of droids which (s)he is interested to merge back
* what are the main difference between the forge and the "original"
* ...

salu2
-- 
Thorsten Scherler <thorsten.at.apache.org>
codeBusters S.L. - web based systems
<consulting, training and solutions>
http://www.codebusters.es/