You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@any23.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/01/13 14:16:33 UTC

[DISCUSS] Questions on Basic-Crawler Module

Hi Guys,

OK further to my ridiculous question regarding where the module actually
is, I would like to pose some more relevant thoughts.

A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion which
was included within the Incubator proposal for a Nutch Any23 plugin. As you
know, currently the crawling in the basic-crawler plugin is done via
crawler4j, @ Apache we are great believers of eat your own dog food,
therefore my proposal would be to remove the dependencies on crawler4j if I
was building the Nutch implementation using instead Nutch interfaces and
functionality. This kind of leads on to my question as to

1) Should the basic-crawler plugin be kept within Any23? My own thoughts
are that it provides a real nice and easy way to test out Any23
functionality, however should 'crawling' functionality be part of a project
which describes itself as "a library, a web service and a command line tool
that extracts structured data in RDF format from a variety of Web
documents."?
2) The knock-on effect of removing this module and porting it directly to
Nutch would be that to test out Any23 libraries within a crawler you would
need a working knowledge of Nutch... this could be putting up barriers to
adoption...
3) I'm assuming that a Nutch plugin would simply use Ivy to pull the
any23-core library from the Apache repo and use this, I'm thinking of
deduplicating as much code as possible between projects... Any ideas

Thanks

[1] https://issues.apache.org/jira/browse/NUTCH-1129

-- 
*Lewis*

Re: [DISCUSS] Questions on Basic-Crawler Module

Posted by Michele Mostarda <mi...@gmail.com>.

On 14 January 2012 17:35, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Hi Michele,
>
> I was thinking about replying to my original thread with some of the points
> you make as I completely agree with your logic. Simone also mention the
> importance of keeping the basic-crawler as a plugin and I agree with this
> aswel.
>

That's great!


> Once we get the Any23 packages changed to o.a.any23 rather than
> a.deri.any23, this will allow us to push it to apache nexus, I'll begin
> work on the Nutch-Any23 plugin. We'll take it from there.
>

Really good, I will start with the ANY23-21 just now.

>
> Thanks for getting back to me with your thoughts.
>

Please.

The best.

Mic


>
> On Sat, Jan 14, 2012 at 3:39 PM, Michele Mostarda <
> michele.mostarda@gmail.com> wrote:
>
> > On 13 January 2012 14:21, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com
> > >wrote:
> >
> > > Further to this, the Basic crawler plugin took some 4 mins to download
> > > dependencies, install and test...
> > >
> > > Seems a lot of overhead for a plugin which is not even mentioned in the
> > > project description. Considering the overall build took some 8 mins
> > > locally.
> > >
> >
> > The Crawler plugin has been added with milestone 0.7.0, the documentation
> > has not yet written.
> >
> > Mic
> >
> >
> > >
> > > ...
> > >
> > > On Fri, Jan 13, 2012 at 1:16 PM, Lewis John Mcgibbney <
> > > lewis.mcgibbney@gmail.com> wrote:
> > >
> > > > Hi Guys,
> > > >
> > > > OK further to my ridiculous question regarding where the module
> > actually
> > > > is, I would like to pose some more relevant thoughts.
> > > >
> > > > A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion
> > > > which was included within the Incubator proposal for a Nutch Any23
> > > plugin.
> > > > As you know, currently the crawling in the basic-crawler plugin is
> done
> > > via
> > > > crawler4j, @ Apache we are great believers of eat your own dog food,
> > > > therefore my proposal would be to remove the dependencies on
> crawler4j
> > > if I
> > > > was building the Nutch implementation using instead Nutch interfaces
> > and
> > > > functionality. This kind of leads on to my question as to
> > > >
> > > > 1) Should the basic-crawler plugin be kept within Any23? My own
> > thoughts
> > > > are that it provides a real nice and easy way to test out Any23
> > > > functionality, however should 'crawling' functionality be part of a
> > > project
> > > > which describes itself as "a library, a web service and a command
> line
> > > tool
> > > > that extracts structured data in RDF format from a variety of Web
> > > > documents."?
> > > > 2) The knock-on effect of removing this module and porting it
> directly
> > to
> > > > Nutch would be that to test out Any23 libraries within a crawler you
> > > would
> > > > need a working knowledge of Nutch... this could be putting up
> barriers
> > to
> > > > adoption...
> > > > 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the
> > > > any23-core library from the Apache repo and use this, I'm thinking of
> > > > deduplicating as much code as possible between projects... Any ideas
> > > >
> > > > Thanks
> > > >
> > > > [1] https://issues.apache.org/jira/browse/NUTCH-1129
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
> >
> >
> > --
> > Michele Mostarda
> > Senior Software Engineer
> > skype: michele.mostarda
> > twitter: micmos
> > mail: me@michelemostarda.com
> > site : http://www.michelemostarda.com
> >
>
>
>
> --
> *Lewis*
>



-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: me@michelemostarda.com
site : http://www.michelemostarda.com

Re: [DISCUSS] Questions on Basic-Crawler Module

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Michele,

I was thinking about replying to my original thread with some of the points
you make as I completely agree with your logic. Simone also mention the
importance of keeping the basic-crawler as a plugin and I agree with this
aswel.

Once we get the Any23 packages changed to o.a.any23 rather than
a.deri.any23, this will allow us to push it to apache nexus, I'll begin
work on the Nutch-Any23 plugin. We'll take it from there.

Thanks for getting back to me with your thoughts.

On Sat, Jan 14, 2012 at 3:39 PM, Michele Mostarda <
michele.mostarda@gmail.com> wrote:

> On 13 January 2012 14:21, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
> >wrote:
>
> > Further to this, the Basic crawler plugin took some 4 mins to download
> > dependencies, install and test...
> >
> > Seems a lot of overhead for a plugin which is not even mentioned in the
> > project description. Considering the overall build took some 8 mins
> > locally.
> >
>
> The Crawler plugin has been added with milestone 0.7.0, the documentation
> has not yet written.
>
> Mic
>
>
> >
> > ...
> >
> > On Fri, Jan 13, 2012 at 1:16 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > Hi Guys,
> > >
> > > OK further to my ridiculous question regarding where the module
> actually
> > > is, I would like to pose some more relevant thoughts.
> > >
> > > A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion
> > > which was included within the Incubator proposal for a Nutch Any23
> > plugin.
> > > As you know, currently the crawling in the basic-crawler plugin is done
> > via
> > > crawler4j, @ Apache we are great believers of eat your own dog food,
> > > therefore my proposal would be to remove the dependencies on crawler4j
> > if I
> > > was building the Nutch implementation using instead Nutch interfaces
> and
> > > functionality. This kind of leads on to my question as to
> > >
> > > 1) Should the basic-crawler plugin be kept within Any23? My own
> thoughts
> > > are that it provides a real nice and easy way to test out Any23
> > > functionality, however should 'crawling' functionality be part of a
> > project
> > > which describes itself as "a library, a web service and a command line
> > tool
> > > that extracts structured data in RDF format from a variety of Web
> > > documents."?
> > > 2) The knock-on effect of removing this module and porting it directly
> to
> > > Nutch would be that to test out Any23 libraries within a crawler you
> > would
> > > need a working knowledge of Nutch... this could be putting up barriers
> to
> > > adoption...
> > > 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the
> > > any23-core library from the Apache repo and use this, I'm thinking of
> > > deduplicating as much code as possible between projects... Any ideas
> > >
> > > Thanks
> > >
> > > [1] https://issues.apache.org/jira/browse/NUTCH-1129
> > >
> > > --
> > > *Lewis*
> > >
> > >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> Michele Mostarda
> Senior Software Engineer
> skype: michele.mostarda
> twitter: micmos
> mail: me@michelemostarda.com
> site : http://www.michelemostarda.com
>



-- 
*Lewis*

Re: [DISCUSS] Questions on Basic-Crawler Module

Posted by Michele Mostarda <mi...@gmail.com>.

On 13 January 2012 14:21, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Further to this, the Basic crawler plugin took some 4 mins to download
> dependencies, install and test...
>
> Seems a lot of overhead for a plugin which is not even mentioned in the
> project description. Considering the overall build took some 8 mins
> locally.
>

The Crawler plugin has been added with milestone 0.7.0, the documentation
has not yet written.

Mic


>
> ...
>
> On Fri, Jan 13, 2012 at 1:16 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi Guys,
> >
> > OK further to my ridiculous question regarding where the module actually
> > is, I would like to pose some more relevant thoughts.
> >
> > A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion
> > which was included within the Incubator proposal for a Nutch Any23
> plugin.
> > As you know, currently the crawling in the basic-crawler plugin is done
> via
> > crawler4j, @ Apache we are great believers of eat your own dog food,
> > therefore my proposal would be to remove the dependencies on crawler4j
> if I
> > was building the Nutch implementation using instead Nutch interfaces and
> > functionality. This kind of leads on to my question as to
> >
> > 1) Should the basic-crawler plugin be kept within Any23? My own thoughts
> > are that it provides a real nice and easy way to test out Any23
> > functionality, however should 'crawling' functionality be part of a
> project
> > which describes itself as "a library, a web service and a command line
> tool
> > that extracts structured data in RDF format from a variety of Web
> > documents."?
> > 2) The knock-on effect of removing this module and porting it directly to
> > Nutch would be that to test out Any23 libraries within a crawler you
> would
> > need a working knowledge of Nutch... this could be putting up barriers to
> > adoption...
> > 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the
> > any23-core library from the Apache repo and use this, I'm thinking of
> > deduplicating as much code as possible between projects... Any ideas
> >
> > Thanks
> >
> > [1] https://issues.apache.org/jira/browse/NUTCH-1129
> >
> > --
> > *Lewis*
> >
> >
>
>
> --
> *Lewis*
>



-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: me@michelemostarda.com
site : http://www.michelemostarda.com

Re: [DISCUSS] Questions on Basic-Crawler Module

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Further to this, the Basic crawler plugin took some 4 mins to download
dependencies, install and test...

Seems a lot of overhead for a plugin which is not even mentioned in the
project description. Considering the overall build took some 8 mins locally.

...

On Fri, Jan 13, 2012 at 1:16 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Guys,
>
> OK further to my ridiculous question regarding where the module actually
> is, I would like to pose some more relevant thoughts.
>
> A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion
> which was included within the Incubator proposal for a Nutch Any23 plugin.
> As you know, currently the crawling in the basic-crawler plugin is done via
> crawler4j, @ Apache we are great believers of eat your own dog food,
> therefore my proposal would be to remove the dependencies on crawler4j if I
> was building the Nutch implementation using instead Nutch interfaces and
> functionality. This kind of leads on to my question as to
>
> 1) Should the basic-crawler plugin be kept within Any23? My own thoughts
> are that it provides a real nice and easy way to test out Any23
> functionality, however should 'crawling' functionality be part of a project
> which describes itself as "a library, a web service and a command line tool
> that extracts structured data in RDF format from a variety of Web
> documents."?
> 2) The knock-on effect of removing this module and porting it directly to
> Nutch would be that to test out Any23 libraries within a crawler you would
> need a working knowledge of Nutch... this could be putting up barriers to
> adoption...
> 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the
> any23-core library from the Apache repo and use this, I'm thinking of
> deduplicating as much code as possible between projects... Any ideas
>
> Thanks
>
> [1] https://issues.apache.org/jira/browse/NUTCH-1129
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Re: [DISCUSS] Questions on Basic-Crawler Module

Posted by Michele Mostarda <mi...@gmail.com>.

On 13 January 2012 14:16, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Hi Guys,
>
> OK further to my ridiculous question regarding where the module actually
> is, I would like to pose some more relevant thoughts.
>
> A while ago I opened NUTCH-1129 [1], based enitrely on the suggestion which
> was included within the Incubator proposal for a Nutch Any23 plugin. As you
> know, currently the crawling in the basic-crawler plugin is done via
> crawler4j, @ Apache we are great believers of eat your own dog food,
> therefore my proposal would be to remove the dependencies on crawler4j if I
> was building the Nutch implementation using instead Nutch interfaces and
> functionality. This kind of leads on to my question as to
>
> 1) Should the basic-crawler plugin be kept within Any23? My own thoughts
> are that it provides a real nice and easy way to test out Any23
> functionality, however should 'crawling' functionality be part of a project
> which describes itself as "a library, a web service and a command line tool
> that extracts structured data in RDF format from a variety of Web
> documents."?

2) The knock-on effect of removing this module and porting it directly to
> Nutch would be that to test out Any23 libraries within a crawler you would
> need a working knowledge of Nutch... this could be putting up barriers to
> adoption...
> 3) I'm assuming that a Nutch plugin would simply use Ivy to pull the
> any23-core library from the Apache repo and use this, I'm thinking of
> deduplicating as much code as possible between projects... Any ideas
>
>
Trust me Lewis, the possibility to crawl the semantic content of a site
with a
single command is priceless, a lot of users asked to add crawler
functionalities
to Any23.

However the crawling functionality requires specific (and immature)
dependencies,
that's why it has been implemented as a plugin.

I don't liked crawler4j, it required some dirty workarounds to be used in
the plugin,
 but it was the only library providing exactly what we needed for the
purpose of the Crawl CLI.

I completely agree with the idea of replacing crawler4j with some ASF
alternative, but at the
condition to keep it easy to use as a CLI.

The best.

Mic


> Thanks
>
> [1] https://issues.apache.org/jira/browse/NUTCH-1129
>
> --
> *Lewis*
>



-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: me@michelemostarda.com
site : http://www.michelemostarda.com