You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/12/13 18:17:25 UTC

[Fwd: Crawler submits forms?]

FYI

This has been fixed in the mapred branch, but that patch is not in 
0.7.1.  This alone might be a reason to make a 0.7.2 release.

Doug

-------- Original Message --------
Subject: Crawler submits forms?
Date: Tue, 13 Dec 2005 16:57:34 -0000
From: Andy Read <An...@Azurite.co.uk>
Reply-To: nutch-agent@lucene.apache.org
Organization: Azurite Systems Ltd.
To: <nu...@lucene.apache.org>

Hi,

I'm using nutch to create a site search facility for a couple of site.

I upgraded from 0.6 to 0.7.1 a few days ago and have just noticed that blank
users are being registered on my site at the exact times the cron job runs
the crawl tool to re-index the site.  This means that the crawler is now
submitting a post request from the registration form!  Is this a new
'feature' of 0.7 or 0.7.1?  I can't find any mention in changes.txt and I
can't find any config option referring to it.  Surely the crawler should
never submit form input?

Any help appreciated.

Thanks,

Andy Read

www.azurite.co.uk




Re: [Fwd: Crawler submits forms?]

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Doug Cutting wrote:
> Andrzej Bialecki wrote:
> 
>> Please also don't forget that the trunk/ will soon be invaded by the 
>> code from mapred, I guess some time around the middle of January (Doug?) 
> 
> 
> Thinking about this more, perhaps we should do it sooner.  There's 
> already a branch for 0.7.x releases, so what point is there in not 
> merging mapred to trunk now?  We'd have fewer branches to maintain, and 
> start getting nightly builds of mapred.  Folks who require 0.7.x 
> compatibility can continue to use (and patch) the 0.7.x branch.  
> Objections?
> 
> Doug
> 
+1. Looking at the questions on mailing lists I do not think many people 
use trunk now.

Piotr

Re: mapred merge to trunk

Posted by Zaheed Haque <za...@gmail.com>.
+++3 (However I have no voting rights :-)

Please do!

Cheers

On 12/15/05, Doug Cutting <cu...@nutch.org> wrote:
> Sami Siren wrote:
> > +1. I think this is good time to merge now as the mapred is fully usable.
>
> Barring objections, I will do this tomorrow morning, Pacific time.
>
> Doug
>


--
Best Regards
Zaheed Haque
Phone : +46 735 000006
E.mail: zaheed.haque@gmail.com

Re: mapred merge to trunk

Posted by Doug Cutting <cu...@nutch.org>.
Doug Cutting wrote:
> Barring objections, I will do this tomorrow morning, Pacific time.

The mapred branch has now been merged to trunk.

Use the following command to switch your mapred working copies to trunk:

svn switch https://svn.apache.org/repos/asf/lucene/nutch/trunk

Doug

mapred merge to trunk

Posted by Doug Cutting <cu...@nutch.org>.
Sami Siren wrote:
> +1. I think this is good time to merge now as the mapred is fully usable.

Barring objections, I will do this tomorrow morning, Pacific time.

Doug

Re: [Fwd: Crawler submits forms?]

Posted by Sami Siren <s....@sonera.inet.fi>.
Doug Cutting wrote:
> Andrzej Bialecki wrote:
> 
>> Please also don't forget that the trunk/ will soon be invaded by the 
>> code from mapred, I guess some time around the middle of January (Doug?) 
> 
> 
> Thinking about this more, perhaps we should do it sooner.  There's 
> already a branch for 0.7.x releases, so what point is there in not 
> merging mapred to trunk now?  We'd have fewer branches to maintain, and 
> start getting nightly builds of mapred.  Folks who require 0.7.x 
> compatibility can continue to use (and patch) the 0.7.x branch.  
> Objections?
> 
> Doug
> 
+1. I think this is good time to merge now as the mapred is fully usable.

--
  Sami Siren



Re: [Fwd: Crawler submits forms?]

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Yes, we just need to make sure that all important bits from trunk are on 
> the 0.7 branch, before we start.

I will sync mapred with the trunk prior to the merge, so we should still 
be able to get anything we need after mapred is merged back to trunk.

BTW, we're pretty closely following the recommendations in:

http://svnbook.red-bean.com/en/1.1/ch04s04.html#svn-ch-4-sect-4.4

The mapred branch is a 'feature' branch.  At the end of this section 
they describe how to merge a feature branch back into the trunk.

Doug

Re: [Fwd: Crawler submits forms?]

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> I agree. I just thought that we would prepare the relase based on the 
>> code in trunk/ , and in that case we would like to wait with the 
>> merge before we do the release.
>
>
> My definition of trunk is that it should be where the majority of 
> development happens.  It is what we should build nightly, etc.
>
> Major versions should be branched from trunk, and point releases 
> created as tags from the version branches.
>
> A development branch (e.g., mapred) should be used when a few 
> developers need to make radical changes and do not want to disrupt 
> other developers.
>
> So if most developers are now comfortable working on mapred, then we 
> no longer need to keep it in a branch.  And we already have a version 
> branch for 0.7, so we don't need to reserve trunk for that.
>
> Does this analysis sound right?


Yes, we just need to make sure that all important bits from trunk are on 
the 0.7 branch, before we start.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [Fwd: Crawler submits forms?]

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> I agree. I just thought that we would prepare the relase based on the 
> code in trunk/ , and in that case we would like to wait with the merge 
> before we do the release.

My definition of trunk is that it should be where the majority of 
development happens.  It is what we should build nightly, etc.

Major versions should be branched from trunk, and point releases created 
as tags from the version branches.

A development branch (e.g., mapred) should be used when a few developers 
need to make radical changes and do not want to disrupt other developers.

So if most developers are now comfortable working on mapred, then we no 
longer need to keep it in a branch.  And we already have a version 
branch for 0.7, so we don't need to reserve trunk for that.

Does this analysis sound right?

Doug

Re: [Fwd: Crawler submits forms?]

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> Please also don't forget that the trunk/ will soon be invaded by the 
>> code from mapred, I guess some time around the middle of January (Doug?) 
>
>
> Thinking about this more, perhaps we should do it sooner.  There's 
> already a branch for 0.7.x releases, so what point is there in not 
> merging mapred to trunk now?  We'd have fewer branches to maintain, 
> and start getting nightly builds of mapred.  Folks who require 0.7.x 
> compatibility can continue to use (and patch) the 0.7.x branch.  
> Objections?
>
> Doug


I agree. I just thought that we would prepare the relase based on the 
code in trunk/ , and in that case we would like to wait with the merge 
before we do the release.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [Fwd: Crawler submits forms?]

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Please also don't forget that the trunk/ will soon be invaded by the 
> code from mapred, I guess some time around the middle of January (Doug?) 

Thinking about this more, perhaps we should do it sooner.  There's 
already a branch for 0.7.x releases, so what point is there in not 
merging mapred to trunk now?  We'd have fewer branches to maintain, and 
start getting nightly builds of mapred.  Folks who require 0.7.x 
compatibility can continue to use (and patch) the 0.7.x branch.  Objections?

Doug

Re: [Fwd: Crawler submits forms?]

Posted by Piotr Kosiorowski <pk...@gmail.com>.
+1 - I wanted to suggest exactly this approach - but we should try to keep
in mind not to introduce new features without serious reason (especially not
backward compatible ones).
Piotr

On 12/14/05, Jérôme Charron <je...@gmail.com> wrote:
>
> > What people think if we collect a list of issues and make a voting
> > iteration?
>
> +1
>
>

Re: [Fwd: Crawler submits forms?]

Posted by Jérôme Charron <je...@gmail.com>.
> What people think if we collect a list of issues and make a voting
> iteration?

+1

Re: [Fwd: Crawler submits forms?]

Posted by Stefan Groschupf <sg...@media-style.com>.
>>
>> http://issues.apache.org/jira/browse/NUTCH-125
>>
>
> On its way ... ;-) I'll add it during this week.

There are some more issues that are very small issues and some there are
also some patches from the  community.
What people think if we collect a list of issues and make a voting  
iteration?

Stefan 
  

Re: [Fwd: Crawler submits forms?]

Posted by Andrzej Bialecki <ab...@getopt.org>.
Zaheed Haque wrote:

>what about the following:
>
>http://issues.apache.org/jira/browse/NUTCH-125
>  
>

On its way ... ;-) I'll add it during this week.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [Fwd: Crawler submits forms?]

Posted by Zaheed Haque <za...@gmail.com>.
what about the following:

http://issues.apache.org/jira/browse/NUTCH-125

Cheers

On 12/13/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Jérôme Charron wrote:
>
> >+1 for a 0.7.2 release.
> >
> >
>
> +1.
>
> Things are going well on the mapred branch, all basic tools are almost
> in place, so after this release we will probably start merging... so,
> this looks like the last release of the 0.7.x line (from the code in
> trunk/ - I'm sure there will be maintenance releases afterwards).
>
> >I think we can wait for the enhancement proposed by Chris today: Adding an
> >alias in parse-plugin.xml file and use a content-type/extension-id mapping
> >instead of content-type/plugin-id.
> >
> >
>
> IMHO, this needs to be really well tested before going into a release
> ... possibilities for confusion are great.
>
> >For further improvements, the new mime-type repository based on freedesktop
> >mime-type will be needed.
> >I cannot reasonably include this in 0.7.2, but I think it will be in trunk
> >by the end of the year.
> >
> >
> >
>
> Please also don't forget that the trunk/ will soon be invaded by the
> code from mapred, I guess some time around the middle of January (Doug?) ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


--
Best Regards
Zaheed Haque
Phone : +46 735 000006
E.mail: zaheed.haque@gmail.com

Re: [Fwd: Crawler submits forms?]

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:

>+1 for a 0.7.2 release.
>  
>

+1.

Things are going well on the mapred branch, all basic tools are almost 
in place, so after this release we will probably start merging... so, 
this looks like the last release of the 0.7.x line (from the code in 
trunk/ - I'm sure there will be maintenance releases afterwards).

>I think we can wait for the enhancement proposed by Chris today: Adding an
>alias in parse-plugin.xml file and use a content-type/extension-id mapping
>instead of content-type/plugin-id.
>  
>

IMHO, this needs to be really well tested before going into a release 
... possibilities for confusion are great.

>For further improvements, the new mime-type repository based on freedesktop
>mime-type will be needed.
>I cannot reasonably include this in 0.7.2, but I think it will be in trunk
>by the end of the year.
>
>  
>

Please also don't forget that the trunk/ will soon be invaded by the 
code from mapred, I guess some time around the middle of January (Doug?) ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [Fwd: Crawler submits forms?]

Posted by Jérôme Charron <je...@gmail.com>.
+1 for a 0.7.2 release.
Here are the issues/revisions I can merge to 0.7 branch.
These changes mainly concern the parser-factory changes (NUTCH-88)

http://issues.apache.org/jira/browse/NUTCH-112
http://issues.apache.org/jira/browse/NUTCH-135
http://svn.apache.org/viewcvs.cgi?rev=356532&view=rev
http://svn.apache.org/viewcvs.cgi?rev=355809&view=rev
http://svn.apache.org/viewcvs.cgi?rev=354398&view=rev
http://svn.apache.org/viewcvs.cgi?rev=326889&view=rev
http://svn.apache.org/viewcvs.cgi?rev=321250&view=rev
http://svn.apache.org/viewcvs.cgi?rev=321231&view=rev
http://svn.apache.org/viewcvs.cgi?rev=306808&view=rev
http://svn.apache.org/viewcvs.cgi?rev=293370&view=rev
http://svn.apache.org/viewcvs.cgi?rev=292865&view=rev
http://svn.apache.org/viewcvs.cgi?rev=292035&view=rev

 <pk...@gmail.com>
Piotr, what about the italian translation?
0.7.2 could be a good candidate for a commit. no?

>> This has been fixed in the mapred branch, but that patch is not in
> >> 0.7.1 .  This alone might be a reason to make a 0.7.2 release.

http://svn.apache.org/viewcvs.cgi?view=rev&rev=348533

> I would be happy to see some more parser selection problems fixed but
> > looks like Jerome is working  hard also to get stuff fixed, may we  can
> > wait until that.

I think we can wait for the enhancement proposed by Chris today: Adding an
alias in parse-plugin.xml file and use a content-type/extension-id mapping
instead of content-type/plugin-id.
For further improvements, the new mime-type repository based on freedesktop
mime-type will be needed.
I cannot reasonably include this in 0.7.2, but I think it will be in trunk
by the end of the year.

What reasonable target date can we planned for a 0.7.2 ?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: [Fwd: Crawler submits forms?]

Posted by Piotr Kosiorowski <pk...@gmail.com>.
If we are going to make 0.7.2 release I would like to commit
a patch for http://issues.apache.org/jira/browse/NUTCH-112
and probably for some build problems people are raporting (missing src 
folder in nutch-extension plugin).
I will look at them in next few days.
Regards
Piotr
Stefan Groschupf wrote:
>> This has been fixed in the mapred branch, but that patch is not in  
>> 0.7.1.  This alone might be a reason to make a 0.7.2 release.
> 
> 
> May we can get fixed some more parser selection related issue until  
> next days also and get this into a 0.7.2 release.
> I would be happy to see some more parser selection problems fixed but  
> looks like Jerome is working  hard also to get stuff fixed, may we  can 
> wait until that.
> 
> Stefan


Re: [Fwd: Crawler submits forms?]

Posted by Stefan Groschupf <sg...@media-style.com>.
> This has been fixed in the mapred branch, but that patch is not in  
> 0.7.1.  This alone might be a reason to make a 0.7.2 release.

May we can get fixed some more parser selection related issue until  
next days also and get this into a 0.7.2 release.
I would be happy to see some more parser selection problems fixed but  
looks like Jerome is working  hard also to get stuff fixed, may we  
can wait until that.

Stefan