You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Gang <mi...@gmail.com> on 2013/01/08 13:15:49 UTC
problem with nutch2.1 and redirect
Hi all,
I have the following problem
I injected the url
http://openurl.ebscohost.com/linksvc/linking.aspx?sid=a9h&volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
In firefox the url is redirected to another page with the domain
http://web.ebscohost.com/ehost/detail?...
I want to get the content of the result page.
In nutch i get
bin/nutch readdb -url '
http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature'
-content
key:
http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
baseUrl:
http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
status: 4 (status_redir_temp)
fetchInterval: 2592000
fetchTime: 1357644874578
prevFetchTime: 1357644821312
retries: 0
modifiedTime: 0
protocolStatus: TEMP_MOVED, args=[
http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409
]
parseStatus: (null)
title: null
score: 1.0
markers: {dist=0, _injmrk_=y, _ftcmrk_=1357644850-1310231024,
_gnmrk_=1357644850-1310231024}
metadata _csh_ : ?\ufffd
metadata ___rdrdsc__ : y
contentType: text/html
content:start:
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="http://search.ebscohost.com/login.aspx?...
.">here</a>.</h2>
</body></html>
I see that there is a certain problem with redirect.
I changed in the nutch-default.xml
db.ignore.internal.links and db.ignore.external.links to false and in
conf/regex-urlfilter.txt i commented the line
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
it still does not work.
What did i do wrong ?
Which additional file should be changed?
Thanks,
David
Re: problem with nutch2.1 and redirect
Posted by Tejas Patil <te...@gmail.com>.
Hi Micheal,
Add this to nutch-site.xml and try out a fresh crawl. (Note that you also
need to have the configs suggested by Sebastian)
<property>
<name>*db.max.outlinks.per.page*</name>
<value>*0*</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
Thanks,
Tejas Patil
On Mon, Jan 14, 2013 at 11:49 PM, Michael Gang <mi...@gmail.com>wrote:
> Hi,
>
> Now i have a question.
> Let's say i want to fetch a list of urls and i want to follow redirects,
> but i don't want to fetch other outgoing urls.
> How do i accomplish it with nutch 2.1?
>
> Thanks,
> David
>
>
> On Tue, Jan 8, 2013 at 10:49 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com
> > wrote:
>
> > Hi David,
> >
> > Nutch follows redirects. You should check the URL you are redirected to:
> >
> >
> http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409
> > If it is
> > - not blocked by URL filters
> > - or by db.ignore.external.links (because it's and external link)
> > the redirect URL is fetched the next round (cycle).
> >
> > In Nutch 1.x there is a possibility to follow redirects immediately,
> > see http.redirect.max but it has one disadvantage:
> > there is no deduplication! Because multiple URLs (even hundreds)
> > may be redirected to one single document a crawler should fetch
> > the redirect target only once.
> >
> > The properties
> > db.ignore.external.links
> > and the regex URL filter rule
> > -[?*!@=]
> > apply to all kinds of links / URLs including redirects.
> >
> > So, with your configuration changes (nutch-site.xml would be a better
> > place to do the changes)
> > redirects should be followed. Look for the redirect targets in the web
> > table, they should be
> > there.
> >
> > Sebastian
> >
> > On 01/08/2013 01:15 PM, Michael Gang wrote:
> > > Hi all,
> > >
> > > I have the following problem
> > >
> > > I injected the url
> > >
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?sid=a9h&volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> > > In firefox the url is redirected to another page with the domain
> > > http://web.ebscohost.com/ehost/detail?...
> > >
> > > I want to get the content of the result page.
> > > In nutch i get
> > >
> > > bin/nutch readdb -url '
> > >
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> > '
> > > -content
> > > key:
> > >
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> > > baseUrl:
> > >
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> > > status: 4 (status_redir_temp)
> > > fetchInterval: 2592000
> > > fetchTime: 1357644874578
> > > prevFetchTime: 1357644821312
> > > retries: 0
> > > modifiedTime: 0
> > > protocolStatus: TEMP_MOVED, args=[
> > >
> >
> http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409
> > > ]
> > > parseStatus: (null)
> > > title: null
> > > score: 1.0
> > > markers: {dist=0, _injmrk_=y, _ftcmrk_=1357644850-1310231024,
> > > _gnmrk_=1357644850-1310231024}
> > > metadata _csh_ : ?\ufffd
> > > metadata ___rdrdsc__ : y
> > > contentType: text/html
> > > content:start:
> > > <html><head><title>Object moved</title></head><body>
> > > <h2>Object moved to <a href="http://search.ebscohost.com/login.aspx?.
> ..
> > > .">here</a>.</h2>
> > > </body></html>
> > >
> > > I see that there is a certain problem with redirect.
> > > I changed in the nutch-default.xml
> > > db.ignore.internal.links and db.ignore.external.links to false and in
> > > conf/regex-urlfilter.txt i commented the line
> > > # skip URLs containing certain characters as probable queries, etc.
> > > #-[?*!@=]
> > >
> > > it still does not work.
> > > What did i do wrong ?
> > > Which additional file should be changed?
> > >
> > > Thanks,
> > > David
> > >
> >
> >
>
Re: problem with nutch2.1 and redirect
Posted by Michael Gang <mi...@gmail.com>.
Hi,
Now i have a question.
Let's say i want to fetch a list of urls and i want to follow redirects,
but i don't want to fetch other outgoing urls.
How do i accomplish it with nutch 2.1?
Thanks,
David
On Tue, Jan 8, 2013 at 10:49 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:
> Hi David,
>
> Nutch follows redirects. You should check the URL you are redirected to:
>
> http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409
> If it is
> - not blocked by URL filters
> - or by db.ignore.external.links (because it's and external link)
> the redirect URL is fetched the next round (cycle).
>
> In Nutch 1.x there is a possibility to follow redirects immediately,
> see http.redirect.max but it has one disadvantage:
> there is no deduplication! Because multiple URLs (even hundreds)
> may be redirected to one single document a crawler should fetch
> the redirect target only once.
>
> The properties
> db.ignore.external.links
> and the regex URL filter rule
> -[?*!@=]
> apply to all kinds of links / URLs including redirects.
>
> So, with your configuration changes (nutch-site.xml would be a better
> place to do the changes)
> redirects should be followed. Look for the redirect targets in the web
> table, they should be
> there.
>
> Sebastian
>
> On 01/08/2013 01:15 PM, Michael Gang wrote:
> > Hi all,
> >
> > I have the following problem
> >
> > I injected the url
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?sid=a9h&volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> > In firefox the url is redirected to another page with the domain
> > http://web.ebscohost.com/ehost/detail?...
> >
> > I want to get the content of the result page.
> > In nutch i get
> >
> > bin/nutch readdb -url '
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> '
> > -content
> > key:
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> > baseUrl:
> >
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> > status: 4 (status_redir_temp)
> > fetchInterval: 2592000
> > fetchTime: 1357644874578
> > prevFetchTime: 1357644821312
> > retries: 0
> > modifiedTime: 0
> > protocolStatus: TEMP_MOVED, args=[
> >
> http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409
> > ]
> > parseStatus: (null)
> > title: null
> > score: 1.0
> > markers: {dist=0, _injmrk_=y, _ftcmrk_=1357644850-1310231024,
> > _gnmrk_=1357644850-1310231024}
> > metadata _csh_ : ?\ufffd
> > metadata ___rdrdsc__ : y
> > contentType: text/html
> > content:start:
> > <html><head><title>Object moved</title></head><body>
> > <h2>Object moved to <a href="http://search.ebscohost.com/login.aspx?...
> > .">here</a>.</h2>
> > </body></html>
> >
> > I see that there is a certain problem with redirect.
> > I changed in the nutch-default.xml
> > db.ignore.internal.links and db.ignore.external.links to false and in
> > conf/regex-urlfilter.txt i commented the line
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> >
> > it still does not work.
> > What did i do wrong ?
> > Which additional file should be changed?
> >
> > Thanks,
> > David
> >
>
>
Re: problem with nutch2.1 and redirect
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi David,
Nutch follows redirects. You should check the URL you are redirected to:
http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409
If it is
- not blocked by URL filters
- or by db.ignore.external.links (because it's and external link)
the redirect URL is fetched the next round (cycle).
In Nutch 1.x there is a possibility to follow redirects immediately,
see http.redirect.max but it has one disadvantage:
there is no deduplication! Because multiple URLs (even hundreds)
may be redirected to one single document a crawler should fetch
the redirect target only once.
The properties
db.ignore.external.links
and the regex URL filter rule
-[?*!@=]
apply to all kinds of links / URLs including redirects.
So, with your configuration changes (nutch-site.xml would be a better place to do the changes)
redirects should be followed. Look for the redirect targets in the web table, they should be
there.
Sebastian
On 01/08/2013 01:15 PM, Michael Gang wrote:
> Hi all,
>
> I have the following problem
>
> I injected the url
> http://openurl.ebscohost.com/linksvc/linking.aspx?sid=a9h&volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> In firefox the url is redirected to another page with the domain
> http://web.ebscohost.com/ehost/detail?...
>
> I want to get the content of the result page.
> In nutch i get
>
> bin/nutch readdb -url '
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature'
> -content
> key:
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> baseUrl:
> http://openurl.ebscohost.com/linksvc/linking.aspx?volume=394&date=19980827&spage=839&issn=0028-0836&stitle=&genre=article&issue=6696&title=Nature
> status: 4 (status_redir_temp)
> fetchInterval: 2592000
> fetchTime: 1357644874578
> prevFetchTime: 1357644821312
> retries: 0
> modifiedTime: 0
> protocolStatus: TEMP_MOVED, args=[
> http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=a2h&AN=84164637&msid=943330409
> ]
> parseStatus: (null)
> title: null
> score: 1.0
> markers: {dist=0, _injmrk_=y, _ftcmrk_=1357644850-1310231024,
> _gnmrk_=1357644850-1310231024}
> metadata _csh_ : ?\ufffd
> metadata ___rdrdsc__ : y
> contentType: text/html
> content:start:
> <html><head><title>Object moved</title></head><body>
> <h2>Object moved to <a href="http://search.ebscohost.com/login.aspx?...
> .">here</a>.</h2>
> </body></html>
>
> I see that there is a certain problem with redirect.
> I changed in the nutch-default.xml
> db.ignore.internal.links and db.ignore.external.links to false and in
> conf/regex-urlfilter.txt i commented the line
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
>
> it still does not work.
> What did i do wrong ?
> Which additional file should be changed?
>
> Thanks,
> David
>