You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Amit Sela <am...@infolinks.com> on 2013/11/14 12:56:46 UTC

Get original URL from crawldb in case of redirect

Hi all,

I'm readin the crawldb as CrawledPage and I see the fetched URL, content
etc.
In case of a redirection (I allow 10 redirections in nutch-site.xml) the
fetched URL is not the original URL the Fetcher turned to, and I would like
to get that as well.

Does nutch store it somewhere, I'm basically looking for mapping between
URLs attempted to fetch and actually fetched.

Thanks,

Amit.

Re: Get original URL from crawldb in case of redirect

Posted by Amit Sela <am...@infolinks.com>.

I can answer this one myself. it would.
Thanks.


On Sat, Nov 16, 2013 at 4:52 PM, Amit Sela <am...@infolinks.com> wrote:

> Would _pst_ exist in metadata even if I'm crawling with:
> db.update.additions.allowed=false
>
> (I have a use case where I don't really crawl, but actually just fetch,
> and sometimes the list is too long for one execution so I have to
> re-execute on the same crawlDB but I don't want to crawl outside the seed
> list).
>
> Thanks.
>
>
> On Fri, Nov 15, 2013 at 10:05 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com> wrote:
>
>> Hi Amit,
>>
>> here the answer for Nutch 1.7
>> (or are you using 2.x?):
>>
>> Every URL is stored in CrawlDb even with
>>   http.redirect.max = 10
>>
>> For redirects, the target URL is stored in CrawlDatum's
>> metadata under key _pst_ (protocol status):
>>
>> http://issues.apache.org/jira/browse/NUTCH      Version: 7
>> Status: 4 (db_redir_temp)
>> Fetch time: Sun Dec 15 20:38:53 CET 2013
>> Modified time: Fri Nov 15 20:38:53 CET 2013
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 0.00941915
>> Signature: null
>> Metadata:
>>         Content-Type=text/html
>>         _maxdepth_=1000
>>         _pst_=temp_moved(13), lastModified=0:
>> https://issues.apache.org/jira/browse/NUTCH
>>         _depth_=2
>>
>> Sebastian
>>
>> On 11/14/2013 12:56 PM, Amit Sela wrote:
>> > Hi all,
>> >
>> > I'm readin the crawldb as CrawledPage and I see the fetched URL, content
>> > etc.
>> > In case of a redirection (I allow 10 redirections in nutch-site.xml) the
>> > fetched URL is not the original URL the Fetcher turned to, and I would
>> like
>> > to get that as well.
>> >
>> > Does nutch store it somewhere, I'm basically looking for mapping between
>> > URLs attempted to fetch and actually fetched.
>> >
>> > Thanks,
>> >
>> > Amit.
>> >
>>
>>
>

Re: Get original URL from crawldb in case of redirect

Posted by Amit Sela <am...@infolinks.com>.

Would _pst_ exist in metadata even if I'm crawling with:
db.update.additions.allowed=false

(I have a use case where I don't really crawl, but actually just fetch, and
sometimes the list is too long for one execution so I have to re-execute on
the same crawlDB but I don't want to crawl outside the seed list).

Thanks.


On Fri, Nov 15, 2013 at 10:05 PM, Sebastian Nagel <
wastl.nagel@googlemail.com> wrote:

> Hi Amit,
>
> here the answer for Nutch 1.7
> (or are you using 2.x?):
>
> Every URL is stored in CrawlDb even with
>   http.redirect.max = 10
>
> For redirects, the target URL is stored in CrawlDatum's
> metadata under key _pst_ (protocol status):
>
> http://issues.apache.org/jira/browse/NUTCH      Version: 7
> Status: 4 (db_redir_temp)
> Fetch time: Sun Dec 15 20:38:53 CET 2013
> Modified time: Fri Nov 15 20:38:53 CET 2013
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 0.00941915
> Signature: null
> Metadata:
>         Content-Type=text/html
>         _maxdepth_=1000
>         _pst_=temp_moved(13), lastModified=0:
> https://issues.apache.org/jira/browse/NUTCH
>         _depth_=2
>
> Sebastian
>
> On 11/14/2013 12:56 PM, Amit Sela wrote:
> > Hi all,
> >
> > I'm readin the crawldb as CrawledPage and I see the fetched URL, content
> > etc.
> > In case of a redirection (I allow 10 redirections in nutch-site.xml) the
> > fetched URL is not the original URL the Fetcher turned to, and I would
> like
> > to get that as well.
> >
> > Does nutch store it somewhere, I'm basically looking for mapping between
> > URLs attempted to fetch and actually fetched.
> >
> > Thanks,
> >
> > Amit.
> >
>
>

Re: Get original URL from crawldb in case of redirect

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Amit,

here the answer for Nutch 1.7
(or are you using 2.x?):

Every URL is stored in CrawlDb even with
  http.redirect.max = 10

For redirects, the target URL is stored in CrawlDatum's
metadata under key _pst_ (protocol status):

http://issues.apache.org/jira/browse/NUTCH      Version: 7
Status: 4 (db_redir_temp)
Fetch time: Sun Dec 15 20:38:53 CET 2013
Modified time: Fri Nov 15 20:38:53 CET 2013
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.00941915
Signature: null
Metadata:
        Content-Type=text/html
        _maxdepth_=1000
        _pst_=temp_moved(13), lastModified=0: https://issues.apache.org/jira/browse/NUTCH
        _depth_=2

Sebastian

On 11/14/2013 12:56 PM, Amit Sela wrote:
> Hi all,
> 
> I'm readin the crawldb as CrawledPage and I see the fetched URL, content
> etc.
> In case of a redirection (I allow 10 redirections in nutch-site.xml) the
> fetched URL is not the original URL the Fetcher turned to, and I would like
> to get that as well.
> 
> Does nutch store it somewhere, I'm basically looking for mapping between
> URLs attempted to fetch and actually fetched.
> 
> Thanks,
> 
> Amit.
>