You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by A Laxmi <a....@gmail.com> on 2013/07/31 16:55:45 UTC

Nutch 1.6 - sequence in which crawler works its way to a URL

Hello,

For example, I have a single *seed *url say "http://nutch.apache.org/" and
I am crawling it for "n" times. At the end of the crawl, I have 1220 new
urls generated/fetched/updated from a single seed url. While looking at
these 1220 new urls, I am interested to know how a particular site eg.
"www.abc/xy.com" has been crawled. Better question would be - in what
sequence did the crawler work its way to a particular url "www.abc/xy.com"?

Thanks for your help!

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Posted by A Laxmi <a....@gmail.com>.

Thanks Talat! I am using Nutch 1.6. Does Hive good for Nutch 1.6?


On Thu, Aug 1, 2013 at 8:48 AM, Talat UYARER <ta...@agmlab.com>wrote:

> Hi,
>
> I had same problem. I solved with Hive. I  mapped hbase table to hive.
> After than i write little query. If you use Hive i can help you. But your
> problem is url-validation plugin problem. you should add in your
> nutch-site.xml. Doest come by the default.
>
>
>
> 01-08-2013 13:57 tarihinde, A Laxmi yazdı:
>
>> Is there any way to find an *inlink *of a crawled site?
>>
>>
>>
>> On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <a....@gmail.com> wrote:
>>
>>  Thanks for your help, Ahme! I would be interested in more than a
>>> timestamp. I would like to understand how a particular URL was crawled -
>>> in
>>> better terms, the sequence or how nutch landed up with a particular link
>>> in
>>> its crawldb.
>>>
>>> My problem is I found one site from the crawled list of URLS with a
>>> horrible URL format something like '
>>> www.domainabc.com/level1/**level2/level3\\\\\\\\\\\\/**level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>
>>> <http://www.**domainabc.com/level1/level2/**level3%5C%5C%5C%5C%5C%5C%5C%
>>> **5C%5C%5C%5C%5C/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>
>>> **>?"
>>>
>>> - as you can see this link it has got some backslashes for some reason. I
>>> tried to reach that url starting from the landing page "
>>> www.domainabc.com/level1/**level2/<http://www.domainabc.com/level1/level2/>"
>>> but I could not find that URL with such
>>> a bad format. So, I want to know how did nutch reach that url? Is there
>>> some link nutch crawled which has the url " '
>>> www.domainabc.com/level1/**level2/level3\\\\\\\\\\\\/**level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>
>>> <http://www.**domainabc.com/level1/level2/**level3%5C%5C%5C%5C%5C%5C%5C%
>>> **5C%5C%5C%5C%5C/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>
>>> **>?"
>>>
>>> somewhere? In what sequence did nutch did the crawling starting from a
>>> seed
>>> url to crawl such a url? I hope I made it clear. Please let me know if
>>> you
>>> have any questions. Any help is much appreciated.
>>>
>>>
>>> On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <
>>> emre.aladag@agmlab.com>wrote:
>>>
>>>  Hello,
>>>>
>>>> Does timestamp give what you need? There should be a timestamp
>>>> indicating
>>>> the time of the operation.
>>>>
>>>>
>>>>
>>>>
>>>> ----- Orijinal Mesaj -----
>>>> Kimden: "A Laxmi" <a....@gmail.com>
>>>> Kime: user@nutch.apache.org
>>>> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
>>>> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
>>>>
>>>> Hello,
>>>>
>>>> For example, I have a single *seed *url say "http://nutch.apache.org/"
>>>> and
>>>> I am crawling it for "n" times. At the end of the crawl, I have 1220 new
>>>> urls generated/fetched/updated from a single seed url. While looking at
>>>> these 1220 new urls, I am interested to know how a particular site eg.
>>>> "www.abc/xy.com" has been crawled. Better question would be - in what
>>>> sequence did the crawler work its way to a particular url "www.abc/
>>>> xy.com
>>>> "?
>>>>
>>>> Thanks for your help!
>>>>
>>>>
>>>
>

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Posted by Talat UYARER <ta...@agmlab.com>.

Hi,

I had same problem. I solved with Hive. I  mapped hbase table to hive. 
After than i write little query. If you use Hive i can help you. But 
your problem is url-validation plugin problem. you should add in your 
nutch-site.xml. Doest come by the default.



01-08-2013 13:57 tarihinde, A Laxmi yazdı:
> Is there any way to find an *inlink *of a crawled site?
>
>
> On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <a....@gmail.com> wrote:
>
>> Thanks for your help, Ahme! I would be interested in more than a
>> timestamp. I would like to understand how a particular URL was crawled - in
>> better terms, the sequence or how nutch landed up with a particular link in
>> its crawldb.
>>
>> My problem is I found one site from the crawled list of URLS with a
>> horrible URL format something like '
>> www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>?"
>> - as you can see this link it has got some backslashes for some reason. I
>> tried to reach that url starting from the landing page "
>> www.domainabc.com/level1/level2/" but I could not find that URL with such
>> a bad format. So, I want to know how did nutch reach that url? Is there
>> some link nutch crawled which has the url " '
>> www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>?"
>> somewhere? In what sequence did nutch did the crawling starting from a seed
>> url to crawl such a url? I hope I made it clear. Please let me know if you
>> have any questions. Any help is much appreciated.
>>
>>
>> On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <em...@agmlab.com>wrote:
>>
>>> Hello,
>>>
>>> Does timestamp give what you need? There should be a timestamp indicating
>>> the time of the operation.
>>>
>>>
>>>
>>>
>>> ----- Orijinal Mesaj -----
>>> Kimden: "A Laxmi" <a....@gmail.com>
>>> Kime: user@nutch.apache.org
>>> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
>>> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
>>>
>>> Hello,
>>>
>>> For example, I have a single *seed *url say "http://nutch.apache.org/"
>>> and
>>> I am crawling it for "n" times. At the end of the crawl, I have 1220 new
>>> urls generated/fetched/updated from a single seed url. While looking at
>>> these 1220 new urls, I am interested to know how a particular site eg.
>>> "www.abc/xy.com" has been crawled. Better question would be - in what
>>> sequence did the crawler work its way to a particular url "www.abc/xy.com
>>> "?
>>>
>>> Thanks for your help!
>>>
>>

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Posted by A Laxmi <a....@gmail.com>.

Julien - Sure. Thanks for your help! Your response atleast gave me a
direction - linkdb.


On Thu, Aug 1, 2013 at 9:40 AM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:

> What about  Googling / reading the WIKI / doing a bit of research yourself
>  before asking questions on the mailing list?
>
> On 1 August 2013 14:34, A Laxmi <a....@gmail.com> wrote:
>
> > Hi Julien. Thanks for the suggestion! Could you please provide a
> reference
> > link on how to use linkdb?
> >
> >
> > On Thu, Aug 1, 2013 at 9:28 AM, Julien Nioche <
> > lists.digitalpebble@gmail.com
> > > wrote:
> >
> > > Why don't you create a linkdb then read it with the nutch readlinkdb
> > > command?
> > >
> > >
> > > On 1 August 2013 11:57, A Laxmi <a....@gmail.com> wrote:
> > >
> > > > Is there any way to find an *inlink *of a crawled site?
> > > >
> > > >
> > > > On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <a....@gmail.com>
> > wrote:
> > > >
> > > > > Thanks for your help, Ahme! I would be interested in more than a
> > > > > timestamp. I would like to understand how a particular URL was
> > crawled
> > > -
> > > > in
> > > > > better terms, the sequence or how nutch landed up with a particular
> > > link
> > > > in
> > > > > its crawldb.
> > > > >
> > > > > My problem is I found one site from the crawled list of URLS with a
> > > > > horrible URL format something like '
> > > > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1
> <
> > > >
> > >
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > > > >?"
> > > > > - as you can see this link it has got some backslashes for some
> > > reason. I
> > > > > tried to reach that url starting from the landing page "
> > > > > www.domainabc.com/level1/level2/" but I could not find that URL
> with
> > > > such
> > > > > a bad format. So, I want to know how did nutch reach that url? Is
> > there
> > > > > some link nutch crawled which has the url " '
> > > > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1
> <
> > > >
> > >
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > > > >?"
> > > > > somewhere? In what sequence did nutch did the crawling starting
> from
> > a
> > > > seed
> > > > > url to crawl such a url? I hope I made it clear. Please let me know
> > if
> > > > you
> > > > > have any questions. Any help is much appreciated.
> > > > >
> > > > >
> > > > > On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <
> > > > emre.aladag@agmlab.com>wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> Does timestamp give what you need? There should be a timestamp
> > > > indicating
> > > > >> the time of the operation.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> ----- Orijinal Mesaj -----
> > > > >> Kimden: "A Laxmi" <a....@gmail.com>
> > > > >> Kime: user@nutch.apache.org
> > > > >> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
> > > > >> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
> > > > >>
> > > > >> Hello,
> > > > >>
> > > > >> For example, I have a single *seed *url say "
> > http://nutch.apache.org/
> > > "
> > > > >> and
> > > > >> I am crawling it for "n" times. At the end of the crawl, I have
> 1220
> > > new
> > > > >> urls generated/fetched/updated from a single seed url. While
> looking
> > > at
> > > > >> these 1220 new urls, I am interested to know how a particular site
> > eg.
> > > > >> "www.abc/xy.com" has been crawled. Better question would be - in
> > what
> > > > >> sequence did the crawler work its way to a particular url
> "www.abc/
> > > > xy.com
> > > > >> "?
> > > > >>
> > > > >> Thanks for your help!
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Posted by Julien Nioche <li...@gmail.com>.

What about  Googling / reading the WIKI / doing a bit of research yourself
 before asking questions on the mailing list?

On 1 August 2013 14:34, A Laxmi <a....@gmail.com> wrote:

> Hi Julien. Thanks for the suggestion! Could you please provide a reference
> link on how to use linkdb?
>
>
> On Thu, Aug 1, 2013 at 9:28 AM, Julien Nioche <
> lists.digitalpebble@gmail.com
> > wrote:
>
> > Why don't you create a linkdb then read it with the nutch readlinkdb
> > command?
> >
> >
> > On 1 August 2013 11:57, A Laxmi <a....@gmail.com> wrote:
> >
> > > Is there any way to find an *inlink *of a crawled site?
> > >
> > >
> > > On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <a....@gmail.com>
> wrote:
> > >
> > > > Thanks for your help, Ahme! I would be interested in more than a
> > > > timestamp. I would like to understand how a particular URL was
> crawled
> > -
> > > in
> > > > better terms, the sequence or how nutch landed up with a particular
> > link
> > > in
> > > > its crawldb.
> > > >
> > > > My problem is I found one site from the crawled list of URLS with a
> > > > horrible URL format something like '
> > > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> > >
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > > >?"
> > > > - as you can see this link it has got some backslashes for some
> > reason. I
> > > > tried to reach that url starting from the landing page "
> > > > www.domainabc.com/level1/level2/" but I could not find that URL with
> > > such
> > > > a bad format. So, I want to know how did nutch reach that url? Is
> there
> > > > some link nutch crawled which has the url " '
> > > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> > >
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > > >?"
> > > > somewhere? In what sequence did nutch did the crawling starting from
> a
> > > seed
> > > > url to crawl such a url? I hope I made it clear. Please let me know
> if
> > > you
> > > > have any questions. Any help is much appreciated.
> > > >
> > > >
> > > > On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <
> > > emre.aladag@agmlab.com>wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> Does timestamp give what you need? There should be a timestamp
> > > indicating
> > > >> the time of the operation.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> ----- Orijinal Mesaj -----
> > > >> Kimden: "A Laxmi" <a....@gmail.com>
> > > >> Kime: user@nutch.apache.org
> > > >> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
> > > >> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
> > > >>
> > > >> Hello,
> > > >>
> > > >> For example, I have a single *seed *url say "
> http://nutch.apache.org/
> > "
> > > >> and
> > > >> I am crawling it for "n" times. At the end of the crawl, I have 1220
> > new
> > > >> urls generated/fetched/updated from a single seed url. While looking
> > at
> > > >> these 1220 new urls, I am interested to know how a particular site
> eg.
> > > >> "www.abc/xy.com" has been crawled. Better question would be - in
> what
> > > >> sequence did the crawler work its way to a particular url "www.abc/
> > > xy.com
> > > >> "?
> > > >>
> > > >> Thanks for your help!
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Posted by A Laxmi <a....@gmail.com>.

Hi Julien. Thanks for the suggestion! Could you please provide a reference
link on how to use linkdb?


On Thu, Aug 1, 2013 at 9:28 AM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:

> Why don't you create a linkdb then read it with the nutch readlinkdb
> command?
>
>
> On 1 August 2013 11:57, A Laxmi <a....@gmail.com> wrote:
>
> > Is there any way to find an *inlink *of a crawled site?
> >
> >
> > On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <a....@gmail.com> wrote:
> >
> > > Thanks for your help, Ahme! I would be interested in more than a
> > > timestamp. I would like to understand how a particular URL was crawled
> -
> > in
> > > better terms, the sequence or how nutch landed up with a particular
> link
> > in
> > > its crawldb.
> > >
> > > My problem is I found one site from the crawled list of URLS with a
> > > horrible URL format something like '
> > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > >?"
> > > - as you can see this link it has got some backslashes for some
> reason. I
> > > tried to reach that url starting from the landing page "
> > > www.domainabc.com/level1/level2/" but I could not find that URL with
> > such
> > > a bad format. So, I want to know how did nutch reach that url? Is there
> > > some link nutch crawled which has the url " '
> > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > >?"
> > > somewhere? In what sequence did nutch did the crawling starting from a
> > seed
> > > url to crawl such a url? I hope I made it clear. Please let me know if
> > you
> > > have any questions. Any help is much appreciated.
> > >
> > >
> > > On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <
> > emre.aladag@agmlab.com>wrote:
> > >
> > >> Hello,
> > >>
> > >> Does timestamp give what you need? There should be a timestamp
> > indicating
> > >> the time of the operation.
> > >>
> > >>
> > >>
> > >>
> > >> ----- Orijinal Mesaj -----
> > >> Kimden: "A Laxmi" <a....@gmail.com>
> > >> Kime: user@nutch.apache.org
> > >> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
> > >> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
> > >>
> > >> Hello,
> > >>
> > >> For example, I have a single *seed *url say "http://nutch.apache.org/
> "
> > >> and
> > >> I am crawling it for "n" times. At the end of the crawl, I have 1220
> new
> > >> urls generated/fetched/updated from a single seed url. While looking
> at
> > >> these 1220 new urls, I am interested to know how a particular site eg.
> > >> "www.abc/xy.com" has been crawled. Better question would be - in what
> > >> sequence did the crawler work its way to a particular url "www.abc/
> > xy.com
> > >> "?
> > >>
> > >> Thanks for your help!
> > >>
> > >
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Posted by Julien Nioche <li...@gmail.com>.

Why don't you create a linkdb then read it with the nutch readlinkdb
command?


On 1 August 2013 11:57, A Laxmi <a....@gmail.com> wrote:

> Is there any way to find an *inlink *of a crawled site?
>
>
> On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <a....@gmail.com> wrote:
>
> > Thanks for your help, Ahme! I would be interested in more than a
> > timestamp. I would like to understand how a particular URL was crawled -
> in
> > better terms, the sequence or how nutch landed up with a particular link
> in
> > its crawldb.
> >
> > My problem is I found one site from the crawled list of URLS with a
> > horrible URL format something like '
> > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> >?"
> > - as you can see this link it has got some backslashes for some reason. I
> > tried to reach that url starting from the landing page "
> > www.domainabc.com/level1/level2/" but I could not find that URL with
> such
> > a bad format. So, I want to know how did nutch reach that url? Is there
> > some link nutch crawled which has the url " '
> > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> >?"
> > somewhere? In what sequence did nutch did the crawling starting from a
> seed
> > url to crawl such a url? I hope I made it clear. Please let me know if
> you
> > have any questions. Any help is much appreciated.
> >
> >
> > On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <
> emre.aladag@agmlab.com>wrote:
> >
> >> Hello,
> >>
> >> Does timestamp give what you need? There should be a timestamp
> indicating
> >> the time of the operation.
> >>
> >>
> >>
> >>
> >> ----- Orijinal Mesaj -----
> >> Kimden: "A Laxmi" <a....@gmail.com>
> >> Kime: user@nutch.apache.org
> >> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
> >> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
> >>
> >> Hello,
> >>
> >> For example, I have a single *seed *url say "http://nutch.apache.org/"
> >> and
> >> I am crawling it for "n" times. At the end of the crawl, I have 1220 new
> >> urls generated/fetched/updated from a single seed url. While looking at
> >> these 1220 new urls, I am interested to know how a particular site eg.
> >> "www.abc/xy.com" has been crawled. Better question would be - in what
> >> sequence did the crawler work its way to a particular url "www.abc/
> xy.com
> >> "?
> >>
> >> Thanks for your help!
> >>
> >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Posted by A Laxmi <a....@gmail.com>.

Is there any way to find an *inlink *of a crawled site?


On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <a....@gmail.com> wrote:

> Thanks for your help, Ahme! I would be interested in more than a
> timestamp. I would like to understand how a particular URL was crawled - in
> better terms, the sequence or how nutch landed up with a particular link in
> its crawldb.
>
> My problem is I found one site from the crawled list of URLS with a
> horrible URL format something like '
> www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>?"
> - as you can see this link it has got some backslashes for some reason. I
> tried to reach that url starting from the landing page "
> www.domainabc.com/level1/level2/" but I could not find that URL with such
> a bad format. So, I want to know how did nutch reach that url? Is there
> some link nutch crawled which has the url " '
> www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>?"
> somewhere? In what sequence did nutch did the crawling starting from a seed
> url to crawl such a url? I hope I made it clear. Please let me know if you
> have any questions. Any help is much appreciated.
>
>
> On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <em...@agmlab.com>wrote:
>
>> Hello,
>>
>> Does timestamp give what you need? There should be a timestamp indicating
>> the time of the operation.
>>
>>
>>
>>
>> ----- Orijinal Mesaj -----
>> Kimden: "A Laxmi" <a....@gmail.com>
>> Kime: user@nutch.apache.org
>> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
>> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
>>
>> Hello,
>>
>> For example, I have a single *seed *url say "http://nutch.apache.org/"
>> and
>> I am crawling it for "n" times. At the end of the crawl, I have 1220 new
>> urls generated/fetched/updated from a single seed url. While looking at
>> these 1220 new urls, I am interested to know how a particular site eg.
>> "www.abc/xy.com" has been crawled. Better question would be - in what
>> sequence did the crawler work its way to a particular url "www.abc/xy.com
>> "?
>>
>> Thanks for your help!
>>
>
>

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Posted by A Laxmi <a....@gmail.com>.

Thanks for your help, Ahme! I would be interested in more than a timestamp.
I would like to understand how a particular URL was crawled - in better
terms, the sequence or how nutch landed up with a particular link in its
crawldb.

My problem is I found one site from the crawled list of URLS with a
horrible URL format something like '
www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1?" - as
you can see this link it has got some backslashes for some reason. I tried
to reach that url starting from the landing page "
www.domainabc.com/level1/level2/" but I could not find that URL with such a
bad format. So, I want to know how did nutch reach that url? Is there some
link nutch crawled which has the url " '
www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1?"
somewhere? In what sequence did nutch did the crawling starting from a seed
url to crawl such a url? I hope I made it clear. Please let me know if you
have any questions. Any help is much appreciated.

On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <em...@agmlab.com>wrote:

> Hello,
>
> Does timestamp give what you need? There should be a timestamp indicating
> the time of the operation.
>
>
>
>
> ----- Orijinal Mesaj -----
> Kimden: "A Laxmi" <a....@gmail.com>
> Kime: user@nutch.apache.org
> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
>
> Hello,
>
> For example, I have a single *seed *url say "http://nutch.apache.org/" and
> I am crawling it for "n" times. At the end of the crawl, I have 1220 new
> urls generated/fetched/updated from a single seed url. While looking at
> these 1220 new urls, I am interested to know how a particular site eg.
> "www.abc/xy.com" has been crawled. Better question would be - in what
> sequence did the crawler work its way to a particular url "www.abc/xy.com
> "?
>
> Thanks for your help!
>

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Posted by Ahme Emre Aladağ <em...@agmlab.com>.

Hello,

Does timestamp give what you need? There should be a timestamp indicating the time of the operation.




----- Orijinal Mesaj -----
Kimden: "A Laxmi" <a....@gmail.com>
Kime: user@nutch.apache.org
Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
Konu: Nutch 1.6 - sequence in which crawler works its way to a URL

Hello,

For example, I have a single *seed *url say "http://nutch.apache.org/" and
I am crawling it for "n" times. At the end of the crawl, I have 1220 new
urls generated/fetched/updated from a single seed url. While looking at
these 1220 new urls, I am interested to know how a particular site eg.
"www.abc/xy.com" has been crawled. Better question would be - in what
sequence did the crawler work its way to a particular url "www.abc/xy.com"?

Thanks for your help!