You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Cool Coder <te...@yahoo.com> on 2007/11/27 23:20:49 UTC

How to read crawldb

Hello,
           I am just wondering how can I read crawldb and get content of each stored URL. I am not sure whether this can be possible or not.
   
  - BR

       
---------------------------------
Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.

Re: How to read crawldb

Posted by Cool Coder <te...@yahoo.com>.

Thanks for information. I tried with ./bin/nutch readlinkdb, however I could not able to get all the links. I think I am missing something on proper usage pattern of readlinkdb option
  I tried with

  $ ./bin/nutch readlinkdb ./nutch-index/linkdb/
Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
        -dump <out_dir> dump whole link db to a text file in <out_dir>
        -url <url>      print information about <url> to System.out

  Let me tell you that, nutch-index is the location of nutch index and it has following directories
  --crawldb
  --index
  --indexes
--linkdb
  --segments

  can you tell me what is the usage patternm, I should use to view all the links

  - RB

Andrzej Bialecki <ab...@getopt.org> wrote:
  Cool Coder wrote:
> Hello, I am just wondering how can I read crawldb and get content of
> each stored URL. I am not sure whether this can be possible or not.

In Nutch 0.8 and later the page information and link information is 
stored separately, in CrawlDb and LinkDb. You need to have the linkdb 
(see bin/nutch invertlinks command), and then you can use LinkDbReader 
class to retrieve this information. From the command line this is 
bin/nutch readlinkdb.

-- 
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com

---------------------------------
Never miss a thing.   Make Yahoo your homepage.

Re: How to read crawldb

Posted by Andrzej Bialecki <ab...@getopt.org>.

Cool Coder wrote:
> Hello, I am just wondering how can I read crawldb and get content of
> each stored URL. I am not sure whether this can be possible or not.

In Nutch 0.8 and later the page information and link information is 
stored separately, in CrawlDb and LinkDb. You need to have the linkdb 
(see bin/nutch invertlinks command), and then you can use LinkDbReader 
class to retrieve this information. From the command line this is 
bin/nutch readlinkdb.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: How to read crawldb

Posted by Cool Coder <te...@yahoo.com>.

Hi Jian,
           I saw your site and it really looks nice. Thanks for your information and I am still researching on nutch on how to find the raw data for each link. I am currently using Jspider and I think it is also a good crawler. Incase, you decided to make your crawler open source, please let me know and I would love to use that.

  - BR

jian chen <ch...@gmail.com> wrote:
  It is a bit convoluted at best.

I found out that the links and their meta data are stored in the crawldb
directory, and the actual raw http contents of the links are stored in the
different segments.

The crawldb and the segments are MapFiles or SequenceFiles I think. So, you
could use a MapFileReader or SequenceFileReader to read them and dump them
out whatever format you like.

However, so far I haven't figured out how to associate the crawldb links
with their contents. For example, while looping through the crawldb links,
for each link, I want to find its raw http content. But, I don't know how to
do it yet.

That said, it is possible to dump out the two into a MySql database and they
are all keyed on the link/url. But that means, you need to write to the
MySql database twice for each url. Which is not good for performance
reasons.

That's why I am sticking to my own crawler for now and it works very good
for me.

Take a look at www.coolposting.com, where it searches for multiple forums.
The crawler behind that is the one I wrote based on Nutch architecture and
storing into MySql for each url content.

If I want to open source my crawler, I will need to add some licensing terms
to the code first before releasing it onto www.jiansnet.com. Anyway, I will
make the crawler available soon, one way or the other (open source, closed
source but free download, etc.)

Cheers,

Jian

On Nov 27, 2007 2:20 PM, Cool Coder wrote:

> Hello,
> I am just wondering how can I read crawldb and get content of
> each stored URL. I am not sure whether this can be possible or not.
>
> - BR
>
>
> ---------------------------------
> Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try
> it now.
>

---------------------------------
Get easy, one-click access to your favorites.  Make Yahoo! your homepage.

Re: How to read crawldb

Posted by jian chen <ch...@gmail.com>.

It is a bit convoluted at best.

I found out that the links and their meta data are stored in the crawldb
directory, and the actual raw http contents of the links are stored in the
different segments.

The crawldb and the segments are MapFiles or SequenceFiles I think. So, you
could use a MapFileReader or SequenceFileReader to read them and dump them
out whatever format you like.

However, so far I haven't figured out how to associate the crawldb links
with their contents. For example, while looping through the crawldb links,
for each link, I want to find its raw http content. But, I don't know how to
do it yet.

That said, it is possible to dump out the two into a MySql database and they
are all keyed on the link/url. But that means, you need to write to the
MySql database twice for each url. Which is not good for performance
reasons.

That's why I am sticking to my own crawler for now and it works very good
for me.

Take a look at www.coolposting.com, where it searches for multiple forums.
The crawler behind that is the one I wrote based on Nutch architecture and
storing into MySql for each url content.

If I want to open source my crawler, I will need to add some licensing terms
to the code first before releasing it onto www.jiansnet.com. Anyway, I will
make the crawler available soon, one way or the other (open source, closed
source but free download, etc.)

Cheers,

Jian

On Nov 27, 2007 2:20 PM, Cool Coder <te...@yahoo.com> wrote:

> Hello,
>           I am just wondering how can I read crawldb and get content of
> each stored URL. I am not sure whether this can be possible or not.
>
>  - BR
>
>
> ---------------------------------
> Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try
> it now.
>