You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jayadeep Reddy <ja...@ehealthaccess.com> on 2013/08/01 15:03:32 UTC

Way to fetch only new sites

I am using Nutch 2.1 every time I run crawl from dmoz directory my existing
crawled pages in the database are fetched again(Taking long time/). Is
there a way to crawl only new sites.

Thank you

-- 
Jayadeep Reddy.S,
M.D & C.E.O
e Health Access Pvt.Ltd
www.ehealthaccess.com
Hyderabad-Chennai-Banglore
http://www.youtube.com/watch?v=0k5LX8mw6Sk

Re: Way to fetch only new sites

Posted by A Laxmi <a....@gmail.com>.

Thanks Tejas! That was exactly what I was looking for.

On Friday, August 2, 2013, Tejas Patil <te...@gmail.com> wrote:
> Nutch 2.1 officially had support for MySQL as datastore. There were lot of
> issues reported with MySQL and so in the newer version ie. 2.2.X, the
MySQL
> support is removed. I would recommend using HBase as its the most stable
> backend amongst all supported ones.
>
>
> On Thu, Aug 1, 2013 at 7:01 AM, Jayadeep Reddy
> <ja...@ehealthaccess.com>wrote:
>
>> Thank you Julien,
>> Will get hbase and try to crawl.
>>
>>
>> On Thu, Aug 1, 2013 at 7:10 PM, A Laxmi <a....@gmail.com> wrote:
>>
>> > Julien - whatever you are saying about Nutch 2.x and SQL - does it
apply
>> > for the recent release 2.2.1 as well?
>> >
>> >
>> > On Thu, Aug 1, 2013 at 9:38 AM, Julien Nioche <
>> > lists.digitalpebble@gmail.com
>> > > wrote:
>> >
>> > > If you are using Nutch 2.x then you are actually accessing the SQL
>> > storage
>> > > via Apache GORA. The SQL backend in GORA does not work and it is not
>> > > advised to use it. If you want to use Nutch 2 then use a different
>> > backend
>> > > like HBase or Cassandra or use Nutch 1.x
>> > >
>> > > On 1 August 2013 14:32, Jayadeep Reddy <ja...@ehealthaccess.com>
>> > wrote:
>> > >
>> > > > No Julien Using Mysql
>> > > >
>> > > >
>> > > > On Thu, Aug 1, 2013 at 7:00 PM, Julien Nioche <
>> > > > lists.digitalpebble@gmail.com
>> > > > > wrote:
>> > > >
>> > > > > What GORA backend are you using?
>> > > > >
>> > > > >
>> > > > > On 1 August 2013 14:03, Jayadeep Reddy <
jayadeep@ehealthaccess.com
>> >
>> > > > wrote:
>> > > > >
>> > > > > > I am using Nutch 2.1 every time I run crawl from dmoz directory
>> my
>> > > > > existing
>> > > > > > crawled pages in the database are fetched again(Taking long
>> time/).
>> > > Is
>> > > > > > there a way to crawl only new sites.
>> > > > > >
>> > > > > > Thank you
>> > > > > >
>> > > > > > --
>> > > > > > Jayadeep Reddy.S,
>> > > > > > M.D & C.E.O
>> > > > > > e Health Access Pvt.Ltd
>> > > > > > www.ehealthaccess.com
>> > > > > > Hyderabad-Chennai-Banglore
>> > > > > > e Health Access Medical kiosk
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > *
>> > > > > *Open Source Solutions for Text Engineering
>> > > > >
>> > > > > http://digitalpebble.blogspot.com/
>> > > > > http://www.digitalpebble.com
>> > > > > http://twitter.com/digitalpebble
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Jayadeep Reddy.S,
>> > > > M.D & C.E.O
>> > > > e Health Access Pvt.Ltd
>> > > > www.ehealthaccess.com
>> > > > Hyderabad-Chennai-Banglore
>> > > > e Health Access Medical kiosk
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > *
>> > > *Open Source Solutions for Text Engineering
>> > >
>> > > http://digitalpebble.blogspot.com/
>> > > http://www.digitalpebble.com
>> > > http://twitter.com/digitalpebble
>> > >
>> >
>>
>>
>>
>> --
>> Jayadeep Reddy.S,
>> M.D & C.E.O
>> e Health Access Pvt.Ltd
>

Re: Way to fetch only new sites

Posted by Tejas Patil <te...@gmail.com>.

Nutch 2.1 officially had support for MySQL as datastore. There were lot of
issues reported with MySQL and so in the newer version ie. 2.2.X, the MySQL
support is removed. I would recommend using HBase as its the most stable
backend amongst all supported ones.


On Thu, Aug 1, 2013 at 7:01 AM, Jayadeep Reddy
<ja...@ehealthaccess.com>wrote:

> Thank you Julien,
> Will get hbase and try to crawl.
>
>
> On Thu, Aug 1, 2013 at 7:10 PM, A Laxmi <a....@gmail.com> wrote:
>
> > Julien - whatever you are saying about Nutch 2.x and SQL - does it apply
> > for the recent release 2.2.1 as well?
> >
> >
> > On Thu, Aug 1, 2013 at 9:38 AM, Julien Nioche <
> > lists.digitalpebble@gmail.com
> > > wrote:
> >
> > > If you are using Nutch 2.x then you are actually accessing the SQL
> > storage
> > > via Apache GORA. The SQL backend in GORA does not work and it is not
> > > advised to use it. If you want to use Nutch 2 then use a different
> > backend
> > > like HBase or Cassandra or use Nutch 1.x
> > >
> > > On 1 August 2013 14:32, Jayadeep Reddy <ja...@ehealthaccess.com>
> > wrote:
> > >
> > > > No Julien Using Mysql
> > > >
> > > >
> > > > On Thu, Aug 1, 2013 at 7:00 PM, Julien Nioche <
> > > > lists.digitalpebble@gmail.com
> > > > > wrote:
> > > >
> > > > > What GORA backend are you using?
> > > > >
> > > > >
> > > > > On 1 August 2013 14:03, Jayadeep Reddy <jayadeep@ehealthaccess.com
> >
> > > > wrote:
> > > > >
> > > > > > I am using Nutch 2.1 every time I run crawl from dmoz directory
> my
> > > > > existing
> > > > > > crawled pages in the database are fetched again(Taking long
> time/).
> > > Is
> > > > > > there a way to crawl only new sites.
> > > > > >
> > > > > > Thank you
> > > > > >
> > > > > > --
> > > > > > Jayadeep Reddy.S,
> > > > > > M.D & C.E.O
> > > > > > e Health Access Pvt.Ltd
> > > > > > www.ehealthaccess.com
> > > > > > Hyderabad-Chennai-Banglore
> > > > > > http://www.youtube.com/watch?v=0k5LX8mw6Sk
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *
> > > > > *Open Source Solutions for Text Engineering
> > > > >
> > > > > http://digitalpebble.blogspot.com/
> > > > > http://www.digitalpebble.com
> > > > > http://twitter.com/digitalpebble
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jayadeep Reddy.S,
> > > > M.D & C.E.O
> > > > e Health Access Pvt.Ltd
> > > > www.ehealthaccess.com
> > > > Hyderabad-Chennai-Banglore
> > > > http://www.youtube.com/watch?v=0k5LX8mw6Sk
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
>
>
>
> --
> Jayadeep Reddy.S,
> M.D & C.E.O
> e Health Access Pvt.Ltd
> www.ehealthaccess.com
> Hyderabad-Chennai-Banglore
> http://www.youtube.com/watch?v=0k5LX8mw6Sk
>

Re: Way to fetch only new sites

Posted by Jayadeep Reddy <ja...@ehealthaccess.com>.

Thank you Julien,
Will get hbase and try to crawl.


On Thu, Aug 1, 2013 at 7:10 PM, A Laxmi <a....@gmail.com> wrote:

> Julien - whatever you are saying about Nutch 2.x and SQL - does it apply
> for the recent release 2.2.1 as well?
>
>
> On Thu, Aug 1, 2013 at 9:38 AM, Julien Nioche <
> lists.digitalpebble@gmail.com
> > wrote:
>
> > If you are using Nutch 2.x then you are actually accessing the SQL
> storage
> > via Apache GORA. The SQL backend in GORA does not work and it is not
> > advised to use it. If you want to use Nutch 2 then use a different
> backend
> > like HBase or Cassandra or use Nutch 1.x
> >
> > On 1 August 2013 14:32, Jayadeep Reddy <ja...@ehealthaccess.com>
> wrote:
> >
> > > No Julien Using Mysql
> > >
> > >
> > > On Thu, Aug 1, 2013 at 7:00 PM, Julien Nioche <
> > > lists.digitalpebble@gmail.com
> > > > wrote:
> > >
> > > > What GORA backend are you using?
> > > >
> > > >
> > > > On 1 August 2013 14:03, Jayadeep Reddy <ja...@ehealthaccess.com>
> > > wrote:
> > > >
> > > > > I am using Nutch 2.1 every time I run crawl from dmoz directory my
> > > > existing
> > > > > crawled pages in the database are fetched again(Taking long time/).
> > Is
> > > > > there a way to crawl only new sites.
> > > > >
> > > > > Thank you
> > > > >
> > > > > --
> > > > > Jayadeep Reddy.S,
> > > > > M.D & C.E.O
> > > > > e Health Access Pvt.Ltd
> > > > > www.ehealthaccess.com
> > > > > Hyderabad-Chennai-Banglore
> > > > > http://www.youtube.com/watch?v=0k5LX8mw6Sk
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > > http://twitter.com/digitalpebble
> > > >
> > >
> > >
> > >
> > > --
> > > Jayadeep Reddy.S,
> > > M.D & C.E.O
> > > e Health Access Pvt.Ltd
> > > www.ehealthaccess.com
> > > Hyderabad-Chennai-Banglore
> > > http://www.youtube.com/watch?v=0k5LX8mw6Sk
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 
Jayadeep Reddy.S,
M.D & C.E.O
e Health Access Pvt.Ltd
www.ehealthaccess.com
Hyderabad-Chennai-Banglore
http://www.youtube.com/watch?v=0k5LX8mw6Sk

Re: Way to fetch only new sites

Posted by A Laxmi <a....@gmail.com>.

Julien - whatever you are saying about Nutch 2.x and SQL - does it apply
for the recent release 2.2.1 as well?


On Thu, Aug 1, 2013 at 9:38 AM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:

> If you are using Nutch 2.x then you are actually accessing the SQL storage
> via Apache GORA. The SQL backend in GORA does not work and it is not
> advised to use it. If you want to use Nutch 2 then use a different backend
> like HBase or Cassandra or use Nutch 1.x
>
> On 1 August 2013 14:32, Jayadeep Reddy <ja...@ehealthaccess.com> wrote:
>
> > No Julien Using Mysql
> >
> >
> > On Thu, Aug 1, 2013 at 7:00 PM, Julien Nioche <
> > lists.digitalpebble@gmail.com
> > > wrote:
> >
> > > What GORA backend are you using?
> > >
> > >
> > > On 1 August 2013 14:03, Jayadeep Reddy <ja...@ehealthaccess.com>
> > wrote:
> > >
> > > > I am using Nutch 2.1 every time I run crawl from dmoz directory my
> > > existing
> > > > crawled pages in the database are fetched again(Taking long time/).
> Is
> > > > there a way to crawl only new sites.
> > > >
> > > > Thank you
> > > >
> > > > --
> > > > Jayadeep Reddy.S,
> > > > M.D & C.E.O
> > > > e Health Access Pvt.Ltd
> > > > www.ehealthaccess.com
> > > > Hyderabad-Chennai-Banglore
> > > > http://www.youtube.com/watch?v=0k5LX8mw6Sk
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
> >
> >
> > --
> > Jayadeep Reddy.S,
> > M.D & C.E.O
> > e Health Access Pvt.Ltd
> > www.ehealthaccess.com
> > Hyderabad-Chennai-Banglore
> > http://www.youtube.com/watch?v=0k5LX8mw6Sk
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Way to fetch only new sites

Posted by Julien Nioche <li...@gmail.com>.

If you are using Nutch 2.x then you are actually accessing the SQL storage
via Apache GORA. The SQL backend in GORA does not work and it is not
advised to use it. If you want to use Nutch 2 then use a different backend
like HBase or Cassandra or use Nutch 1.x

On 1 August 2013 14:32, Jayadeep Reddy <ja...@ehealthaccess.com> wrote:

> No Julien Using Mysql
>
>
> On Thu, Aug 1, 2013 at 7:00 PM, Julien Nioche <
> lists.digitalpebble@gmail.com
> > wrote:
>
> > What GORA backend are you using?
> >
> >
> > On 1 August 2013 14:03, Jayadeep Reddy <ja...@ehealthaccess.com>
> wrote:
> >
> > > I am using Nutch 2.1 every time I run crawl from dmoz directory my
> > existing
> > > crawled pages in the database are fetched again(Taking long time/). Is
> > > there a way to crawl only new sites.
> > >
> > > Thank you
> > >
> > > --
> > > Jayadeep Reddy.S,
> > > M.D & C.E.O
> > > e Health Access Pvt.Ltd
> > > www.ehealthaccess.com
> > > Hyderabad-Chennai-Banglore
> > > http://www.youtube.com/watch?v=0k5LX8mw6Sk
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
>
>
> --
> Jayadeep Reddy.S,
> M.D & C.E.O
> e Health Access Pvt.Ltd
> www.ehealthaccess.com
> Hyderabad-Chennai-Banglore
> http://www.youtube.com/watch?v=0k5LX8mw6Sk
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Way to fetch only new sites

Posted by Jayadeep Reddy <ja...@ehealthaccess.com>.

No Julien Using Mysql


On Thu, Aug 1, 2013 at 7:00 PM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:

> What GORA backend are you using?
>
>
> On 1 August 2013 14:03, Jayadeep Reddy <ja...@ehealthaccess.com> wrote:
>
> > I am using Nutch 2.1 every time I run crawl from dmoz directory my
> existing
> > crawled pages in the database are fetched again(Taking long time/). Is
> > there a way to crawl only new sites.
> >
> > Thank you
> >
> > --
> > Jayadeep Reddy.S,
> > M.D & C.E.O
> > e Health Access Pvt.Ltd
> > www.ehealthaccess.com
> > Hyderabad-Chennai-Banglore
> > http://www.youtube.com/watch?v=0k5LX8mw6Sk
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
Jayadeep Reddy.S,
M.D & C.E.O
e Health Access Pvt.Ltd
www.ehealthaccess.com
Hyderabad-Chennai-Banglore
http://www.youtube.com/watch?v=0k5LX8mw6Sk

Re: Way to fetch only new sites

Posted by Julien Nioche <li...@gmail.com>.

What GORA backend are you using?


On 1 August 2013 14:03, Jayadeep Reddy <ja...@ehealthaccess.com> wrote:

> I am using Nutch 2.1 every time I run crawl from dmoz directory my existing
> crawled pages in the database are fetched again(Taking long time/). Is
> there a way to crawl only new sites.
>
> Thank you
>
> --
> Jayadeep Reddy.S,
> M.D & C.E.O
> e Health Access Pvt.Ltd
> www.ehealthaccess.com
> Hyderabad-Chennai-Banglore
> http://www.youtube.com/watch?v=0k5LX8mw6Sk
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Way to fetch only new sites

Posted by Jayadeep Reddy <ja...@ehealthaccess.com>.

Laxmi Some one in the group should have a solution to skip database table
while crawling new sites. I searched online but cant find one.



On Thu, Aug 1, 2013 at 6:47 PM, A Laxmi <a....@gmail.com> wrote:

> Jaydeep - I have the same problem as well. When I run a fresh crawl, only
> the urls in the webpage table are being crawled over and over, it was
> ignoring the new urls in seed.txt.
>
>
> On Thu, Aug 1, 2013 at 9:03 AM, Jayadeep Reddy
> <ja...@ehealthaccess.com>wrote:
>
> > I am using Nutch 2.1 every time I run crawl from dmoz directory my
> existing
> > crawled pages in the database are fetched again(Taking long time/). Is
> > there a way to crawl only new sites.
> >
> > Thank you
> >
> > --
> > Jayadeep Reddy.S,
> > M.D & C.E.O
> > e Health Access Pvt.Ltd
> > www.ehealthaccess.com
> > Hyderabad-Chennai-Banglore
> > http://www.youtube.com/watch?v=0k5LX8mw6Sk
> >
>



-- 
Jayadeep Reddy.S,
M.D & C.E.O
e Health Access Pvt.Ltd
www.ehealthaccess.com
Hyderabad-Chennai-Banglore
http://www.youtube.com/watch?v=0k5LX8mw6Sk

Re: Way to fetch only new sites

Posted by A Laxmi <a....@gmail.com>.

Jaydeep - I have the same problem as well. When I run a fresh crawl, only
the urls in the webpage table are being crawled over and over, it was
ignoring the new urls in seed.txt.

On Thu, Aug 1, 2013 at 9:03 AM, Jayadeep Reddy
<ja...@ehealthaccess.com>wrote:

> I am using Nutch 2.1 every time I run crawl from dmoz directory my existing
> crawled pages in the database are fetched again(Taking long time/). Is
> there a way to crawl only new sites.
>
> Thank you
>
> --
> Jayadeep Reddy.S,
> M.D & C.E.O
> e Health Access Pvt.Ltd
> www.ehealthaccess.com
> Hyderabad-Chennai-Banglore
> http://www.youtube.com/watch?v=0k5LX8mw6Sk
>