You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by h b <hb...@gmail.com> on 2013/07/09 21:36:35 UTC

Batch id and Fetch list

Hi
Use case:
* Scrape a given url. e.g. mydomain.com/movies/general

* Parse this page and extract urls that match a certain pattern and
download the pages for these matched urls. lets say the pages I want to
download are mydomain.com/movies/general?id=123 format

Now the problem I am facing is,
* Pagination mydomain.com/movies/general/2 and so on
* links on this page with regex that matches the regex of this page's url
mydoamin.com/movies/kids, mydomain.com/movies/english etc

So when I fetch mydomain.com/movies/general and if this page has links to
next page as well as to mydoamin.com/movies/kids, then for my next fetch I
now have 2 variations of pages

So one way I thought I can deal with this is by using batch_id. So when I
fetch
mydomain.com/movies/general, I use batchId, say 'general'
On a few iterations of these fetches, I end up fetching pages that are a
result of a crawl from a link mydoamin.com/movies/kids which was on
mydomain.com/movies/general page.

At a later point I crawl mydoamin.com/movies/kids as a separate batchId,
say 'kids'

Now, if 'general' has fetched a movie 123 which is also a 'kids' movie,
then the fetch with 'kids' batch_id wont have this movie 123. So if I want
a list of movies fetched under 'kids' I have missed this entry.

Sorry for the long email, but I hope this explains my problem.

Re: Batch id and Fetch list

Posted by Bai Shen <ba...@gmail.com>.
You'll have to write a plugin that does that.  Look at the parse and index
plugins.


On Thu, Jul 11, 2013 at 2:12 PM, h b <hb...@gmail.com> wrote:

> It kinda does.
> But then what is the best way to tie a seed url to the url list that gets
> generated?
>
> So lets say my seed.txt has
> url1.com
> url2.com
>
> So when fetch has fetched say page1, page2, page3 from url1 and
> page4,page5,page6 from url2, after the crawl, how do I tell that page4 is
> from url2.com and page1 is from url1.com?
>
>
>
>
> On Thu, Jul 11, 2013 at 10:54 AM, Bai Shen <ba...@gmail.com>
> wrote:
>
> > Yes, generate marks the urls with the specified batch id.  However, the
> > next time those urls are generated, a new batch id will be set.  And
> > updatedb removes the generate batch id marker from the url.
> >
> > Nutch does not send the batch id to solr, so that is why you are not able
> > to query it.
> >
> > If you want to batch urls to be queried later by solr then you need to
> > write an indexing filter to set a separate field that you can then later
> > query with solr.  Also, you can tell solr to look in the url for your
> > general/kids/etc keyword and do searches that way.
> >
> > Make sense?
> >
> >
> > On Thu, Jul 11, 2013 at 1:13 PM, h b <hb...@gmail.com> wrote:
> >
> > > My understanding is when I specify a batch_id with generate, generate
> > marks
> > > a set of urls to be fetched. So there should be some relation between
> the
> > > urls fetched (or marked to be fetched) with the batch_id, is that not
> so?
> > >
> > > In the same context, with SOLR, I set the
> > >     <field name="batchId" type="string" stored="true" indexed="true"/>
> > >
> > > in my schema.xml, hoping that I can query solr by the batchId, however,
> > > even after reindexing, and restarting solr, I do not see the batchId in
> > the
> > > response. I added fl=batchId to my solr query and get back nothing.
> > >
> > >
> > >
> > > On Thu, Jul 11, 2013 at 4:25 AM, Bai Shen <ba...@gmail.com>
> > wrote:
> > >
> > > > This isn't what Batch ID is for.  If you're crawling on only the one
> > > server
> > > > and only want that specific section, use the regex-urlfilter to
> accept
> > > only
> > > > the specific pages you want.
> > > >
> > > >
> > > > On Tue, Jul 9, 2013 at 3:36 PM, h b <hb...@gmail.com> wrote:
> > > >
> > > > > Hi
> > > > > Use case:
> > > > > * Scrape a given url. e.g. mydomain.com/movies/general
> > > > >
> > > > > * Parse this page and extract urls that match a certain pattern and
> > > > > download the pages for these matched urls. lets say the pages I
> want
> > to
> > > > > download are mydomain.com/movies/general?id=123 format
> > > > >
> > > > > Now the problem I am facing is,
> > > > > * Pagination mydomain.com/movies/general/2 and so on
> > > > > * links on this page with regex that matches the regex of this
> page's
> > > url
> > > > > mydoamin.com/movies/kids, mydomain.com/movies/english etc
> > > > >
> > > > > So when I fetch mydomain.com/movies/general and if this page has
> > links
> > > > to
> > > > > next page as well as to mydoamin.com/movies/kids, then for my next
> > > > fetch I
> > > > > now have 2 variations of pages
> > > > >
> > > > > So one way I thought I can deal with this is by using batch_id. So
> > > when I
> > > > > fetch
> > > > > mydomain.com/movies/general, I use batchId, say 'general'
> > > > > On a few iterations of these fetches, I end up fetching pages that
> > are
> > > a
> > > > > result of a crawl from a link mydoamin.com/movies/kids which was
> on
> > > > > mydomain.com/movies/general page.
> > > > >
> > > > > At a later point I crawl mydoamin.com/movies/kids as a separate
> > > batchId,
> > > > > say 'kids'
> > > > >
> > > > > Now, if 'general' has fetched a movie 123 which is also a 'kids'
> > movie,
> > > > > then the fetch with 'kids' batch_id wont have this movie 123. So
> if I
> > > > want
> > > > > a list of movies fetched under 'kids' I have missed this entry.
> > > > >
> > > > > Sorry for the long email, but I hope this explains my problem.
> > > > >
> > > >
> > >
> >
>

Re: Batch id and Fetch list

Posted by h b <hb...@gmail.com>.
It kinda does.
But then what is the best way to tie a seed url to the url list that gets
generated?

So lets say my seed.txt has
url1.com
url2.com

So when fetch has fetched say page1, page2, page3 from url1 and
page4,page5,page6 from url2, after the crawl, how do I tell that page4 is
from url2.com and page1 is from url1.com?




On Thu, Jul 11, 2013 at 10:54 AM, Bai Shen <ba...@gmail.com> wrote:

> Yes, generate marks the urls with the specified batch id.  However, the
> next time those urls are generated, a new batch id will be set.  And
> updatedb removes the generate batch id marker from the url.
>
> Nutch does not send the batch id to solr, so that is why you are not able
> to query it.
>
> If you want to batch urls to be queried later by solr then you need to
> write an indexing filter to set a separate field that you can then later
> query with solr.  Also, you can tell solr to look in the url for your
> general/kids/etc keyword and do searches that way.
>
> Make sense?
>
>
> On Thu, Jul 11, 2013 at 1:13 PM, h b <hb...@gmail.com> wrote:
>
> > My understanding is when I specify a batch_id with generate, generate
> marks
> > a set of urls to be fetched. So there should be some relation between the
> > urls fetched (or marked to be fetched) with the batch_id, is that not so?
> >
> > In the same context, with SOLR, I set the
> >     <field name="batchId" type="string" stored="true" indexed="true"/>
> >
> > in my schema.xml, hoping that I can query solr by the batchId, however,
> > even after reindexing, and restarting solr, I do not see the batchId in
> the
> > response. I added fl=batchId to my solr query and get back nothing.
> >
> >
> >
> > On Thu, Jul 11, 2013 at 4:25 AM, Bai Shen <ba...@gmail.com>
> wrote:
> >
> > > This isn't what Batch ID is for.  If you're crawling on only the one
> > server
> > > and only want that specific section, use the regex-urlfilter to accept
> > only
> > > the specific pages you want.
> > >
> > >
> > > On Tue, Jul 9, 2013 at 3:36 PM, h b <hb...@gmail.com> wrote:
> > >
> > > > Hi
> > > > Use case:
> > > > * Scrape a given url. e.g. mydomain.com/movies/general
> > > >
> > > > * Parse this page and extract urls that match a certain pattern and
> > > > download the pages for these matched urls. lets say the pages I want
> to
> > > > download are mydomain.com/movies/general?id=123 format
> > > >
> > > > Now the problem I am facing is,
> > > > * Pagination mydomain.com/movies/general/2 and so on
> > > > * links on this page with regex that matches the regex of this page's
> > url
> > > > mydoamin.com/movies/kids, mydomain.com/movies/english etc
> > > >
> > > > So when I fetch mydomain.com/movies/general and if this page has
> links
> > > to
> > > > next page as well as to mydoamin.com/movies/kids, then for my next
> > > fetch I
> > > > now have 2 variations of pages
> > > >
> > > > So one way I thought I can deal with this is by using batch_id. So
> > when I
> > > > fetch
> > > > mydomain.com/movies/general, I use batchId, say 'general'
> > > > On a few iterations of these fetches, I end up fetching pages that
> are
> > a
> > > > result of a crawl from a link mydoamin.com/movies/kids which was on
> > > > mydomain.com/movies/general page.
> > > >
> > > > At a later point I crawl mydoamin.com/movies/kids as a separate
> > batchId,
> > > > say 'kids'
> > > >
> > > > Now, if 'general' has fetched a movie 123 which is also a 'kids'
> movie,
> > > > then the fetch with 'kids' batch_id wont have this movie 123. So if I
> > > want
> > > > a list of movies fetched under 'kids' I have missed this entry.
> > > >
> > > > Sorry for the long email, but I hope this explains my problem.
> > > >
> > >
> >
>

Re: Batch id and Fetch list

Posted by Bai Shen <ba...@gmail.com>.
Yes, generate marks the urls with the specified batch id.  However, the
next time those urls are generated, a new batch id will be set.  And
updatedb removes the generate batch id marker from the url.

Nutch does not send the batch id to solr, so that is why you are not able
to query it.

If you want to batch urls to be queried later by solr then you need to
write an indexing filter to set a separate field that you can then later
query with solr.  Also, you can tell solr to look in the url for your
general/kids/etc keyword and do searches that way.

Make sense?


On Thu, Jul 11, 2013 at 1:13 PM, h b <hb...@gmail.com> wrote:

> My understanding is when I specify a batch_id with generate, generate marks
> a set of urls to be fetched. So there should be some relation between the
> urls fetched (or marked to be fetched) with the batch_id, is that not so?
>
> In the same context, with SOLR, I set the
>     <field name="batchId" type="string" stored="true" indexed="true"/>
>
> in my schema.xml, hoping that I can query solr by the batchId, however,
> even after reindexing, and restarting solr, I do not see the batchId in the
> response. I added fl=batchId to my solr query and get back nothing.
>
>
>
> On Thu, Jul 11, 2013 at 4:25 AM, Bai Shen <ba...@gmail.com> wrote:
>
> > This isn't what Batch ID is for.  If you're crawling on only the one
> server
> > and only want that specific section, use the regex-urlfilter to accept
> only
> > the specific pages you want.
> >
> >
> > On Tue, Jul 9, 2013 at 3:36 PM, h b <hb...@gmail.com> wrote:
> >
> > > Hi
> > > Use case:
> > > * Scrape a given url. e.g. mydomain.com/movies/general
> > >
> > > * Parse this page and extract urls that match a certain pattern and
> > > download the pages for these matched urls. lets say the pages I want to
> > > download are mydomain.com/movies/general?id=123 format
> > >
> > > Now the problem I am facing is,
> > > * Pagination mydomain.com/movies/general/2 and so on
> > > * links on this page with regex that matches the regex of this page's
> url
> > > mydoamin.com/movies/kids, mydomain.com/movies/english etc
> > >
> > > So when I fetch mydomain.com/movies/general and if this page has links
> > to
> > > next page as well as to mydoamin.com/movies/kids, then for my next
> > fetch I
> > > now have 2 variations of pages
> > >
> > > So one way I thought I can deal with this is by using batch_id. So
> when I
> > > fetch
> > > mydomain.com/movies/general, I use batchId, say 'general'
> > > On a few iterations of these fetches, I end up fetching pages that are
> a
> > > result of a crawl from a link mydoamin.com/movies/kids which was on
> > > mydomain.com/movies/general page.
> > >
> > > At a later point I crawl mydoamin.com/movies/kids as a separate
> batchId,
> > > say 'kids'
> > >
> > > Now, if 'general' has fetched a movie 123 which is also a 'kids' movie,
> > > then the fetch with 'kids' batch_id wont have this movie 123. So if I
> > want
> > > a list of movies fetched under 'kids' I have missed this entry.
> > >
> > > Sorry for the long email, but I hope this explains my problem.
> > >
> >
>

Re: Batch id and Fetch list

Posted by h b <hb...@gmail.com>.
My understanding is when I specify a batch_id with generate, generate marks
a set of urls to be fetched. So there should be some relation between the
urls fetched (or marked to be fetched) with the batch_id, is that not so?

In the same context, with SOLR, I set the
    <field name="batchId" type="string" stored="true" indexed="true"/>

in my schema.xml, hoping that I can query solr by the batchId, however,
even after reindexing, and restarting solr, I do not see the batchId in the
response. I added fl=batchId to my solr query and get back nothing.



On Thu, Jul 11, 2013 at 4:25 AM, Bai Shen <ba...@gmail.com> wrote:

> This isn't what Batch ID is for.  If you're crawling on only the one server
> and only want that specific section, use the regex-urlfilter to accept only
> the specific pages you want.
>
>
> On Tue, Jul 9, 2013 at 3:36 PM, h b <hb...@gmail.com> wrote:
>
> > Hi
> > Use case:
> > * Scrape a given url. e.g. mydomain.com/movies/general
> >
> > * Parse this page and extract urls that match a certain pattern and
> > download the pages for these matched urls. lets say the pages I want to
> > download are mydomain.com/movies/general?id=123 format
> >
> > Now the problem I am facing is,
> > * Pagination mydomain.com/movies/general/2 and so on
> > * links on this page with regex that matches the regex of this page's url
> > mydoamin.com/movies/kids, mydomain.com/movies/english etc
> >
> > So when I fetch mydomain.com/movies/general and if this page has links
> to
> > next page as well as to mydoamin.com/movies/kids, then for my next
> fetch I
> > now have 2 variations of pages
> >
> > So one way I thought I can deal with this is by using batch_id. So when I
> > fetch
> > mydomain.com/movies/general, I use batchId, say 'general'
> > On a few iterations of these fetches, I end up fetching pages that are a
> > result of a crawl from a link mydoamin.com/movies/kids which was on
> > mydomain.com/movies/general page.
> >
> > At a later point I crawl mydoamin.com/movies/kids as a separate batchId,
> > say 'kids'
> >
> > Now, if 'general' has fetched a movie 123 which is also a 'kids' movie,
> > then the fetch with 'kids' batch_id wont have this movie 123. So if I
> want
> > a list of movies fetched under 'kids' I have missed this entry.
> >
> > Sorry for the long email, but I hope this explains my problem.
> >
>

Re: Batch id and Fetch list

Posted by Bai Shen <ba...@gmail.com>.
This isn't what Batch ID is for.  If you're crawling on only the one server
and only want that specific section, use the regex-urlfilter to accept only
the specific pages you want.


On Tue, Jul 9, 2013 at 3:36 PM, h b <hb...@gmail.com> wrote:

> Hi
> Use case:
> * Scrape a given url. e.g. mydomain.com/movies/general
>
> * Parse this page and extract urls that match a certain pattern and
> download the pages for these matched urls. lets say the pages I want to
> download are mydomain.com/movies/general?id=123 format
>
> Now the problem I am facing is,
> * Pagination mydomain.com/movies/general/2 and so on
> * links on this page with regex that matches the regex of this page's url
> mydoamin.com/movies/kids, mydomain.com/movies/english etc
>
> So when I fetch mydomain.com/movies/general and if this page has links to
> next page as well as to mydoamin.com/movies/kids, then for my next fetch I
> now have 2 variations of pages
>
> So one way I thought I can deal with this is by using batch_id. So when I
> fetch
> mydomain.com/movies/general, I use batchId, say 'general'
> On a few iterations of these fetches, I end up fetching pages that are a
> result of a crawl from a link mydoamin.com/movies/kids which was on
> mydomain.com/movies/general page.
>
> At a later point I crawl mydoamin.com/movies/kids as a separate batchId,
> say 'kids'
>
> Now, if 'general' has fetched a movie 123 which is also a 'kids' movie,
> then the fetch with 'kids' batch_id wont have this movie 123. So if I want
> a list of movies fetched under 'kids' I have missed this entry.
>
> Sorry for the long email, but I hope this explains my problem.
>