You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Max Lynch <ih...@gmail.com> on 2010/06/24 19:08:16 UTC

Indexing only PDFs

Hi,
I would like to crawl a list of pages but only index PDFs.  From what I
gather I can add an exclusion for all non .pdf extensions in
crawl-urlfilter.txt.

However, I would also like to apply an additional restriction, that I only
index pages that match a certain query.  In my head, this doesn't seem to be
a great way of doing things, since the documents won't be optimized for
searching until they are in nutch's lucene index (AFAIK), but could I do
either of the following?

   1. Write a plugin to do a naive full text search of each "content" field
   before it is indexed and stopping the indexing if my term isn't found
   2. Add a restriction to my solrindex import that only grabs documents
   matching a certain query

The last one appeals to me in more ways than one, since my nutch index isn't
the official index for the rest of my application.  I could see myself
applying a quick filter from nutch to solr, only giving solr what I want.
 But that means I waste time crawling stuff I don't need.

Is there a way to accomplish this, especially option #2?

Thanks!

Re: Indexing only PDFs

Posted by Max Lynch <ih...@gmail.com>.
On Thu, Jun 24, 2010 at 3:21 PM, Alexander Aristov <
alexander.aristov@gmail.com> wrote:

> Hi
>
> When you are thinking about it you should also consider that nutch adds new
> links from fetched documents and so if you want to apply filter on early
> stage you wouldn't get urls which lead to new resources and they won't be
> fetched on the next stages.
>
> So you would want to specify all fetch resources explicitly in the seed
> list.
>
> But I wrote a plugin to filter out only office documents and skip html from
> being indexed.
>
> Best Regards
> Alexander Aristov


Interesting.  Thanks Alexander.

Do you know if there is a way to filter the docs when I do
$ nutch solrindex etc?

-Max

Re: Indexing only PDFs

Posted by Alexander Aristov <al...@gmail.com>.
Hi

When you are thinking about it you should also consider that nutch adds new
links from fetched documents and so if you want to apply filter on early
stage you wouldn't get urls which lead to new resources and they won't be
fetched on the next stages.

So you would want to specify all fetch resources explicitly in the seed
list.

But I wrote a plugin to filter out only office documents and skip html from
being indexed.

Best Regards
Alexander Aristov


On 24 June 2010 21:08, Max Lynch <ih...@gmail.com> wrote:

> Hi,
> I would like to crawl a list of pages but only index PDFs.  From what I
> gather I can add an exclusion for all non .pdf extensions in
> crawl-urlfilter.txt.
>
> However, I would also like to apply an additional restriction, that I only
> index pages that match a certain query.  In my head, this doesn't seem to
> be
> a great way of doing things, since the documents won't be optimized for
> searching until they are in nutch's lucene index (AFAIK), but could I do
> either of the following?
>
>   1. Write a plugin to do a naive full text search of each "content" field
>   before it is indexed and stopping the indexing if my term isn't found
>   2. Add a restriction to my solrindex import that only grabs documents
>   matching a certain query
>
> The last one appeals to me in more ways than one, since my nutch index
> isn't
> the official index for the rest of my application.  I could see myself
> applying a quick filter from nutch to solr, only giving solr what I want.
>  But that means I waste time crawling stuff I don't need.
>
> Is there a way to accomplish this, especially option #2?
>
> Thanks!
>