You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by George Weller <ge...@markem.com> on 2007/10/22 18:19:05 UTC

PDF problems, inc. documents returned with XLS extension

Hi all,

I'm trying to use Nutch for a intranet search. After much reading on the
FAQs, wikis, and these lists I have it working very well for JSP pages, with
pretty decent quality results. I am however experiencing problems  searching
for PDF documents.

First I note in the logs that a large number of PDF documents have been
fetched, and yet only two have been indexed, and indeed only these two
appear in search results. The content limit is set high enough to allow
these documents to be indexed, so I can't think why this should be.

Secondly for those documents that ARE indexed, rather bizarrely, the
document titles in the search results have a '.xls' extension. I can even
search for all PDF document just by using the query 'xls'. Note that this
suffix is most definitely NOT in the actual title of those files. I also
chanced upon a site that seems to use Nutch (no affiliation- I just googled)
and found the same problem...

http://www.bfm.bm/nutch?query=xls&Submit=Go

I don't see any output from the "more.jsp" include either. I'm not certain
as I've never seen it working, but I imagine its meant to add a "[PDF]"
chunk to the title.

Can someone explain why I'm having these problems?

Thanks very much,
George
-- 
View this message in context: http://www.nabble.com/PDF-problems%2C-inc.-documents-returned-with-XLS-extension-tf4671286.html#a13344771
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: PDF problems, inc. documents returned with XLS extension

Posted by George Weller <ge...@markem.com>.

Sami Siren-2 wrote:
> 
> George Weller wrote:
>> Hi all,
>> 
>> First I note in the logs that a large number of PDF documents have been
>> fetched, and yet only two have been indexed, and indeed only these two
>> appear in search results. The content limit is set high enough to allow
>> these documents to be indexed, so I can't think why this should be.
> 
> Are there any related errors on log?
> 
>> Secondly for those documents that ARE indexed, rather bizarrely, the
>> document titles in the search results have a '.xls' extension. I can even
>> search for all PDF document just by using the query 'xls'. Note that this
>> suffix is most definitely NOT in the actual title of those files. I also
>> chanced upon a site that seems to use Nutch (no affiliation- I just
>> googled)
>> and found the same problem...
> 
> In the examples from your site the title is extracted from the pdf
> metadata so it just uses the title stored within the pdf doc.
> 
> -- 
>  Sami Siren
> 
> 
Thanks for the reply.

Yes you're absolutely right! I did a sample crawl on our production server
and I notice that it also returns some PDFs with ".doc" in the title.... I
can now see that this is due to whatever software was used to convert the
XLS or DOC documents to PDF format in the first place!

I couldn't spot any other errors in the log, but I think I managed to solve
the other problem too. I had the content limit set to around 1.6MB IIRC,
which after a quick survey of common document I concluded would be enough to
allow indexing of the main docs that people would search for (most of which
were a couple of hundred kilobytes), but it seems that it wasn't enough. I
have now set it to be unlimited (i.e. -1), and I'm getting proper results.

Now I just need to find out what "more.jsp" does, and how to get it going...
Back to the wiki I think!

Thanks again,
George
-- 
View this message in context: http://www.nabble.com/PDF-problems%2C-inc.-documents-returned-with-XLS-extension-tf4671286.html#a13381606
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: PDF problems, inc. documents returned with XLS extension

Posted by Sami Siren <ss...@gmail.com>.
George Weller wrote:
> Hi all,
> 
> First I note in the logs that a large number of PDF documents have been
> fetched, and yet only two have been indexed, and indeed only these two
> appear in search results. The content limit is set high enough to allow
> these documents to be indexed, so I can't think why this should be.

Are there any related errors on log?

> Secondly for those documents that ARE indexed, rather bizarrely, the
> document titles in the search results have a '.xls' extension. I can even
> search for all PDF document just by using the query 'xls'. Note that this
> suffix is most definitely NOT in the actual title of those files. I also
> chanced upon a site that seems to use Nutch (no affiliation- I just googled)
> and found the same problem...

In the examples from your site the title is extracted from the pdf
metadata so it just uses the title stored within the pdf doc.

-- 
 Sami Siren