You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mónica Lamas González <ml...@teccon.es> on 2007/12/13 15:30:27 UTC

Proble with pdf and word indexing

Hi,
 
I have a system with the nutch configure.The all html pages that they are generated dinamically with Servlets a JSP, are correctly indexing with crawl, but I have a problem with the pdf and word files. My system save those files in database and in my portal I have urls that they show those files. But, those URLs are jsp, for example http://www.mydomain.com/myportal/file.jsp?id=xx and this URL returns an pdf file. The crawl doesn't reconize this contain. I test my system with URLs http://www.mydomain.com/myportal/file.pdf, and in this case the nutch indexes correctly.
 
Could you help me?
 
Monica

Re: Proble with pdf and word indexing

Posted by Martin Kuen <ma...@gmail.com>.
Hi,

maybe the following ideas are helpful for you:
> Monica wrote
> I have a system with the nutch configure.The all html pages that they are generated dinamically with Servlets a JSP, are correctly indexing with crawl, but I have a problem with the pdf and word files. My system save those files in database and in my portal I have urls that they show those files. But, those URLs are jsp, for example http://www.mydomain.com/myportal/file.jsp?id=xx and this URL returns an pdf file. The crawl doesn't reconize this contain. I test my system with URLs http://www.mydomain.com/myportal/file.pdf, and in this case the nutch indexes correctly.

If I recall it correctly nutch uses the following procedure (in that
order) to determine a resource's mime-tpye:
1.) Try to use the extension to map the file to a mime-type (http://.
.. ./file.jsp is possibly mapped to html instead of pdf).
2.) Use the HTTP-header's mime.-type info.
3.) Use magic number guessing.
If one of these heuristics can come up with a mime-type the residual
heuristics are not tried.

An example for case 1.) is the following url:
"http://en.wikipedia.org/wiki/EMM386.EXE". This file will be mapped to
a mime-type called
"dos/x-application" or sth. similar. Nutch will produce an error
stating that it could not find a suitable plugin. However your browser
will display this page correctly (http-header's mime-tpye).


On Dec 13, 2007 3:36 PM, Mónica Lamas González <ml...@teccon.es> wrote:
> Sorry, when I test the URL http://www.mydomain.com/myportal/file.pdf <http://www.mydomain.com/myportal/file.pdf> , I don't obtain any result in my searches.

There is a limit for a resource's size. However, I found before (in my
case) that this limit is set too low for pdf files. Files exceeding
this limit will simply be truncated. This works fine for html files.
However the pdf parsing plugin will fail if it is fed with a truncated
pdf file. (see nutch-default.xml for a value named http.content.limit
or similar and override it in nutch-site.xml). I don't know how the
plugin for word files behaves in that situation.

The things mentioned above are based on my experiences using the nutch
default configuration.


Hope it helps,

Martin

RE: Proble with pdf and word indexing

Posted by Mónica Lamas González <ml...@teccon.es>.
Sorry, when I test the URL http://www.mydomain.com/myportal/file.pdf <http://www.mydomain.com/myportal/file.pdf> , I don't obtain any result in my searches.
 
Mónica

________________________________

De: Mónica Lamas González [mailto:mlamas@teccon.es]
Enviado el: jue 13/12/2007 15:30
Para: nutch-user@lucene.apache.org
Asunto: Proble with pdf and word indexing



Hi,

I have a system with the nutch configure.The all html pages that they are generated dinamically with Servlets a JSP, are correctly indexing with crawl, but I have a problem with the pdf and word files. My system save those files in database and in my portal I have urls that they show those files. But, those URLs are jsp, for example http://www.mydomain.com/myportal/file.jsp?id=xx and this URL returns an pdf file. The crawl doesn't reconize this contain. I test my system with URLs http://www.mydomain.com/myportal/file.pdf, and in this case the nutch indexes correctly.

Could you help me?

Monica