You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by steven shingler <sh...@gmail.com> on 2006/09/12 18:13:15 UTC

caching - filetypes

Hi all,

I'm trying to find out which filetypes nutch will cache.

for example: it does html, but not pdf.

Is there any documentation on how different filetypes are handled?

Is it possible to configure nutch to cache pdfs etc?

Any advice very gratefully received.
Thanks,
Steve

Re: caching - filetypes

Posted by Alvaro Cabrerizo <to...@gmail.com>.
Hi,
Watching your website I can see two kind of different results:

 -For example the first hit
http://www.lds.org/newsroom/files/jeff_lindsay_DNA_3.pdf, has no summary and
it produces the problem with cache.

-The third hit belongs to  the second  group,  they have summary and the
cache link goes fine.

So it looks like nutch cant access the content of first groupt hits. Maybe
parse-pdf plugin cant handle this pdf, it could happen, this would also
explains why the title of the first group hits is the URL, and not the title
keep inside pdf document.

If I were you I would crawl only the first hit (
http://www.lds.org/newsroom/files/jeff_lindsay_DNA_3.pdf ), and look the log
file.  If  parse-pdf  cant handle this document you will see a big ERROR
message.

Hope it helps.

Alvaro C.

2006/9/14, Jacob Brunson <ja...@gmail.com>:
>
> >
> > I don't know if I understand completely your email.
> > What you mean with "cache"?
>
> So if you go with the standard search results page, there is a link to
> a cached copy of the page.  If the page was html, then there are no
> problems, however, if the page was binary, it returns a http 500
> internal server error.
>
> You can see this if you click on the "cached" link of any of the pdf
> documents in the search results on my search engine:
> http://ldssearch.com/search.jsp?lang=en&query=pdf
>
>
> >
> > steven shingler escribió:
> > > Hi all,
> > >
> > > I'm trying to find out which filetypes nutch will cache.
> > >
> > > for example: it does html, but not pdf.
> > >
> > > Is there any documentation on how different filetypes are handled?
> > >
> > > Is it possible to configure nutch to cache pdfs etc?
> > >
> > > Any advice very gratefully received.
> > > Thanks,
> > > Steve
> > >
> > >
> ------------------------------------------------------------------------
> > >
> > > No virus found in this incoming message.
> > > Checked by AVG Free Edition.
> > > Version: 7.1.405 / Virus Database: 268.12.3/445 - Release Date:
> 11/09/2006
> > >
> >
> >
> >
> >
> > __________________________________________________
> > Preguntá. Respondé. Descubrí.
> > Todo lo que querías saber, y lo que ni imaginabas,
> > está en Yahoo! Respuestas (Beta).
> > ¡Probalo ya!
> > http://www.yahoo.com.ar/respuestas
> >
> >
> >
>
>
> --
> http://JacobBrunson.com
>

Re: caching - filetypes

Posted by Jacob Brunson <ja...@gmail.com>.
>
> I don't know if I understand completely your email.
> What you mean with "cache"?

So if you go with the standard search results page, there is a link to
a cached copy of the page.  If the page was html, then there are no
problems, however, if the page was binary, it returns a http 500
internal server error.

You can see this if you click on the "cached" link of any of the pdf
documents in the search results on my search engine:
http://ldssearch.com/search.jsp?lang=en&query=pdf


>
> steven shingler escribió:
> > Hi all,
> >
> > I'm trying to find out which filetypes nutch will cache.
> >
> > for example: it does html, but not pdf.
> >
> > Is there any documentation on how different filetypes are handled?
> >
> > Is it possible to configure nutch to cache pdfs etc?
> >
> > Any advice very gratefully received.
> > Thanks,
> > Steve
> >
> > ------------------------------------------------------------------------
> >
> > No virus found in this incoming message.
> > Checked by AVG Free Edition.
> > Version: 7.1.405 / Virus Database: 268.12.3/445 - Release Date: 11/09/2006
> >
>
>
>
>
> __________________________________________________
> Preguntá. Respondé. Descubrí.
> Todo lo que querías saber, y lo que ni imaginabas,
> está en Yahoo! Respuestas (Beta).
> ¡Probalo ya!
> http://www.yahoo.com.ar/respuestas
>
>
>


-- 
http://JacobBrunson.com

Re: caching - filetypes

Posted by Ernesto De Santis <de...@yahoo.com.ar>.
Hi Steven

I don't know if I understand completely your email.
What you mean with "cache"?

If do you want to crawl pdf's, you need to delete the url filter for that.

In your crawl-urlfilter.txt, do you have a line starting with a minus 
and a list of file extensions. Delete pdf extension.

Good luck
Ernesto.
PD: I'm a nutch beginner, but how nobody did response you, I try to help 
you.


steven shingler escribió:
> Hi all,
>
> I'm trying to find out which filetypes nutch will cache.
>
> for example: it does html, but not pdf.
>
> Is there any documentation on how different filetypes are handled?
>
> Is it possible to configure nutch to cache pdfs etc?
>
> Any advice very gratefully received.
> Thanks,
> Steve
>
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.3/445 - Release Date: 11/09/2006
>   

	
	
		
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas