You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Eyeris RodrIguez Rueda <er...@uci.cu> on 2015/04/14 20:06:51 UTC

Re: [MASSMAIL]how to get and index the filename of resources?

Please any help with this problem.


----- Mensaje original -----
De: "Eyeris RodrIguez Rueda" <er...@uci.cu>
Para: user@nutch.apache.org
Enviados: Miércoles, 1 de Abril 2015 10:47:22
Asunto: [MASSMAIL]how to get and index the filename of resources?

Hi all.
I am using nutch 1.9(local mode) and solr 4.10
I want to index the name of the files in solr but nutch doesn´t get this information from urls.
This is important because some pdf don´t has title and i can use the file name as alternative.
If i do a parsechecker to this url
http://www.prensa-latina.cu/images/stories/Media/NegociosEnCuba.pdf

i can check that the title is empty, in this case i must use file name.
This is my plugin.includes property, i have activated index-more but my problem persist.

<property>
  <name>plugin.includes</name>
  <value>protocol-(ftp|http|httpclient)|urlfilter-(domain|regex)|parse-(html|tika|metatags|zip)|mimetype-filter|index-(basic|anchor|more|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|microformats-customtag|language-identifier|links-extractor</value>
</property>


Is there any way to do that?
Please any suggestion or post will be appreciated. 
Thanks in advance.