You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by paddz <pa...@aufwind.cc> on 2012/08/07 10:37:55 UTC
Parsing/Indexing alt tag
Hi,
i'm running Nutch 1.5.1 +Solr 3.6.1 and would like to crawl/index image *alt
tags/attributes*. Sadly Google didn't give me an answer. Can i parse these
alt tags with the tika plugin or do i need to include an extra one?
Patrick
--
View this message in context: http://lucene.472066.n3.nabble.com/Parsing-Indexing-alt-tag-tp3999540.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Parsing/Indexing alt tag
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Patrick,
I 'think' you would need to create your own parser implementation for
this. There was a discussion here [0] which I hope you will find
useful.
hth
Lewis
[0] http://www.mail-archive.com/user@nutch.apache.org/msg06758.html
On Tue, Aug 7, 2012 at 9:37 AM, paddz <pa...@aufwind.cc> wrote:
> Hi,
>
> i'm running Nutch 1.5.1 +Solr 3.6.1 and would like to crawl/index image *alt
> tags/attributes*. Sadly Google didn't give me an answer. Can i parse these
> alt tags with the tika plugin or do i need to include an extra one?
>
> Patrick
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Parsing-Indexing-alt-tag-tp3999540.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
--
Lewis
RE: Parsing/Indexing alt tag
Posted by Markus Jelsma <ma...@openindex.io>.
You can write a simple parse filter plugin. With the NodeWalker you can walk all nodes of the DOM and get the alt attribute for img tags.
NodeWalker walker = new NodeWalker(doc);
Node currentNode = walker.nextNode();
if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
if ("img".equalsIgnoreCase(currentNode.getNodeName())) {
HashMap<String,String> atts = getAttributes(currentNode);
}
}
}
protected HashMap<String,String> getAttributes(Node node) {
HashMap<String,String> attribMap = new HashMap<String,String>();
NamedNodeMap attributes = node.getAttributes();
for(int i = 0 ; i < attributes.getLength(); i++) {
Attr attribute = (Attr)attributes.item(i);
attribMap.put(attribute.getName().toLowerCase(), attribute.getValue());
}
return attribMap;
}
-----Original message-----
> From:Alexandre <al...@gmail.com>
> Sent: Mon 01-Oct-2012 15:05
> To: user@nutch.apache.org
> Subject: Re: Parsing/Indexing alt tag
>
> Hi Patrick,
>
> I have the same Problem.
> Did you find a way to parse the alt attributes without rewrite a complet
> parse plugin?
>
> Alex.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Parsing-Indexing-alt-tag-tp3999540p4011181.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Re: Parsing/Indexing alt tag
Posted by Alexandre <al...@gmail.com>.
Hi Patrick,
I have the same Problem.
Did you find a way to parse the alt attributes without rewrite a complet
parse plugin?
Alex.
--
View this message in context: http://lucene.472066.n3.nabble.com/Parsing-Indexing-alt-tag-tp3999540p4011181.html
Sent from the Nutch - User mailing list archive at Nabble.com.