You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by paddz <pa...@aufwind.cc> on 2012/08/07 10:37:55 UTC

Parsing/Indexing alt tag

Hi,

i'm running Nutch 1.5.1 +Solr 3.6.1 and would like to crawl/index image *alt
tags/attributes*. Sadly Google didn't give me an answer. Can i parse these
alt tags with the tika plugin or do i need to include an extra one? 

Patrick



--
View this message in context: http://lucene.472066.n3.nabble.com/Parsing-Indexing-alt-tag-tp3999540.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Parsing/Indexing alt tag

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Patrick,

I 'think' you would need to create your own parser implementation for
this. There was a discussion here [0] which I hope you will find
useful.

hth
Lewis

[0] http://www.mail-archive.com/user@nutch.apache.org/msg06758.html

On Tue, Aug 7, 2012 at 9:37 AM, paddz <pa...@aufwind.cc> wrote:
> Hi,
>
> i'm running Nutch 1.5.1 +Solr 3.6.1 and would like to crawl/index image *alt
> tags/attributes*. Sadly Google didn't give me an answer. Can i parse these
> alt tags with the tika plugin or do i need to include an extra one?
>
> Patrick
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Parsing-Indexing-alt-tag-tp3999540.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis

RE: Parsing/Indexing alt tag

Posted by Markus Jelsma <ma...@openindex.io>.
You can write a simple parse filter plugin. With the NodeWalker you can walk all nodes of the DOM and get the alt attribute for img tags.

        NodeWalker walker = new NodeWalker(doc);
        Node currentNode = walker.nextNode();
        if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
          if ("img".equalsIgnoreCase(currentNode.getNodeName())) {
            HashMap<String,String> atts = getAttributes(currentNode);

          }
        }
      }
 
   protected HashMap<String,String> getAttributes(Node node) {
    HashMap<String,String> attribMap = new HashMap<String,String>();

    NamedNodeMap attributes = node.getAttributes();

    for(int i = 0 ; i < attributes.getLength(); i++) {
      Attr attribute = (Attr)attributes.item(i);
      attribMap.put(attribute.getName().toLowerCase(), attribute.getValue());
    }

    return attribMap;
  }

-----Original message-----
> From:Alexandre <al...@gmail.com>
> Sent: Mon 01-Oct-2012 15:05
> To: user@nutch.apache.org
> Subject: Re: Parsing/Indexing alt tag
> 
> Hi Patrick,
> 
> I have the same Problem.
> Did you find a way to parse the alt attributes without rewrite a complet
> parse plugin?
> 
> Alex.
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Parsing-Indexing-alt-tag-tp3999540p4011181.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Re: Parsing/Indexing alt tag

Posted by Alexandre <al...@gmail.com>.
Hi Patrick,

I have the same Problem.
Did you find a way to parse the alt attributes without rewrite a complet
parse plugin?

Alex.



--
View this message in context: http://lucene.472066.n3.nabble.com/Parsing-Indexing-alt-tag-tp3999540p4011181.html
Sent from the Nutch - User mailing list archive at Nabble.com.