You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Anne Blankert <an...@geodan.nl> on 2010/02/01 10:04:41 UTC

Re: Keep attribute after parsing

Hello,

Changing the HTML-handler in the configuration is not so easy. I think I 
had about the same question (see list-thread "How to customize parsing 
html, retrieve <div> content"). The list came up with the following 
solution (setting MyHtmlMapper in Context should be available as of tika 
0.6):

class MyHtmlMapper extends DefaultHtmlMapper {
        public String mapSafeElement(String name) {
            if ("DIV".equals(name)) return "div";
            return super.mapSafeElement(name);
        }
    }

    Parser parser = ...;
    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, new MyHtmlMapper());
    parser.parse(..., context);


Anne

On 2010-01-30 1:34, florent andré wrote:
> So ok, I found a solution - surely not the optimal one - but I will 
> share my experience with you.
>
> HtmlParser is not "extends enabled" because:
> 1 - all attributes are private and have to be protected
> 2 - resolve() is in the same case
> 3 - call to super.startElement() is not so easy because of 
> body/title/discard level counting.
>
> HtmlParser is more extendEnabled, but the only reason why I extend 
> this class is to modify the "hardcoded" new HtmlHandler in  expression 
> parser.setContentHandler(new XHTMLDowngradeHandler( new 
> HtmlHandler(this, handler, metadata)));
>
> to MyHtmlHandler(...).
>
> Maybe a configuration solution for this class instanciation will be 
> profitable.
>
> Can you tell me if I don't take the right way, and if a possibility to 
> "overwrite/extend" the features of parser is in your roadmap ?
>
> My two pences...
> have a good day
> ++
>
> Florent André wrote:
>> Hi all,
>>
>> I work on html parsing via generic AutoDetectParser() class.
>>
>> I have to keep some "specific" attributes (id and class) in <table>
>> attribute in order to detect witch table have "meaning" for my app.
>>
>> So, as far as I understand for now, I have to :
>> - extend HtmlHandler with MyHtmlHandler
>>
>> - in MyHtmlHandler override public void startElement(...) with something
>> like this :
>>
>> if (bodyLevel == 0 && discardLevel == 0) {
>>   if ("TABLE".equals(name)){
>>     AttributesImpl attributes = new AttributesImpl();
>>         String id = atts.getValue("id");
>>     String class = atts.getValue("class");
>>     if (id != null){
>>       attributes.addAttribute("", "id", "id", "CDATA", id);      }
>>     if (class != null){
>>       attributes.addAttribute("", "class", "class", "CDATA", class);  
>>     }
>>         xhtml.startElement("http://www.w3.org/1999/xhtml", "table", 
>> "table",
>> attributes);   }
>>   else{
>>     //if other that table
>>     super.startElement(...)
>>   }
>> else{
>> //if other bodyLevel and discardLevel
>> super.startElement(...)
>> }
>>
>>
>> - And finally pass MyHtmlHandler to parse() method via parseContext.
>> *****
>>
>> * This is the right way to do such a thing ? * How I can use the 
>> parseContext to pass MyHtmlHandler ? I don't find any
>> example on it...
>>
>>
>> Any comment will be much appreciated,
>>
>> Have a good day
>>   
>


-- 

Drs. Anne Blankert

Geodan Systems & Research
President Kennedylaan 1
1079 MB Amsterdam (NL)
-------------------------------------
Tel: +31 (0)20 - 5711 311
Fax: +31 (0)20 - 5711 333
-------------------------------------
E-mail: anne.blankert@geodan.nl
Website: www.geodan.nl
Disclaimer: www.geodan.nl/disclaimer
-------------------------------------