You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Anne Blankert <an...@geodan.nl> on 2010/02/01 10:04:41 UTC
Re: Keep attribute after parsing
Hello,
Changing the HTML-handler in the configuration is not so easy. I think I
had about the same question (see list-thread "How to customize parsing
html, retrieve <div> content"). The list came up with the following
solution (setting MyHtmlMapper in Context should be available as of tika
0.6):
class MyHtmlMapper extends DefaultHtmlMapper {
public String mapSafeElement(String name) {
if ("DIV".equals(name)) return "div";
return super.mapSafeElement(name);
}
}
Parser parser = ...;
ParseContext context = new ParseContext();
context.set(HtmlMapper.class, new MyHtmlMapper());
parser.parse(..., context);
Anne
On 2010-01-30 1:34, florent andré wrote:
> So ok, I found a solution - surely not the optimal one - but I will
> share my experience with you.
>
> HtmlParser is not "extends enabled" because:
> 1 - all attributes are private and have to be protected
> 2 - resolve() is in the same case
> 3 - call to super.startElement() is not so easy because of
> body/title/discard level counting.
>
> HtmlParser is more extendEnabled, but the only reason why I extend
> this class is to modify the "hardcoded" new HtmlHandler in expression
> parser.setContentHandler(new XHTMLDowngradeHandler( new
> HtmlHandler(this, handler, metadata)));
>
> to MyHtmlHandler(...).
>
> Maybe a configuration solution for this class instanciation will be
> profitable.
>
> Can you tell me if I don't take the right way, and if a possibility to
> "overwrite/extend" the features of parser is in your roadmap ?
>
> My two pences...
> have a good day
> ++
>
> Florent André wrote:
>> Hi all,
>>
>> I work on html parsing via generic AutoDetectParser() class.
>>
>> I have to keep some "specific" attributes (id and class) in <table>
>> attribute in order to detect witch table have "meaning" for my app.
>>
>> So, as far as I understand for now, I have to :
>> - extend HtmlHandler with MyHtmlHandler
>>
>> - in MyHtmlHandler override public void startElement(...) with something
>> like this :
>>
>> if (bodyLevel == 0 && discardLevel == 0) {
>> if ("TABLE".equals(name)){
>> AttributesImpl attributes = new AttributesImpl();
>> String id = atts.getValue("id");
>> String class = atts.getValue("class");
>> if (id != null){
>> attributes.addAttribute("", "id", "id", "CDATA", id); }
>> if (class != null){
>> attributes.addAttribute("", "class", "class", "CDATA", class);
>> }
>> xhtml.startElement("http://www.w3.org/1999/xhtml", "table",
>> "table",
>> attributes); }
>> else{
>> //if other that table
>> super.startElement(...)
>> }
>> else{
>> //if other bodyLevel and discardLevel
>> super.startElement(...)
>> }
>>
>>
>> - And finally pass MyHtmlHandler to parse() method via parseContext.
>> *****
>>
>> * This is the right way to do such a thing ? * How I can use the
>> parseContext to pass MyHtmlHandler ? I don't find any
>> example on it...
>>
>>
>> Any comment will be much appreciated,
>>
>> Have a good day
>>
>
--
Drs. Anne Blankert
Geodan Systems & Research
President Kennedylaan 1
1079 MB Amsterdam (NL)
-------------------------------------
Tel: +31 (0)20 - 5711 311
Fax: +31 (0)20 - 5711 333
-------------------------------------
E-mail: anne.blankert@geodan.nl
Website: www.geodan.nl
Disclaimer: www.geodan.nl/disclaimer
-------------------------------------