You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Ole Jørgen Brønner (JIRA)" <ji...@apache.org> on 2016/12/21 01:08:58 UTC

[jira] [Commented] (SOLR-7027) ExtractingRequestHandler indiscriminantly dumps all source HTML attributes into the catch-all field when captureAttr=false, but it should be more selective, something like only href, title, alt, etc. attributes

    [ https://issues.apache.org/jira/browse/SOLR-7027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765748#comment-15765748 ] 

Ole Jørgen Brønner commented on SOLR-7027:
------------------------------------------

A semi-related issue that caught me off guard is that it doesn't seem to be possible to capture both attribute values ({{captureAttr}}) and content ({{capture=h1}}) and be able to distinguish between the content and attributes?

Without {{captureAttr}} the content captured in the {{h1}} field will be very low quality since h1 tags commonly contain eg. {{class}} attributes, but with {{captureAttr}} the attribute values will be stored in the same field. (it doesn't seem possible to map the attributes and the content to different fields). They will be stored as different values in the multivalued field, but I don't think that helps much.

The documentation says that when capturing elements ({{capture=h1}}) the content should also be present in the catch-all content field, but that doesn't align with my observations.

> ExtractingRequestHandler indiscriminantly dumps all source HTML attributes into the catch-all field when captureAttr=false, but it should be more selective, something like only href, title, alt, etc. attributes
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7027
>                 URL: https://issues.apache.org/jira/browse/SOLR-7027
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 5.0
>            Reporter: Steve Rowe
>            Priority: Minor
>             Fix For: 5.2, 6.0
>
>
> On line 283 in {{SolrContentHandler}}, the catch-all field gets *all* source HTML attribute values dumped into it:
> {code:java}
> 270:  @Override
> 271:  public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
> 272:    StringBuilder theBldr = fieldBuilders.get(localName);
> 273:    if (theBldr != null) {
> 274:      //we need to switch the currentBuilder
> 275:      bldrStack.add(theBldr);
> 276:    }
> 277:    if (captureAttribs == true) {
> 278:      for (int i = 0; i < attributes.getLength(); i++) {
> 279:        addField(localName, attributes.getValue(i), null);
> 280:      }
> 281:    } else {
> 282:      for (int i = 0; i < attributes.getLength(); i++) {
> 283:        bldrStack.getLast().append(' ').append(attributes.getValue(i));
> 284:      }
> 285:    }
> 286:    bldrStack.getLast().append(' ');
> 287:  }
> {code}
> But this will contains lots of unwanted cruft: {{class}} and {{style}} tags, etc.
> It would be much better if only attribute values containing addresses or tooltip text, etc. were dumped into the catch-all field.  Here are a couple of places where this kind of attribute are described:
> http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute)
> From Tika's {{HtmlHandler}} class:
> {code:java}
>     // List of attributes that need to be resolved.
>     private static final Set<String> URI_ATTRIBUTES =
>         new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite"));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org