You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/03/12 23:36:36 UTC

[Solr Wiki] Update of "ExtractingRequestHandler" by YonikSeeley

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "ExtractingRequestHandler" page has been changed by YonikSeeley.
The comment on this change is: switch examples to multivalued txt field.
http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=68&rev2=69

--------------------------------------------------

  
  = Examples =
  == Mapping and Capture ==
- Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
+ Capture <div> tags separate, and then map that field to a dynamic field named foo_txt.
  
  {{{
-  curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div"  -F "tutorial=@tutorial.pdf"
+  curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_txt&capture=div"  -F "tutorial=@tutorial.pdf"
  }}}
  == Mapping, Capture and Boost ==
- Capture <div> tags separate, and then map that field to a dynamic field named foo_t.  Boost foo_t by 3.
+ Capture <div> tags separate, and then map that field to a dynamic field named foo_txt.  Boost foo_txt by 3.
  
  {{{
- curl "http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3" -F "tutorial=@tutorial.pdf"
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3" -F "tutorial=@tutorial.pdf"
  }}}
  == Literals ==
  To add in your own metadata, pass in the literal parameter along with the file:
  
  {{{
- curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah"  -F "tutorial=@tutorial.pdf"
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&literal.blah_s=Bah"  -F "tutorial=@tutorial.pdf"
  }}}
  == XPath ==
  Restrict down the XHTML returned by Tika by passing in an XPath expression
  
  {{{
- curl "http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()"  -F "tutorial=@tutorial.pdf"
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()"  -F "tutorial=@tutorial.pdf"
  }}}
  == Extract Only ==
  {{{
  curl "http://localhost:8983/solr/update/extract?&extractOnly=true"  --data-binary @tutorial.html  -H 'Content-type:text/html'
  }}}
- A the output includes XML generated by Tika (and is hence further escaped by Solr's XML) using a different output format enhance the readability:
+ A the output includes XML generated by Tika and is thus further escaped by Solr's XML format. Using a different output format like json or ruby enhances the readability:
  
  {{{
  curl "http://localhost:8983/solr/update/extract?&extractOnly=true&wt=ruby&indent=true"  --data-binary @tutorial.html  -H 'Content-type:text/html'