You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/03/12 23:36:36 UTC
[Solr Wiki] Update of "ExtractingRequestHandler" by YonikSeeley
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "ExtractingRequestHandler" page has been changed by YonikSeeley.
The comment on this change is: switch examples to multivalued txt field.
http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=68&rev2=69
--------------------------------------------------
= Examples =
== Mapping and Capture ==
- Capture <div> tags separate, and then map that field to a dynamic field named foo_t.
+ Capture <div> tags separate, and then map that field to a dynamic field named foo_txt.
{{{
- curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div" -F "tutorial=@tutorial.pdf"
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_txt&capture=div" -F "tutorial=@tutorial.pdf"
}}}
== Mapping, Capture and Boost ==
- Capture <div> tags separate, and then map that field to a dynamic field named foo_t. Boost foo_t by 3.
+ Capture <div> tags separate, and then map that field to a dynamic field named foo_txt. Boost foo_txt by 3.
{{{
- curl "http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3" -F "tutorial=@tutorial.pdf"
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3" -F "tutorial=@tutorial.pdf"
}}}
== Literals ==
To add in your own metadata, pass in the literal parameter along with the file:
{{{
- curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah" -F "tutorial=@tutorial.pdf"
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&literal.blah_s=Bah" -F "tutorial=@tutorial.pdf"
}}}
== XPath ==
Restrict down the XHTML returned by Tika by passing in an XPath expression
{{{
- curl "http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()" -F "tutorial=@tutorial.pdf"
+ curl "http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()" -F "tutorial=@tutorial.pdf"
}}}
== Extract Only ==
{{{
curl "http://localhost:8983/solr/update/extract?&extractOnly=true" --data-binary @tutorial.html -H 'Content-type:text/html'
}}}
- A the output includes XML generated by Tika (and is hence further escaped by Solr's XML) using a different output format enhance the readability:
+ A the output includes XML generated by Tika and is thus further escaped by Solr's XML format. Using a different output format like json or ruby enhances the readability:
{{{
curl "http://localhost:8983/solr/update/extract?&extractOnly=true&wt=ruby&indent=true" --data-binary @tutorial.html -H 'Content-type:text/html'