You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by AlessandroF <al...@gmail.com> on 2012/06/06 08:57:51 UTC
Extract information from url field
Hi All,
I would like to know if it's possible to set up a field where Solr, after
posting a document, automatically extracts part of the content as a result
of a regexp to field.
e.g.
Having an URL field containing
http://www.myCompany.Com/Department/Service/index.html
congifured as <field name="url" type="url" stored="true" indexed="true"
required="true"/>
after posting It should be splitted like :
<doc>
....
<str name="url">http://www.myCompany.Com/Department/Service/index.html</str>
<str name="department">Department</str>
....
</doc>
Thanks for helping!
Alessandro
--
View this message in context: http://lucene.472066.n3.nabble.com/Extract-information-from-url-field-tp3987913.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Extract information from url field
Posted by AlessandroF <al...@gmail.com>.
Dear Jack,
Thanks you very mutch, It's works fine.
Alessandro
--
View this message in context: http://lucene.472066.n3.nabble.com/Extract-information-from-url-field-tp3987913p3988243.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Extract information from url field
Posted by Jack Krupansky <ja...@basetechnology.com>.
Yes, using PatternTokenizerFactory. Here's an example field type that if you
define a "department" field with this type and do a copyField from "url" to
"department, it will end up with the department name alone. It handles
embedded punctuation (e.g., dot, dash, and underscore) and mixed case words
(breaks into separate words.) It is "text" rather than "string", so you can
search on individual name words or a phrase. It also lower-cases the name,
but you can skip that step
<fieldType name="pat_url_department_text" class="solr.TextField"
sortMissingLast="true">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory"
pattern="://[^/]*/([^/]*)/" group="1"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
-- Jack Krupansky
-----Original Message-----
From: AlessandroF
Sent: Wednesday, June 06, 2012 2:57 AM
To: solr-user@lucene.apache.org
Subject: Extract information from url field
Hi All,
I would like to know if it's possible to set up a field where Solr, after
posting a document, automatically extracts part of the content as a result
of a regexp to field.
e.g.
Having an URL field containing
http://www.myCompany.Com/Department/Service/index.html
congifured as <field name="url" type="url" stored="true" indexed="true"
required="true"/>
after posting It should be splitted like :
<doc>
....
<str name="url">http://www.myCompany.Com/Department/Service/index.html</str>
<str name="department">Department</str>
....
</doc>
Thanks for helping!
Alessandro
--
View this message in context:
http://lucene.472066.n3.nabble.com/Extract-information-from-url-field-tp3987913.html
Sent from the Solr - User mailing list archive at Nabble.com.