You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by AlessandroF <al...@gmail.com> on 2012/06/06 08:57:51 UTC

Extract information from url field

Hi All,
I would like to know if it's possible to set up a field where Solr, after
posting a document, automatically extracts part of the content as a result
of a regexp to field.

e.g. 

Having an URL field containing 
http://www.myCompany.Com/Department/Service/index.html
congifured as <field name="url" type="url" stored="true" indexed="true"
required="true"/>

after posting It should be splitted like :

<doc>
....
<str name="url">http://www.myCompany.Com/Department/Service/index.html</str>
<str name="department">Department</str>
....
</doc>

Thanks for helping!

Alessandro





--
View this message in context: http://lucene.472066.n3.nabble.com/Extract-information-from-url-field-tp3987913.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Extract information from url field

Posted by AlessandroF <al...@gmail.com>.

Dear Jack,
Thanks you very mutch, It's works fine.
Alessandro


--
View this message in context: http://lucene.472066.n3.nabble.com/Extract-information-from-url-field-tp3987913p3988243.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Extract information from url field

Posted by Jack Krupansky <ja...@basetechnology.com>.

Yes, using PatternTokenizerFactory. Here's an example field type that if you 
define a "department" field with this type and do a copyField from "url" to 
"department, it will end up with the department name alone. It handles 
embedded punctuation (e.g., dot, dash, and underscore) and mixed case words 
(breaks into separate words.) It is "text" rather than "string", so you can 
search on individual name words or a phrase. It also lower-cases the name, 
but you can skip that step

<fieldType name="pat_url_department_text" class="solr.TextField" 
sortMissingLast="true">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" 
pattern="://[^/]*/([^/]*)/" group="1"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>






-- Jack Krupansky
-----Original Message----- 
From: AlessandroF
Sent: Wednesday, June 06, 2012 2:57 AM
To: solr-user@lucene.apache.org
Subject: Extract information from url field

Hi All,
I would like to know if it's possible to set up a field where Solr, after
posting a document, automatically extracts part of the content as a result
of a regexp to field.

e.g.

Having an URL field containing
http://www.myCompany.Com/Department/Service/index.html
congifured as <field name="url" type="url" stored="true" indexed="true"
required="true"/>

after posting It should be splitted like :

<doc>
....
<str name="url">http://www.myCompany.Com/Department/Service/index.html</str>
<str name="department">Department</str>
....
</doc>

Thanks for helping!

Alessandro





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Extract-information-from-url-field-tp3987913.html
Sent from the Solr - User mailing list archive at Nabble.com.