You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Hanjan, Harinder" <Ha...@calgary.ca> on 2018/06/13 20:29:29 UTC

RE: [EXT] Re: Extracting top level URL when indexing document

Thank you Alex.  I have managed to get this to work via URLClassifyProcessorFactory. If anyone is interested, it can be easily done via with the following solrconfig.xml

<updateRequestProcessorChain name="urlProcessor">
	<processor class="org.apache.solr.update.processor.URLClassifyProcessorFactory">
		  <bool name="enabled">true</bool>
		  <str name="inputField">SolrId</str>
		  <str name="domainOutputField">hostname</str>
		  </processor>
	<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.UpdateRequestHandler">
        <lst name="defaults">
         <str name="update.chain">urlProcessor</str>
       </lst>
  </requestHandler>

I will look at how to submit a patch to the Java doc.

Thanks!
Harinder

-----Original Message-----
From: Alexandre Rafalovitch [mailto:arafalov@gmail.com] 
Sent: Wednesday, June 13, 2018 12:13 AM
To: solr-user <so...@lucene.apache.org>
Subject: [EXT] Re: Extracting top level URL when indexing document

Try URLClassifyProcessorFactory in the processing chain instead, configured in solrconfig.xml

There is very little documentation for it, so check the source for exact params. Or search for the blog post introducing it several years ago.

Documentation patches would be welcome.

Regards,
    Alex

On Wed, Jun 13, 2018, 01:02 Hanjan, Harinder, <Ha...@calgary.ca>
wrote:

> Hello!
>
> I am indexing web documents and have a need to extract their top-level 
> URL to be stored in a different field. I have had some success with 
> the PatternTokenizerFactory (relevant schema bits at the bottom) but 
> the behavior appears to be inconsistent.  Most of the times, the top 
> level URL is extracted just fine but for some documents, it is being cut off.
>
> Examples:
> URL
>
> Extracted URL
>
> Comment
>
> http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf
>
> http://www.calgaryarb.ca
>
> Success
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarymlc.ca_
> about-2Dcmlc_&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r
> =N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1n
> vhzANSYzX_MuFCGcxdD4&s=bAlhGU5kNa_tlJbhmb8vEe3gRIF9vcH7de6UJL-mM28&e=
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarymlc.ca&
> d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhV
> Hu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFC
> GcxdD4&s=-4gwWSR2Uut2C-JHJ3c0Uj0Ys0W4APyH7if3WXsEvqU&e=
>
> Success
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarypolicec
> ommission.ca_reports.php&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNk
> yKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJR
> D0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4&s=ZfPgYWPLxqnMbfYceg-RObyXzSmmcPTU0t8Z
> 55ZVbY4&e=
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarypolicec
> ommissio&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30I
> rhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzAN
> SYzX_MuFCGcxdD4&s=BM-LaN4V7PlZW3_vm6prIX-NS3EW1zPz42Cy25S9HxU&e=
>
> Fail
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__attainyourhome.co
> m_&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeK
> KhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_M
> uFCGcxdD4&s=bHYfs9IWkicyxYn5tZN0EtKNIA1O9MCyrDMVxG1Kn1g&e=
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__attai&d=DwIBaQ&c=
> jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO
> 9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4&s=9k
> DXPBHblDyQp9yLzYAyGTvboVZDKrzUK3jYYLmJLTI&e=
>
> Fail
>
> https://liveandplay.calgary.ca/DROPIN/page/dropin
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__livea&d=DwIBaQ&c=
> jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO
> 9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4&s=Xy
> mwSoyJw0F3EqGH7zaDoSJBIu-oVNFxmnVxOnDghJc&e=
>
> Fail
>
>
>
>
> Relevant schema:
> <copyField dest="hostname" source="SolrId"/>
>
> <field name="hostname" type="hostnameType" stored="true" indexed="false"
> multiValued="false"/>
>
> <fieldType name="hostnameType" class="solr.TextField"
> sortMissingLast="true">
>                 <analyzer type="index">
>                                 <tokenizer
>
> class="solr.PatternTokenizerFactory"
>
> pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
>                                                 group="0"/>
>                 </analyzer>
> </fieldType>
>
>
> I have tested the Regex and it is matching things fine. Please see 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__regex101.com_r_wN6cZ7_358&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4&s=U-s-VXfldf8O1uoyOmy_hf3jRuTUml1MMV8YxF-RWUc&e=.
> So it appears that I have a gap in my understanding of how Solr 
> PatternTokenizerFactory works. I would appreciate any insight on the issue.
> hostname field will be used in facet queries.
>
> Thank you!
> Harinder
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or 
> entity named above and may contain information that is confidential or 
> legally privileged. If you are not the intended recipient named above 
> or a person responsible for delivering messages or communications to 
> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 
> distribution, or copying of this communication or any of the 
> information contained in it is strictly prohibited. If you have 
> received this communication in error, please notify us immediately by 
> telephone and then destroy or delete this communication, or return it 
> to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.
>