You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tom Potter <to...@orangebus.co.uk> on 2019/02/13 10:51:10 UTC

Difficulty getting data from Nutch parse data into Solr document

I'm not sure how to get some of the data from a crawled PDF document into
my Solr index. When I run the parsechecker tool I can see the date I need
as an attribute of the Content Metadata (date=2018-08-06T14:14:00Z), but
I'm not sure how I configure the solrindex-mapping.xml to successfully map
this to a Solr field.

I tried adding the below mapping, but it didn't work:

<field dest="date" source="date"/>

Below is an example of the result of the parsechecker data showing the date
attribute in the Content Metadata:
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: XXXXXXX
Outlinks: 1
  outlink: toUrl: https://xxx.zzz anchor:
Content Metadata: Server=Microsoft-IIS/7.5 Connection=close
Last-Modified=Mon, 06 Aug 2018 15:16:28 GMT Date=Wed, 13 Feb 2019 10:36:52
GMT nutch.crawl.score=0.0 nutch.fetch.time=1550054216537
Cache-Control=no-cache, no-store ETag="8727b79f5faf0086a80c86df4cbbac12"
Content-Disposition=inline; filename=xxxxx.pdf" X-AspNet-Version=4.0.30319
Content-Length=81903 Content-Type=application/pdf X-Powered-By=ASP.NET
Parse Metadata: date=2018-08-06T14:14:00Z pdf:PDFVersion=1.5
xmp:CreatorTool=Microsoft Office Word
access_permission:modify_annotations=true
access_permission:can_print_degraded=true dc:creator=XXXXX
dcterms:created=2018-08-06T14:14:00Z Last-Modified=2018-08-06T14:14:00Z
dcterms:modified=2018-08-06T14:14:00Z dc:format=application/pdf;
version=1.5 Last-Save-Date=2018-08-06T14:14:00Z
access_permission:fill_in_form=true meta:save-date=2018-08-06T14:14:00Z
pdf:encrypted=false dc:title=xxxxxxxx modified=2018-08-06T14:14:00Z
Content-Type=application/pdf creator=XXXXXX meta:author=XXXXX
meta:creation-date=2018-08-06T14:14:00Z created=Mon Aug 06 15:14:00 BST
2018 access_permission:extract_for_accessibility=true
access_permission:assemble_document=true xmpTPg:NPages=7
Creation-Date=2018-08-06T14:14:00Z access_permission:extract_content=true
access_permission:can_print=true Author=XXXXXX producer=Aspose.Words for
.NET 16.2.0.0 access_permission:can_modify=true


-- 


*Tom Potter*
Software Developer  T: 0191 241 3703
E: tom.potter@orangebus.co.uk <lo...@orangebus.co.uk> • W:
www.orangebus.co.uk •
[image: Orange Bus] <http://www.orangebus.co.uk/> Orange Bus, Milburn
House, Dean Street, Newcastle Upon Tyne, NE1 1LE

-- 


This email and any attachment to it are confidential. Unless you are the 
intended recipient, you may not use, copy or disclose either the message or 
any information contained in the message. If you are not the intended 
recipient, you should delete this email and notify the sender immediately. 
Any views or opinions expressed in this email are those of the sender 
unless otherwise stated. All copyright in any Orange Bus and/or Capita 
material in this email is reserved. All emails may be recorded by Orange 
Bus  and monitored for legitimate business purposes. Orange Bus and Capita 
exclude all liability for any loss or damage arising or resulting from the 
receipt, use or transmission of this email to the fullest extent permitted 
by law.




Orange Bus Limited is a company registered in England & Wales 
under company registration number 4444974. Our registered company address 
is 30 Berners Street, London, W1T 3LR, United Kingdom. Orange Bus Limited, 
part of Capita Software, is a subsidiary of Capita Business Services Ltd 
registered in England & Wales under company number 2299747. 




*You are 
receiving this message from Capita Software. Should you wish to see how we 
may have collected or may use your information, or view ways to exercise 
your individual rights, see our Privacy Notice 
<https://www.capitasoftware.com/PrivacyNotice>*