You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tom Potter <to...@orangebus.co.uk> on 2019/02/13 10:51:10 UTC
Difficulty getting data from Nutch parse data into Solr document
I'm not sure how to get some of the data from a crawled PDF document into
my Solr index. When I run the parsechecker tool I can see the date I need
as an attribute of the Content Metadata (date=2018-08-06T14:14:00Z), but
I'm not sure how I configure the solrindex-mapping.xml to successfully map
this to a Solr field.
I tried adding the below mapping, but it didn't work:
<field dest="date" source="date"/>
Below is an example of the result of the parsechecker data showing the date
attribute in the Content Metadata:
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: XXXXXXX
Outlinks: 1
outlink: toUrl: https://xxx.zzz anchor:
Content Metadata: Server=Microsoft-IIS/7.5 Connection=close
Last-Modified=Mon, 06 Aug 2018 15:16:28 GMT Date=Wed, 13 Feb 2019 10:36:52
GMT nutch.crawl.score=0.0 nutch.fetch.time=1550054216537
Cache-Control=no-cache, no-store ETag="8727b79f5faf0086a80c86df4cbbac12"
Content-Disposition=inline; filename=xxxxx.pdf" X-AspNet-Version=4.0.30319
Content-Length=81903 Content-Type=application/pdf X-Powered-By=ASP.NET
Parse Metadata: date=2018-08-06T14:14:00Z pdf:PDFVersion=1.5
xmp:CreatorTool=Microsoft Office Word
access_permission:modify_annotations=true
access_permission:can_print_degraded=true dc:creator=XXXXX
dcterms:created=2018-08-06T14:14:00Z Last-Modified=2018-08-06T14:14:00Z
dcterms:modified=2018-08-06T14:14:00Z dc:format=application/pdf;
version=1.5 Last-Save-Date=2018-08-06T14:14:00Z
access_permission:fill_in_form=true meta:save-date=2018-08-06T14:14:00Z
pdf:encrypted=false dc:title=xxxxxxxx modified=2018-08-06T14:14:00Z
Content-Type=application/pdf creator=XXXXXX meta:author=XXXXX
meta:creation-date=2018-08-06T14:14:00Z created=Mon Aug 06 15:14:00 BST
2018 access_permission:extract_for_accessibility=true
access_permission:assemble_document=true xmpTPg:NPages=7
Creation-Date=2018-08-06T14:14:00Z access_permission:extract_content=true
access_permission:can_print=true Author=XXXXXX producer=Aspose.Words for
.NET 16.2.0.0 access_permission:can_modify=true
--
*Tom Potter*
Software Developer T: 0191 241 3703
E: tom.potter@orangebus.co.uk <lo...@orangebus.co.uk> • W:
www.orangebus.co.uk •
[image: Orange Bus] <http://www.orangebus.co.uk/> Orange Bus, Milburn
House, Dean Street, Newcastle Upon Tyne, NE1 1LE
--
This email and any attachment to it are confidential. Unless you are the
intended recipient, you may not use, copy or disclose either the message or
any information contained in the message. If you are not the intended
recipient, you should delete this email and notify the sender immediately.
Any views or opinions expressed in this email are those of the sender
unless otherwise stated. All copyright in any Orange Bus and/or Capita
material in this email is reserved. All emails may be recorded by Orange
Bus and monitored for legitimate business purposes. Orange Bus and Capita
exclude all liability for any loss or damage arising or resulting from the
receipt, use or transmission of this email to the fullest extent permitted
by law.
Orange Bus Limited is a company registered in England & Wales
under company registration number 4444974. Our registered company address
is 30 Berners Street, London, W1T 3LR, United Kingdom. Orange Bus Limited,
part of Capita Software, is a subsidiary of Capita Business Services Ltd
registered in England & Wales under company number 2299747.
*You are
receiving this message from Capita Software. Should you wish to see how we
may have collected or may use your information, or view ways to exercise
your individual rights, see our Privacy Notice
<https://www.capitasoftware.com/PrivacyNotice>*