You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Koorosh Vakhshoori <Ko...@synopsys.com> on 2011/06/15 20:17:42 UTC

Question on XPATH use in Solr Cell.

I am new to both Solr and Cell, so sorry if I am misusing some of the terminologies. So the problem I am trying to solve is to index a PDF document using Solr Cell where I want to exclude part of it via XPATH. I am using Solr release 3.1. When researching the user list, I came across one entry on this topic titled 'XPath query support in Solr Cell' which clarify one issue, but still I am having problem getting what I want.

Here is what I have done so far:

First, I started by executing the following 'CURL' command to see what I would get:

curl "http://localhost:8983/solr/docs/update/extract?literal.id=123&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()&extractOnly=true" -F "file=@/docs/test.pdf"

This worked fine. Next I tried getting the first DIV element by modifying the XPATH query as follows:

curl "http://localhost:8983/solr/docs/update/extract?literal.id=123&xpath=/xhtml:html/xhtml:body/xhtml:div\[1\]/descendant:node()&extractOnly=true" -F "file=@/docs/test.pdf"

Note, I am escaping the '[]', I even tried using their encoded values %5B and %5D. It ran, but it did not match anything. Here is was I got:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">627</int>
</lst><str name="test.pdf"/><lst name="test.pdf_metadata"><arr name="s
tream_source_info"><str>file</str></arr><arr name="subject"><str>Version A-2007.
12</str></arr><arr name="Last-Modified"><str>2009-08-12T17:07:27Z</str></arr><ar
r name="Author"><str>Test title.</str></arr><arr name="
creator"><str>FrameMaker 7.1</str></arr><arr name="xmpTPg:NPages"><str>187</str>
</arr><arr name="Creation-Date"><str>2009-08-12T17:07:27Z</str></arr><arr name="
title"><str>Test Document</str></arr><arr name="stream_content_type"><str
>application/octet-stream</str></arr><arr name="created"><str>Wed Aug 12 10:07:2
7 PDT 2009</str></arr><arr name="stream_size"><str>1372769</str></arr><arr name=
"stream_name"><str>test.pdf</str></arr><arr name="producer"><str>Acrobat Di
stiller 7.0.5 (Windows)</str></arr><arr name="Copyright"><str>2007</str></arr><a
rr name="Content-Type"><str>application/pdf</str></arr><arr name="Keywords"><str
>Test</str></arr></lst>
</response>

On a different track I explored what could be an XPATH expression for my purpose. Here I have something that should get me there most of the way:

//xhtml:body/xhtml:div\[not(contains(p,'EXCLUDE TEXT'))\]

I independently validated the XPATH expression at following URL:

http://www.whitebeam.org/library/guide/TechNotes/xpathtestbed.rhtm

As was suggested in previously mentioned posting.

Any suggestion and help is greatly appreciated.