You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Phillip Wu <ph...@unsw.edu.au> on 2017/09/25 02:55:45 UTC

Solr fields for Microsoft files, image files, PDF, text files

 Hi,
I'm starting out with Solr on a Windows box.

I want to index the following documents:
doc;docx
xls;xlsx
ppt
vsd

pdf
txt

gif;jpeg;tiff

I undersand that solr uses Apache Tika to read these file types and return an xml stream back to Solr.
For Tika image processing, I've loaded Tesseract.

To be able to search the documents, I need to define "fields" in a file called meta-schema.

How do I get a list of all valid field names based on the file type? For example *.doc, what "fields" exist so I choose what to store?

I'm assuming that for example, *.doc files there is metadata put into the file by Microsoft Word eg.author,date and "free form" text.

So where is the list of valid fields per file type?

Also how do I search the "free form" text for a word/pattern in the Solr search tool?

RE: Solr fields for Microsoft files, image files, PDF, text files

Posted by "Allison, Timothy B." <ta...@mitre.org>.

bq: How do I get a list of all valid field names based on the file type

bq: You don't. At least I've never found any. Plus various document formats will allow custom meta-data fields so there's no definitive list.

It would be trivial to add field counts per mime to tika-eval.  If you're interested in this, please open a ticket on Tika's JIRA.

Re: Solr fields for Microsoft files, image files, PDF, text files

Posted by Erick Erickson <er...@gmail.com>.

bq: How do I get a list of all valid field names based on the file type

You don't. At least I've never found any. Plus various document
formats will allow custom meta-data fields so there's no definitive
list.

bq: Also how do I search the "free form" text for a word/pattern in
the Solr search tool?

you put the extracted text (as opposed to meta-data) into an analyzed
field and search that.

NOTE: Solr is a search engine. The closest thing to an OOB "Solr
Search Tool" is the admin UI, which isn't intended to be an end-user
facing app.

Here's some SolrJ code that'll let you explore the meta-data fields in
various document types:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

You can pull out the RDBMS bits pretty easily.

Best,
Erick

On Sun, Sep 24, 2017 at 7:55 PM, Phillip Wu <ph...@unsw.edu.au> wrote:
>
>  Hi,
> I'm starting out with Solr on a Windows box.
>
> I want to index the following documents:
> doc;docx
> xls;xlsx
> ppt
> vsd
>
> pdf
> txt
>
> gif;jpeg;tiff
>
> I undersand that solr uses Apache Tika to read these file types and return an xml stream back to Solr.
> For Tika image processing, I've loaded Tesseract.
>
> To be able to search the documents, I need to define "fields" in a file called meta-schema.
>
> How do I get a list of all valid field names based on the file type? For example *.doc, what "fields" exist so I choose what to store?
>
> I'm assuming that for example, *.doc files there is metadata put into the file by Microsoft Word eg.author,date and "free form" text.
>
> So where is the list of valid fields per file type?
>
> Also how do I search the "free form" text for a word/pattern in the Solr search tool?
>
>
>
>

Re: Solr fields for Microsoft files, image files, PDF, text files

Posted by Erik Hatcher <er...@gmail.com>.

Phillip - You may be interested to start with the example/files that ships with Solr.   It is specifically designed as a configuration (and UI!) that deals with indexing rich files with a bit more than other examples - it pulls out acronyms, e-mail addresses, and URLs from text, as well as what you’ve asked about, mapping content types to more friendly human types (“image” instead of the whole gamut of image/* content-types).

	Erik

> On Sep 24, 2017, at 10:55 PM, Phillip Wu <ph...@unsw.edu.au> wrote:
> 
> 
> Hi,
> I'm starting out with Solr on a Windows box.
> 
> I want to index the following documents:
> doc;docx
> xls;xlsx
> ppt
> vsd
> 
> pdf
> txt
> 
> gif;jpeg;tiff
> 
> I undersand that solr uses Apache Tika to read these file types and return an xml stream back to Solr.
> For Tika image processing, I've loaded Tesseract.
> 
> To be able to search the documents, I need to define "fields" in a file called meta-schema.
> 
> How do I get a list of all valid field names based on the file type? For example *.doc, what "fields" exist so I choose what to store?
> 
> I'm assuming that for example, *.doc files there is metadata put into the file by Microsoft Word eg.author,date and "free form" text.
> 
> So where is the list of valid fields per file type?
> 
> Also how do I search the "free form" text for a word/pattern in the Solr search tool?
> 
> 
> 
>