You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Phillip Rhodes <mo...@gmail.com> on 2017/10/14 22:39:16 UTC

How to determine the set of all possible fields in MCF output?

Hi all, I've been working with MCF the past few days and am very happy
with what it lets me do, and I have a pipeline going from my
repository to Solr which works fine.  But there is one point I clearly
don't understand, which is:

How do you know exactly what fields are going to be output in a given
configuration?  I found that i had to resort to trial and error to
tweak my Solr schema to avoid "undefined field xxxxx" errors from
Manifold when trying to write to Solr.  Now to be fair, clearly I
could just ignore any fields I don't specifically know I want, but I'd
like to understand how this works.

Is it the case that the initial set of fields depends on the
repository connector?  I found that I seemed to get some Alfresco
specific stuff when reading from Alfresco, as opposed to what I got
from a simple dummy file-system repo I was initially experimenting
with.

It also seems that Tika adds some fields, (actually a lot of fields)
even when you don't have a Tika transform wired in explicitly?   Is it
the case that you need to put in an explicit Tika transform if you
want to control which fields are contributed by Tika?

And on that point, is there a master list of possible fields that TIka
will emit, or is Tika just transforming the names of metadata fields
in the documents it encounters, and programmatically generating a
field name?


Any and all help on understanding how this works is greatly appreciated...


Phil
~~~~
This message optimized for indexing by NSA PRISM

Re: How to determine the set of all possible fields in MCF output?

Posted by Phillip Rhodes <mo...@gmail.com>.

On Sat, Oct 14, 2017 at 7:50 PM, Steph van Schalkwyk <st...@remcam.net>
wrote:

> When you run TIKA standalone on a file, you can see all the emitted fields
> for that particular document type as well as added metadata.
>


Radical.  Thanks!


Phil

Re: How to determine the set of all possible fields in MCF output?

Posted by Steph van Schalkwyk <st...@remcam.net>.

When you run TIKA standalone on a file, you can see all the emitted fields
for that particular document type as well as added metadata.
<code>

import java.io.File;import java.io.FileInputStream;import java.io.IOException;
import org.apache.tika.exception.TikaException;import
org.apache.tika.metadata.Metadata;import
org.apache.tika.parser.AutoDetectParser;import
org.apache.tika.parser.ParseContext;import
org.apache.tika.parser.Parser;import
org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class ParserExtraction {
	
   public static void main(final String[] args) throws
IOException,SAXException, TikaException {

      //Assume sample.txt is in your current directory
      File file = new File("sample.txt");

      //parse method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(file);
      ParseContext context = new ParseContext();

      //parsing the file
      parser.parse(inputstream, handler, metadata, context);
      System.out.println("File content : " + Handler.toString());
   }https://www.tutorialspoint.com/tika/tika_content_extraction.htm




*Steph van Schalkwyk*
Principal, Remcam Search Engines
+1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
<http://www.remcam.net/> Skype: svanschalkwyk
<https://mail.google.com/mail/u/0/#>
<http://linkedin.com/in/vanschalkwyk>

On Sat, Oct 14, 2017 at 6:17 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Phil,
>
> You are correct in asserting that in MCF it is the sum total of all the
> connections that the document passes through that determine its attribute
> set.  That includes transformation connections as well as the repository
> connection.
>
> Tika is one connection that does add a lot of fields and these depend not
> only on the configuration of the Tika connection, but also on the kind of
> document being extracted.  If you want to figure out the sum total of
> what's possible, you will need to consult the Tika documentation.  And yes,
> the field names Tika generates are created based on what Tika finds in the
> document.
>
> Alternatively, you can configure your job to send output to a null output
> connection.  This connection records all attribute information for each
> document in the simple history, so you can get an idea what to expect.
>
> I'm a little confused about your statement that Tika runs even when it's
> not in a job's pipeline.  That's not actually true, so I'm wondering what
> you are seeing.
>
> Thanks,
> Karl
>
>
> On Sat, Oct 14, 2017 at 6:39 PM, Phillip Rhodes <motley.crue.fan@gmail.com
> > wrote:
>
>> Hi all, I've been working with MCF the past few days and am very happy
>> with what it lets me do, and I have a pipeline going from my
>> repository to Solr which works fine.  But there is one point I clearly
>> don't understand, which is:
>>
>> How do you know exactly what fields are going to be output in a given
>> configuration?  I found that i had to resort to trial and error to
>> tweak my Solr schema to avoid "undefined field xxxxx" errors from
>> Manifold when trying to write to Solr.  Now to be fair, clearly I
>> could just ignore any fields I don't specifically know I want, but I'd
>> like to understand how this works.
>>
>> Is it the case that the initial set of fields depends on the
>> repository connector?  I found that I seemed to get some Alfresco
>> specific stuff when reading from Alfresco, as opposed to what I got
>> from a simple dummy file-system repo I was initially experimenting
>> with.
>>
>> It also seems that Tika adds some fields, (actually a lot of fields)
>> even when you don't have a Tika transform wired in explicitly?   Is it
>> the case that you need to put in an explicit Tika transform if you
>> want to control which fields are contributed by Tika?
>>
>> And on that point, is there a master list of possible fields that TIka
>> will emit, or is Tika just transforming the names of metadata fields
>> in the documents it encounters, and programmatically generating a
>> field name?
>>
>>
>> Any and all help on understanding how this works is greatly appreciated...
>>
>>
>> Phil
>> ~~~~
>> This message optimized for indexing by NSA PRISM
>>
>
>

Re: How to determine the set of all possible fields in MCF output?

Posted by Phillip Rhodes <mo...@gmail.com>.

Hmm... I tried the MetadataAdjuster before and unchecked "Keep all
metadata" and it still seemed to send everything through.   Probably I
just did something wrong... I'll try it again.


Phil

This message optimized for indexing by NSA PRISM


On Tue, Oct 24, 2017 at 3:31 AM, Karl Wright <da...@gmail.com> wrote:
> Hi Phil,
>
> Solr will certainly skip any fields that it doesn't know about and simply
> not save them.  There's little cost to having them pass through MCF; the big
> cost is extraction, which you're stuck with because Alfresco does it no
> matter what.  So I'm not sure what a white-list transformer does for you.
>
> But in any case, there's already a transformer that allows you to map
> metadata around -- the Metadata Adjuster.  See:
>
> http://manifoldcf.apache.org/release/release-2.8.1/en_US/end-user-documentation.html#metadataadjuster
>
> This transformer maps metadata values, allows you to insert new ones, and
> also allows you to ONLY pass through the ones that are explicitly specified
> if you wish.
>
> Thanks,
> Karl
>
>
> On Mon, Oct 23, 2017 at 9:19 PM, Phillip Rhodes <mo...@gmail.com>
> wrote:
>>
>> FWIW, I now understand what I was missing that made me think Manifold
>> was running TIka when it wasn't.  It turns out that Alfresco uses Tika
>> internally and when you get a document from Alfresco (using the
>> Webscripts connector anyway) the set of fields you get includes all
>> the image metadata and what-not (for image files).  I never realized
>> this because I don't typically use Alfresco for images.  But when I
>> added extra logging to the Alfresco WebScripts connector code, to spit
>> out the incoming field set, I see things like:
>>
>> Found property exif:yResolution = 72.0
>> Found property cm:owner = admin
>> Found property exif:isoSpeedRatings = 400
>> Found property exif:fNumber = 3.5
>> Found property sys:node-uuid = 0516a5cc-fc04-4512-a4ed-b595b7c3908b
>> Found property exif:pixelYDimension = 2048
>> Found property exif:resolutionUnit = Inch
>> Found property exif:dateTimeOriginal = 2005-01-09T16:00:55Z
>> Found property sys:locale = en_GB
>>
>> which explains why the Solr connector was trying to save fields like
>> exif_fNumber and exif_resolutionUnit.   This came up because the
>> Alfresco instance I'm experimenting with has their default sample
>> workspace which includes images and things I don't normally touch.
>> :-)
>>
>> As for managing all this so my history doesn't contain all those
>> failure messages, I thought about creating a "WhitelistFieldTransform"
>> as a transform connection to drop any fields other than the ones that
>> are whitelisted.    Two questions:
>>
>> 1. Does this seem like a reasonable approach, or is there a better way?
>>
>> 2. If this is reasonable and I create such a filter, would there be
>> any interest in having it contributed back to MCF?
>>
>>
>> Cheers,
>>
>>
>> Phil
>>
>> This message optimized for indexing by NSA PRISM
>>
>>
>> On Sun, Oct 15, 2017 at 10:11 AM, Karl Wright <da...@gmail.com> wrote:
>> > Hi Phil,
>> >
>> > In most cases you can't modify the fields being output by the various
>> > connectors, but you don't have to use them.  If you have an output
>> > connector
>> > that *insists* on using all of them in a destructive way, we'd like to
>> > know
>> > about that.  Usually extra fields are harmless and only the ones you
>> > want in
>> > your schema are looked for.
>> >
>> > Karl
>> >
>> >
>> > On Sat, Oct 14, 2017 at 8:12 PM, Phillip Rhodes
>> > <mo...@gmail.com>
>> > wrote:
>> >>
>> >> On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <da...@gmail.com>
>> >> wrote:
>> >> > Hi Phil,
>> >> >
>> >> > You are correct in asserting that in MCF it is the sum total of all
>> >> > the
>> >> > connections that the document passes through that determine its
>> >> > attribute
>> >> > set.  That includes transformation connections as well as the
>> >> > repository
>> >> > connection.
>> >>
>> >> OK, sounds good.
>> >>
>> >> > Tika is one connection that does add a lot of fields and these depend
>> >> > not
>> >> > only on the configuration of the Tika connection, but also on the
>> >> > kind
>> >> > of
>> >> > document being extracted.  If you want to figure out the sum total of
>> >> > what's
>> >> > possible, you will need to consult the Tika documentation.  And yes,
>> >> > the
>> >> > field names Tika generates are created based on what Tika finds in
>> >> > the
>> >> > document.
>> >>
>> >> Gotcha.   So if I want to limit the fields output to *only* a specific
>> >> set that is determined in advance, is there a way to accomplish that?
>> >>
>> >> > Alternatively, you can configure your job to send output to a null
>> >> > output
>> >> > connection.  This connection records all attribute information for
>> >> > each
>> >> > document in the simple history, so you can get an idea what to
>> >> > expect.
>> >>
>> >> Excellent, I'll investigate that.
>> >>
>> >> > I'm a little confused about your statement that Tika runs even when
>> >> > it's
>> >> > not
>> >> > in a job's pipeline.  That's not actually true, so I'm wondering what
>> >> > you
>> >> > are seeing.
>> >>
>> >> It's probable that I'm wrong.  I just thought maybe there was some
>> >> default behavior, because I pointed MCF at a directory full of PDF's
>> >> without explicitly configuring Tika and I saw fields in the output
>> >> that I thought were probably generated by Tika.  Likewise now I am
>> >> running a pipeline with no explicit Tika step and I see output fields
>> >> for EXIF stuff for images and the like, which I assumed came from
>> >> Tika.
>> >>
>> >>
>> >>
>> >> Phil
>> >
>> >
>
>

Re: How to determine the set of all possible fields in MCF output?

Posted by Karl Wright <da...@gmail.com>.

Hi Phil,

Solr will certainly skip any fields that it doesn't know about and simply
not save them.  There's little cost to having them pass through MCF; the
big cost is extraction, which you're stuck with because Alfresco does it no
matter what.  So I'm not sure what a white-list transformer does for you.

But in any case, there's already a transformer that allows you to map
metadata around -- the Metadata Adjuster.  See:

http://manifoldcf.apache.org/release/release-2.8.1/en_US/end-user-documentation.html#metadataadjuster

This transformer maps metadata values, allows you to insert new ones, and
also allows you to ONLY pass through the ones that are explicitly specified
if you wish.

Thanks,
Karl


On Mon, Oct 23, 2017 at 9:19 PM, Phillip Rhodes <mo...@gmail.com>
wrote:

> FWIW, I now understand what I was missing that made me think Manifold
> was running TIka when it wasn't.  It turns out that Alfresco uses Tika
> internally and when you get a document from Alfresco (using the
> Webscripts connector anyway) the set of fields you get includes all
> the image metadata and what-not (for image files).  I never realized
> this because I don't typically use Alfresco for images.  But when I
> added extra logging to the Alfresco WebScripts connector code, to spit
> out the incoming field set, I see things like:
>
> Found property exif:yResolution = 72.0
> Found property cm:owner = admin
> Found property exif:isoSpeedRatings = 400
> Found property exif:fNumber = 3.5
> Found property sys:node-uuid = 0516a5cc-fc04-4512-a4ed-b595b7c3908b
> Found property exif:pixelYDimension = 2048
> Found property exif:resolutionUnit = Inch
> Found property exif:dateTimeOriginal = 2005-01-09T16:00:55Z
> Found property sys:locale = en_GB
>
> which explains why the Solr connector was trying to save fields like
> exif_fNumber and exif_resolutionUnit.   This came up because the
> Alfresco instance I'm experimenting with has their default sample
> workspace which includes images and things I don't normally touch.
> :-)
>
> As for managing all this so my history doesn't contain all those
> failure messages, I thought about creating a "WhitelistFieldTransform"
> as a transform connection to drop any fields other than the ones that
> are whitelisted.    Two questions:
>
> 1. Does this seem like a reasonable approach, or is there a better way?
>
> 2. If this is reasonable and I create such a filter, would there be
> any interest in having it contributed back to MCF?
>
>
> Cheers,
>
>
> Phil
>
> This message optimized for indexing by NSA PRISM
>
>
> On Sun, Oct 15, 2017 at 10:11 AM, Karl Wright <da...@gmail.com> wrote:
> > Hi Phil,
> >
> > In most cases you can't modify the fields being output by the various
> > connectors, but you don't have to use them.  If you have an output
> connector
> > that *insists* on using all of them in a destructive way, we'd like to
> know
> > about that.  Usually extra fields are harmless and only the ones you
> want in
> > your schema are looked for.
> >
> > Karl
> >
> >
> > On Sat, Oct 14, 2017 at 8:12 PM, Phillip Rhodes <
> motley.crue.fan@gmail.com>
> > wrote:
> >>
> >> On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <da...@gmail.com>
> wrote:
> >> > Hi Phil,
> >> >
> >> > You are correct in asserting that in MCF it is the sum total of all
> the
> >> > connections that the document passes through that determine its
> >> > attribute
> >> > set.  That includes transformation connections as well as the
> repository
> >> > connection.
> >>
> >> OK, sounds good.
> >>
> >> > Tika is one connection that does add a lot of fields and these depend
> >> > not
> >> > only on the configuration of the Tika connection, but also on the kind
> >> > of
> >> > document being extracted.  If you want to figure out the sum total of
> >> > what's
> >> > possible, you will need to consult the Tika documentation.  And yes,
> the
> >> > field names Tika generates are created based on what Tika finds in the
> >> > document.
> >>
> >> Gotcha.   So if I want to limit the fields output to *only* a specific
> >> set that is determined in advance, is there a way to accomplish that?
> >>
> >> > Alternatively, you can configure your job to send output to a null
> >> > output
> >> > connection.  This connection records all attribute information for
> each
> >> > document in the simple history, so you can get an idea what to expect.
> >>
> >> Excellent, I'll investigate that.
> >>
> >> > I'm a little confused about your statement that Tika runs even when
> it's
> >> > not
> >> > in a job's pipeline.  That's not actually true, so I'm wondering what
> >> > you
> >> > are seeing.
> >>
> >> It's probable that I'm wrong.  I just thought maybe there was some
> >> default behavior, because I pointed MCF at a directory full of PDF's
> >> without explicitly configuring Tika and I saw fields in the output
> >> that I thought were probably generated by Tika.  Likewise now I am
> >> running a pipeline with no explicit Tika step and I see output fields
> >> for EXIF stuff for images and the like, which I assumed came from
> >> Tika.
> >>
> >>
> >>
> >> Phil
> >
> >
>

Re: How to determine the set of all possible fields in MCF output?

Posted by Phillip Rhodes <mo...@gmail.com>.

FWIW, I now understand what I was missing that made me think Manifold
was running TIka when it wasn't.  It turns out that Alfresco uses Tika
internally and when you get a document from Alfresco (using the
Webscripts connector anyway) the set of fields you get includes all
the image metadata and what-not (for image files).  I never realized
this because I don't typically use Alfresco for images.  But when I
added extra logging to the Alfresco WebScripts connector code, to spit
out the incoming field set, I see things like:

Found property exif:yResolution = 72.0
Found property cm:owner = admin
Found property exif:isoSpeedRatings = 400
Found property exif:fNumber = 3.5
Found property sys:node-uuid = 0516a5cc-fc04-4512-a4ed-b595b7c3908b
Found property exif:pixelYDimension = 2048
Found property exif:resolutionUnit = Inch
Found property exif:dateTimeOriginal = 2005-01-09T16:00:55Z
Found property sys:locale = en_GB

which explains why the Solr connector was trying to save fields like
exif_fNumber and exif_resolutionUnit.   This came up because the
Alfresco instance I'm experimenting with has their default sample
workspace which includes images and things I don't normally touch.
:-)

As for managing all this so my history doesn't contain all those
failure messages, I thought about creating a "WhitelistFieldTransform"
as a transform connection to drop any fields other than the ones that
are whitelisted.    Two questions:

1. Does this seem like a reasonable approach, or is there a better way?

2. If this is reasonable and I create such a filter, would there be
any interest in having it contributed back to MCF?


Cheers,


Phil

This message optimized for indexing by NSA PRISM


On Sun, Oct 15, 2017 at 10:11 AM, Karl Wright <da...@gmail.com> wrote:
> Hi Phil,
>
> In most cases you can't modify the fields being output by the various
> connectors, but you don't have to use them.  If you have an output connector
> that *insists* on using all of them in a destructive way, we'd like to know
> about that.  Usually extra fields are harmless and only the ones you want in
> your schema are looked for.
>
> Karl
>
>
> On Sat, Oct 14, 2017 at 8:12 PM, Phillip Rhodes <mo...@gmail.com>
> wrote:
>>
>> On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <da...@gmail.com> wrote:
>> > Hi Phil,
>> >
>> > You are correct in asserting that in MCF it is the sum total of all the
>> > connections that the document passes through that determine its
>> > attribute
>> > set.  That includes transformation connections as well as the repository
>> > connection.
>>
>> OK, sounds good.
>>
>> > Tika is one connection that does add a lot of fields and these depend
>> > not
>> > only on the configuration of the Tika connection, but also on the kind
>> > of
>> > document being extracted.  If you want to figure out the sum total of
>> > what's
>> > possible, you will need to consult the Tika documentation.  And yes, the
>> > field names Tika generates are created based on what Tika finds in the
>> > document.
>>
>> Gotcha.   So if I want to limit the fields output to *only* a specific
>> set that is determined in advance, is there a way to accomplish that?
>>
>> > Alternatively, you can configure your job to send output to a null
>> > output
>> > connection.  This connection records all attribute information for each
>> > document in the simple history, so you can get an idea what to expect.
>>
>> Excellent, I'll investigate that.
>>
>> > I'm a little confused about your statement that Tika runs even when it's
>> > not
>> > in a job's pipeline.  That's not actually true, so I'm wondering what
>> > you
>> > are seeing.
>>
>> It's probable that I'm wrong.  I just thought maybe there was some
>> default behavior, because I pointed MCF at a directory full of PDF's
>> without explicitly configuring Tika and I saw fields in the output
>> that I thought were probably generated by Tika.  Likewise now I am
>> running a pipeline with no explicit Tika step and I see output fields
>> for EXIF stuff for images and the like, which I assumed came from
>> Tika.
>>
>>
>>
>> Phil
>
>

Re: How to determine the set of all possible fields in MCF output?

Posted by Karl Wright <da...@gmail.com>.

Hi Phil,

In most cases you can't modify the fields being output by the various
connectors, but you don't have to use them.  If you have an output
connector that *insists* on using all of them in a destructive way, we'd
like to know about that.  Usually extra fields are harmless and only the
ones you want in your schema are looked for.

Karl


On Sat, Oct 14, 2017 at 8:12 PM, Phillip Rhodes <mo...@gmail.com>
wrote:

> On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <da...@gmail.com> wrote:
> > Hi Phil,
> >
> > You are correct in asserting that in MCF it is the sum total of all the
> > connections that the document passes through that determine its attribute
> > set.  That includes transformation connections as well as the repository
> > connection.
>
> OK, sounds good.
>
> > Tika is one connection that does add a lot of fields and these depend not
> > only on the configuration of the Tika connection, but also on the kind of
> > document being extracted.  If you want to figure out the sum total of
> what's
> > possible, you will need to consult the Tika documentation.  And yes, the
> > field names Tika generates are created based on what Tika finds in the
> > document.
>
> Gotcha.   So if I want to limit the fields output to *only* a specific
> set that is determined in advance, is there a way to accomplish that?
>
> > Alternatively, you can configure your job to send output to a null output
> > connection.  This connection records all attribute information for each
> > document in the simple history, so you can get an idea what to expect.
>
> Excellent, I'll investigate that.
>
> > I'm a little confused about your statement that Tika runs even when it's
> not
> > in a job's pipeline.  That's not actually true, so I'm wondering what you
> > are seeing.
>
> It's probable that I'm wrong.  I just thought maybe there was some
> default behavior, because I pointed MCF at a directory full of PDF's
> without explicitly configuring Tika and I saw fields in the output
> that I thought were probably generated by Tika.  Likewise now I am
> running a pipeline with no explicit Tika step and I see output fields
> for EXIF stuff for images and the like, which I assumed came from
> Tika.
>
>
>
> Phil
>

Re: How to determine the set of all possible fields in MCF output?

Posted by Phillip Rhodes <mo...@gmail.com>.

On Sat, Oct 14, 2017 at 7:17 PM, Karl Wright <da...@gmail.com> wrote:
> Hi Phil,
>
> You are correct in asserting that in MCF it is the sum total of all the
> connections that the document passes through that determine its attribute
> set.  That includes transformation connections as well as the repository
> connection.

OK, sounds good.

> Tika is one connection that does add a lot of fields and these depend not
> only on the configuration of the Tika connection, but also on the kind of
> document being extracted.  If you want to figure out the sum total of what's
> possible, you will need to consult the Tika documentation.  And yes, the
> field names Tika generates are created based on what Tika finds in the
> document.

Gotcha.   So if I want to limit the fields output to *only* a specific
set that is determined in advance, is there a way to accomplish that?

> Alternatively, you can configure your job to send output to a null output
> connection.  This connection records all attribute information for each
> document in the simple history, so you can get an idea what to expect.

Excellent, I'll investigate that.

> I'm a little confused about your statement that Tika runs even when it's not
> in a job's pipeline.  That's not actually true, so I'm wondering what you
> are seeing.

It's probable that I'm wrong.  I just thought maybe there was some
default behavior, because I pointed MCF at a directory full of PDF's
without explicitly configuring Tika and I saw fields in the output
that I thought were probably generated by Tika.  Likewise now I am
running a pipeline with no explicit Tika step and I see output fields
for EXIF stuff for images and the like, which I assumed came from
Tika.

Phil

Re: How to determine the set of all possible fields in MCF output?

Posted by Karl Wright <da...@gmail.com>.

Hi Phil,

You are correct in asserting that in MCF it is the sum total of all the
connections that the document passes through that determine its attribute
set.  That includes transformation connections as well as the repository
connection.

Tika is one connection that does add a lot of fields and these depend not
only on the configuration of the Tika connection, but also on the kind of
document being extracted.  If you want to figure out the sum total of
what's possible, you will need to consult the Tika documentation.  And yes,
the field names Tika generates are created based on what Tika finds in the
document.

Alternatively, you can configure your job to send output to a null output
connection.  This connection records all attribute information for each
document in the simple history, so you can get an idea what to expect.

I'm a little confused about your statement that Tika runs even when it's
not in a job's pipeline.  That's not actually true, so I'm wondering what
you are seeing.

Thanks,
Karl

On Sat, Oct 14, 2017 at 6:39 PM, Phillip Rhodes <mo...@gmail.com>
wrote:

> Hi all, I've been working with MCF the past few days and am very happy
> with what it lets me do, and I have a pipeline going from my
> repository to Solr which works fine.  But there is one point I clearly
> don't understand, which is:
>
> How do you know exactly what fields are going to be output in a given
> configuration?  I found that i had to resort to trial and error to
> tweak my Solr schema to avoid "undefined field xxxxx" errors from
> Manifold when trying to write to Solr.  Now to be fair, clearly I
> could just ignore any fields I don't specifically know I want, but I'd
> like to understand how this works.
>
> Is it the case that the initial set of fields depends on the
> repository connector?  I found that I seemed to get some Alfresco
> specific stuff when reading from Alfresco, as opposed to what I got
> from a simple dummy file-system repo I was initially experimenting
> with.
>
> It also seems that Tika adds some fields, (actually a lot of fields)
> even when you don't have a Tika transform wired in explicitly?   Is it
> the case that you need to put in an explicit Tika transform if you
> want to control which fields are contributed by Tika?
>
> And on that point, is there a master list of possible fields that TIka
> will emit, or is Tika just transforming the names of metadata fields
> in the documents it encounters, and programmatically generating a
> field name?
>
>
> Any and all help on understanding how this works is greatly appreciated...
>
>
> Phil
> ~~~~
> This message optimized for indexing by NSA PRISM
>