You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Gian Maria Ricci <al...@nablasoft.com> on 2013/05/25 10:44:07 UTC

Tika: How can I import automatically all metadata without specifiying them explicitly

Hi to everyone,

 

I've configured import of a document folder with FileListEntityProcessor,
everything went smooth on the first try, but I have a simple question. I'm
able to map metadata without any problem, but I'd like to import in my index
all metadata, not only those I've configured with field nodes. In this
example I've imported Author and title, but I does not know in advance which
metadata a document could have and I wish to have all of them inside my
index.

 

Here is my import config. It is the first try with importing with tika and
probably I'm missing a simple stuff.

 

<dataConfig>  

                <dataSource type="BinFileDataSource" />

                                <document>

                                                <entity name="files"
dataSource="null" rootEntity="false"

 
processor="FileListEntityProcessor" 

                                                baseDir="c:/temp/docs"
fileName=".*\.(doc)|(pdf)|(docx)"

                                                onError="skip"

                                                recursive="true">

                                                                <field
column="file" name="id" />

                                                                <field
column="fileAbsolutePath" name="path" />

                                                                <field
column="fileSize" name="size" />

                                                                <field
column="fileLastModified" name="lastModified" />

                                                                

                                                                <entity 

 
name="documentImport" 

 
processor="TikaEntityProcessor"

 
url="${files.fileAbsolutePath}" 

 
format="text">

 
<field column="file" name="fileName"/>

 
<field column="Author" name="author" meta="true"/>

 
<field column="title" name="title" meta="true"/>

 
<field column="text" name="text"/>

                                                                </entity>

                                </entity>

                                </document> 

</dataConfig>  

 

 

--

Gian Maria Ricci

Mobile: +39 320 0136949

 <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635>
<http://www.linkedin.com/in/gianmariaricci>
<https://twitter.com/alkampfer>   <http://feeds.feedburner.com/AlkampferEng>
<skype://alkampferaok/>

Re: How can I import automatically all metadata without specifiying them explicitly

Posted by Jack Krupansky <ja...@basetechnology.com>.

Setting the uprefix parameter of SolrCell (ERH) to something like "attr_" 
will result in all metatdata attributes that are not named in the Solr 
schema being given the "attr_" prefix to their metadata attribute names. For 
example,

curl "http://localhost:8983/solr/update/extract?literal.id=doc-1\
&commit=true&uprefix=attr_" -F "my.pdf=@my.pdf"

Once you fixed out which of the metadata you want to keep, either add those 
metadata attribute names to your schema, or
add explicit SolrCell field mappings for each piece of metadata: 
&fmap.my-field=metadata-name.

-- Jack Krupansky

-----Original Message----- 
From: Gian Maria Ricci
Sent: Monday, May 27, 2013 4:21 AM
To: solr-user@lucene.apache.org
Subject: RE: Tika: How can I import automatically all metadata without 
specifiying them explicitly

Thanks for the help.

@Alexandre: Thanks for the suggestion, I'll try to use an
ExtractingRequestHandler, I thought that I was missing some DIH option :).

@Erik: I'm interested in knowing them all to do various form of analysis. I
have documents coming from heterogeneous sources and I'm interested in
searching inside the content, but also being able to extract all possible
metadata. I'm working in .Net so it is useful letting tika doing everything
for me directly in solr and then retrieve all metadata for matched
documents.

Thanks again to everyone.

--
Gian Maria Ricci
Mobile: +39 320 0136949

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Sunday, May 26, 2013 5:30 PM
To: solr-user@lucene.apache.org; Gian Maria Ricci
Subject: Re: Tika: How can I import automatically all metadata without
specifiying them explicitly

In addition to Alexandre's comment:

bq:  ...I'd like to import in my index all metadata

Be a little careful here, this isn't actually very useful in my experience.
Sure
it's nice to have all that data in the index, but... how do you search it
meaningfully?

Consider that some doc may have an "author" metadata field. Another may have
a "last editor" field. Yet another may have a "main author" field. If you
add all these as their field name, what do you do to search for "author"?
Somehow you have to create a mapping between the various metadata names and
something that's searchable, why not do this at index time?

Not to mention I've seen this done and the result may be literally hundreds
of different metadata fields which are not very useful.

All that said, it may be perfectly valid to inde them all, but before going
there it's worth considering whether the result is actually _useful_.

Best
Erick

On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
<al...@nablasoft.com>wrote:

> Hi to everyone,****
>
> ** **
>
> I've configured import of a document folder with
> FileListEntityProcessor, everything went smooth on the first try, but
> I have a simple question. I'm able to map metadata without any
> problem, but I'd like to import in my index all metadata, not only
> those I've configured with field nodes. In this example I've imported
> Author and title, but I does not know in advance which metadata a
> document could have and I wish to have all of them inside my
> index.****
>
> ** **
>
> Here is my import config. It is the first try with importing with tika
> and probably I'm missing a simple stuff.****
>
> ** **
>
> <dataConfig>  ****
>
>                 <dataSource type="BinFileDataSource" />****
>
>                                 <document>****
>
>                                                 <entity name="files"
> dataSource="null" rootEntity="false"****
>
>
> processor="FileListEntityProcessor" ****
>
>                                                 baseDir="c:/temp/docs"
> fileName=".*\.(doc)|(pdf)|(docx)"****
>
>                                                 onError="skip"****
>
>                                                 recursive="true">****
>
>                                                                 <field
> column="file" name="id" />****
>
>                                                                 <field
> column="fileAbsolutePath" name="path" />****
>
>                                                                 <field
> column="fileSize" name="size" />****
>
>                                                                 <field
> column="fileLastModified" name="lastModified" />****
>
>                                                                 ****
>
>
> <entity **
> **
>
>
> name="documentImport" ****
>
>
> processor="TikaEntityProcessor"****
>
>
> url="${files.fileAbsolutePath}" ****
>
>
> format="text">****
>
>
> <field column="file" name="fileName"/>****
>
>
> <field column="Author" name="author" meta="true"/>****
>
>
> <field column="title" name="title" meta="true"/>****
>
>
> <field column="text" name="text"/>****
>
>
> </entity>*
> ***
>
>                                 </entity>****
>
>                                 </document> ****
>
> </dataConfig>  ****
>
> ** **
>
> ** **
>
> --****
>
> Gian Maria Ricci****
>
> Mobile: +39 320 0136949****
>
> <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635> [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn
> uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]<http://www.linkedin.com/in/gianm
> ariaricci>
>  [image:
> https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I
> 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]<https://twitter.com/alkampfer>
>  [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU
> Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]<http://feeds.feedburner.com/Alka
> mpferEng>
>  [image:
> https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xf
> DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
> ****
>
> ** **
>
> ** **
>

RE: Tika: How can I import automatically all metadata without specifiying them explicitly

Posted by Gian Maria Ricci <al...@nablasoft.com>.

Thanks a lot, other useful hints, and probably standalone Tika could be a solution.

I've another little question: how can I express filters in DIH configuration to run import of the server incrementally?

Actually I've two distinct scenario. 

In first scenario I've documents stored inside database, so I need to write a DIH to import data from database and since I have timestamp column this is not a problem.

Second scenario: need to monitor one folder, and do incremental population each 15 minutes. Usually with Sql DIH I use some column as a filter to do incremental population, but I wonder if it is possible to pass filter to BinFileDataSource, telling to process only new files and those modified after a timestamp (last run).

Thanks again for all your precious suggestions.

--
Gian Maria Ricci
Mobile: +39 320 0136949
    


-----Original Message-----
From: Alexandre Rafalovitch [mailto:arafalov@gmail.com] 
Sent: Monday, May 27, 2013 1:44 PM
To: solr-user@lucene.apache.org
Subject: RE: Tika: How can I import automatically all metadata without specifiying them explicitly

Standalone Tika can also run in a network server mode.  That increases data roundtrips but gives you more options. Even in .net .

Regards,
      Alex
On 27 May 2013 04:22, "Gian Maria Ricci" <al...@nablasoft.com> wrote:

> Thanks for the help.
>
> @Alexandre: Thanks for the suggestion, I'll try to use an 
> ExtractingRequestHandler, I thought that I was missing some DIH option :).
>
> @Erik: I'm interested in knowing them all to do various form of 
> analysis. I have documents coming from heterogeneous sources and I'm 
> interested in searching inside the content, but also being able to 
> extract all possible metadata. I'm working in .Net so it is useful 
> letting tika doing everything for me directly in solr and then 
> retrieve all metadata for matched documents.
>
> Thanks again to everyone.
>
> --
> Gian Maria Ricci
> Mobile: +39 320 0136949
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Sunday, May 26, 2013 5:30 PM
> To: solr-user@lucene.apache.org; Gian Maria Ricci
> Subject: Re: Tika: How can I import automatically all metadata without 
> specifiying them explicitly
>
> In addition to Alexandre's comment:
>
> bq:  ...I'd like to import in my index all metadata
>
> Be a little careful here, this isn't actually very useful in my experience.
> Sure
> it's nice to have all that data in the index, but... how do you search 
> it meaningfully?
>
> Consider that some doc may have an "author" metadata field. Another 
> may have a "last editor" field. Yet another may have a "main author" 
> field. If you add all these as their field name, what do you do to 
> search for "author"?
> Somehow you have to create a mapping between the various metadata 
> names and something that's searchable, why not do this at index time?
>
> Not to mention I've seen this done and the result may be literally 
> hundreds of different metadata fields which are not very useful.
>
> All that said, it may be perfectly valid to inde them all, but before 
> going there it's worth considering whether the result is actually _useful_.
>
> Best
> Erick
>
>
> On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
> <al...@nablasoft.com>wrote:
>
> > Hi to everyone,****
> >
> > ** **
> >
> > I've configured import of a document folder with 
> > FileListEntityProcessor, everything went smooth on the first try, 
> > but I have a simple question. I'm able to map metadata without any 
> > problem, but I'd like to import in my index all metadata, not only 
> > those I've configured with field nodes. In this example I've 
> > imported Author and title, but I does not know in advance which 
> > metadata a document could have and I wish to have all of them inside 
> > my
> > index.****
> >
> > ** **
> >
> > Here is my import config. It is the first try with importing with 
> > tika and probably I'm missing a simple stuff.****
> >
> > ** **
> >
> > <dataConfig>  ****
> >
> >                 <dataSource type="BinFileDataSource" />****
> >
> >                                 <document>****
> >
> >                                                 <entity name="files"
> > dataSource="null" rootEntity="false"****
> >
> >
> > processor="FileListEntityProcessor" ****
> >
> >                                                 baseDir="c:/temp/docs"
> > fileName=".*\.(doc)|(pdf)|(docx)"****
> >
> >                                                 onError="skip"****
> >
> >                                                 
> > recursive="true">****
> >
> >                                                                 
> > <field column="file" name="id" />****
> >
> >                                                                 
> > <field column="fileAbsolutePath" name="path" />****
> >
> >                                                                 
> > <field column="fileSize" name="size" />****
> >
> >                                                                 
> > <field column="fileLastModified" name="lastModified" />****
> >
> >                                                                 ****
> >
> >
> > <entity **
> > **
> >
> >
> > name="documentImport" ****
> >
> >
> > processor="TikaEntityProcessor"****
> >
> >
> > url="${files.fileAbsolutePath}" ****
> >
> >
> > format="text">****
> >
> >
> > <field column="file" name="fileName"/>****
> >
> >
> > <field column="Author" name="author" meta="true"/>****
> >
> >
> > <field column="title" name="title" meta="true"/>****
> >
> >
> > <field column="text" name="text"/>****
> >
> >
> > </entity>*
> > ***
> >
> >                                 </entity>****
> >
> >                                 </document> ****
> >
> > </dataConfig>  ****
> >
> > ** **
> >
> > ** **
> >
> > --****
> >
> > Gian Maria Ricci****
> >
> > Mobile: +39 320 0136949****
> >
> > <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635>
> [image:
> > https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-
> > rn 
> > uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]<http://www.linkedin.com/in/gia
> > nm
> > ariaricci>
> >  [image:
> > https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw
> > 1I 
> > 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]<https://twitter.com/alkampfer>
> >  [image:
> > https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypK
> > MU 
> > Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]<http://feeds.feedburner.com/Al
> > ka
> > mpferEng>
> >  [image:
> > https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470
> > xf
> > DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
> > ****
> >
> > ** **
> >
> > ** **
> >
>

RE: Tika: How can I import automatically all metadata without specifiying them explicitly

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Standalone Tika can also run in a network server mode.  That increases data
roundtrips but gives you more options. Even in .net .

Regards,
      Alex
On 27 May 2013 04:22, "Gian Maria Ricci" <al...@nablasoft.com> wrote:

> Thanks for the help.
>
> @Alexandre: Thanks for the suggestion, I'll try to use an
> ExtractingRequestHandler, I thought that I was missing some DIH option :).
>
> @Erik: I'm interested in knowing them all to do various form of analysis. I
> have documents coming from heterogeneous sources and I'm interested in
> searching inside the content, but also being able to extract all possible
> metadata. I'm working in .Net so it is useful letting tika doing everything
> for me directly in solr and then retrieve all metadata for matched
> documents.
>
> Thanks again to everyone.
>
> --
> Gian Maria Ricci
> Mobile: +39 320 0136949
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Sunday, May 26, 2013 5:30 PM
> To: solr-user@lucene.apache.org; Gian Maria Ricci
> Subject: Re: Tika: How can I import automatically all metadata without
> specifiying them explicitly
>
> In addition to Alexandre's comment:
>
> bq:  ...I'd like to import in my index all metadata
>
> Be a little careful here, this isn't actually very useful in my experience.
> Sure
> it's nice to have all that data in the index, but... how do you search it
> meaningfully?
>
> Consider that some doc may have an "author" metadata field. Another may
> have
> a "last editor" field. Yet another may have a "main author" field. If you
> add all these as their field name, what do you do to search for "author"?
> Somehow you have to create a mapping between the various metadata names and
> something that's searchable, why not do this at index time?
>
> Not to mention I've seen this done and the result may be literally hundreds
> of different metadata fields which are not very useful.
>
> All that said, it may be perfectly valid to inde them all, but before going
> there it's worth considering whether the result is actually _useful_.
>
> Best
> Erick
>
>
> On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
> <al...@nablasoft.com>wrote:
>
> > Hi to everyone,****
> >
> > ** **
> >
> > I've configured import of a document folder with
> > FileListEntityProcessor, everything went smooth on the first try, but
> > I have a simple question. I'm able to map metadata without any
> > problem, but I'd like to import in my index all metadata, not only
> > those I've configured with field nodes. In this example I've imported
> > Author and title, but I does not know in advance which metadata a
> > document could have and I wish to have all of them inside my
> > index.****
> >
> > ** **
> >
> > Here is my import config. It is the first try with importing with tika
> > and probably I'm missing a simple stuff.****
> >
> > ** **
> >
> > <dataConfig>  ****
> >
> >                 <dataSource type="BinFileDataSource" />****
> >
> >                                 <document>****
> >
> >                                                 <entity name="files"
> > dataSource="null" rootEntity="false"****
> >
> >
> > processor="FileListEntityProcessor" ****
> >
> >                                                 baseDir="c:/temp/docs"
> > fileName=".*\.(doc)|(pdf)|(docx)"****
> >
> >                                                 onError="skip"****
> >
> >                                                 recursive="true">****
> >
> >                                                                 <field
> > column="file" name="id" />****
> >
> >                                                                 <field
> > column="fileAbsolutePath" name="path" />****
> >
> >                                                                 <field
> > column="fileSize" name="size" />****
> >
> >                                                                 <field
> > column="fileLastModified" name="lastModified" />****
> >
> >                                                                 ****
> >
> >
> > <entity **
> > **
> >
> >
> > name="documentImport" ****
> >
> >
> > processor="TikaEntityProcessor"****
> >
> >
> > url="${files.fileAbsolutePath}" ****
> >
> >
> > format="text">****
> >
> >
> > <field column="file" name="fileName"/>****
> >
> >
> > <field column="Author" name="author" meta="true"/>****
> >
> >
> > <field column="title" name="title" meta="true"/>****
> >
> >
> > <field column="text" name="text"/>****
> >
> >
> > </entity>*
> > ***
> >
> >                                 </entity>****
> >
> >                                 </document> ****
> >
> > </dataConfig>  ****
> >
> > ** **
> >
> > ** **
> >
> > --****
> >
> > Gian Maria Ricci****
> >
> > Mobile: +39 320 0136949****
> >
> > <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635>
> [image:
> > https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn
> > uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]<http://www.linkedin.com/in/gianm
> > ariaricci>
> >  [image:
> > https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I
> > 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]<https://twitter.com/alkampfer>
> >  [image:
> > https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU
> > Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]<http://feeds.feedburner.com/Alka
> > mpferEng>
> >  [image:
> > https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xf
> > DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
> > ****
> >
> > ** **
> >
> > ** **
> >
>

RE: Tika: How can I import automatically all metadata without specifiying them explicitly

Posted by Gian Maria Ricci <al...@nablasoft.com>.

Thanks for the help.

@Alexandre: Thanks for the suggestion, I'll try to use an
ExtractingRequestHandler, I thought that I was missing some DIH option :).

@Erik: I'm interested in knowing them all to do various form of analysis. I
have documents coming from heterogeneous sources and I'm interested in
searching inside the content, but also being able to extract all possible
metadata. I'm working in .Net so it is useful letting tika doing everything
for me directly in solr and then retrieve all metadata for matched
documents.

Thanks again to everyone. 

--
Gian Maria Ricci
Mobile: +39 320 0136949
    


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Sunday, May 26, 2013 5:30 PM
To: solr-user@lucene.apache.org; Gian Maria Ricci
Subject: Re: Tika: How can I import automatically all metadata without
specifiying them explicitly

In addition to Alexandre's comment:

bq:  ...I'd like to import in my index all metadata

Be a little careful here, this isn't actually very useful in my experience.
Sure
it's nice to have all that data in the index, but... how do you search it
meaningfully?

Consider that some doc may have an "author" metadata field. Another may have
a "last editor" field. Yet another may have a "main author" field. If you
add all these as their field name, what do you do to search for "author"?
Somehow you have to create a mapping between the various metadata names and
something that's searchable, why not do this at index time?

Not to mention I've seen this done and the result may be literally hundreds
of different metadata fields which are not very useful.

All that said, it may be perfectly valid to inde them all, but before going
there it's worth considering whether the result is actually _useful_.

Best
Erick


On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
<al...@nablasoft.com>wrote:

> Hi to everyone,****
>
> ** **
>
> I've configured import of a document folder with 
> FileListEntityProcessor, everything went smooth on the first try, but 
> I have a simple question. I'm able to map metadata without any 
> problem, but I'd like to import in my index all metadata, not only 
> those I've configured with field nodes. In this example I've imported 
> Author and title, but I does not know in advance which metadata a 
> document could have and I wish to have all of them inside my 
> index.****
>
> ** **
>
> Here is my import config. It is the first try with importing with tika 
> and probably I'm missing a simple stuff.****
>
> ** **
>
> <dataConfig>  ****
>
>                 <dataSource type="BinFileDataSource" />****
>
>                                 <document>****
>
>                                                 <entity name="files"
> dataSource="null" rootEntity="false"****
>
>
> processor="FileListEntityProcessor" ****
>
>                                                 baseDir="c:/temp/docs"
> fileName=".*\.(doc)|(pdf)|(docx)"****
>
>                                                 onError="skip"****
>
>                                                 recursive="true">****
>
>                                                                 <field 
> column="file" name="id" />****
>
>                                                                 <field 
> column="fileAbsolutePath" name="path" />****
>
>                                                                 <field 
> column="fileSize" name="size" />****
>
>                                                                 <field 
> column="fileLastModified" name="lastModified" />****
>
>                                                                 ****
>
>                                                                 
> <entity **
> **
>
>
> name="documentImport" ****
>
>
> processor="TikaEntityProcessor"****
>
>
> url="${files.fileAbsolutePath}" ****
>
>
> format="text">****
>
>
> <field column="file" name="fileName"/>****
>
>
> <field column="Author" name="author" meta="true"/>****
>
>
> <field column="title" name="title" meta="true"/>****
>
>
> <field column="text" name="text"/>****
>
>                                                                 
> </entity>*
> ***
>
>                                 </entity>****
>
>                                 </document> ****
>
> </dataConfig>  ****
>
> ** **
>
> ** **
>
> --****
>
> Gian Maria Ricci****
>
> Mobile: +39 320 0136949****
>
> <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635> [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rn
> uVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]<http://www.linkedin.com/in/gianm
> ariaricci>
>  [image:
> https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I
> 59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]<https://twitter.com/alkampfer>
>  [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMU
> Tub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]<http://feeds.feedburner.com/Alka
> mpferEng>
>  [image:
> https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xf
> DtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
> ****
>
> ** **
>
> ** **
>

Re: Tika: How can I import automatically all metadata without specifiying them explicitly

Posted by Erick Erickson <er...@gmail.com>.

In addition to Alexandre's comment:

bq:  ...I’d like to import in my index all metadata

Be a little careful here, this isn't actually very useful in my experience.
Sure
it's nice to have all that data in the index, but... how do you search it
meaningfully?

Consider that some doc may have an "author" metadata field. Another may have
a "last editor" field. Yet another may have a "main author" field. If you
add all
these as their field name, what do you do to search for "author"? Somehow
you
have to create a mapping between the various metadata names and something
that's searchable, why not do this at index time?

Not to mention I've seen this done and the result may be literally hundreds
of
different metadata fields which are not very useful.

All that said, it may be perfectly valid to inde them all, but before going
there
it's worth considering whether the result is actually _useful_.

Best
Erick


On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
<al...@nablasoft.com>wrote:

> Hi to everyone,****
>
> ** **
>
> I’ve configured import of a document folder with FileListEntityProcessor,
> everything went smooth on the first try, but I have a simple question. I’m
> able to map metadata without any problem, but I’d like to import in my
> index all metadata, not only those I’ve configured with field nodes. In
> this example I’ve imported Author and title, but I does not know in advance
> which metadata a document could have and I wish to have all of them inside
> my index.****
>
> ** **
>
> Here is my import config. It is the first try with importing with tika and
> probably I’m missing a simple stuff.****
>
> ** **
>
> <dataConfig>  ****
>
>                 <dataSource type="BinFileDataSource" />****
>
>                                 <document>****
>
>                                                 <entity name="files"
> dataSource="null" rootEntity="false"****
>
>
> processor="FileListEntityProcessor" ****
>
>                                                 baseDir="c:/temp/docs"
> fileName=".*\.(doc)|(pdf)|(docx)"****
>
>                                                 onError="skip"****
>
>                                                 recursive="true">****
>
>                                                                 <field
> column="file" name="id" />****
>
>                                                                 <field
> column="fileAbsolutePath" name="path" />****
>
>                                                                 <field
> column="fileSize" name="size" />****
>
>                                                                 <field
> column="fileLastModified" name="lastModified" />****
>
>                                                                 ****
>
>                                                                 <entity **
> **
>
>
> name="documentImport" ****
>
>
> processor="TikaEntityProcessor"****
>
>
> url="${files.fileAbsolutePath}" ****
>
>
> format="text">****
>
>
> <field column="file" name="fileName"/>****
>
>
> <field column="Author" name="author" meta="true"/>****
>
>
> <field column="title" name="title" meta="true"/>****
>
>
> <field column="text" name="text"/>****
>
>                                                                 </entity>*
> ***
>
>                                 </entity>****
>
>                                 </document> ****
>
> </dataConfig>  ****
>
> ** **
>
> ** **
>
> --****
>
> Gian Maria Ricci****
>
> Mobile: +39 320 0136949****
>
> <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635> [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rnuVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]<http://www.linkedin.com/in/gianmariaricci>
>  [image:
> https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]<https://twitter.com/alkampfer>
>  [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMUTub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]<http://feeds.feedburner.com/AlkampferEng>
>  [image:
> https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xfDtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
> ****
>
> ** **
>
> ** **
>

Re: Tika: How can I import automatically all metadata without specifiying them explicitly

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Tika support inside DIH does not support wildcard mapping. If you are not
planning to do any inner-entity content parsing, you might be better off
with using ExtractingRequestHandler and uprefix parameter.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Sat, May 25, 2013 at 4:44 AM, Gian Maria Ricci
<al...@nablasoft.com>wrote:

> Hi to everyone,****
>
> ** **
>
> I’ve configured import of a document folder with FileListEntityProcessor,
> everything went smooth on the first try, but I have a simple question. I’m
> able to map metadata without any problem, but I’d like to import in my
> index all metadata, not only those I’ve configured with field nodes. In
> this example I’ve imported Author and title, but I does not know in advance
> which metadata a document could have and I wish to have all of them inside
> my index.****
>
> ** **
>
> Here is my import config. It is the first try with importing with tika and
> probably I’m missing a simple stuff.****
>
> ** **
>
> <dataConfig>  ****
>
>                 <dataSource type="BinFileDataSource" />****
>
>                                 <document>****
>
>                                                 <entity name="files"
> dataSource="null" rootEntity="false"****
>
>
> processor="FileListEntityProcessor" ****
>
>                                                 baseDir="c:/temp/docs"
> fileName=".*\.(doc)|(pdf)|(docx)"****
>
>                                                 onError="skip"****
>
>                                                 recursive="true">****
>
>                                                                 <field
> column="file" name="id" />****
>
>                                                                 <field
> column="fileAbsolutePath" name="path" />****
>
>                                                                 <field
> column="fileSize" name="size" />****
>
>                                                                 <field
> column="fileLastModified" name="lastModified" />****
>
>                                                                 ****
>
>                                                                 <entity **
> **
>
>
> name="documentImport" ****
>
>
> processor="TikaEntityProcessor"****
>
>
> url="${files.fileAbsolutePath}" ****
>
>
> format="text">****
>
>
> <field column="file" name="fileName"/>****
>
>
> <field column="Author" name="author" meta="true"/>****
>
>
> <field column="title" name="title" meta="true"/>****
>
>
> <field column="text" name="text"/>****
>
>                                                                 </entity>*
> ***
>
>                                 </entity>****
>
>                                 </document> ****
>
> </dataConfig>  ****
>
> ** **
>
> ** **
>
> --****
>
> Gian Maria Ricci****
>
> Mobile: +39 320 0136949****
>
> <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635> [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQyg0wiW_QuTxl-rnuVR2P0jGuj4qO3I9attctCNarL--FC3vdPYg]<http://www.linkedin.com/in/gianmariaricci>
>  [image:
> https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT8z0HpwpDSjDWw1I59Yx7HmF79u-NnP0NYeYYyEyWM1WtIbOl7]<https://twitter.com/alkampfer>
>  [image:
> https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQQWMj687BGGypKMUTub_lkUrull1uU2LTx0K2tDBeu3mNUr7Oxlg]<http://feeds.feedburner.com/AlkampferEng>
>  [image:
> https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSkTG_lPTPFe470xfDtiInUtseqKcuV_lvI5h_-8t_3PsY5ikg3]
> ****
>
> ** **
>
> ** **
>