You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Alexander Stoffers <st...@modell-aachen.de> on 2014/03/28 15:13:22 UTC

Windows-Share to Solr is not working properly

Hi Karl,

we have a problem with crawling documents out of a windows share to Solr.

Our Solr schema has a date field that is not multivalued, but the output of the crawled (e.g. pdf) document has a date array instead of a single date.

I tried to remove the the whole field with the tab "Solr Field Mapping", using date=>'' but is not working at all. Can´t i remove the date metadata at all?

We figured out, that the crawler get´s the date metadata field out of the binaries where we found a field, called ModDate. If we remove the ModDate field out of the binaries the date metadata field disapears.

Can you explain, why the crawler puts the ModDate twice in the date field array?


Thank you in Advance
Alex



-- 
-- 

Dipl.-Wirt.-Ing. Alexander Stoffers
Leiter IT & Produktentwicklung
Modell Aachen GmbH - Interaktive Managementsysteme
Dennewartstr. 25-27, 52068 Aachen
fon ++49 176 1011 9752, fax ++49 241 9148 8653
http://www.modell-aachen.de

Geschäftsführung: Dr.-Ing. Carsten Behrens
Amtsgericht Aachen, HRB 15622

--

Unseren IT-Support erreichen Sie unter
support@modell-aachen.de
+49 (0)241 53808720

Re: Windows-Share to Solr is not working properly

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Alexander,

Which version of solr are you using? 

Please try these steps:

1) Set literalsOverride=true in solrconfig.xml (default section of extraction request handler)

2) Set fmap.date=ignored_date in solrconfig.xml (default section of extraction request handler)

If none of above works, don't worry, this will work for sure. FirstFieldValueUpdateProcessorFactory will convert multi valued field into single valued one.

 <updateRequestProcessorChain name="remove">

    <processor class="solr.FirstFieldValueUpdateProcessorFactory">
        <str name="fieldName">date</str>
    </processor>
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>
  
  <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
   <lst name="defaults">     
      <str name="update.chain">remove</str>
   </lst>  
  </requestHandler>

Ahmet

On Friday, March 28, 2014 6:53 PM, Karl Wright <da...@gmail.com> wrote:

Hi Alexander,

I do understand your problem.  But I assure you that ManifoldCF does not (and never did) extract metadata fields from binary documents.  Are you sure this is happening in ManifoldCF?  Perhaps you have a Tika pipeline configured in Solr?

Karl





On Fri, Mar 28, 2014 at 11:47 AM, Alexander Stoffers <st...@modell-aachen.de> wrote:

Hi Karl,
>
>thank you for you quick response!
>
>I´m sorry for my bad English skills, but i try to get it more clear:
>
>I actually don´t understand where ManifoldCF processes/maps a metadata field "date", after crawling a pdf document. We tried to explore the issue and we figured out that somewhere in the process the metadata field "ModDate" of the document itself is mapped to the metadata field "date". Furthermore the magic "date" field get´s an array.
>
>If we delete the metadata field "ModDate" of the document, the metadata field "date" used in the ManifoldCF process disapears.
>
>If we don´t delete the field "ModDate" of the document, and try to map the field "date" to something else or blank, the date field is processed to the Solr output connector, so that Solr will fail, because the date field is an array and the Solr schema expacts an single value for it´s date field.
>
>I hope that i could explain our problem a little bit better :-)
>
>Best Regards
>Alex
>
>----- Ursprüngliche Mail -----
>Von: "Karl Wright" <da...@gmail.com>
>An: user@manifoldcf.apache.org
>Gesendet: Freitag, 28. März 2014 15:29:11
>Betreff: Re: Windows-Share to Solr is not working properly
>
>
>Hi Alexander,
>
>It's hard to figure out exactly what you have configured from your email,
>but here are a couple of points:
>
>(1) ManifoldCF does not extract dates from binary files; it will only
>supply dates from file metadata.  So MCF is supplying the date from the
>modification date of the Windows file.
>(2) The JCIFS connector provides the same metadata date value in two ways:
>
>    rd.addField("lastModified", lastModifiedDate.toString());
>    rd.setModifiedDate(lastModifiedDate);
>
>This was done for backwards compatibility reasons.  You can control which
>metadata value name is used for the ModifiedDate field on the Solr
>connection's Schema tab.
>
>As for the "lastModified" data, you can either map that to a field you
>don't have in your solr schema, or you can suppress it entirely by creating
>an entry for Field Mapping that has "lastModified" on the left and a blank
>field on the right, and then clicking the "Add" button.  Bear in mind that
>1.5 had a bug in this functionality which was fixed in 1.5.1.
>
>Karl
>
>
>
>
>On Fri, Mar 28, 2014 at 10:13 AM, Alexander Stoffers <
>stoffers@modell-aachen.de> wrote:
>
>> Hi Karl,
>>
>> we have a problem with crawling documents out of a windows share to Solr.
>>
>> Our Solr schema has a date field that is not multivalued, but the output
>> of the crawled (e.g. pdf) document has a date array instead of a single
>> date.
>>
>> I tried to remove the the whole field with the tab "Solr Field Mapping",
>> using date=>'' but is not working at all. Can´t i remove the date metadata
>> at all?
>>
>> We figured out, that the crawler get´s the date metadata field out of the
>> binaries where we found a field, called ModDate. If we remove the ModDate
>> field out of the binaries the date metadata field disapears.
>>
>> Can you explain, why the crawler puts the ModDate twice in the date field
>> array?
>>
>>
>> Thank you in Advance
>> Alex
>>
>>
>>
>> --
>> --
>>
>> Dipl.-Wirt.-Ing. Alexander Stoffers
>> Leiter IT & Produktentwicklung
>> Modell Aachen GmbH - Interaktive Managementsysteme
>> Dennewartstr. 25-27, 52068 Aachen
>> fon ++49 176 1011 9752, fax ++49 241 9148 8653
>> http://www.modell-aachen.de
>>
>> Geschäftsführung: Dr.-Ing. Carsten Behrens
>> Amtsgericht Aachen, HRB 15622
>>
>> --
>>
>> Unseren IT-Support erreichen Sie unter
>> support@modell-aachen.de
>> +49 (0)241 53808720
>>
>

Re: Windows-Share to Solr is not working properly

Posted by Karl Wright <da...@gmail.com>.
Hi Alexander,

I do understand your problem.  But I assure you that ManifoldCF does not
(and never did) extract metadata fields from binary documents.  Are you
sure this is happening in ManifoldCF?  Perhaps you have a Tika pipeline
configured in Solr?

Karl



On Fri, Mar 28, 2014 at 11:47 AM, Alexander Stoffers <
stoffers@modell-aachen.de> wrote:

> Hi Karl,
>
> thank you for you quick response!
>
> I´m sorry for my bad English skills, but i try to get it more clear:
>
> I actually don´t understand where ManifoldCF processes/maps a metadata
> field "date", after crawling a pdf document. We tried to explore the issue
> and we figured out that somewhere in the process the metadata field
> "ModDate" of the document itself is mapped to the metadata field "date".
> Furthermore the magic "date" field get´s an array.
>
> If we delete the metadata field "ModDate" of the document, the metadata
> field "date" used in the ManifoldCF process disapears.
>
> If we don´t delete the field "ModDate" of the document, and try to map the
> field "date" to something else or blank, the date field is processed to the
> Solr output connector, so that Solr will fail, because the date field is an
> array and the Solr schema expacts an single value for it´s date field.
>
> I hope that i could explain our problem a little bit better :-)
>
> Best Regards
> Alex
>
> ----- Ursprüngliche Mail -----
> Von: "Karl Wright" <da...@gmail.com>
> An: user@manifoldcf.apache.org
> Gesendet: Freitag, 28. März 2014 15:29:11
> Betreff: Re: Windows-Share to Solr is not working properly
>
> Hi Alexander,
>
> It's hard to figure out exactly what you have configured from your email,
> but here are a couple of points:
>
> (1) ManifoldCF does not extract dates from binary files; it will only
> supply dates from file metadata.  So MCF is supplying the date from the
> modification date of the Windows file.
> (2) The JCIFS connector provides the same metadata date value in two ways:
>
>     rd.addField("lastModified", lastModifiedDate.toString());
>     rd.setModifiedDate(lastModifiedDate);
>
> This was done for backwards compatibility reasons.  You can control which
> metadata value name is used for the ModifiedDate field on the Solr
> connection's Schema tab.
>
> As for the "lastModified" data, you can either map that to a field you
> don't have in your solr schema, or you can suppress it entirely by creating
> an entry for Field Mapping that has "lastModified" on the left and a blank
> field on the right, and then clicking the "Add" button.  Bear in mind that
> 1.5 had a bug in this functionality which was fixed in 1.5.1.
>
> Karl
>
>
>
>
> On Fri, Mar 28, 2014 at 10:13 AM, Alexander Stoffers <
> stoffers@modell-aachen.de> wrote:
>
> > Hi Karl,
> >
> > we have a problem with crawling documents out of a windows share to Solr.
> >
> > Our Solr schema has a date field that is not multivalued, but the output
> > of the crawled (e.g. pdf) document has a date array instead of a single
> > date.
> >
> > I tried to remove the the whole field with the tab "Solr Field Mapping",
> > using date=>'' but is not working at all. Can´t i remove the date
> metadata
> > at all?
> >
> > We figured out, that the crawler get´s the date metadata field out of the
> > binaries where we found a field, called ModDate. If we remove the ModDate
> > field out of the binaries the date metadata field disapears.
> >
> > Can you explain, why the crawler puts the ModDate twice in the date field
> > array?
> >
> >
> > Thank you in Advance
> > Alex
> >
> >
> >
> > --
> > --
> >
> > Dipl.-Wirt.-Ing. Alexander Stoffers
> > Leiter IT & Produktentwicklung
> > Modell Aachen GmbH - Interaktive Managementsysteme
> > Dennewartstr. 25-27, 52068 Aachen
> > fon ++49 176 1011 9752, fax ++49 241 9148 8653
> > http://www.modell-aachen.de
> >
> > Geschäftsführung: Dr.-Ing. Carsten Behrens
> > Amtsgericht Aachen, HRB 15622
> >
> > --
> >
> > Unseren IT-Support erreichen Sie unter
> > support@modell-aachen.de
> > +49 (0)241 53808720
> >
>

Re: Windows-Share to Solr is not working properly

Posted by Alexander Stoffers <st...@modell-aachen.de>.
Hi Karl,

thank you for you quick response!

I´m sorry for my bad English skills, but i try to get it more clear:

I actually don´t understand where ManifoldCF processes/maps a metadata field "date", after crawling a pdf document. We tried to explore the issue and we figured out that somewhere in the process the metadata field "ModDate" of the document itself is mapped to the metadata field "date". Furthermore the magic "date" field get´s an array.

If we delete the metadata field "ModDate" of the document, the metadata field "date" used in the ManifoldCF process disapears.

If we don´t delete the field "ModDate" of the document, and try to map the field "date" to something else or blank, the date field is processed to the Solr output connector, so that Solr will fail, because the date field is an array and the Solr schema expacts an single value for it´s date field.

I hope that i could explain our problem a little bit better :-)

Best Regards
Alex

----- Ursprüngliche Mail -----
Von: "Karl Wright" <da...@gmail.com>
An: user@manifoldcf.apache.org
Gesendet: Freitag, 28. März 2014 15:29:11
Betreff: Re: Windows-Share to Solr is not working properly

Hi Alexander,

It's hard to figure out exactly what you have configured from your email,
but here are a couple of points:

(1) ManifoldCF does not extract dates from binary files; it will only
supply dates from file metadata.  So MCF is supplying the date from the
modification date of the Windows file.
(2) The JCIFS connector provides the same metadata date value in two ways:

    rd.addField("lastModified", lastModifiedDate.toString());
    rd.setModifiedDate(lastModifiedDate);

This was done for backwards compatibility reasons.  You can control which
metadata value name is used for the ModifiedDate field on the Solr
connection's Schema tab.

As for the "lastModified" data, you can either map that to a field you
don't have in your solr schema, or you can suppress it entirely by creating
an entry for Field Mapping that has "lastModified" on the left and a blank
field on the right, and then clicking the "Add" button.  Bear in mind that
1.5 had a bug in this functionality which was fixed in 1.5.1.

Karl




On Fri, Mar 28, 2014 at 10:13 AM, Alexander Stoffers <
stoffers@modell-aachen.de> wrote:

> Hi Karl,
>
> we have a problem with crawling documents out of a windows share to Solr.
>
> Our Solr schema has a date field that is not multivalued, but the output
> of the crawled (e.g. pdf) document has a date array instead of a single
> date.
>
> I tried to remove the the whole field with the tab "Solr Field Mapping",
> using date=>'' but is not working at all. Can´t i remove the date metadata
> at all?
>
> We figured out, that the crawler get´s the date metadata field out of the
> binaries where we found a field, called ModDate. If we remove the ModDate
> field out of the binaries the date metadata field disapears.
>
> Can you explain, why the crawler puts the ModDate twice in the date field
> array?
>
>
> Thank you in Advance
> Alex
>
>
>
> --
> --
>
> Dipl.-Wirt.-Ing. Alexander Stoffers
> Leiter IT & Produktentwicklung
> Modell Aachen GmbH - Interaktive Managementsysteme
> Dennewartstr. 25-27, 52068 Aachen
> fon ++49 176 1011 9752, fax ++49 241 9148 8653
> http://www.modell-aachen.de
>
> Geschäftsführung: Dr.-Ing. Carsten Behrens
> Amtsgericht Aachen, HRB 15622
>
> --
>
> Unseren IT-Support erreichen Sie unter
> support@modell-aachen.de
> +49 (0)241 53808720
>

Re: Windows-Share to Solr is not working properly

Posted by Karl Wright <da...@gmail.com>.
Hi Alexander,

It's hard to figure out exactly what you have configured from your email,
but here are a couple of points:

(1) ManifoldCF does not extract dates from binary files; it will only
supply dates from file metadata.  So MCF is supplying the date from the
modification date of the Windows file.
(2) The JCIFS connector provides the same metadata date value in two ways:

    rd.addField("lastModified", lastModifiedDate.toString());
    rd.setModifiedDate(lastModifiedDate);

This was done for backwards compatibility reasons.  You can control which
metadata value name is used for the ModifiedDate field on the Solr
connection's Schema tab.

As for the "lastModified" data, you can either map that to a field you
don't have in your solr schema, or you can suppress it entirely by creating
an entry for Field Mapping that has "lastModified" on the left and a blank
field on the right, and then clicking the "Add" button.  Bear in mind that
1.5 had a bug in this functionality which was fixed in 1.5.1.

Karl




On Fri, Mar 28, 2014 at 10:13 AM, Alexander Stoffers <
stoffers@modell-aachen.de> wrote:

> Hi Karl,
>
> we have a problem with crawling documents out of a windows share to Solr.
>
> Our Solr schema has a date field that is not multivalued, but the output
> of the crawled (e.g. pdf) document has a date array instead of a single
> date.
>
> I tried to remove the the whole field with the tab "Solr Field Mapping",
> using date=>'' but is not working at all. Can´t i remove the date metadata
> at all?
>
> We figured out, that the crawler get´s the date metadata field out of the
> binaries where we found a field, called ModDate. If we remove the ModDate
> field out of the binaries the date metadata field disapears.
>
> Can you explain, why the crawler puts the ModDate twice in the date field
> array?
>
>
> Thank you in Advance
> Alex
>
>
>
> --
> --
>
> Dipl.-Wirt.-Ing. Alexander Stoffers
> Leiter IT & Produktentwicklung
> Modell Aachen GmbH - Interaktive Managementsysteme
> Dennewartstr. 25-27, 52068 Aachen
> fon ++49 176 1011 9752, fax ++49 241 9148 8653
> http://www.modell-aachen.de
>
> Geschäftsführung: Dr.-Ing. Carsten Behrens
> Amtsgericht Aachen, HRB 15622
>
> --
>
> Unseren IT-Support erreichen Sie unter
> support@modell-aachen.de
> +49 (0)241 53808720
>