You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Salih Sen <sa...@dilisim.com> on 2015/01/08 17:13:48 UTC

Metadata fields get lost in 1.7.2 with Sharepoint 2013 repository and Solr output connection

Hi,

We've noticed that metadata of some documents aren't indexed in Solr.

I tried tracking down to issue in source code and noticed that
RepositoryDocument
has around 25 fields until it reaches the RepositoryDocumentFactory.
​ ​
Document that returned from
​ ​
factory.createDocument()
​ ​
has only a single field in IncrementalIngester.java line 3089.



I couldn't get the logic behind if (iter.hasNext()) in the code below while
it has twenty something fields it "iterates" on only the first one.
Is is the expected behaviour?

A similar code also exist in createDocument() method so I feel I might be
looking at the wrong places but as far as I can see this part creates the
difference between the document comes from Sharepoint repository and the
one posted to Solr.

Thanks.


RepositoryDocumentFactory.java
---------------------------------​------------

public RepositoryDocumentFactory(RepositoryDocument document)
  throws ManifoldCFException, IOException
{
  this.original = document;

  try
  {
    this.binaryTracker = new TempFileInput(document.getBinaryStream());
    // Copy all reader streams
    Iterator<String> iter = document.getFields();
    if (iter.hasNext())
    {
      String fieldName = iter.next();
      Object[] objects = document.getField(fieldName);
      if (objects instanceof Reader[])
      {
        CharacterInput[] newValues = new CharacterInput[objects.length];
        metadataReaders.put(fieldName,newValues);
        // Populate newValues
        for (int i = 0; i < newValues.length; i++)
        {
          newValues[i] = new TempFileCharacterInput((Reader)objects[i]);
        }
      }
    }
  }
  catch (Throwable e)
  {
    // Clean up everything we've done so far.
    if (this.binaryTracker != null)
      this.binaryTracker.discard();
    for (String key : metadataReaders.keySet())
    {
      CharacterInput[] rt = metadataReaders.get(key);
      for (CharacterInput r : rt)
      {
        if (r != null)
          r.discard();
      }
    }
    if (e instanceof IOException)
      throw (IOException)e;
    else if (e instanceof RuntimeException)
      throw (RuntimeException)e;
    else if (e instanceof Error)
      throw (Error)e;
    else
      throw new RuntimeException("Unknown exception type:
"+e.getClass().getName()+": "+e.getMessage(),e);
  }
}



--

Salih Şen

Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd.
Sti.

email: salih@dilisim.com

Tel: 0 222 330 20 21

GSM: 0 507 296 15 51

Re: Metadata fields get lost in 1.7.2 with Sharepoint 2013 repository and Solr output connection

Posted by Salih Sen <sa...@dilisim.com>.
Hi Karl,

It turns out we hit this bug because I left null ouput connection in
job settings before adding the Solr repository.

In any case it's good to know It'll be fixed in newer version :)

Thanks.

On Fri, Jan 9, 2015 at 8:25 AM, Karl Wright <da...@gmail.com> wrote:
> There is a fix committed, and a patch available that you can use with 1.7.
>
> Thanks,
> Karl
>
> On Thu, Jan 8, 2015 at 1:27 PM, Karl Wright <da...@gmail.com> wrote:
>
>> I was able to reproduce this using an RSS connection as input.  Any
>> bifurcation of the pipeline seems to cause only one metadata field to be
>> transmitted to the outputs, for reasons as yet unclear.
>>
>> CONNECTORS-1138.
>>
>> Karl
>>
>>
>>
>>
>> On Thu, Jan 8, 2015 at 11:36 AM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Actually, I take some of this back.  Any SharePoint metadata that is
>>> associated with a parent object rather than a child is represented in
>>> RepositoryDocument as a Reader[] array.  So you should see
>>> RepositoryDocumentFactory iterating through all such fields and making a
>>> TempFileCharacterInput for each member of each field.  If you are seeing
>>> only one iteration of the getFields() iterator, it means that the
>>> RepositoryDocument object fields member is not properly being managed.  But
>>> I'm looking that RepositoryDocument code, and addField() looks like it does
>>> the right thing for all variations of data types.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Jan 8, 2015 at 11:24 AM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Salih,
>>>>
>>>> The code you point at is designed to make copies of fields that are
>>>> represented by Reader objects.  Most SharePoint fields are represented by
>>>> String objects, so this code does not apply to them.
>>>>
>>>> The place you want to look is:
>>>>
>>>> >>>>>>
>>>>     // Copy metadata fields (including minting new Readers where needed)
>>>>     Iterator<String> iter = original.getFields();
>>>>     if (iter.hasNext())
>>>>     {
>>>>       String fieldName = iter.next();
>>>>       Object[] objects = original.getField(fieldName);
>>>>       if (objects instanceof Reader[])
>>>>       {
>>>>         CharacterInput[] rts = metadataReaders.get(fieldName);
>>>>         Reader[] newReaders = new Reader[rts.length];
>>>>         for (int i = 0; i < rts.length; i++)
>>>>         {
>>>>           rts[i].doneWithStream();
>>>>           newReaders[i] = rts[i].getStream();
>>>>         }
>>>>         rd.addField(fieldName,newReaders);
>>>>       }
>>>>       else if (objects instanceof Date[])
>>>>       {
>>>>         rd.addField(fieldName,(Date[])objects);
>>>>       }
>>>>       else if (objects instanceof String[])
>>>>       {
>>>>         rd.addField(fieldName,(String[])objects);
>>>>       }
>>>>       else
>>>>         throw new RuntimeException("Unknown kind of metadata:
>>>> "+objects.getClass().getName());
>>>>     }
>>>>
>>>> <<<<<<
>>>>
>>>> This code should copy all fields to the new RepositoryDocument object
>>>> (rd), and do the necessary special manipulation for Reader fields.
>>>>
>>>> If you'd be willing to send me a screen shot of your job (from your view
>>>> job page), I can try to recreate your pipeline here and see what's going on.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Thu, Jan 8, 2015 at 11:13 AM, Salih Sen <sa...@dilisim.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We've noticed that metadata of some documents aren't indexed in Solr.
>>>>>
>>>>> I tried tracking down to issue in source code and noticed that
>>>>> RepositoryDocument
>>>>> has around 25 fields until it reaches the RepositoryDocumentFactory.
>>>>>
>>>>> Document that returned from
>>>>>
>>>>> factory.createDocument()
>>>>>
>>>>> has only a single field in IncrementalIngester.java line 3089.
>>>>>
>>>>>
>>>>>
>>>>> I couldn't get the logic behind if (iter.hasNext()) in the code below
>>>>> while
>>>>> it has twenty something fields it "iterates" on only the first one.
>>>>> Is is the expected behaviour?
>>>>>
>>>>> A similar code also exist in createDocument() method so I feel I might
>>>>> be
>>>>> looking at the wrong places but as far as I can see this part creates
>>>>> the
>>>>> difference between the document comes from Sharepoint repository and the
>>>>> one posted to Solr.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> RepositoryDocumentFactory.java
>>>>> ---------------------------------------------
>>>>>
>>>>> public RepositoryDocumentFactory(RepositoryDocument document)
>>>>>   throws ManifoldCFException, IOException
>>>>> {
>>>>>   this.original = document;
>>>>>
>>>>>   try
>>>>>   {
>>>>>     this.binaryTracker = new TempFileInput(document.getBinaryStream());
>>>>>     // Copy all reader streams
>>>>>     Iterator<String> iter = document.getFields();
>>>>>     if (iter.hasNext())
>>>>>     {
>>>>>       String fieldName = iter.next();
>>>>>       Object[] objects = document.getField(fieldName);
>>>>>       if (objects instanceof Reader[])
>>>>>       {
>>>>>         CharacterInput[] newValues = new CharacterInput[objects.length];
>>>>>         metadataReaders.put(fieldName,newValues);
>>>>>         // Populate newValues
>>>>>         for (int i = 0; i < newValues.length; i++)
>>>>>         {
>>>>>           newValues[i] = new TempFileCharacterInput((Reader)objects[i]);
>>>>>         }
>>>>>       }
>>>>>     }
>>>>>   }
>>>>>   catch (Throwable e)
>>>>>   {
>>>>>     // Clean up everything we've done so far.
>>>>>     if (this.binaryTracker != null)
>>>>>       this.binaryTracker.discard();
>>>>>     for (String key : metadataReaders.keySet())
>>>>>     {
>>>>>       CharacterInput[] rt = metadataReaders.get(key);
>>>>>       for (CharacterInput r : rt)
>>>>>       {
>>>>>         if (r != null)
>>>>>           r.discard();
>>>>>       }
>>>>>     }
>>>>>     if (e instanceof IOException)
>>>>>       throw (IOException)e;
>>>>>     else if (e instanceof RuntimeException)
>>>>>       throw (RuntimeException)e;
>>>>>     else if (e instanceof Error)
>>>>>       throw (Error)e;
>>>>>     else
>>>>>       throw new RuntimeException("Unknown exception type:
>>>>> "+e.getClass().getName()+": "+e.getMessage(),e);
>>>>>   }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Salih Şen
>>>>>
>>>>> Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret
>>>>> Ltd.
>>>>> Sti.
>>>>>
>>>>> email: salih@dilisim.com
>>>>>
>>>>> Tel: 0 222 330 20 21
>>>>>
>>>>> GSM: 0 507 296 15 51
>>>>>
>>>>
>>>>
>>>
>>



-- 
Salih Şen

Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd. Sti.

email: salih@dilisim.com

Tel: 0 222 330 20 21

GSM: 0 507 296 15 51

Re: Metadata fields get lost in 1.7.2 with Sharepoint 2013 repository and Solr output connection

Posted by Karl Wright <da...@gmail.com>.
There is a fix committed, and a patch available that you can use with 1.7.

Thanks,
Karl

On Thu, Jan 8, 2015 at 1:27 PM, Karl Wright <da...@gmail.com> wrote:

> I was able to reproduce this using an RSS connection as input.  Any
> bifurcation of the pipeline seems to cause only one metadata field to be
> transmitted to the outputs, for reasons as yet unclear.
>
> CONNECTORS-1138.
>
> Karl
>
>
>
>
> On Thu, Jan 8, 2015 at 11:36 AM, Karl Wright <da...@gmail.com> wrote:
>
>> Actually, I take some of this back.  Any SharePoint metadata that is
>> associated with a parent object rather than a child is represented in
>> RepositoryDocument as a Reader[] array.  So you should see
>> RepositoryDocumentFactory iterating through all such fields and making a
>> TempFileCharacterInput for each member of each field.  If you are seeing
>> only one iteration of the getFields() iterator, it means that the
>> RepositoryDocument object fields member is not properly being managed.  But
>> I'm looking that RepositoryDocument code, and addField() looks like it does
>> the right thing for all variations of data types.
>>
>> Karl
>>
>>
>> On Thu, Jan 8, 2015 at 11:24 AM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Salih,
>>>
>>> The code you point at is designed to make copies of fields that are
>>> represented by Reader objects.  Most SharePoint fields are represented by
>>> String objects, so this code does not apply to them.
>>>
>>> The place you want to look is:
>>>
>>> >>>>>>
>>>     // Copy metadata fields (including minting new Readers where needed)
>>>     Iterator<String> iter = original.getFields();
>>>     if (iter.hasNext())
>>>     {
>>>       String fieldName = iter.next();
>>>       Object[] objects = original.getField(fieldName);
>>>       if (objects instanceof Reader[])
>>>       {
>>>         CharacterInput[] rts = metadataReaders.get(fieldName);
>>>         Reader[] newReaders = new Reader[rts.length];
>>>         for (int i = 0; i < rts.length; i++)
>>>         {
>>>           rts[i].doneWithStream();
>>>           newReaders[i] = rts[i].getStream();
>>>         }
>>>         rd.addField(fieldName,newReaders);
>>>       }
>>>       else if (objects instanceof Date[])
>>>       {
>>>         rd.addField(fieldName,(Date[])objects);
>>>       }
>>>       else if (objects instanceof String[])
>>>       {
>>>         rd.addField(fieldName,(String[])objects);
>>>       }
>>>       else
>>>         throw new RuntimeException("Unknown kind of metadata:
>>> "+objects.getClass().getName());
>>>     }
>>>
>>> <<<<<<
>>>
>>> This code should copy all fields to the new RepositoryDocument object
>>> (rd), and do the necessary special manipulation for Reader fields.
>>>
>>> If you'd be willing to send me a screen shot of your job (from your view
>>> job page), I can try to recreate your pipeline here and see what's going on.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>> On Thu, Jan 8, 2015 at 11:13 AM, Salih Sen <sa...@dilisim.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We've noticed that metadata of some documents aren't indexed in Solr.
>>>>
>>>> I tried tracking down to issue in source code and noticed that
>>>> RepositoryDocument
>>>> has around 25 fields until it reaches the RepositoryDocumentFactory.
>>>> ​ ​
>>>> Document that returned from
>>>> ​ ​
>>>> factory.createDocument()
>>>> ​ ​
>>>> has only a single field in IncrementalIngester.java line 3089.
>>>>
>>>>
>>>>
>>>> I couldn't get the logic behind if (iter.hasNext()) in the code below
>>>> while
>>>> it has twenty something fields it "iterates" on only the first one.
>>>> Is is the expected behaviour?
>>>>
>>>> A similar code also exist in createDocument() method so I feel I might
>>>> be
>>>> looking at the wrong places but as far as I can see this part creates
>>>> the
>>>> difference between the document comes from Sharepoint repository and the
>>>> one posted to Solr.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> RepositoryDocumentFactory.java
>>>> ---------------------------------​------------
>>>>
>>>> public RepositoryDocumentFactory(RepositoryDocument document)
>>>>   throws ManifoldCFException, IOException
>>>> {
>>>>   this.original = document;
>>>>
>>>>   try
>>>>   {
>>>>     this.binaryTracker = new TempFileInput(document.getBinaryStream());
>>>>     // Copy all reader streams
>>>>     Iterator<String> iter = document.getFields();
>>>>     if (iter.hasNext())
>>>>     {
>>>>       String fieldName = iter.next();
>>>>       Object[] objects = document.getField(fieldName);
>>>>       if (objects instanceof Reader[])
>>>>       {
>>>>         CharacterInput[] newValues = new CharacterInput[objects.length];
>>>>         metadataReaders.put(fieldName,newValues);
>>>>         // Populate newValues
>>>>         for (int i = 0; i < newValues.length; i++)
>>>>         {
>>>>           newValues[i] = new TempFileCharacterInput((Reader)objects[i]);
>>>>         }
>>>>       }
>>>>     }
>>>>   }
>>>>   catch (Throwable e)
>>>>   {
>>>>     // Clean up everything we've done so far.
>>>>     if (this.binaryTracker != null)
>>>>       this.binaryTracker.discard();
>>>>     for (String key : metadataReaders.keySet())
>>>>     {
>>>>       CharacterInput[] rt = metadataReaders.get(key);
>>>>       for (CharacterInput r : rt)
>>>>       {
>>>>         if (r != null)
>>>>           r.discard();
>>>>       }
>>>>     }
>>>>     if (e instanceof IOException)
>>>>       throw (IOException)e;
>>>>     else if (e instanceof RuntimeException)
>>>>       throw (RuntimeException)e;
>>>>     else if (e instanceof Error)
>>>>       throw (Error)e;
>>>>     else
>>>>       throw new RuntimeException("Unknown exception type:
>>>> "+e.getClass().getName()+": "+e.getMessage(),e);
>>>>   }
>>>> }
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Salih Şen
>>>>
>>>> Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret
>>>> Ltd.
>>>> Sti.
>>>>
>>>> email: salih@dilisim.com
>>>>
>>>> Tel: 0 222 330 20 21
>>>>
>>>> GSM: 0 507 296 15 51
>>>>
>>>
>>>
>>
>

Re: Metadata fields get lost in 1.7.2 with Sharepoint 2013 repository and Solr output connection

Posted by Karl Wright <da...@gmail.com>.
I was able to reproduce this using an RSS connection as input.  Any
bifurcation of the pipeline seems to cause only one metadata field to be
transmitted to the outputs, for reasons as yet unclear.

CONNECTORS-1138.

Karl




On Thu, Jan 8, 2015 at 11:36 AM, Karl Wright <da...@gmail.com> wrote:

> Actually, I take some of this back.  Any SharePoint metadata that is
> associated with a parent object rather than a child is represented in
> RepositoryDocument as a Reader[] array.  So you should see
> RepositoryDocumentFactory iterating through all such fields and making a
> TempFileCharacterInput for each member of each field.  If you are seeing
> only one iteration of the getFields() iterator, it means that the
> RepositoryDocument object fields member is not properly being managed.  But
> I'm looking that RepositoryDocument code, and addField() looks like it does
> the right thing for all variations of data types.
>
> Karl
>
>
> On Thu, Jan 8, 2015 at 11:24 AM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Salih,
>>
>> The code you point at is designed to make copies of fields that are
>> represented by Reader objects.  Most SharePoint fields are represented by
>> String objects, so this code does not apply to them.
>>
>> The place you want to look is:
>>
>> >>>>>>
>>     // Copy metadata fields (including minting new Readers where needed)
>>     Iterator<String> iter = original.getFields();
>>     if (iter.hasNext())
>>     {
>>       String fieldName = iter.next();
>>       Object[] objects = original.getField(fieldName);
>>       if (objects instanceof Reader[])
>>       {
>>         CharacterInput[] rts = metadataReaders.get(fieldName);
>>         Reader[] newReaders = new Reader[rts.length];
>>         for (int i = 0; i < rts.length; i++)
>>         {
>>           rts[i].doneWithStream();
>>           newReaders[i] = rts[i].getStream();
>>         }
>>         rd.addField(fieldName,newReaders);
>>       }
>>       else if (objects instanceof Date[])
>>       {
>>         rd.addField(fieldName,(Date[])objects);
>>       }
>>       else if (objects instanceof String[])
>>       {
>>         rd.addField(fieldName,(String[])objects);
>>       }
>>       else
>>         throw new RuntimeException("Unknown kind of metadata:
>> "+objects.getClass().getName());
>>     }
>>
>> <<<<<<
>>
>> This code should copy all fields to the new RepositoryDocument object
>> (rd), and do the necessary special manipulation for Reader fields.
>>
>> If you'd be willing to send me a screen shot of your job (from your view
>> job page), I can try to recreate your pipeline here and see what's going on.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Thu, Jan 8, 2015 at 11:13 AM, Salih Sen <sa...@dilisim.com> wrote:
>>
>>> Hi,
>>>
>>> We've noticed that metadata of some documents aren't indexed in Solr.
>>>
>>> I tried tracking down to issue in source code and noticed that
>>> RepositoryDocument
>>> has around 25 fields until it reaches the RepositoryDocumentFactory.
>>> ​ ​
>>> Document that returned from
>>> ​ ​
>>> factory.createDocument()
>>> ​ ​
>>> has only a single field in IncrementalIngester.java line 3089.
>>>
>>>
>>>
>>> I couldn't get the logic behind if (iter.hasNext()) in the code below
>>> while
>>> it has twenty something fields it "iterates" on only the first one.
>>> Is is the expected behaviour?
>>>
>>> A similar code also exist in createDocument() method so I feel I might be
>>> looking at the wrong places but as far as I can see this part creates the
>>> difference between the document comes from Sharepoint repository and the
>>> one posted to Solr.
>>>
>>> Thanks.
>>>
>>>
>>> RepositoryDocumentFactory.java
>>> ---------------------------------​------------
>>>
>>> public RepositoryDocumentFactory(RepositoryDocument document)
>>>   throws ManifoldCFException, IOException
>>> {
>>>   this.original = document;
>>>
>>>   try
>>>   {
>>>     this.binaryTracker = new TempFileInput(document.getBinaryStream());
>>>     // Copy all reader streams
>>>     Iterator<String> iter = document.getFields();
>>>     if (iter.hasNext())
>>>     {
>>>       String fieldName = iter.next();
>>>       Object[] objects = document.getField(fieldName);
>>>       if (objects instanceof Reader[])
>>>       {
>>>         CharacterInput[] newValues = new CharacterInput[objects.length];
>>>         metadataReaders.put(fieldName,newValues);
>>>         // Populate newValues
>>>         for (int i = 0; i < newValues.length; i++)
>>>         {
>>>           newValues[i] = new TempFileCharacterInput((Reader)objects[i]);
>>>         }
>>>       }
>>>     }
>>>   }
>>>   catch (Throwable e)
>>>   {
>>>     // Clean up everything we've done so far.
>>>     if (this.binaryTracker != null)
>>>       this.binaryTracker.discard();
>>>     for (String key : metadataReaders.keySet())
>>>     {
>>>       CharacterInput[] rt = metadataReaders.get(key);
>>>       for (CharacterInput r : rt)
>>>       {
>>>         if (r != null)
>>>           r.discard();
>>>       }
>>>     }
>>>     if (e instanceof IOException)
>>>       throw (IOException)e;
>>>     else if (e instanceof RuntimeException)
>>>       throw (RuntimeException)e;
>>>     else if (e instanceof Error)
>>>       throw (Error)e;
>>>     else
>>>       throw new RuntimeException("Unknown exception type:
>>> "+e.getClass().getName()+": "+e.getMessage(),e);
>>>   }
>>> }
>>>
>>>
>>>
>>> --
>>>
>>> Salih Şen
>>>
>>> Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd.
>>> Sti.
>>>
>>> email: salih@dilisim.com
>>>
>>> Tel: 0 222 330 20 21
>>>
>>> GSM: 0 507 296 15 51
>>>
>>
>>
>

Re: Metadata fields get lost in 1.7.2 with Sharepoint 2013 repository and Solr output connection

Posted by Karl Wright <da...@gmail.com>.
Actually, I take some of this back.  Any SharePoint metadata that is
associated with a parent object rather than a child is represented in
RepositoryDocument as a Reader[] array.  So you should see
RepositoryDocumentFactory iterating through all such fields and making a
TempFileCharacterInput for each member of each field.  If you are seeing
only one iteration of the getFields() iterator, it means that the
RepositoryDocument object fields member is not properly being managed.  But
I'm looking that RepositoryDocument code, and addField() looks like it does
the right thing for all variations of data types.

Karl


On Thu, Jan 8, 2015 at 11:24 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Salih,
>
> The code you point at is designed to make copies of fields that are
> represented by Reader objects.  Most SharePoint fields are represented by
> String objects, so this code does not apply to them.
>
> The place you want to look is:
>
> >>>>>>
>     // Copy metadata fields (including minting new Readers where needed)
>     Iterator<String> iter = original.getFields();
>     if (iter.hasNext())
>     {
>       String fieldName = iter.next();
>       Object[] objects = original.getField(fieldName);
>       if (objects instanceof Reader[])
>       {
>         CharacterInput[] rts = metadataReaders.get(fieldName);
>         Reader[] newReaders = new Reader[rts.length];
>         for (int i = 0; i < rts.length; i++)
>         {
>           rts[i].doneWithStream();
>           newReaders[i] = rts[i].getStream();
>         }
>         rd.addField(fieldName,newReaders);
>       }
>       else if (objects instanceof Date[])
>       {
>         rd.addField(fieldName,(Date[])objects);
>       }
>       else if (objects instanceof String[])
>       {
>         rd.addField(fieldName,(String[])objects);
>       }
>       else
>         throw new RuntimeException("Unknown kind of metadata:
> "+objects.getClass().getName());
>     }
>
> <<<<<<
>
> This code should copy all fields to the new RepositoryDocument object
> (rd), and do the necessary special manipulation for Reader fields.
>
> If you'd be willing to send me a screen shot of your job (from your view
> job page), I can try to recreate your pipeline here and see what's going on.
>
> Thanks,
> Karl
>
>
>
> On Thu, Jan 8, 2015 at 11:13 AM, Salih Sen <sa...@dilisim.com> wrote:
>
>> Hi,
>>
>> We've noticed that metadata of some documents aren't indexed in Solr.
>>
>> I tried tracking down to issue in source code and noticed that
>> RepositoryDocument
>> has around 25 fields until it reaches the RepositoryDocumentFactory.
>> ​ ​
>> Document that returned from
>> ​ ​
>> factory.createDocument()
>> ​ ​
>> has only a single field in IncrementalIngester.java line 3089.
>>
>>
>>
>> I couldn't get the logic behind if (iter.hasNext()) in the code below
>> while
>> it has twenty something fields it "iterates" on only the first one.
>> Is is the expected behaviour?
>>
>> A similar code also exist in createDocument() method so I feel I might be
>> looking at the wrong places but as far as I can see this part creates the
>> difference between the document comes from Sharepoint repository and the
>> one posted to Solr.
>>
>> Thanks.
>>
>>
>> RepositoryDocumentFactory.java
>> ---------------------------------​------------
>>
>> public RepositoryDocumentFactory(RepositoryDocument document)
>>   throws ManifoldCFException, IOException
>> {
>>   this.original = document;
>>
>>   try
>>   {
>>     this.binaryTracker = new TempFileInput(document.getBinaryStream());
>>     // Copy all reader streams
>>     Iterator<String> iter = document.getFields();
>>     if (iter.hasNext())
>>     {
>>       String fieldName = iter.next();
>>       Object[] objects = document.getField(fieldName);
>>       if (objects instanceof Reader[])
>>       {
>>         CharacterInput[] newValues = new CharacterInput[objects.length];
>>         metadataReaders.put(fieldName,newValues);
>>         // Populate newValues
>>         for (int i = 0; i < newValues.length; i++)
>>         {
>>           newValues[i] = new TempFileCharacterInput((Reader)objects[i]);
>>         }
>>       }
>>     }
>>   }
>>   catch (Throwable e)
>>   {
>>     // Clean up everything we've done so far.
>>     if (this.binaryTracker != null)
>>       this.binaryTracker.discard();
>>     for (String key : metadataReaders.keySet())
>>     {
>>       CharacterInput[] rt = metadataReaders.get(key);
>>       for (CharacterInput r : rt)
>>       {
>>         if (r != null)
>>           r.discard();
>>       }
>>     }
>>     if (e instanceof IOException)
>>       throw (IOException)e;
>>     else if (e instanceof RuntimeException)
>>       throw (RuntimeException)e;
>>     else if (e instanceof Error)
>>       throw (Error)e;
>>     else
>>       throw new RuntimeException("Unknown exception type:
>> "+e.getClass().getName()+": "+e.getMessage(),e);
>>   }
>> }
>>
>>
>>
>> --
>>
>> Salih Şen
>>
>> Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd.
>> Sti.
>>
>> email: salih@dilisim.com
>>
>> Tel: 0 222 330 20 21
>>
>> GSM: 0 507 296 15 51
>>
>
>

Re: Metadata fields get lost in 1.7.2 with Sharepoint 2013 repository and Solr output connection

Posted by Karl Wright <da...@gmail.com>.
Hi Salih,

The code you point at is designed to make copies of fields that are
represented by Reader objects.  Most SharePoint fields are represented by
String objects, so this code does not apply to them.

The place you want to look is:

>>>>>>
    // Copy metadata fields (including minting new Readers where needed)
    Iterator<String> iter = original.getFields();
    if (iter.hasNext())
    {
      String fieldName = iter.next();
      Object[] objects = original.getField(fieldName);
      if (objects instanceof Reader[])
      {
        CharacterInput[] rts = metadataReaders.get(fieldName);
        Reader[] newReaders = new Reader[rts.length];
        for (int i = 0; i < rts.length; i++)
        {
          rts[i].doneWithStream();
          newReaders[i] = rts[i].getStream();
        }
        rd.addField(fieldName,newReaders);
      }
      else if (objects instanceof Date[])
      {
        rd.addField(fieldName,(Date[])objects);
      }
      else if (objects instanceof String[])
      {
        rd.addField(fieldName,(String[])objects);
      }
      else
        throw new RuntimeException("Unknown kind of metadata:
"+objects.getClass().getName());
    }

<<<<<<

This code should copy all fields to the new RepositoryDocument object (rd),
and do the necessary special manipulation for Reader fields.

If you'd be willing to send me a screen shot of your job (from your view
job page), I can try to recreate your pipeline here and see what's going on.

Thanks,
Karl



On Thu, Jan 8, 2015 at 11:13 AM, Salih Sen <sa...@dilisim.com> wrote:

> Hi,
>
> We've noticed that metadata of some documents aren't indexed in Solr.
>
> I tried tracking down to issue in source code and noticed that
> RepositoryDocument
> has around 25 fields until it reaches the RepositoryDocumentFactory.
> ​ ​
> Document that returned from
> ​ ​
> factory.createDocument()
> ​ ​
> has only a single field in IncrementalIngester.java line 3089.
>
>
>
> I couldn't get the logic behind if (iter.hasNext()) in the code below while
> it has twenty something fields it "iterates" on only the first one.
> Is is the expected behaviour?
>
> A similar code also exist in createDocument() method so I feel I might be
> looking at the wrong places but as far as I can see this part creates the
> difference between the document comes from Sharepoint repository and the
> one posted to Solr.
>
> Thanks.
>
>
> RepositoryDocumentFactory.java
> ---------------------------------​------------
>
> public RepositoryDocumentFactory(RepositoryDocument document)
>   throws ManifoldCFException, IOException
> {
>   this.original = document;
>
>   try
>   {
>     this.binaryTracker = new TempFileInput(document.getBinaryStream());
>     // Copy all reader streams
>     Iterator<String> iter = document.getFields();
>     if (iter.hasNext())
>     {
>       String fieldName = iter.next();
>       Object[] objects = document.getField(fieldName);
>       if (objects instanceof Reader[])
>       {
>         CharacterInput[] newValues = new CharacterInput[objects.length];
>         metadataReaders.put(fieldName,newValues);
>         // Populate newValues
>         for (int i = 0; i < newValues.length; i++)
>         {
>           newValues[i] = new TempFileCharacterInput((Reader)objects[i]);
>         }
>       }
>     }
>   }
>   catch (Throwable e)
>   {
>     // Clean up everything we've done so far.
>     if (this.binaryTracker != null)
>       this.binaryTracker.discard();
>     for (String key : metadataReaders.keySet())
>     {
>       CharacterInput[] rt = metadataReaders.get(key);
>       for (CharacterInput r : rt)
>       {
>         if (r != null)
>           r.discard();
>       }
>     }
>     if (e instanceof IOException)
>       throw (IOException)e;
>     else if (e instanceof RuntimeException)
>       throw (RuntimeException)e;
>     else if (e instanceof Error)
>       throw (Error)e;
>     else
>       throw new RuntimeException("Unknown exception type:
> "+e.getClass().getName()+": "+e.getMessage(),e);
>   }
> }
>
>
>
> --
>
> Salih Şen
>
> Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd.
> Sti.
>
> email: salih@dilisim.com
>
> Tel: 0 222 330 20 21
>
> GSM: 0 507 296 15 51
>