You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mathias Lux <ml...@itec.uni-klu.ac.at> on 2013/12/18 11:03:37 UTC

DataImport Handler, writing a new EntityProcessor

Hi all!

I've got a question regarding writing a new EntityProcessor, in the
same sense as the Tika one. My EntityProcessor should analyze jpg
images and create document fields to be used with the LIRE Solr plugin
(https://bitbucket.org/dermotte/liresolr). Basically I've taken the
same approach as the TikaEntityProcessor, but my setup just indexes
the first of 1000 images. I'm using a FileListEntityProcessor to get
all JPEGs from a directory and then I'm handing them over (see [2]).
My code for the EntityProcessor is at [1]. I've tried to use the
DataSource as well as the filePath attribute, but it ends up all the
same. However, the FileListEntityProcessor is able to read all the
files according to the debug output, but I'm missing the link from the
FileListEntityProcessor to the LireEntityProcessor.

I'd appreciate any pointer or help :)

cheers,
  Mathias

[1] LireEntityProcessor http://pastebin.com/JFajkNtf
[2] dataConfig http://pastebin.com/vSHucatJ

-- 
Dr. Mathias Lux
Klagenfurt University, Austria
http://tinyurl.com/mlux-itec

Re: DataImport Handler, writing a new EntityProcessor

Posted by Mathias Lux <ml...@itec.uni-klu.ac.at>.
Hi!

Thanks for all the advice! I finally did it, the most annoying error
that took me the best of a day to figure out was that the state
variable here had to be reset:
https://bitbucket.org/dermotte/liresolr/src/d27878a71c63842cb72b84162b599d99c4408965/src/main/java/net/semanticmetadata/lire/solr/LireEntityProcessor.java?at=master#cl-56

The EntityProcessor is part of this image search plugin if anyone is
interested: https://bitbucket.org/dermotte/liresolr/

:) It's always the small things that are hard to find

cheers and thanks, Mathias

On Wed, Dec 18, 2013 at 7:26 PM, P Williams
<wi...@gmail.com> wrote:
> Hi Mathias,
>
> I'd recommend testing one thing at a time.  See if you can get it to work
> for one image before you try a directory of images.  Also try testing using
> the solr-testframework using your ide (I use Eclipse) to debug rather than
> your browser/print statements.  Hopefully that will give you some more
> specific knowledge of what's happening around your plugin.
>
> I also wrote an EntityProcessor plugin to read from a properties
> file<https://issues.apache.org/jira/browse/SOLR-3928>.
>  Hopefully that'll give you some insight about this kind of Solr plugin and
> testing them.
>
> Cheers,
> Tricia
>
>
>
>
> On Wed, Dec 18, 2013 at 3:03 AM, Mathias Lux <ml...@itec.uni-klu.ac.at>wrote:
>
>> Hi all!
>>
>> I've got a question regarding writing a new EntityProcessor, in the
>> same sense as the Tika one. My EntityProcessor should analyze jpg
>> images and create document fields to be used with the LIRE Solr plugin
>> (https://bitbucket.org/dermotte/liresolr). Basically I've taken the
>> same approach as the TikaEntityProcessor, but my setup just indexes
>> the first of 1000 images. I'm using a FileListEntityProcessor to get
>> all JPEGs from a directory and then I'm handing them over (see [2]).
>> My code for the EntityProcessor is at [1]. I've tried to use the
>> DataSource as well as the filePath attribute, but it ends up all the
>> same. However, the FileListEntityProcessor is able to read all the
>> files according to the debug output, but I'm missing the link from the
>> FileListEntityProcessor to the LireEntityProcessor.
>>
>> I'd appreciate any pointer or help :)
>>
>> cheers,
>>   Mathias
>>
>> [1] LireEntityProcessor http://pastebin.com/JFajkNtf
>> [2] dataConfig http://pastebin.com/vSHucatJ
>>
>> --
>> Dr. Mathias Lux
>> Klagenfurt University, Austria
>> http://tinyurl.com/mlux-itec
>>



-- 
PD Dr. Mathias Lux
Klagenfurt University, Austria
http://tinyurl.com/mlux-itec

Re: DataImport Handler, writing a new EntityProcessor

Posted by P Williams <wi...@gmail.com>.
Hi Mathias,

I'd recommend testing one thing at a time.  See if you can get it to work
for one image before you try a directory of images.  Also try testing using
the solr-testframework using your ide (I use Eclipse) to debug rather than
your browser/print statements.  Hopefully that will give you some more
specific knowledge of what's happening around your plugin.

I also wrote an EntityProcessor plugin to read from a properties
file<https://issues.apache.org/jira/browse/SOLR-3928>.
 Hopefully that'll give you some insight about this kind of Solr plugin and
testing them.

Cheers,
Tricia




On Wed, Dec 18, 2013 at 3:03 AM, Mathias Lux <ml...@itec.uni-klu.ac.at>wrote:

> Hi all!
>
> I've got a question regarding writing a new EntityProcessor, in the
> same sense as the Tika one. My EntityProcessor should analyze jpg
> images and create document fields to be used with the LIRE Solr plugin
> (https://bitbucket.org/dermotte/liresolr). Basically I've taken the
> same approach as the TikaEntityProcessor, but my setup just indexes
> the first of 1000 images. I'm using a FileListEntityProcessor to get
> all JPEGs from a directory and then I'm handing them over (see [2]).
> My code for the EntityProcessor is at [1]. I've tried to use the
> DataSource as well as the filePath attribute, but it ends up all the
> same. However, the FileListEntityProcessor is able to read all the
> files according to the debug output, but I'm missing the link from the
> FileListEntityProcessor to the LireEntityProcessor.
>
> I'd appreciate any pointer or help :)
>
> cheers,
>   Mathias
>
> [1] LireEntityProcessor http://pastebin.com/JFajkNtf
> [2] dataConfig http://pastebin.com/vSHucatJ
>
> --
> Dr. Mathias Lux
> Klagenfurt University, Austria
> http://tinyurl.com/mlux-itec
>

Re: DataImport Handler, writing a new EntityProcessor

Posted by Mathias Lux <ml...@itec.uni-klu.ac.at>.
Unfortunately it is the same in non-debug, just the first document. I
also output the params to sout, but it seems only the first one is
ever arriving at my custom class. I've the feeling that I'm doing
something seriously wrong here, based on a complete misunderstanding
:) I basically assume that the nested entity processor will be called
for each of the rows that come out from its parent. I've read
somewhere, that the data has to be taken from the data source, and
I've implemented that, but it doesn't seem to change anything.

cheers,
Mathias

On Wed, Dec 18, 2013 at 3:05 PM, Dyer, James
<Ja...@ingramcontent.com> wrote:
> The first thing I would suggest is to try and run it not in debug mode.  DIH's debug mode limits the number of documents it will take in, so that might be all that is wrong here.
>
> James Dyer
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: mathias.lux@gmail.com [mailto:mathias.lux@gmail.com] On Behalf Of Mathias Lux
> Sent: Wednesday, December 18, 2013 4:04 AM
> To: solr-user@lucene.apache.org
> Subject: DataImport Handler, writing a new EntityProcessor
>
> Hi all!
>
> I've got a question regarding writing a new EntityProcessor, in the
> same sense as the Tika one. My EntityProcessor should analyze jpg
> images and create document fields to be used with the LIRE Solr plugin
> (https://bitbucket.org/dermotte/liresolr). Basically I've taken the
> same approach as the TikaEntityProcessor, but my setup just indexes
> the first of 1000 images. I'm using a FileListEntityProcessor to get
> all JPEGs from a directory and then I'm handing them over (see [2]).
> My code for the EntityProcessor is at [1]. I've tried to use the
> DataSource as well as the filePath attribute, but it ends up all the
> same. However, the FileListEntityProcessor is able to read all the
> files according to the debug output, but I'm missing the link from the
> FileListEntityProcessor to the LireEntityProcessor.
>
> I'd appreciate any pointer or help :)
>
> cheers,
>   Mathias
>
> [1] LireEntityProcessor http://pastebin.com/JFajkNtf
> [2] dataConfig http://pastebin.com/vSHucatJ
>
> --
> Dr. Mathias Lux
> Klagenfurt University, Austria
> http://tinyurl.com/mlux-itec
>



-- 
PD Dr. Mathias Lux
Klagenfurt University, Austria
http://tinyurl.com/mlux-itec

RE: DataImport Handler, writing a new EntityProcessor

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
The first thing I would suggest is to try and run it not in debug mode.  DIH's debug mode limits the number of documents it will take in, so that might be all that is wrong here.

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: mathias.lux@gmail.com [mailto:mathias.lux@gmail.com] On Behalf Of Mathias Lux
Sent: Wednesday, December 18, 2013 4:04 AM
To: solr-user@lucene.apache.org
Subject: DataImport Handler, writing a new EntityProcessor

Hi all!

I've got a question regarding writing a new EntityProcessor, in the
same sense as the Tika one. My EntityProcessor should analyze jpg
images and create document fields to be used with the LIRE Solr plugin
(https://bitbucket.org/dermotte/liresolr). Basically I've taken the
same approach as the TikaEntityProcessor, but my setup just indexes
the first of 1000 images. I'm using a FileListEntityProcessor to get
all JPEGs from a directory and then I'm handing them over (see [2]).
My code for the EntityProcessor is at [1]. I've tried to use the
DataSource as well as the filePath attribute, but it ends up all the
same. However, the FileListEntityProcessor is able to read all the
files according to the debug output, but I'm missing the link from the
FileListEntityProcessor to the LireEntityProcessor.

I'd appreciate any pointer or help :)

cheers,
  Mathias

[1] LireEntityProcessor http://pastebin.com/JFajkNtf
[2] dataConfig http://pastebin.com/vSHucatJ

-- 
Dr. Mathias Lux
Klagenfurt University, Austria
http://tinyurl.com/mlux-itec