You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael Lackhoff <mi...@lackhoff.de> on 2009/11/08 16:56:52 UTC

Getting started with DIH

I would like to start using DIH to index some RSS-Feeds and mail folders

To get started I tried the RSS example from the wiki but as it is Solr
complains about the missing id field. After some experimenting I found
out two ways to fill the id:

- <copyField source="link" dest="id"/> in schema.xml
This works but isn't very flexible. Perhaps I have other types of
records with a real id or a multivalued link-field. Then this solution
would break.

- Changing the id field to type "uuid"
Again I would like to keep real ids where I have them and not a random UUID.

What didn't work but looks like the potentially best solution is to fill
the id in my data-config by using the link twice:
  <field column="link"         xpath="/RDF/item/link" />
  <field column="id"           xpath="/RDF/item/link" />
This would be a definition just for this single data source but I don't
get any docs (also no error message). No trace of any inserts whatsoever.
Is it possible to fill the id that way?

Another question regarding MailEntityProcessor
I found this example:
<document>
   <entity processor="MailEntityProcessor"
           user="somebody@gmail.com"
           password="something"
           host="imap.gmail.com"
           protocol="imaps"
           folders = "x,y,z"/>
</document>

But what is the dataSource (the enclosing tag to document)? That is, how
would a minimal but complete data-config.xml look like to index mails
from an IMAP server?

And finally, is it possible to combine the definitions for several
RSS-Feeds and Mail-accounts into one data-config? Or do I need a
separate config file and request handler for each of them?

-Michael

Re: Getting started with DIH

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
This one is kind of a hack.

So I have opened an issue.

https://issues.apache.org/jira/browse/SOLR-1547

On Mon, Nov 9, 2009 at 12:43 PM, Michael Lackhoff <mi...@lackhoff.de> wrote:
> On 09.11.2009 06:54 Erik Hatcher wrote:
>
>> The brackets probably come from it being transformed as an array.  Try
>> saying multiValued="false" on your <field> specifications.
>
> Indeed. Thanks Erik that was it.
>
> My first steps with DIH showed me what a powerful tool this is but
> although the DIH wiki page might well be the longest in the whole wiki
> there are so many mysteries left for the uninitiated. Is there any other
> documentation I might have missed?
>
> Thanks
> -Michael
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Getting started with DIH

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
The tried and tested strategy is to post the question in this mailing
list w/ your data-config.xml.


On Mon, Nov 9, 2009 at 1:08 PM, Michael Lackhoff <mi...@lackhoff.de> wrote:
> On 09.11.2009 08:20 Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> It just started of as a single page and the features just got piled up
>> and the page just bigger.  we are thinking of cutting it down to
>> smaller more manageable pages
>
> Oh, I like it the way it is as one page, so that the browser full text
> search can help. It is just that the features and power seem to grow
> even faster than the wike page ;-)
> E.g. I couldn't find a way how to add a second rss feed. I tried with a
> second entity parallel to the slashdot one but got an exception:
> "java.io.IOException: FULL" whatever that means, so I must be doing
> something wrong but couldn't find a hint.
>
> -Michael
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Getting started with DIH

Posted by Michael Lackhoff <mi...@lackhoff.de>.
On 09.11.2009 08:20 Noble Paul നോബിള്‍ नोब्ळ् wrote:

> It just started of as a single page and the features just got piled up
> and the page just bigger.  we are thinking of cutting it down to
> smaller more manageable pages

Oh, I like it the way it is as one page, so that the browser full text
search can help. It is just that the features and power seem to grow
even faster than the wike page ;-)
E.g. I couldn't find a way how to add a second rss feed. I tried with a
second entity parallel to the slashdot one but got an exception:
"java.io.IOException: FULL" whatever that means, so I must be doing
something wrong but couldn't find a hint.

-Michael

Re: Getting started with DIH

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
On Mon, Nov 9, 2009 at 12:43 PM, Michael Lackhoff <mi...@lackhoff.de> wrote:
> On 09.11.2009 06:54 Erik Hatcher wrote:
>
>> The brackets probably come from it being transformed as an array.  Try
>> saying multiValued="false" on your <field> specifications.
>
> Indeed. Thanks Erik that was it.
>
> My first steps with DIH showed me what a powerful tool this is but
> although the DIH wiki page might well be the longest in the whole wiki
> there are so many mysteries left for the uninitiated. Is there any other
> documentation I might have missed?

There is an FAQ page and that is it
http://wiki.apache.org/solr/DataImportHandlerFaq

It just started of as a single page and the features just got piled up
and the page just bigger.  we are thinking of cutting it down to
smaller more manageable pages
>
> Thanks
> -Michael
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Getting started with DIH

Posted by Michael Lackhoff <mi...@lackhoff.de>.
On 09.11.2009 06:54 Erik Hatcher wrote:

> The brackets probably come from it being transformed as an array.  Try  
> saying multiValued="false" on your <field> specifications.

Indeed. Thanks Erik that was it.

My first steps with DIH showed me what a powerful tool this is but
although the DIH wiki page might well be the longest in the whole wiki
there are so many mysteries left for the uninitiated. Is there any other
documentation I might have missed?

Thanks
-Michael

Re: Getting started with DIH

Posted by Erik Hatcher <er...@gmail.com>.
The brackets probably come from it being transformed as an array.  Try  
saying multiValued="false" on your <field> specifications.

	Erik

On Nov 9, 2009, at 12:34 AM, Michael Lackhoff wrote:

> On 08.11.2009 16:56 Michael Lackhoff wrote:
>
>> What didn't work but looks like the potentially best solution is to  
>> fill
>> the id in my data-config by using the link twice:
>>  <field column="link"         xpath="/RDF/item/link" />
>>  <field column="id"           xpath="/RDF/item/link" />
>> This would be a definition just for this single data source but I  
>> don't
>> get any docs (also no error message). No trace of any inserts  
>> whatsoever.
>> Is it possible to fill the id that way?
>
> Found the answer in the list archive: use TemplateTransformer:
>  <field column="link"         xpath="/RDF/item/link" />
>  <field column="id"           template="${slashdot.link}" />
>
> Only minor and cosmetic problem: there are brackets around the id  
> field
> (like [http://somelink/]). For an id this doesn't really matter but I
> would like to understand what is going on here. In the wiki I found  
> only
> this info:
>> The rules for the template are same as the templates in 'query',  
>> 'url'
>> etc
> but I couldn't find any info about those either. Is this documented
> somewhere?
>
> -Michael


Re: Getting started with DIH

Posted by Michael Lackhoff <mi...@lackhoff.de>.
On 08.11.2009 16:56 Michael Lackhoff wrote:

> What didn't work but looks like the potentially best solution is to fill
> the id in my data-config by using the link twice:
>   <field column="link"         xpath="/RDF/item/link" />
>   <field column="id"           xpath="/RDF/item/link" />
> This would be a definition just for this single data source but I don't
> get any docs (also no error message). No trace of any inserts whatsoever.
> Is it possible to fill the id that way?

Found the answer in the list archive: use TemplateTransformer:
  <field column="link"         xpath="/RDF/item/link" />
  <field column="id"           template="${slashdot.link}" />

Only minor and cosmetic problem: there are brackets around the id field
(like [http://somelink/]). For an id this doesn't really matter but I
would like to understand what is going on here. In the wiki I found only
this info:
> The rules for the template are same as the templates in 'query', 'url'
> etc
but I couldn't find any info about those either. Is this documented
somewhere?

-Michael

Re: Getting started with DIH

Posted by "Lucas F. A. Teixeira" <lu...@gmail.com>.
If I'm not wrong, you can have several entities in one document, but just
one datasource configured.

[]sm


Lucas Frare Teixeira .·.
- lucastex@gmail.com
- lucastex.com.br
- blog.lucastex.com
- twitter.com/lucastex


On Sun, Nov 8, 2009 at 3:36 PM, Michael Lackhoff <mi...@lackhoff.de>wrote:

> On 08.11.2009 17:03 Lucas F. A. Teixeira wrote:
>
> > You have an example on using mail dih in solr distro
>
> <blush>Don't know where my eyes were. Thanks!</blush>
>
> When I was at it I looked at the schema.xml for the rss example and it
> uses "link" as UniqueKey, which is of course good, if you only have rss
> items but not so good if you also plan to add other data sources.
> So I am still interested in a good solution for my id problem:
>
> >> What didn't work but looks like the potentially best solution is to fill
> >> the id in my data-config by using the link twice:
> >>  <field column="link"         xpath="/RDF/item/link" />
> >>  <field column="id"           xpath="/RDF/item/link" />
> >> This would be a definition just for this single data source but I don't
> >> get any docs (also no error message). No trace of any inserts
> whatsoever.
> >> Is it possible to fill the id that way?
>
> and this one:
>
> >> And finally, is it possible to combine the definitions for several
> >> RSS-Feeds and Mail-accounts into one data-config? Or do I need a
> >> separate config file and request handler for each of them?
>
> Thanks
> -Michael
>

Re: Getting started with DIH

Posted by Michael Lackhoff <mi...@lackhoff.de>.
On 08.11.2009 17:03 Lucas F. A. Teixeira wrote:

> You have an example on using mail dih in solr distro

<blush>Don't know where my eyes were. Thanks!</blush>

When I was at it I looked at the schema.xml for the rss example and it
uses "link" as UniqueKey, which is of course good, if you only have rss
items but not so good if you also plan to add other data sources.
So I am still interested in a good solution for my id problem:

>> What didn't work but looks like the potentially best solution is to fill
>> the id in my data-config by using the link twice:
>>  <field column="link"         xpath="/RDF/item/link" />
>>  <field column="id"           xpath="/RDF/item/link" />
>> This would be a definition just for this single data source but I don't
>> get any docs (also no error message). No trace of any inserts whatsoever.
>> Is it possible to fill the id that way?

and this one:

>> And finally, is it possible to combine the definitions for several
>> RSS-Feeds and Mail-accounts into one data-config? Or do I need a
>> separate config file and request handler for each of them?

Thanks
-Michael

Re: Getting started with DIH

Posted by "Lucas F. A. Teixeira" <lu...@gmail.com>.
You have an example on using mail dih in solr distro

[]s,

Lucas Frare Teixeira .·.
- lucastex@gmail.com
- lucastex.com.br
- blog.lucastex.com
- twitter.com/lucastex


On Sun, Nov 8, 2009 at 1:56 PM, Michael Lackhoff <mi...@lackhoff.de>wrote:

> I would like to start using DIH to index some RSS-Feeds and mail folders
>
> To get started I tried the RSS example from the wiki but as it is Solr
> complains about the missing id field. After some experimenting I found
> out two ways to fill the id:
>
> - <copyField source="link" dest="id"/> in schema.xml
> This works but isn't very flexible. Perhaps I have other types of
> records with a real id or a multivalued link-field. Then this solution
> would break.
>
> - Changing the id field to type "uuid"
> Again I would like to keep real ids where I have them and not a random
> UUID.
>
> What didn't work but looks like the potentially best solution is to fill
> the id in my data-config by using the link twice:
>  <field column="link"         xpath="/RDF/item/link" />
>  <field column="id"           xpath="/RDF/item/link" />
> This would be a definition just for this single data source but I don't
> get any docs (also no error message). No trace of any inserts whatsoever.
> Is it possible to fill the id that way?
>
> Another question regarding MailEntityProcessor
> I found this example:
> <document>
>   <entity processor="MailEntityProcessor"
>           user="somebody@gmail.com"
>           password="something"
>           host="imap.gmail.com"
>           protocol="imaps"
>           folders = "x,y,z"/>
> </document>
>
> But what is the dataSource (the enclosing tag to document)? That is, how
> would a minimal but complete data-config.xml look like to index mails
> from an IMAP server?
>
> And finally, is it possible to combine the definitions for several
> RSS-Feeds and Mail-accounts into one data-config? Or do I need a
> separate config file and request handler for each of them?
>
> -Michael
>