You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by khalid y <ke...@gmail.com> on 2009/12/04 23:14:05 UTC
Re: WELCOME to solr-user@lucene.apache.org
Hi,
I have a problem with solr. I'm indexing some html content and solr crash
because my id field is multivalued.
I found that Tika read the html and extract metadata like <meta name="id"
content="12"> from my htmls but my documents has an already an id setted by
literal.id=10.
I tried to map the id from Tika by fmap.id=ignored_ but it ignore also my
literal.id
I'm using solr 1.4 and tika 0.5
Someone can explain to me how I can ignore this the Tika id metadata ??
Thanks
Re: WELCOME to solr-user@lucene.apache.org
Posted by Chris Hostetter <ho...@fucit.org>.
(FYI: in the future please start a new thread with an approriate subject
line when you ask questions -- you probably would have gotten a lot more
responses fro people interested in Tika and SolrCell if they could tell
that this email was about SolrCell)
: I found that Tika read the html and extract metadata like <meta name="id"
: content="12"> from my htmls but my documents has an already an id setted by
: literal.id=10.
:
: I tried to map the id from Tika by fmap.id=ignored_ but it ignore also my
: literal.id
Hmmmm, yeah: that seems like an odd order of operations, but it's
documented on the wiki so evidently it's intentional...
http://wiki.apache.org/solr/ExtractingRequestHandler#Order_of_field_operations
my best sugguestions:
* use the capture param to restrict what gets extracted (it's probably
possible to write an XPath query that selects everything *except*
metadata[id])
* change the name of your uniqueKey field to be something other then "id"
so it's less likely to collide with a value from the document.
I also opened two Jira issues that you may want to post comments in...
https://issues.apache.org/jira/browse/SOLR-1633
https://issues.apache.org/jira/browse/SOLR-1634
-Hoss
Re: WELCOME to solr-user@lucene.apache.org
Posted by khalid y <ke...@gmail.com>.
Thanks a lot for you response !!
For the first solution :
I need to index all the content of my websites and I want just tika ignore
<meta name="id"> because I have already an id
I'll try monday and tell you if it works
The second solution :
Are your sure Tika use the HTML Tokenizer ? I'll check
2009/12/5 Raghuveer Kancherla <ra...@aplopio.com>
> 2 ways I can think of ...
>
> - ExtractingRequestHandler (this is what I am guessing you are using now)
>
> Set extractOnly=true while making a request to the extractingRequestHandler
> and get the parsed content back. Now make a post request on update request
> handler with what ever fields and field values you want.
>
>
> - Use HTMLStripWhiteSpaceTokenizer factory. This article may be helpful
> to explain what I mean.
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory
> .
>
>
>
> - Raghu
>
>
>
> On Sat, Dec 5, 2009 at 3:44 AM, khalid y <ke...@gmail.com> wrote:
>
> > Hi,
> >
> > I have a problem with solr. I'm indexing some html content and solr crash
> > because my id field is multivalued.
> > I found that Tika read the html and extract metadata like <meta name="id"
> > content="12"> from my htmls but my documents has an already an id setted
> by
> > literal.id=10.
> >
> > I tried to map the id from Tika by fmap.id=ignored_ but it ignore also
> my
> > literal.id
> >
> > I'm using solr 1.4 and tika 0.5
> >
> > Someone can explain to me how I can ignore this the Tika id metadata ??
> >
> > Thanks
> >
>
Re: WELCOME to solr-user@lucene.apache.org
Posted by Raghuveer Kancherla <ra...@aplopio.com>.
2 ways I can think of ...
- ExtractingRequestHandler (this is what I am guessing you are using now)
Set extractOnly=true while making a request to the extractingRequestHandler
and get the parsed content back. Now make a post request on update request
handler with what ever fields and field values you want.
- Use HTMLStripWhiteSpaceTokenizer factory. This article may be helpful
to explain what I mean.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory.
- Raghu
On Sat, Dec 5, 2009 at 3:44 AM, khalid y <ke...@gmail.com> wrote:
> Hi,
>
> I have a problem with solr. I'm indexing some html content and solr crash
> because my id field is multivalued.
> I found that Tika read the html and extract metadata like <meta name="id"
> content="12"> from my htmls but my documents has an already an id setted by
> literal.id=10.
>
> I tried to map the id from Tika by fmap.id=ignored_ but it ignore also my
> literal.id
>
> I'm using solr 1.4 and tika 0.5
>
> Someone can explain to me how I can ignore this the Tika id metadata ??
>
> Thanks
>