You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Clivot <cl...@netmedia.de> on 2014/03/25 11:29:19 UTC

Indexing parts of an HTML file differently

Hello,

I have the following issue and need help:

One HTML file has different parts for different countries.
For example:

<!-- Country: FR, BE --->
....
Address for France and Benelux
....
<!-- Country End -->
<!-- Country: CH -->
....
Address for Switzerland
....
<!-- Country End -->

Depending on a parameter, I show or hide the parts on the website
Logically, all parts are in the index and therefore all items are found by SolR.
My question is: how can I have only the items for the current country in my result list?

Thanks a lot
Regards
Michael

_______________________________
clivot@netmedia.de
netmedia - the Social Workplace Experts

netmedianer GmbH, Neugrabenweg 5-7, 66123 Saarbr?cken, Germany
fon: +49 681 37988-12, fax: +49 681 37988-99, mobil: +49 151 54775197
Gesch?ftsf?hrer: Boris Brenner, Tim Mik?a | HRB Saarbr?cken 13975

https://twitter.com/netmedianer, https://www.facebook.com/netmedianer

Re: Indexing parts of an HTML file differently

Posted by Jack Krupansky <ja...@basetechnology.com>.

There is no Solr feature that would break up your HTML file - you will have 
to do that yourself, either before you send the file to Solr or by 
developing a custom update processor that extracts the sections and directs 
each to a specific field for the language. The former is probably easier 
since any generic processor that extracts text from an HTML file will strip 
out all HTML comments.

-- Jack Krupansky

-----Original Message----- 
From: Michael Clivot
Sent: Tuesday, March 25, 2014 6:29 AM
To: solr-user@lucene.apache.org
Subject: Indexing parts of an HTML file differently

Hello,

I have the following issue and need help:

One HTML file has different parts for different countries.
For example:

<!-- Country: FR, BE --->
....
Address for France and Benelux
....
<!-- Country End -->
<!-- Country: CH -->
....
Address for Switzerland
....
<!-- Country End -->

Depending on a parameter, I show or hide the parts on the website
Logically, all parts are in the index and therefore all items are found by 
SolR.
My question is: how can I have only the items for the current country in my 
result list?

Thanks a lot
Regards
Michael

_______________________________
clivot@netmedia.de
netmedia - the Social Workplace Experts

netmedianer GmbH, Neugrabenweg 5-7, 66123 Saarbr?cken, Germany
fon: +49 681 37988-12, fax: +49 681 37988-99, mobil: +49 151 54775197
Gesch?ftsf?hrer: Boris Brenner, Tim Mik?a | HRB Saarbr?cken 13975

https://twitter.com/netmedianer, https://www.facebook.com/netmedianer

Re: Indexing parts of an HTML file differently

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Can you get Delivery Server to generate Solr-style XML or JSON update
file? Might be easier than generating and then re-parsing HTML?

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Thu, Mar 27, 2014 at 3:28 PM, Michael Clivot <cl...@netmedia.de> wrote:
> Thanks for your answer Jack.
> @Gora:
>
>> How are you fetching the HTML content, and indexing it into Solr?
>
> We are using SolR with the OpenText Delivery Server. The Delivery Server generated HTML representations of the published pages and writes them to the directory, which is used by solr to get data content.
>
>> It is probably best to handle this requirement at that point. Haven't used Nutch ( http://nutch.apache.org/) recently, but you might be able to use it for this.
>
> Do you mean the web crawler way? From the first view, it fits us not very good. In this case we need to implement ourselves the OpenText Search layer. Theoretically, we can try to teach DeliveryServer to understand external indexes. But the crawling itself is not the preferred solution - it is not so responsive, as the DS-way; in case of existing authorization restrictions, it should be many crawler users for every role; etc...
>
> -----Ursprüngliche Nachricht-----
> Von: Gora Mohanty [mailto:gora@mimirtech.com]
> Gesendet: Dienstag, 25. März 2014 11:32
> An: solr-user@lucene.apache.org
> Betreff: Re: Indexing parts of an HTML file differently
>
> On 25 March 2014 15:59, Michael Clivot <cl...@netmedia.de> wrote:
>> Hello,
>>
>> I have the following issue and need help:
>>
>> One HTML file has different parts for different countries.
>> For example:
>>
>> <!-- Country: FR, BE --->
>> ....
>> Address for France and Benelux
>> ....
>> <!-- Country End -->
>> <!-- Country: CH -->
>> ....
>> Address for Switzerland
>> ....
>> <!-- Country End -->
>>
>> Depending on a parameter, I show or hide the parts on the website
>> Logically, all parts are in the index and therefore all items are found by SolR.
>> My question is: how can I have only the items for the current country in my result list?
>
> How are you fetching the HTML content, and indexing it into Solr?
> It is probably best to handle this requirement at that point. Haven't used Nutch ( http://nutch.apache.org/ ) recently, but you might be able to use it for this.
>
> Regards,
> Gora

AW: Indexing parts of an HTML file differently

Posted by Michael Clivot <cl...@netmedia.de>.

Thanks for your answer Jack.
@Gora:

> How are you fetching the HTML content, and indexing it into Solr?

We are using SolR with the OpenText Delivery Server. The Delivery Server generated HTML representations of the published pages and writes them to the directory, which is used by solr to get data content.

> It is probably best to handle this requirement at that point. Haven't used Nutch ( http://nutch.apache.org/) recently, but you might be able to use it for this.

Do you mean the web crawler way? From the first view, it fits us not very good. In this case we need to implement ourselves the OpenText Search layer. Theoretically, we can try to teach DeliveryServer to understand external indexes. But the crawling itself is not the preferred solution - it is not so responsive, as the DS-way; in case of existing authorization restrictions, it should be many crawler users for every role; etc...

-----Ursprüngliche Nachricht-----
Von: Gora Mohanty [mailto:gora@mimirtech.com] 
Gesendet: Dienstag, 25. März 2014 11:32
An: solr-user@lucene.apache.org
Betreff: Re: Indexing parts of an HTML file differently

On 25 March 2014 15:59, Michael Clivot <cl...@netmedia.de> wrote:
> Hello,
>
> I have the following issue and need help:
>
> One HTML file has different parts for different countries.
> For example:
>
> <!-- Country: FR, BE --->
> ....
> Address for France and Benelux
> ....
> <!-- Country End -->
> <!-- Country: CH -->
> ....
> Address for Switzerland
> ....
> <!-- Country End -->
>
> Depending on a parameter, I show or hide the parts on the website 
> Logically, all parts are in the index and therefore all items are found by SolR.
> My question is: how can I have only the items for the current country in my result list?

How are you fetching the HTML content, and indexing it into Solr?
It is probably best to handle this requirement at that point. Haven't used Nutch ( http://nutch.apache.org/ ) recently, but you might be able to use it for this.

Regards,
Gora

Re: Indexing parts of an HTML file differently

Posted by Gora Mohanty <go...@mimirtech.com>.

On 25 March 2014 15:59, Michael Clivot <cl...@netmedia.de> wrote:
> Hello,
>
> I have the following issue and need help:
>
> One HTML file has different parts for different countries.
> For example:
>
> <!-- Country: FR, BE --->
> ....
> Address for France and Benelux
> ....
> <!-- Country End -->
> <!-- Country: CH -->
> ....
> Address for Switzerland
> ....
> <!-- Country End -->
>
> Depending on a parameter, I show or hide the parts on the website
> Logically, all parts are in the index and therefore all items are found by SolR.
> My question is: how can I have only the items for the current country in my result list?

How are you fetching the HTML content, and indexing it into Solr?
It is probably best to handle this requirement at that point. Haven't
used Nutch ( http://nutch.apache.org/ ) recently, but you might be
able to use it for this.

Regards,
Gora