You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Furkan KAMACI <fu...@gmail.com> on 2014/02/23 01:21:40 UTC

Wikipedia Data Cleaning at Solr

Hi;

I want to run an NLP algorithm for Wikipedia data. I used dataimport
handler for dump data and everything is OK. However there are some texts as
like:

== Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı
eğitimden yararlanılmaktadır.

I think that it should be like that:

Altyapı bilgileri Köyde, ilköğretim okulu yoktur fakat taşımalı eğitimden
yararlanılmaktadır.

On the other hand this should be removed:

{| border="0" cellpadding="5" cellspacing="5" |- bgcolor="#aaaaaa"
|'''Seçim Yılı''' |'''Muhtar''' |- bgcolor="#dddddd" |[[2009]] |kazım
güngör |- bgcolor="#dddddd" | |Ömer Gungor |- bgcolor="#dddddd" | |Fazlı
Uzun |- bgcolor="#dddddd" | |Cemal Özden |- bgcolor="#dddddd" | | |}

Also including titles as like == Altyapı bilgileri == should be optional (I
think that they can be removed for some purposes)

My question is that. Is there any analyzer combination to clean up
Wikipedia data for Solr?

Thanks;
Furkan KAMACI

Re: Wikipedia Data Cleaning at Solr

Posted by Furkan KAMACI <fu...@gmail.com>.
My input is that:

{| style="text-align: left; width: 50%; table-layout: fixed;" border="0" |}

Analysis is as follows:

WT
textraw_bytesstartendtypeflagsposition
style[73 74 79 6c 65]38<ALPHANUM>01
text[74 65 78 74]1014<ALPHANUM>02
align[61 6c 69 67 6e]1520<ALPHANUM>03
left[6c 65 66 74]2226<ALPHANUM>04
width[77 69 64 74 68]2833<ALPHANUM>05
50[35 30]3537<ALPHANUM>06
table[74 61 62 6c 65]4045<ALPHANUM>07
layout[6c 61 79 6f 75 74]4652<ALPHANUM>08
fixed[66 69 78 65 64]5459<ALPHANUM>09
border[62 6f 72 64 65 72]6268<ALPHANUM>010
0[30]7071<ALPHANUM>011



2014-02-24 0:28 GMT+02:00 Furkan KAMACI <fu...@gmail.com>:

> I've compared the results when using WikipediaTokenizer for  index time
> analyzer but there is no difference?
>
>
> 2014-02-23 3:44 GMT+02:00 Ahmet Arslan <io...@yahoo.com>:
>
> Hi Furkan,
>>
>> There is org.apache.lucene.analysis.wikipedia.WikipediaTokenizer
>>
>> Ahmet
>>
>>
>> On Sunday, February 23, 2014 2:22 AM, Furkan KAMACI <
>> furkankamaci@gmail.com> wrote:
>> Hi;
>>
>> I want to run an NLP algorithm for Wikipedia data. I used dataimport
>> handler for dump data and everything is OK. However there are some texts
>> as
>> like:
>>
>> == Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı
>> eğitimden yararlanılmaktadır.
>>
>> I think that it should be like that:
>>
>> Altyapı bilgileri Köyde, ilköğretim okulu yoktur fakat taşımalı eğitimden
>> yararlanılmaktadır.
>>
>> On the other hand this should be removed:
>>
>> {| border="0" cellpadding="5" cellspacing="5" |- bgcolor="#aaaaaa"
>> |'''Seçim Yılı''' |'''Muhtar''' |- bgcolor="#dddddd" |[[2009]] |kazım
>> güngör |- bgcolor="#dddddd" | |Ömer Gungor |- bgcolor="#dddddd" | |Fazlı
>> Uzun |- bgcolor="#dddddd" | |Cemal Özden |- bgcolor="#dddddd" | | |}
>>
>> Also including titles as like == Altyapı bilgileri == should be optional
>> (I
>> think that they can be removed for some purposes)
>>
>> My question is that. Is there any analyzer combination to clean up
>> Wikipedia data for Solr?
>>
>> Thanks;
>> Furkan KAMACI
>>
>
>

Re: Wikipedia Data Cleaning at Solr

Posted by Furkan KAMACI <fu...@gmail.com>.
I've compared the results when using WikipediaTokenizer for  index time
analyzer but there is no difference?


2014-02-23 3:44 GMT+02:00 Ahmet Arslan <io...@yahoo.com>:

> Hi Furkan,
>
> There is org.apache.lucene.analysis.wikipedia.WikipediaTokenizer
>
> Ahmet
>
>
> On Sunday, February 23, 2014 2:22 AM, Furkan KAMACI <
> furkankamaci@gmail.com> wrote:
> Hi;
>
> I want to run an NLP algorithm for Wikipedia data. I used dataimport
> handler for dump data and everything is OK. However there are some texts as
> like:
>
> == Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı
> eğitimden yararlanılmaktadır.
>
> I think that it should be like that:
>
> Altyapı bilgileri Köyde, ilköğretim okulu yoktur fakat taşımalı eğitimden
> yararlanılmaktadır.
>
> On the other hand this should be removed:
>
> {| border="0" cellpadding="5" cellspacing="5" |- bgcolor="#aaaaaa"
> |'''Seçim Yılı''' |'''Muhtar''' |- bgcolor="#dddddd" |[[2009]] |kazım
> güngör |- bgcolor="#dddddd" | |Ömer Gungor |- bgcolor="#dddddd" | |Fazlı
> Uzun |- bgcolor="#dddddd" | |Cemal Özden |- bgcolor="#dddddd" | | |}
>
> Also including titles as like == Altyapı bilgileri == should be optional (I
> think that they can be removed for some purposes)
>
> My question is that. Is there any analyzer combination to clean up
> Wikipedia data for Solr?
>
> Thanks;
> Furkan KAMACI
>

Re: Wikipedia Data Cleaning at Solr

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Furkan,

There is org.apache.lucene.analysis.wikipedia.WikipediaTokenizer

Ahmet


On Sunday, February 23, 2014 2:22 AM, Furkan KAMACI <fu...@gmail.com> wrote:
Hi;

I want to run an NLP algorithm for Wikipedia data. I used dataimport
handler for dump data and everything is OK. However there are some texts as
like:

== Altyapı bilgileri == Köyde, [[ilköğretim]] okulu yoktur fakat taşımalı
eğitimden yararlanılmaktadır.

I think that it should be like that:

Altyapı bilgileri Köyde, ilköğretim okulu yoktur fakat taşımalı eğitimden
yararlanılmaktadır.

On the other hand this should be removed:

{| border="0" cellpadding="5" cellspacing="5" |- bgcolor="#aaaaaa"
|'''Seçim Yılı''' |'''Muhtar''' |- bgcolor="#dddddd" |[[2009]] |kazım
güngör |- bgcolor="#dddddd" | |Ömer Gungor |- bgcolor="#dddddd" | |Fazlı
Uzun |- bgcolor="#dddddd" | |Cemal Özden |- bgcolor="#dddddd" | | |}

Also including titles as like == Altyapı bilgileri == should be optional (I
think that they can be removed for some purposes)

My question is that. Is there any analyzer combination to clean up
Wikipedia data for Solr?

Thanks;
Furkan KAMACI