You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/03/19 16:22:55 UTC

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Okay, this works.

$ bin/nutch org.apache.nutch.tools.DBpediaParser
/Users/simpatico/Downloads/wikipedia_links_en.nt  | grep "
http://en.wikipedia.org" | sort -u > wiki/urls

I guess this is the same as implementing logic that outputs only urls not
already encountered, though.

On Sat, Mar 19, 2011 at 4:05 PM, Dimitris Kontokostas <ji...@gmail.com>wrote:

> they are files, (original file with duplicates, new with unique)
>
> if your parser writes to output then it is
> $ your_parser_command | grep "http://en.wikipedia.org" | sort -u >
> unique_urls_file
>
> otherwise you run your parser separately
> $ cat your_parser_output_file | grep "http://en.wikipedia.org" | sort -u >
> unique_urls_file
>
> Dimitris
>
>
> On Sat, Mar 19, 2011 at 4:58 PM, Gabriele Kahlout <
> gabriele@mysimpatico.com> wrote:
>
>> You meant this. Now what do I have in youfile and newfile?
>> $ cat youfile | grep "http://en.wikipedia.org" | sort -u > newfile
>>
>> I'm after all the urls in wikipedia, I don't have them.
>>
>>
>> On Sat, Mar 19, 2011 at 3:55 PM, Dimitris Kontokostas <ji...@gmail.com>wrote:
>>
>>> cat youfile | grep "http://..." | sort -u > new file
>>>
>>>
>>> On Sat, Mar 19, 2011 at 4:40 PM, Gabriele Kahlout <
>>> gabriele@mysimpatico.com> wrote:
>>>
>>>> working example/cmd? I'm not sure we are talking about the same thing.
>>>>
>>>>
>>>> On Sat, Mar 19, 2011 at 3:36 PM, Dimitris Kontokostas <
>>>> jimkont@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> You can grep the output with http://en.wikipedia.org<http://en.wikipedia.org/wiki/Anarchism>and pipe it to sort -u
>>>>>
>>>>> Cheers,
>>>>> Dimitris
>>>>>
>>>>> On Sat, Mar 19, 2011 at 3:47 PM, Gabriele Kahlout <
>>>>> gabriele@mysimpatico.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Mar 19, 2011 at 2:13 PM, Gabriele Kahlout <
>>>>>> gabriele@mysimpatico.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I've downloaded and wrote a simple parser to give me pedia urls from this
>>>>>>> dbpedia file
>>>>>>> <http://downloads.dbpedia.org/3.6/en/wikipedia_links_en.nt.bz2>as
>>>>>>> shown below. I find the result unsatisfactory since it contains many
>>>>>>> duplicates. Adding logic to the parser to avoid them (through remembering)
>>>>>>> seems to be also very expensive, since the file size (uncompressed) is 3GB.
>>>>>>> Is there a better approach to get Wikipedia urls like is done with dmoz in
>>>>>>>
>>>>>>> wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
>>>>>>> bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>> http://dbpedia.org/resource/AfghanistanGeography
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>> n"@e
>>>>>>> http://dbpedia.org/resource/AfghanistanGeography
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>> http://dbpedia.org/resource/Anarchism
>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>> n"@e
>>>>>>> http://dbpedia.org/resource/Anarchism
>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>> http://dbpedia.org/resource/AccessibleComputing
>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>> n"@e
>>>>>>> http://dbpedia.org/resource/AccessibleComputing
>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>> http://dbpedia.org/resource/AfghanistanHistory
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>> n"@e
>>>>>>> http://dbpedia.org/resource/AfghanistanHistory
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>> http://dbpedia.org/resource/AfghanistanPeople
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>> n"@e
>>>>>>> http://dbpedia.org/resource/AfghanistanPeople
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>> http://dbpedia.org/resource/AfghanistanTransportations
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>> n"@e
>>>>>>> http://dbpedia.org/resource/AfghanistanTransportations
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanCommunications
>>>>>>> http://dbpedia.org/resource/AfghanistanCommunications
>>>>>>> http://en.wikipedia.org/wiki/AfghanistanCommunications
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> K. Gabriele
>>>>>>>
>>>>>>> --- unchanged since 20/9/10 ---
>>>>>>> P.S. If the subject contains "[LON]" or the addressee acknowledges
>>>>>>> the receipt within 48 hours then I don't resend the email.
>>>>>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>>>>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>>>>>
>>>>>>> If an email is sent by a sender that is not a trusted contact or the
>>>>>>> email does not contain a valid code then the email is not received. A valid
>>>>>>> code starts with a hyphen and ends with "X".
>>>>>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧
>>>>>>> y ∈ L(-[a-z]+[0-9]X)).
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Colocation vs. Managed Hosting
>>>>>> A question and answer guide to determining the best fit
>>>>>> for your organization - today and in the future.
>>>>>> http://p.sf.net/sfu/internap-sfd2d
>>>>>> _______________________________________________
>>>>>> Dbpedia-discussion mailing list
>>>>>> Dbpedia-discussion@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kontokostas Dimitris
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> K. Gabriele
>>>>
>>>> --- unchanged since 20/9/10 ---
>>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>>> receipt within 48 hours then I don't resend the email.
>>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>>
>>>> If an email is sent by a sender that is not a trusted contact or the
>>>> email does not contain a valid code then the email is not received. A valid
>>>> code starts with a hyphen and ends with "X".
>>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>>> L(-[a-z]+[0-9]X)).
>>>>
>>>>
>>>
>>>
>>> --
>>> Kontokostas Dimitris
>>>
>>
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Kontokostas Dimitris
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

agreed, and lazy is a virtue (at least in CS).

On Sat, Mar 19, 2011 at 4:50 PM, Dimitris Kontokostas <ji...@gmail.com>wrote:

> it is, but it is usually faster and bug free
> (and you don't have to implement it :)
>
>
> On Sat, Mar 19, 2011 at 5:22 PM, Gabriele Kahlout <
> gabriele@mysimpatico.com> wrote:
>
>> Okay, this works.
>>
>> $ bin/nutch org.apache.nutch.tools.DBpediaParser
>> /Users/simpatico/Downloads/wikipedia_links_en.nt  | grep "
>> http://en.wikipedia.org" | sort -u > wiki/urls
>>
>> I guess this is the same as implementing logic that outputs only urls not
>> already encountered, though.
>>
>> On Sat, Mar 19, 2011 at 4:05 PM, Dimitris Kontokostas <ji...@gmail.com>wrote:
>>
>>> they are files, (original file with duplicates, new with unique)
>>>
>>> if your parser writes to output then it is
>>> $ your_parser_command | grep "http://en.wikipedia.org" | sort -u >
>>> unique_urls_file
>>>
>>> otherwise you run your parser separately
>>> $ cat your_parser_output_file | grep "http://en.wikipedia.org" | sort -u
>>> > unique_urls_file
>>>
>>> Dimitris
>>>
>>>
>>> On Sat, Mar 19, 2011 at 4:58 PM, Gabriele Kahlout <
>>> gabriele@mysimpatico.com> wrote:
>>>
>>>> You meant this. Now what do I have in youfile and newfile?
>>>> $ cat youfile | grep "http://en.wikipedia.org" | sort -u > newfile
>>>>
>>>> I'm after all the urls in wikipedia, I don't have them.
>>>>
>>>>
>>>> On Sat, Mar 19, 2011 at 3:55 PM, Dimitris Kontokostas <
>>>> jimkont@gmail.com> wrote:
>>>>
>>>>> cat youfile | grep "http://..." | sort -u > new file
>>>>>
>>>>>
>>>>> On Sat, Mar 19, 2011 at 4:40 PM, Gabriele Kahlout <
>>>>> gabriele@mysimpatico.com> wrote:
>>>>>
>>>>>> working example/cmd? I'm not sure we are talking about the same thing.
>>>>>>
>>>>>>
>>>>>> On Sat, Mar 19, 2011 at 3:36 PM, Dimitris Kontokostas <
>>>>>> jimkont@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> You can grep the output with http://en.wikipedia.org<http://en.wikipedia.org/wiki/Anarchism>and pipe it to sort -u
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Dimitris
>>>>>>>
>>>>>>> On Sat, Mar 19, 2011 at 3:47 PM, Gabriele Kahlout <
>>>>>>> gabriele@mysimpatico.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Mar 19, 2011 at 2:13 PM, Gabriele Kahlout <
>>>>>>>> gabriele@mysimpatico.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I've downloaded and wrote a simple parser to give me pedia urls
>>>>>>>>> from this dbpedia file
>>>>>>>>> <http://downloads.dbpedia.org/3.6/en/wikipedia_links_en.nt.bz2>as
>>>>>>>>> shown below. I find the result unsatisfactory since it contains many
>>>>>>>>> duplicates. Adding logic to the parser to avoid them (through remembering)
>>>>>>>>> seems to be also very expensive, since the file size (uncompressed) is 3GB.
>>>>>>>>> Is there a better approach to get Wikipedia urls like is done with dmoz in
>>>>>>>>>
>>>>>>>>> wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
>>>>>>>>> bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>>>> http://dbpedia.org/resource/AfghanistanGeography
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/AfghanistanGeography
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>>>> http://dbpedia.org/resource/Anarchism
>>>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/Anarchism
>>>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>>>> http://dbpedia.org/resource/AccessibleComputing
>>>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/AccessibleComputing
>>>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>>>> http://dbpedia.org/resource/AfghanistanHistory
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/AfghanistanHistory
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>>>> http://dbpedia.org/resource/AfghanistanPeople
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/AfghanistanPeople
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>>>> http://dbpedia.org/resource/AfghanistanTransportations
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/AfghanistanTransportations
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanCommunications
>>>>>>>>> http://dbpedia.org/resource/AfghanistanCommunications
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanCommunications
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>> K. Gabriele
>>>>>>>>>
>>>>>>>>> --- unchanged since 20/9/10 ---
>>>>>>>>> P.S. If the subject contains "[LON]" or the addressee acknowledges
>>>>>>>>> the receipt within 48 hours then I don't resend the email.
>>>>>>>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this)
>>>>>>>>> ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>>>>>>>
>>>>>>>>> If an email is sent by a sender that is not a trusted contact or
>>>>>>>>> the email does not contain a valid code then the email is not received. A
>>>>>>>>> valid code starts with a hyphen and ends with "X".
>>>>>>>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x)
>>>>>>>>> ∧ y ∈ L(-[a-z]+[0-9]X)).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Colocation vs. Managed Hosting
>>>>>>>> A question and answer guide to determining the best fit
>>>>>>>> for your organization - today and in the future.
>>>>>>>> http://p.sf.net/sfu/internap-sfd2d
>>>>>>>> _______________________________________________
>>>>>>>> Dbpedia-discussion mailing list
>>>>>>>> Dbpedia-discussion@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Kontokostas Dimitris
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> K. Gabriele
>>>>>>
>>>>>> --- unchanged since 20/9/10 ---
>>>>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>>>>> receipt within 48 hours then I don't resend the email.
>>>>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>>>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>>>>
>>>>>> If an email is sent by a sender that is not a trusted contact or the
>>>>>> email does not contain a valid code then the email is not received. A valid
>>>>>> code starts with a hyphen and ends with "X".
>>>>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
>>>>>> ∈ L(-[a-z]+[0-9]X)).
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kontokostas Dimitris
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> K. Gabriele
>>>>
>>>> --- unchanged since 20/9/10 ---
>>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>>> receipt within 48 hours then I don't resend the email.
>>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>>
>>>> If an email is sent by a sender that is not a trusted contact or the
>>>> email does not contain a valid code then the email is not received. A valid
>>>> code starts with a hyphen and ends with "X".
>>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>>> L(-[a-z]+[0-9]X)).
>>>>
>>>>
>>>
>>>
>>> --
>>> Kontokostas Dimitris
>>>
>>
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Kontokostas Dimitris
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Posted by Gora Mohanty <go...@mimirtech.com>.

On Wed, Mar 23, 2011 at 4:04 PM, Gabriele Kahlout
<ga...@mysimpatico.com> wrote:
> -E      Interpret regular expressions as extended (modern) regular
> expressions rather than basic regular expressions (BRE's).  The
>              re_format(7) manual page fully describes both formats.
>
> -e command
>              Append the editing commands specified by the command argument
> to the list of commands.
>
>> awk '/http://simple.wiki/{ gsub( "<|>", "", $1 ); print $1}'
>> wikipedia_links_simple.nt | sort -u
[...]

Ah, from the above, it seems that you are using a different
UNIX. I made the mistake of assuming Linux and GNU. My
sed does not have a -E option, and awk is GNU awk, i.e.,
gawk. You might have GNU awk installed as gawk, and the
above should then work with gawk.

Regards,
Gora

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

-E      Interpret regular expressions as extended (modern) regular
expressions rather than basic regular expressions (BRE's).  The
             re_format(7) manual page fully describes both formats.

-e command
             Append the editing commands specified by the command argument
to the list of commands.

awk '/http://simple.wiki/{ gsub( "<|>", "", $1 ); print $1}'
> wikipedia_links_simple.nt | sort -u
>
>
awk: syntax error at source line 1
 context is
     >>> /http://simple. <<< wiki/{ gsub( "<|>", "", $1 ); print $1}
awk: bailing out at source line 1




> This assumes that your URLs are properly escaped, i.e., no
> spaces, but so does your solution above.
>
> Regards,
> Gora
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Posted by Gora Mohanty <go...@mimirtech.com>.

On Wed, Mar 23, 2011 at 3:26 PM, Gabriele Kahlout
<ga...@mysimpatico.com> wrote:
> $ cat wikipedia_links_simple.nt | grep "http://simple.wiki" | awk  '{print
> $1}' | sort -u | sed -E 's/<|>//g'

I have lost track of what you were trying to do, but it really should
not be that difficult. Taking the above at face value (I also presume
that you meant sed -e), this can be simplified to (put everything on
one line if word-wrap splits it up):

awk '/http://simple.wiki/{ gsub( "<|>", "", $1 ); print $1}'
wikipedia_links_simple.nt | sort -u

This assumes that your URLs are properly escaped, i.e., no
spaces, but so does your solution above.

Regards,
Gora

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

$ cat wikipedia_links_simple.nt | grep "http://simple.wiki" | awk  '{print
$1}' | sort -u | sed -E 's/<|>//g'
http://dbpedia.org/resource/%22The_Take_Over,_the_Breaks_Over%22
http://dbpedia.org/resource/%22Weird_Al%22_Yankovic
http://dbpedia.org/resource/%2703_Bonnie_&_Clyde
...
http://simple.wikipedia.org/wiki/Shangla_District
http://simple.wikipedia.org/wiki/Shangla_Pass
http://simple.wikipedia.org/wiki/Shangrila_Lake
http://simple.wikipedia.org/wiki/Shani
http://simple.wikipedia.org/wiki/Shani_Glacier

Although on sedtest with only the following it works fine.
<http://en.wikipedia.org/wiki/AfghanistanGeography> <
http://xmlns.com/foaf/0.1/primaryTopic> <
http://dbpedia.org/resource/AfghanistanGeography>
<http://simple.wikipedia.org/wiki/A> <http://xmlns.com/foaf/0.1/primaryTopic>
<http://dbpedia.org/resource/A>

On Sun, Mar 20, 2011 at 5:16 PM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> This works:
>
> $ cat sedtest | grep "http://en.wiki" | awk  '{print $1}' | sort -u | sed
> *-E *'s/<|>//g'
>
> http://en.wikipedia.org/wiki/AfghanistanGeography
>
>
> for sed, | > are specia characters and have to be escaped
>> e.g.
>> cat wikipedia_links_el.nt | grep "http://el.wiki" | awk  '{print $1}' |
>> sort -u | sed 's/<\|\>//g'
>>
>> $ cat sedtest | grep "http://en.wiki" | awk  '{print $1}' | sort -u | sed
> 's/<\|\>//g'
> <http://en.wikipedia.org/wiki/AfghanistanGeography>
>
> $ cat sedtest | grep "http://en.wiki" | awk  '{print $1}' | sort -u | sed
> 's/*\>*//g'
> <http://en.wikipedia.org/wiki/AfghanistanGeography
>
> On the mac > doesn't seem to be special.
>
> $ cat sedtest | grep "http://en.wiki" | awk  '{print $1}' | sort -u | sed
> 's/*>*//g'
> <http://en.wikipedia.org/wiki/AfghanistanGeography
>
>
>
>
>> cheers
>>
>> Dimitris
>>
>>
>> On Sun, Mar 20, 2011 at 3:34 PM, Gabriele Kahlout <
>> gabriele@mysimpatico.com> wrote:
>>
>>> On Sat, Mar 19, 2011 at 5:03 PM, Dimitris Kontokostas <jimkont@gmail.com
>>> > wrote:
>>>
>>>> cat wikipedia_links_en.nt | grep "http://en.wiki" | awk  '{print $1}' |
>>>> sort -u | sed 's/<//g' > urls
>>>>
>>>
>>> This is really a sed question. Yours leaves urls endining with > . The
>>> search patter regex should be "<|>" but that doesn't work.
>>> $ cat sedtest | grep "http://en.wiki" | awk  '{print $1}' | sort -u |
>>> sed 's/<|>//g' > urls
>>>
>>>
>>> --
>>> Regards,
>>> K. Gabriele
>>>
>>> --- unchanged since 20/9/10 ---
>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>> receipt within 48 hours then I don't resend the email.
>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>
>>> If an email is sent by a sender that is not a trusted contact or the
>>> email does not contain a valid code then the email is not received. A valid
>>> code starts with a hyphen and ends with "X".
>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>> L(-[a-z]+[0-9]X)).
>>>
>>>
>>
>>
>> --
>> Kontokostas Dimitris
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

This works:

$ cat sedtest | grep "http://en.wiki" | awk  '{print $1}' | sort -u | sed *-E
*'s/<|>//g'
http://en.wikipedia.org/wiki/AfghanistanGeography


for sed, | > are specia characters and have to be escaped
> e.g.
> cat wikipedia_links_el.nt | grep "http://el.wiki" | awk  '{print $1}' |
> sort -u | sed 's/<\|\>//g'
>
> $ cat sedtest | grep "http://en.wiki" | awk  '{print $1}' | sort -u | sed
's/<\|\>//g'
<http://en.wikipedia.org/wiki/AfghanistanGeography>

$ cat sedtest | grep "http://en.wiki" | awk  '{print $1}' | sort -u | sed
's/*\>*//g'
<http://en.wikipedia.org/wiki/AfghanistanGeography

On the mac > doesn't seem to be special.
$ cat sedtest | grep "http://en.wiki" | awk  '{print $1}' | sort -u | sed
's/*>*//g'
<http://en.wikipedia.org/wiki/AfghanistanGeography




> cheers
> Dimitris
>
>
> On Sun, Mar 20, 2011 at 3:34 PM, Gabriele Kahlout <
> gabriele@mysimpatico.com> wrote:
>
>> On Sat, Mar 19, 2011 at 5:03 PM, Dimitris Kontokostas <ji...@gmail.com>wrote:
>>
>>> cat wikipedia_links_en.nt | grep "http://en.wiki" | awk  '{print $1}' |
>>> sort -u | sed 's/<//g' > urls
>>>
>>
>> This is really a sed question. Yours leaves urls endining with > . The
>> search patter regex should be "<|>" but that doesn't work.
>> $ cat sedtest | grep "http://en.wiki" | awk  '{print $1}' | sort -u | sed
>> 's/<|>//g' > urls
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Kontokostas Dimitris
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

On Sat, Mar 19, 2011 at 5:03 PM, Dimitris Kontokostas <ji...@gmail.com>wrote:

> cat wikipedia_links_en.nt | grep "http://en.wiki" | awk  '{print $1}' |
> sort -u | sed 's/<//g' > urls
>

This is really a sed question. Yours leaves urls endining with > . The
search patter regex should be "<|>" but that doesn't work.
$ cat sedtest | grep "http://en.wiki" | awk  '{print $1}' | sort -u | sed
's/<|>//g' > urls

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Posted by Dimitris Kontokostas <ji...@gmail.com>.

you could also try this directly

cat wikipedia_links_en.nt | grep "http://en.wiki" | awk  '{print $1}' | sort
-u | sed 's/<//g' > urls



On Sat, Mar 19, 2011 at 5:50 PM, Dimitris Kontokostas <ji...@gmail.com>wrote:

> it is, but it is usually faster and bug free
> (and you don't have to implement it :)
>
>
> On Sat, Mar 19, 2011 at 5:22 PM, Gabriele Kahlout <
> gabriele@mysimpatico.com> wrote:
>
>> Okay, this works.
>>
>> $ bin/nutch org.apache.nutch.tools.DBpediaParser
>> /Users/simpatico/Downloads/wikipedia_links_en.nt  | grep "
>> http://en.wikipedia.org" | sort -u > wiki/urls
>>
>> I guess this is the same as implementing logic that outputs only urls not
>> already encountered, though.
>>
>> On Sat, Mar 19, 2011 at 4:05 PM, Dimitris Kontokostas <ji...@gmail.com>wrote:
>>
>>> they are files, (original file with duplicates, new with unique)
>>>
>>> if your parser writes to output then it is
>>> $ your_parser_command | grep "http://en.wikipedia.org" | sort -u >
>>> unique_urls_file
>>>
>>> otherwise you run your parser separately
>>> $ cat your_parser_output_file | grep "http://en.wikipedia.org" | sort -u
>>> > unique_urls_file
>>>
>>> Dimitris
>>>
>>>
>>> On Sat, Mar 19, 2011 at 4:58 PM, Gabriele Kahlout <
>>> gabriele@mysimpatico.com> wrote:
>>>
>>>> You meant this. Now what do I have in youfile and newfile?
>>>> $ cat youfile | grep "http://en.wikipedia.org" | sort -u > newfile
>>>>
>>>> I'm after all the urls in wikipedia, I don't have them.
>>>>
>>>>
>>>> On Sat, Mar 19, 2011 at 3:55 PM, Dimitris Kontokostas <
>>>> jimkont@gmail.com> wrote:
>>>>
>>>>> cat youfile | grep "http://..." | sort -u > new file
>>>>>
>>>>>
>>>>> On Sat, Mar 19, 2011 at 4:40 PM, Gabriele Kahlout <
>>>>> gabriele@mysimpatico.com> wrote:
>>>>>
>>>>>> working example/cmd? I'm not sure we are talking about the same thing.
>>>>>>
>>>>>>
>>>>>> On Sat, Mar 19, 2011 at 3:36 PM, Dimitris Kontokostas <
>>>>>> jimkont@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> You can grep the output with http://en.wikipedia.org<http://en.wikipedia.org/wiki/Anarchism>and pipe it to sort -u
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Dimitris
>>>>>>>
>>>>>>> On Sat, Mar 19, 2011 at 3:47 PM, Gabriele Kahlout <
>>>>>>> gabriele@mysimpatico.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Mar 19, 2011 at 2:13 PM, Gabriele Kahlout <
>>>>>>>> gabriele@mysimpatico.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I've downloaded and wrote a simple parser to give me pedia urls
>>>>>>>>> from this dbpedia file
>>>>>>>>> <http://downloads.dbpedia.org/3.6/en/wikipedia_links_en.nt.bz2>as
>>>>>>>>> shown below. I find the result unsatisfactory since it contains many
>>>>>>>>> duplicates. Adding logic to the parser to avoid them (through remembering)
>>>>>>>>> seems to be also very expensive, since the file size (uncompressed) is 3GB.
>>>>>>>>> Is there a better approach to get Wikipedia urls like is done with dmoz in
>>>>>>>>>
>>>>>>>>> wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
>>>>>>>>> bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>>>> http://dbpedia.org/resource/AfghanistanGeography
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/AfghanistanGeography
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>>>> http://dbpedia.org/resource/Anarchism
>>>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/Anarchism
>>>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>>>> http://dbpedia.org/resource/AccessibleComputing
>>>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/AccessibleComputing
>>>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>>>> http://dbpedia.org/resource/AfghanistanHistory
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/AfghanistanHistory
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>>>> http://dbpedia.org/resource/AfghanistanPeople
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/AfghanistanPeople
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>>>> http://dbpedia.org/resource/AfghanistanTransportations
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>>>> n"@e
>>>>>>>>> http://dbpedia.org/resource/AfghanistanTransportations
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanCommunications
>>>>>>>>> http://dbpedia.org/resource/AfghanistanCommunications
>>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanCommunications
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>> K. Gabriele
>>>>>>>>>
>>>>>>>>> --- unchanged since 20/9/10 ---
>>>>>>>>> P.S. If the subject contains "[LON]" or the addressee acknowledges
>>>>>>>>> the receipt within 48 hours then I don't resend the email.
>>>>>>>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this)
>>>>>>>>> ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>>>>>>>
>>>>>>>>> If an email is sent by a sender that is not a trusted contact or
>>>>>>>>> the email does not contain a valid code then the email is not received. A
>>>>>>>>> valid code starts with a hyphen and ends with "X".
>>>>>>>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x)
>>>>>>>>> ∧ y ∈ L(-[a-z]+[0-9]X)).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Colocation vs. Managed Hosting
>>>>>>>> A question and answer guide to determining the best fit
>>>>>>>> for your organization - today and in the future.
>>>>>>>> http://p.sf.net/sfu/internap-sfd2d
>>>>>>>> _______________________________________________
>>>>>>>> Dbpedia-discussion mailing list
>>>>>>>> Dbpedia-discussion@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Kontokostas Dimitris
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> K. Gabriele
>>>>>>
>>>>>> --- unchanged since 20/9/10 ---
>>>>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>>>>> receipt within 48 hours then I don't resend the email.
>>>>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>>>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>>>>
>>>>>> If an email is sent by a sender that is not a trusted contact or the
>>>>>> email does not contain a valid code then the email is not received. A valid
>>>>>> code starts with a hyphen and ends with "X".
>>>>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
>>>>>> ∈ L(-[a-z]+[0-9]X)).
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kontokostas Dimitris
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> K. Gabriele
>>>>
>>>> --- unchanged since 20/9/10 ---
>>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>>> receipt within 48 hours then I don't resend the email.
>>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>>
>>>> If an email is sent by a sender that is not a trusted contact or the
>>>> email does not contain a valid code then the email is not received. A valid
>>>> code starts with a hyphen and ends with "X".
>>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>>> L(-[a-z]+[0-9]X)).
>>>>
>>>>
>>>
>>>
>>> --
>>> Kontokostas Dimitris
>>>
>>
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Kontokostas Dimitris
>



-- 
Kontokostas Dimitris

Re: [Dbpedia-discussion] Get list of Wikipedia URLS for crawling

Posted by Dimitris Kontokostas <ji...@gmail.com>.

it is, but it is usually faster and bug free
(and you don't have to implement it :)

On Sat, Mar 19, 2011 at 5:22 PM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> Okay, this works.
>
> $ bin/nutch org.apache.nutch.tools.DBpediaParser
> /Users/simpatico/Downloads/wikipedia_links_en.nt  | grep "
> http://en.wikipedia.org" | sort -u > wiki/urls
>
> I guess this is the same as implementing logic that outputs only urls not
> already encountered, though.
>
> On Sat, Mar 19, 2011 at 4:05 PM, Dimitris Kontokostas <ji...@gmail.com>wrote:
>
>> they are files, (original file with duplicates, new with unique)
>>
>> if your parser writes to output then it is
>> $ your_parser_command | grep "http://en.wikipedia.org" | sort -u >
>> unique_urls_file
>>
>> otherwise you run your parser separately
>> $ cat your_parser_output_file | grep "http://en.wikipedia.org" | sort -u
>> > unique_urls_file
>>
>> Dimitris
>>
>>
>> On Sat, Mar 19, 2011 at 4:58 PM, Gabriele Kahlout <
>> gabriele@mysimpatico.com> wrote:
>>
>>> You meant this. Now what do I have in youfile and newfile?
>>> $ cat youfile | grep "http://en.wikipedia.org" | sort -u > newfile
>>>
>>> I'm after all the urls in wikipedia, I don't have them.
>>>
>>>
>>> On Sat, Mar 19, 2011 at 3:55 PM, Dimitris Kontokostas <jimkont@gmail.com
>>> > wrote:
>>>
>>>> cat youfile | grep "http://..." | sort -u > new file
>>>>
>>>>
>>>> On Sat, Mar 19, 2011 at 4:40 PM, Gabriele Kahlout <
>>>> gabriele@mysimpatico.com> wrote:
>>>>
>>>>> working example/cmd? I'm not sure we are talking about the same thing.
>>>>>
>>>>>
>>>>> On Sat, Mar 19, 2011 at 3:36 PM, Dimitris Kontokostas <
>>>>> jimkont@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> You can grep the output with http://en.wikipedia.org<http://en.wikipedia.org/wiki/Anarchism>and pipe it to sort -u
>>>>>>
>>>>>> Cheers,
>>>>>> Dimitris
>>>>>>
>>>>>> On Sat, Mar 19, 2011 at 3:47 PM, Gabriele Kahlout <
>>>>>> gabriele@mysimpatico.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Mar 19, 2011 at 2:13 PM, Gabriele Kahlout <
>>>>>>> gabriele@mysimpatico.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I've downloaded and wrote a simple parser to give me pedia urls from
>>>>>>>> this dbpedia file
>>>>>>>> <http://downloads.dbpedia.org/3.6/en/wikipedia_links_en.nt.bz2>as
>>>>>>>> shown below. I find the result unsatisfactory since it contains many
>>>>>>>> duplicates. Adding logic to the parser to avoid them (through remembering)
>>>>>>>> seems to be also very expensive, since the file size (uncompressed) is 3GB.
>>>>>>>> Is there a better approach to get Wikipedia urls like is done with dmoz in
>>>>>>>>
>>>>>>>> wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
>>>>>>>> bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>>> http://dbpedia.org/resource/AfghanistanGeography
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>>> n"@e
>>>>>>>> http://dbpedia.org/resource/AfghanistanGeography
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>>> http://dbpedia.org/resource/Anarchism
>>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>>> n"@e
>>>>>>>> http://dbpedia.org/resource/Anarchism
>>>>>>>> http://en.wikipedia.org/wiki/Anarchism
>>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>>> http://dbpedia.org/resource/AccessibleComputing
>>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>>> n"@e
>>>>>>>> http://dbpedia.org/resource/AccessibleComputing
>>>>>>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>>> http://dbpedia.org/resource/AfghanistanHistory
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>>> n"@e
>>>>>>>> http://dbpedia.org/resource/AfghanistanHistory
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>>> http://dbpedia.org/resource/AfghanistanPeople
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>>> n"@e
>>>>>>>> http://dbpedia.org/resource/AfghanistanPeople
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>>> http://dbpedia.org/resource/AfghanistanTransportations
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>>> n"@e
>>>>>>>> http://dbpedia.org/resource/AfghanistanTransportations
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanCommunications
>>>>>>>> http://dbpedia.org/resource/AfghanistanCommunications
>>>>>>>> http://en.wikipedia.org/wiki/AfghanistanCommunications
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> K. Gabriele
>>>>>>>>
>>>>>>>> --- unchanged since 20/9/10 ---
>>>>>>>> P.S. If the subject contains "[LON]" or the addressee acknowledges
>>>>>>>> the receipt within 48 hours then I don't resend the email.
>>>>>>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>>>>>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>>>>>>
>>>>>>>> If an email is sent by a sender that is not a trusted contact or the
>>>>>>>> email does not contain a valid code then the email is not received. A valid
>>>>>>>> code starts with a hyphen and ends with "X".
>>>>>>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧
>>>>>>>> y ∈ L(-[a-z]+[0-9]X)).
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Colocation vs. Managed Hosting
>>>>>>> A question and answer guide to determining the best fit
>>>>>>> for your organization - today and in the future.
>>>>>>> http://p.sf.net/sfu/internap-sfd2d
>>>>>>> _______________________________________________
>>>>>>> Dbpedia-discussion mailing list
>>>>>>> Dbpedia-discussion@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kontokostas Dimitris
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> K. Gabriele
>>>>>
>>>>> --- unchanged since 20/9/10 ---
>>>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>>>> receipt within 48 hours then I don't resend the email.
>>>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>>>
>>>>> If an email is sent by a sender that is not a trusted contact or the
>>>>> email does not contain a valid code then the email is not received. A valid
>>>>> code starts with a hyphen and ends with "X".
>>>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
>>>>> ∈ L(-[a-z]+[0-9]X)).
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Kontokostas Dimitris
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> K. Gabriele
>>>
>>> --- unchanged since 20/9/10 ---
>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>> receipt within 48 hours then I don't resend the email.
>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>
>>> If an email is sent by a sender that is not a trusted contact or the
>>> email does not contain a valid code then the email is not received. A valid
>>> code starts with a hyphen and ends with "X".
>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>> L(-[a-z]+[0-9]X)).
>>>
>>>
>>
>>
>> --
>> Kontokostas Dimitris
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Kontokostas Dimitris