You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@any23.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/04/05 00:19:11 UTC

Re: Too many tuples!!

Hi Tim,

I've just picked this up, it got lost in my filters.

2012/3/30 Tim Potter <te...@yahoo-inc.com>

>
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
>

With regards to the link to the above URL and the source below, I can't
find snippet below in the above page!!! Can you please check and confirm
for me.

>
> Given the HTML Snippet:
>
> <a href="
> http://toolserver.org/~geohack/geohack.php?pagename=List_of_Nike_missile_locations&amp;params=34_22_41_N_118_09_03_W_&amp;title=LA-04-LS<http://toolserver.org/%7Egeohack/geohack.php?pagename=List_of_Nike_missile_locations&params=34_22_41_N_118_09_03_W_&title=LA-04-LS>
> " class="external text" rel="nofollow" style="white-space: normal;">
>
> <span class="geo-default">
>
> <span title="Maps, aerial photos, and other data for this location" class=
> "geo-dms">
>
> <span class="latitude">34°22′41″N</span>
>
> <span class="longitude">118°09′03″W</span>
>
> </span>
>
> </span>
>
> <span class="geo-multi-punct">&#65279; / &#65279;</span>
>
> <span class="geo-nondefault">
>
> <span class="vcard">
>
> <span title="Maps, aerial photos, and other data for this location" class=
> "geo-dec">34.37806°N 118.15083°W</span>
>
> <span style="display: none">
>
> &#65279; /
>
> <span class="geo">34.37806; -118.15083</span>
>
> </span>
>
> <span style="display: none">
>
> &#65279; (
>
> <span class="fn org">LA-04-LS</span>
>
> )
>
>  </span>
>
>  </span>
>
>  </span>
>
> </a>
>
>

Re: Too many tuples!!

Posted by Michele Mostarda <mi...@gmail.com>.
Hi Tim,

     another good source for vocab usage / coverage statistics about the
Semantic Web is

       http://sindice.com/stats/basic-stats/

Best.
Mic

2012/4/5 Michele Mostarda <mi...@gmail.com>

> Hi Tim,
>
>    sorry for delay.
>
> First of all: did you see this initiative [0], it looks like to be similar
> to your task.
>
> I attempted to reproduce your issue using the latest Any23 trunk version
> but I didn't obtain any nesting triples (investigating on it).
>
> The triples with predicate "http://vocab.sindice.net/any23#nesting" are
> generated by a post processing phase which adds meta triples describing
> how HTML markup elements producing triples are nested together.
>
> This is mostly used when you have a page containing nested Microformats
> (like the mf-geo <span class="geo-default">) and you want to keep the
> meaning of the metadata expressed by the Microformats.
>
> For your purpose you don't need to produce the consolidation triples so
> you can skip the production of such triples setting to "off" the flag
> "any23.extraction.metadata.nesting". Specific instructions about how to
> use flags can be found here [1].
>
> Another flag that can reduce the number of generated meta triples is
> "any23.extraction.metadata.domain.per.entity" that when set to
> "off" prevents the generation of domain triples: (
>  _:noded11095f6ff16d9464e5e63653734bb <
> http://vocab.sindice.net/any23#domain> "en.wikipedia.org" . ) .
>
> The quantity of triples produced for the HTML code snippet you pasted is
> 'normal', the RDF data format tends to be a little 'verbose' :)
>
> You can apply filters to prevent to extract triples generated by CSS
> declarations, to do it by commandline use the:
>
>      bin/any23 rover --notrivial  'http://url/to/page'
>
> To do it programmatically take a look at:
>
>     org.apache.any23.filter.IgnoreAccidentalRDFa TripleHandler
> implementation class.
>
> If you notice any unexpected or strange behavior please feel free to
> report an issue at [2].
>
> Hope it helps.
>
> The best.
>
> Mic
>
> [0] http://webdatacommons.org/
> [1] http://incubator.apache.org/any23/configuration.html
> [2] https://issues.apache.org/jira/browse/ANY23
>
>
> 2012/4/5 Tim Potter <te...@yahoo-inc.com>
>
>> Hi Lewis,
>>    Maybe the pages has been modified slightly since I copied that
>> snippet.  If you search for '118°09′03″W' in the page source you should
>> find the entry.    I guest the easiest way to reproduce the problem is to
>> run:
>>
>>  'any23tools Rover
>> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations'
>>
>> It returns somewhere in the order of a million tuples.
>>
>> I found switching off the nested triple production ('any23tools Rover –n
>> http….') returns a lot less. Like a few thousand.
>>
>> Like I said, I don't have enough experience with RDF to know if what
>> Any23 is extracting is correct.  Just seems like a lot of tuples..
>>
>> Thanks for your help.
>>
>> Regards,
>>   Tim P.
>>
>>
>>
>> From: Lewis John Mcgibbney <le...@gmail.com>
>> Reply-To: "any23-user@incubator.apache.org" <
>> any23-user@incubator.apache.org>
>> Date: Wed, 4 Apr 2012 23:19:11 +0100
>> To: "any23-user@incubator.apache.org" <an...@incubator.apache.org>
>> Subject: Re: Too many tuples!!
>>
>> Hi Tim,
>>
>> I've just picked this up, it got lost in my filters.
>>
>> 2012/3/30 Tim Potter <te...@yahoo-inc.com>
>>
>>>
>>> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
>>>
>>
>> With regards to the link to the above URL and the source below, I can't
>> find snippet below in the above page!!! Can you please check and confirm
>> for me.
>>
>>>
>>> Given the HTML Snippet:
>>>
>>> <a href="
>>> http://toolserver.org/~geohack/geohack.php?pagename=List_of_Nike_missile_locations&amp;params=34_22_41_N_118_09_03_W_&amp;title=LA-04-LS<http://toolserver.org/%7Egeohack/geohack.php?pagename=List_of_Nike_missile_locations&params=34_22_41_N_118_09_03_W_&title=LA-04-LS>
>>> " class="external text" rel="nofollow" style="white-space: normal;">
>>>
>>> <span class="geo-default">
>>>
>>> <span title="Maps, aerial photos, and other data for this location"
>>> class="geo-dms">
>>>
>>> <span class="latitude">34°22′41″N</span>
>>>
>>> <span class="longitude">118°09′03″W</span>
>>>
>>> </span>
>>>
>>> </span>
>>>
>>> <span class="geo-multi-punct">&#65279; / &#65279;</span>
>>>
>>> <span class="geo-nondefault">
>>>
>>> <span class="vcard">
>>>
>>> <span title="Maps, aerial photos, and other data for this location"
>>> class="geo-dec">34.37806°N 118.15083°W</span>
>>>
>>> <span style="display: none">
>>>
>>> &#65279; /
>>>
>>> <span class="geo">34.37806; -118.15083</span>
>>>
>>> </span>
>>>
>>> <span style="display: none">
>>>
>>> &#65279; (
>>>
>>> <span class="fn org">LA-04-LS</span>
>>>
>>> )
>>>
>>>  </span>
>>>
>>>  </span>
>>>
>>>  </span>
>>>
>>> </a>
>>>
>>>
>>
>
>
> --
> Michele Mostarda
> Senior Software Engineer
> skype: michele.mostarda
> twitter: micmos
> mail: me@michelemostarda.com
> site : http://www.michelemostarda.com
>
>


-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: me@michelemostarda.com
site : http://www.michelemostarda.com

Re: Too many tuples!!

Posted by Michele Mostarda <mi...@gmail.com>.
Hi Tim,

   sorry for delay.

First of all: did you see this initiative [0], it looks like to be similar
to your task.

I attempted to reproduce your issue using the latest Any23 trunk version
but I didn't obtain any nesting triples (investigating on it).

The triples with predicate "http://vocab.sindice.net/any23#nesting" are
generated by a post processing phase which adds meta triples describing
how HTML markup elements producing triples are nested together.

This is mostly used when you have a page containing nested Microformats
(like the mf-geo <span class="geo-default">) and you want to keep the
meaning of the metadata expressed by the Microformats.

For your purpose you don't need to produce the consolidation triples so you
can skip the production of such triples setting to "off" the flag
"any23.extraction.metadata.nesting". Specific instructions about how to
use flags can be found here [1].

Another flag that can reduce the number of generated meta triples is
"any23.extraction.metadata.domain.per.entity" that when set to
"off" prevents the generation of domain triples: (
 _:noded11095f6ff16d9464e5e63653734bb <http://vocab.sindice.net/any23#domain>
"en.wikipedia.org" . ) .

The quantity of triples produced for the HTML code snippet you pasted is
'normal', the RDF data format tends to be a little 'verbose' :)

You can apply filters to prevent to extract triples generated by CSS
declarations, to do it by commandline use the:

     bin/any23 rover --notrivial  'http://url/to/page'

To do it programmatically take a look at:

    org.apache.any23.filter.IgnoreAccidentalRDFa TripleHandler
implementation class.

If you notice any unexpected or strange behavior please feel free to report
an issue at [2].

Hope it helps.

The best.

Mic

[0] http://webdatacommons.org/
[1] http://incubator.apache.org/any23/configuration.html
[2] https://issues.apache.org/jira/browse/ANY23

2012/4/5 Tim Potter <te...@yahoo-inc.com>

> Hi Lewis,
>    Maybe the pages has been modified slightly since I copied that snippet.
>  If you search for '118°09′03″W' in the page source you should find the
> entry.    I guest the easiest way to reproduce the problem is to run:
>
>  'any23tools Rover
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations'
>
> It returns somewhere in the order of a million tuples.
>
> I found switching off the nested triple production ('any23tools Rover –n
> http….') returns a lot less. Like a few thousand.
>
> Like I said, I don't have enough experience with RDF to know if what Any23
> is extracting is correct.  Just seems like a lot of tuples..
>
> Thanks for your help.
>
> Regards,
>   Tim P.
>
>
>
> From: Lewis John Mcgibbney <le...@gmail.com>
> Reply-To: "any23-user@incubator.apache.org" <
> any23-user@incubator.apache.org>
> Date: Wed, 4 Apr 2012 23:19:11 +0100
> To: "any23-user@incubator.apache.org" <an...@incubator.apache.org>
> Subject: Re: Too many tuples!!
>
> Hi Tim,
>
> I've just picked this up, it got lost in my filters.
>
> 2012/3/30 Tim Potter <te...@yahoo-inc.com>
>
>>
>> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
>>
>
> With regards to the link to the above URL and the source below, I can't
> find snippet below in the above page!!! Can you please check and confirm
> for me.
>
>>
>> Given the HTML Snippet:
>>
>> <a href="
>> http://toolserver.org/~geohack/geohack.php?pagename=List_of_Nike_missile_locations&amp;params=34_22_41_N_118_09_03_W_&amp;title=LA-04-LS<http://toolserver.org/%7Egeohack/geohack.php?pagename=List_of_Nike_missile_locations&params=34_22_41_N_118_09_03_W_&title=LA-04-LS>
>> " class="external text" rel="nofollow" style="white-space: normal;">
>>
>> <span class="geo-default">
>>
>> <span title="Maps, aerial photos, and other data for this location" class
>> ="geo-dms">
>>
>> <span class="latitude">34°22′41″N</span>
>>
>> <span class="longitude">118°09′03″W</span>
>>
>> </span>
>>
>> </span>
>>
>> <span class="geo-multi-punct">&#65279; / &#65279;</span>
>>
>> <span class="geo-nondefault">
>>
>> <span class="vcard">
>>
>> <span title="Maps, aerial photos, and other data for this location" class
>> ="geo-dec">34.37806°N 118.15083°W</span>
>>
>> <span style="display: none">
>>
>> &#65279; /
>>
>> <span class="geo">34.37806; -118.15083</span>
>>
>> </span>
>>
>> <span style="display: none">
>>
>> &#65279; (
>>
>> <span class="fn org">LA-04-LS</span>
>>
>> )
>>
>>  </span>
>>
>>  </span>
>>
>>  </span>
>>
>> </a>
>>
>>
>


-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: me@michelemostarda.com
site : http://www.michelemostarda.com

Re: Too many tuples!!

Posted by Tim Potter <te...@yahoo-inc.com>.
Hi Lewis,
  Having read the page http://incubator.apache.org/any23/dev-microformat-extractors.html I'm beginning to understand what Any23 is doing.  The nesting_original and nesting_strucured tuples are been added to show the relation between an hcard and a nested geo element for example.  Looking at the annotations on the HCardExtractor class currently the only nested elements that won't generate these additional tuples are adr elements.  Looking at http://microformats.org/wiki/hcard I'm wondering if the geo extractor should also be in the @Includes annotation of HCardExtractor??

Regards,
  Tim P.

From: "Yahoo! Inc." <te...@yahoo-inc.com>>
Reply-To: "any23-user@incubator.apache.org<ma...@incubator.apache.org>" <an...@incubator.apache.org>>
Date: Thu, 5 Apr 2012 10:44:43 +0100
To: "any23-user@incubator.apache.org<ma...@incubator.apache.org>" <an...@incubator.apache.org>>
Subject: Re: Too many tuples!!

Hi Lewis,
   Maybe the pages has been modified slightly since I copied that snippet.  If you search for '118��09��03��W' in the page source you should find the entry.    I guest the easiest way to reproduce the problem is to run:

 'any23tools Rover http://en.wikipedia.org/wiki/List_of_Nike_missile_locations'

It returns somewhere in the order of a million tuples.

I found switching off the nested triple production ('any23tools Rover �Cn http��.') returns a lot less. Like a few thousand.

Like I said, I don't have enough experience with RDF to know if what Any23 is extracting is correct.  Just seems like a lot of tuples..

Thanks for your help.

Regards,
  Tim P.



From: Lewis John Mcgibbney <le...@gmail.com>>
Reply-To: "any23-user@incubator.apache.org<ma...@incubator.apache.org>" <an...@incubator.apache.org>>
Date: Wed, 4 Apr 2012 23:19:11 +0100
To: "any23-user@incubator.apache.org<ma...@incubator.apache.org>" <an...@incubator.apache.org>>
Subject: Re: Too many tuples!!

Hi Tim,

I've just picked this up, it got lost in my filters.

2012/3/30 Tim Potter <te...@yahoo-inc.com>>

http://en.wikipedia.org/wiki/List_of_Nike_missile_locations

With regards to the link to the above URL and the source below, I can't find snippet below in the above page!!! Can you please check and confirm for me.

Given the HTML Snippet:


<a href="http://toolserver.org/~geohack/geohack.php?pagename=List_of_Nike_missile_locations&amp;params=34_22_41_N_118_09_03_W_&amp;title=LA-04-LS<http://toolserver.org/%7Egeohack/geohack.php?pagename=List_of_Nike_missile_locations&params=34_22_41_N_118_09_03_W_&title=LA-04-LS>" class="external text" rel="nofollow" style="white-space: normal;">

<span class="geo-default">

<span title="Maps, aerial photos, and other data for this location" class="geo-dms">

<span class="latitude">34��22��41��N</span>

<span class="longitude">118��09��03��W</span>

</span>

</span>

<span class="geo-multi-punct">&#65279; / &#65279;</span>

<span class="geo-nondefault">

<span class="vcard">

<span title="Maps, aerial photos, and other data for this location" class="geo-dec">34.37806��N 118.15083��W</span>

<span style="display: none">

&#65279; /

<span class="geo">34.37806; -118.15083</span>

</span>

<span style="display: none">

&#65279; (

<span class="fn org">LA-04-LS</span>

)

</span>

</span>

</span>

</a>



Re: Too many tuples!!

Posted by Tim Potter <te...@yahoo-inc.com>.
Hi Lewis,
   Maybe the pages has been modified slightly since I copied that snippet.  If you search for '118��09��03��W' in the page source you should find the entry.    I guest the easiest way to reproduce the problem is to run:

 'any23tools Rover http://en.wikipedia.org/wiki/List_of_Nike_missile_locations'

It returns somewhere in the order of a million tuples.

I found switching off the nested triple production ('any23tools Rover �Cn http��.') returns a lot less. Like a few thousand.

Like I said, I don't have enough experience with RDF to know if what Any23 is extracting is correct.  Just seems like a lot of tuples..

Thanks for your help.

Regards,
  Tim P.



From: Lewis John Mcgibbney <le...@gmail.com>>
Reply-To: "any23-user@incubator.apache.org<ma...@incubator.apache.org>" <an...@incubator.apache.org>>
Date: Wed, 4 Apr 2012 23:19:11 +0100
To: "any23-user@incubator.apache.org<ma...@incubator.apache.org>" <an...@incubator.apache.org>>
Subject: Re: Too many tuples!!

Hi Tim,

I've just picked this up, it got lost in my filters.

2012/3/30 Tim Potter <te...@yahoo-inc.com>>

http://en.wikipedia.org/wiki/List_of_Nike_missile_locations

With regards to the link to the above URL and the source below, I can't find snippet below in the above page!!! Can you please check and confirm for me.

Given the HTML Snippet:


<a href="http://toolserver.org/~geohack/geohack.php?pagename=List_of_Nike_missile_locations&amp;params=34_22_41_N_118_09_03_W_&amp;title=LA-04-LS<http://toolserver.org/%7Egeohack/geohack.php?pagename=List_of_Nike_missile_locations&params=34_22_41_N_118_09_03_W_&title=LA-04-LS>" class="external text" rel="nofollow" style="white-space: normal;">

<span class="geo-default">

<span title="Maps, aerial photos, and other data for this location" class="geo-dms">

<span class="latitude">34��22��41��N</span>

<span class="longitude">118��09��03��W</span>

</span>

</span>

<span class="geo-multi-punct">&#65279; / &#65279;</span>

<span class="geo-nondefault">

<span class="vcard">

<span title="Maps, aerial photos, and other data for this location" class="geo-dec">34.37806��N 118.15083��W</span>

<span style="display: none">

&#65279; /

<span class="geo">34.37806; -118.15083</span>

</span>

<span style="display: none">

&#65279; (

<span class="fn org">LA-04-LS</span>

)

</span>

</span>

</span>

</a>