You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ake Tangkananond <ia...@gmail.com> on 2012/08/14 13:15:34 UTC

Tika's outlink is not as expected

Hi,

I'm getting an unexpected behavior from nutch parsing mechanism. Perhaps I
don't really understand Nucth well. Here is what I find it weird. Could you
please advise?

I crawl a website of mimeType application/rss+xml. The fetched content is
parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm expecting it to
give all outlinks in the RSS Feed, but my command
> `scan 'webpage', {COLUMNS => 'ol'}`
gives only one ol cf entry.

Then I add a code at TikaParser.java line 192 as follows to see what are all
outlinks:
> …  
> Parse parse = new Parse(text, title, outlinks, status);
> parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
> 
> for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192
>   LOG.trace(outlink.getToUrl());
> }
> 
> if (metaTags.getNoCache()) { // not okay to cache
>   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
> ByteBuffer.wrap(Bytes
>       .toBytes(cachingPolicy)));
> }
> 
> return parse;

The result is as expected. It prints all URL links in the content. But I
really wonder why only one URL is stored in a storage of cf ol. Here's a
log4j log:
> 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser
> org.apache.tika.parser.feed.FeedParser for mime-type application/rss+xml
> 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
> http://www.manager.co.th/RSS/Politics/Politics.xml: base=null, noCache=false,
> noFollow=false, noIndex=false, refresh=false, refreshHref=null
>  * general tags:
>    - description        =       Manager Online Update ตลอด 24 ชม.
>  * http-equiv tags:
> 
> 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
> 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
> 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
> 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks in
> http://www.manager.co.th/RSS/Politics/Politics.xml
> 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
> 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843

Now I wonder why only one outlink is stored in ol column family. Any advice,
please?


Regards,
Ake Tangkananond

Re: Tika's outlink is not as expected

Posted by Ake Tangkananond <ia...@gmail.com>.

Thank you. I just found it a minute ago and was going to write the email.

([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)

Perhaps, I was too tired yesterday so that I thought I had already
disabled the normalization-regex.


Regards,
Ake Tangkananond




On 8/15/12 2:13 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:

>Ah I think you got bit by the session ids normalization. There is a
>normalize rule in regex-normalize.xml that removes 'sid=.*' from the url.
>Looks like a bug if it strips off query parameters from values like
>'newsid=.*'. There is already a Jira for this: NUTCH-706.
>
>For now, remove the 'sid' value from line 32 in regex-normalize.xml or
>remove the line altogether to solve this.
>
>On Tue, Aug 14, 2012 at 6:29 PM, Ake Tangkananond <ia...@gmail.com>
>wrote:
>
>> Hi Ferdy,
>>
>> Thanks for you advise. I don't have any special filtering/normalizing
>> rules except the standard one. I even try disabling all url
>>normalization
>> plugin, but the result is no difference.
>>
>> The url left over in the ol is
>> column=ol:http://www.manager.co.th/asp-bin/mgrview.aspx?New
>>
>> Yes, it's truncated at "New".. I'm thinking if it is possible that the
>>URL
>> is truncated to make it fit 49 chars, and all truncated URL are the same
>> so there is only one left?
>>
>> In that case, what makes the URL truncated?
>>
>>
>> Regards,
>> Ake Tangkananond
>>
>>
>>
>>
>> On 8/14/12 7:12 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:
>>
>> >Do you have specifc filtering/normalizing rules? From all urls that are
>> >logged, what url is left over in the 'ol' field?
>> >
>> >On Tue, Aug 14, 2012 at 1:49 PM, Ake Tangkananond <ia...@gmail.com>
>> >wrote:
>> >
>> >> Thanks for reply Ferdy.
>> >>
>> >> Variable 'db.max.outlinks.per.page' was set to 100. And I could parse
>> >>HTML
>> >> fine.
>> >>
>> >>
>> >> Regards,
>> >> Ake Tangkananond
>> >>
>> >>
>> >>
>> >>
>> >> On 8/14/12 6:43 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:
>> >>
>> >> >Hi,
>> >> >
>> >> >Judging by your logs, it might be that you have accidentally set
>> >> >'db.max.outlinks.per.page' to 1? If this is not the case, could you
>> >>try to
>> >> >parse some other document types, for example a html page? Please
>>note
>> >>that
>> >> >I'm not using the TikaParser at all; it could be that there is a bug
>> >>with
>> >> >it in Nutch2.
>> >> >
>> >> >Ferdy.
>> >> >
>> >> >On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <ia...@gmail.com>
>> >> >wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> I'm getting an unexpected behavior from nutch parsing mechanism.
>> >> >>Perhaps I
>> >> >> don't really understand Nucth well. Here is what I find it weird.
>> >>Could
>> >> >>you
>> >> >> please advise?
>> >> >>
>> >> >> I crawl a website of mimeType application/rss+xml. The fetched
>> >>content
>> >> >>is
>> >> >> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm
>> >>expecting
>> >> >>it
>> >> >> to
>> >> >> give all outlinks in the RSS Feed, but my command
>> >> >> > `scan 'webpage', {COLUMNS => 'ol'}`
>> >> >> gives only one ol cf entry.
>> >> >>
>> >> >> Then I add a code at TikaParser.java line 192 as follows to see
>>what
>> >>are
>> >> >> all
>> >> >> outlinks:
>> >> >> > ...
>> >> >> > Parse parse = new Parse(text, title, outlinks, status);
>> >> >> > parse = htmlParseFilters.filter(url, page, parse, metaTags,
>>root);
>> >> >> >
>> >> >> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE
>>192
>> >> >> >   LOG.trace(outlink.getToUrl());
>> >> >> > }
>> >> >> >
>> >> >> > if (metaTags.getNoCache()) { // not okay to cache
>> >> >> >   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
>> >> >> > ByteBuffer.wrap(Bytes
>> >> >> >       .toBytes(cachingPolicy)));
>> >> >> > }
>> >> >> >
>> >> >> > return parse;
>> >> >>
>> >> >> The result is as expected. It prints all URL links in the content.
>> >>But I
>> >> >> really wonder why only one URL is stored in a storage of cf ol.
>> >>Here's a
>> >> >> log4j log:
>> >> >> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika
>>parser
>> >> >> > org.apache.tika.parser.feed.FeedParser for mime-type
>> >> >>application/rss+xml
>> >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
>> >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null,
>> >> >> noCache=false,
>> >> >> > noFollow=false, noIndex=false, refresh=false, refreshHref=null
>> >> >> >  * general tags:
>> >> >> >    - description        =       Manager Online Update ตลอด 24
>>ชม.
>> >> >> >  * http-equiv tags:
>> >> >> >
>> >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
>> >> >> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
>> >> >> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
>> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10
>>outlinks
>> >>in
>> >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml
>> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
>> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843
>> >> >>
>> >> >> Now I wonder why only one outlink is stored in ol column family.
>>Any
>> >> >> advice,
>> >> >> please?
>> >> >>
>> >> >>
>> >> >> Regards,
>> >> >> Ake Tangkananond
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> >>
>>
>>
>>

Re: Tika's outlink is not as expected

Posted by Ferdy Galema <fe...@kalooga.com>.

Ah I think you got bit by the session ids normalization. There is a
normalize rule in regex-normalize.xml that removes 'sid=.*' from the url.
Looks like a bug if it strips off query parameters from values like
'newsid=.*'. There is already a Jira for this: NUTCH-706.

For now, remove the 'sid' value from line 32 in regex-normalize.xml or
remove the line altogether to solve this.

On Tue, Aug 14, 2012 at 6:29 PM, Ake Tangkananond <ia...@gmail.com> wrote:

> Hi Ferdy,
>
> Thanks for you advise. I don't have any special filtering/normalizing
> rules except the standard one. I even try disabling all url normalization
> plugin, but the result is no difference.
>
> The url left over in the ol is
> column=ol:http://www.manager.co.th/asp-bin/mgrview.aspx?New
>
> Yes, it's truncated at "New".. I'm thinking if it is possible that the URL
> is truncated to make it fit 49 chars, and all truncated URL are the same
> so there is only one left?
>
> In that case, what makes the URL truncated?
>
>
> Regards,
> Ake Tangkananond
>
>
>
>
> On 8/14/12 7:12 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:
>
> >Do you have specifc filtering/normalizing rules? From all urls that are
> >logged, what url is left over in the 'ol' field?
> >
> >On Tue, Aug 14, 2012 at 1:49 PM, Ake Tangkananond <ia...@gmail.com>
> >wrote:
> >
> >> Thanks for reply Ferdy.
> >>
> >> Variable 'db.max.outlinks.per.page' was set to 100. And I could parse
> >>HTML
> >> fine.
> >>
> >>
> >> Regards,
> >> Ake Tangkananond
> >>
> >>
> >>
> >>
> >> On 8/14/12 6:43 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:
> >>
> >> >Hi,
> >> >
> >> >Judging by your logs, it might be that you have accidentally set
> >> >'db.max.outlinks.per.page' to 1? If this is not the case, could you
> >>try to
> >> >parse some other document types, for example a html page? Please note
> >>that
> >> >I'm not using the TikaParser at all; it could be that there is a bug
> >>with
> >> >it in Nutch2.
> >> >
> >> >Ferdy.
> >> >
> >> >On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <ia...@gmail.com>
> >> >wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I'm getting an unexpected behavior from nutch parsing mechanism.
> >> >>Perhaps I
> >> >> don't really understand Nucth well. Here is what I find it weird.
> >>Could
> >> >>you
> >> >> please advise?
> >> >>
> >> >> I crawl a website of mimeType application/rss+xml. The fetched
> >>content
> >> >>is
> >> >> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm
> >>expecting
> >> >>it
> >> >> to
> >> >> give all outlinks in the RSS Feed, but my command
> >> >> > `scan 'webpage', {COLUMNS => 'ol'}`
> >> >> gives only one ol cf entry.
> >> >>
> >> >> Then I add a code at TikaParser.java line 192 as follows to see what
> >>are
> >> >> all
> >> >> outlinks:
> >> >> > ...
> >> >> > Parse parse = new Parse(text, title, outlinks, status);
> >> >> > parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
> >> >> >
> >> >> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192
> >> >> >   LOG.trace(outlink.getToUrl());
> >> >> > }
> >> >> >
> >> >> > if (metaTags.getNoCache()) { // not okay to cache
> >> >> >   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
> >> >> > ByteBuffer.wrap(Bytes
> >> >> >       .toBytes(cachingPolicy)));
> >> >> > }
> >> >> >
> >> >> > return parse;
> >> >>
> >> >> The result is as expected. It prints all URL links in the content.
> >>But I
> >> >> really wonder why only one URL is stored in a storage of cf ol.
> >>Here's a
> >> >> log4j log:
> >> >> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser
> >> >> > org.apache.tika.parser.feed.FeedParser for mime-type
> >> >>application/rss+xml
> >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
> >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null,
> >> >> noCache=false,
> >> >> > noFollow=false, noIndex=false, refresh=false, refreshHref=null
> >> >> >  * general tags:
> >> >> >    - description        =       Manager Online Update ตลอด 24 ชม.
> >> >> >  * http-equiv tags:
> >> >> >
> >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
> >> >> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
> >> >> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks
> >>in
> >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml
> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843
> >> >>
> >> >> Now I wonder why only one outlink is stored in ol column family. Any
> >> >> advice,
> >> >> please?
> >> >>
> >> >>
> >> >> Regards,
> >> >> Ake Tangkananond
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>
>
>
>

Re: Tika's outlink is not as expected

Posted by Ake Tangkananond <ia...@gmail.com>.

Hi Ferdy,

Thanks for you advise. I don't have any special filtering/normalizing
rules except the standard one. I even try disabling all url normalization
plugin, but the result is no difference.

The url left over in the ol is
column=ol:http://www.manager.co.th/asp-bin/mgrview.aspx?New

Yes, it's truncated at "New".. I'm thinking if it is possible that the URL
is truncated to make it fit 49 chars, and all truncated URL are the same
so there is only one left?

In that case, what makes the URL truncated?


Regards,
Ake Tangkananond




On 8/14/12 7:12 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:

>Do you have specifc filtering/normalizing rules? From all urls that are
>logged, what url is left over in the 'ol' field?
>
>On Tue, Aug 14, 2012 at 1:49 PM, Ake Tangkananond <ia...@gmail.com>
>wrote:
>
>> Thanks for reply Ferdy.
>>
>> Variable 'db.max.outlinks.per.page' was set to 100. And I could parse
>>HTML
>> fine.
>>
>>
>> Regards,
>> Ake Tangkananond
>>
>>
>>
>>
>> On 8/14/12 6:43 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:
>>
>> >Hi,
>> >
>> >Judging by your logs, it might be that you have accidentally set
>> >'db.max.outlinks.per.page' to 1? If this is not the case, could you
>>try to
>> >parse some other document types, for example a html page? Please note
>>that
>> >I'm not using the TikaParser at all; it could be that there is a bug
>>with
>> >it in Nutch2.
>> >
>> >Ferdy.
>> >
>> >On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <ia...@gmail.com>
>> >wrote:
>> >
>> >> Hi,
>> >>
>> >> I'm getting an unexpected behavior from nutch parsing mechanism.
>> >>Perhaps I
>> >> don't really understand Nucth well. Here is what I find it weird.
>>Could
>> >>you
>> >> please advise?
>> >>
>> >> I crawl a website of mimeType application/rss+xml. The fetched
>>content
>> >>is
>> >> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm
>>expecting
>> >>it
>> >> to
>> >> give all outlinks in the RSS Feed, but my command
>> >> > `scan 'webpage', {COLUMNS => 'ol'}`
>> >> gives only one ol cf entry.
>> >>
>> >> Then I add a code at TikaParser.java line 192 as follows to see what
>>are
>> >> all
>> >> outlinks:
>> >> > ...
>> >> > Parse parse = new Parse(text, title, outlinks, status);
>> >> > parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
>> >> >
>> >> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192
>> >> >   LOG.trace(outlink.getToUrl());
>> >> > }
>> >> >
>> >> > if (metaTags.getNoCache()) { // not okay to cache
>> >> >   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
>> >> > ByteBuffer.wrap(Bytes
>> >> >       .toBytes(cachingPolicy)));
>> >> > }
>> >> >
>> >> > return parse;
>> >>
>> >> The result is as expected. It prints all URL links in the content.
>>But I
>> >> really wonder why only one URL is stored in a storage of cf ol.
>>Here's a
>> >> log4j log:
>> >> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser
>> >> > org.apache.tika.parser.feed.FeedParser for mime-type
>> >>application/rss+xml
>> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
>> >> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null,
>> >> noCache=false,
>> >> > noFollow=false, noIndex=false, refresh=false, refreshHref=null
>> >> >  * general tags:
>> >> >    - description        =       Manager Online Update ตลอด 24 ชม.
>> >> >  * http-equiv tags:
>> >> >
>> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
>> >> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
>> >> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
>> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks
>>in
>> >> > http://www.manager.co.th/RSS/Politics/Politics.xml
>> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
>> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843
>> >>
>> >> Now I wonder why only one outlink is stored in ol column family. Any
>> >> advice,
>> >> please?
>> >>
>> >>
>> >> Regards,
>> >> Ake Tangkananond
>> >>
>> >>
>> >>
>>
>>
>>

Re: Tika's outlink is not as expected

Posted by Ferdy Galema <fe...@kalooga.com>.

Do you have specifc filtering/normalizing rules? From all urls that are
logged, what url is left over in the 'ol' field?

On Tue, Aug 14, 2012 at 1:49 PM, Ake Tangkananond <ia...@gmail.com> wrote:

> Thanks for reply Ferdy.
>
> Variable 'db.max.outlinks.per.page' was set to 100. And I could parse HTML
> fine.
>
>
> Regards,
> Ake Tangkananond
>
>
>
>
> On 8/14/12 6:43 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:
>
> >Hi,
> >
> >Judging by your logs, it might be that you have accidentally set
> >'db.max.outlinks.per.page' to 1? If this is not the case, could you try to
> >parse some other document types, for example a html page? Please note that
> >I'm not using the TikaParser at all; it could be that there is a bug with
> >it in Nutch2.
> >
> >Ferdy.
> >
> >On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <ia...@gmail.com>
> >wrote:
> >
> >> Hi,
> >>
> >> I'm getting an unexpected behavior from nutch parsing mechanism.
> >>Perhaps I
> >> don't really understand Nucth well. Here is what I find it weird. Could
> >>you
> >> please advise?
> >>
> >> I crawl a website of mimeType application/rss+xml. The fetched content
> >>is
> >> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm expecting
> >>it
> >> to
> >> give all outlinks in the RSS Feed, but my command
> >> > `scan 'webpage', {COLUMNS => 'ol'}`
> >> gives only one ol cf entry.
> >>
> >> Then I add a code at TikaParser.java line 192 as follows to see what are
> >> all
> >> outlinks:
> >> > ...
> >> > Parse parse = new Parse(text, title, outlinks, status);
> >> > parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
> >> >
> >> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192
> >> >   LOG.trace(outlink.getToUrl());
> >> > }
> >> >
> >> > if (metaTags.getNoCache()) { // not okay to cache
> >> >   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
> >> > ByteBuffer.wrap(Bytes
> >> >       .toBytes(cachingPolicy)));
> >> > }
> >> >
> >> > return parse;
> >>
> >> The result is as expected. It prints all URL links in the content. But I
> >> really wonder why only one URL is stored in a storage of cf ol. Here's a
> >> log4j log:
> >> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser
> >> > org.apache.tika.parser.feed.FeedParser for mime-type
> >>application/rss+xml
> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
> >> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null,
> >> noCache=false,
> >> > noFollow=false, noIndex=false, refresh=false, refreshHref=null
> >> >  * general tags:
> >> >    - description        =       Manager Online Update ตลอด 24 ชม.
> >> >  * http-equiv tags:
> >> >
> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
> >> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
> >> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks in
> >> > http://www.manager.co.th/RSS/Politics/Politics.xml
> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843
> >>
> >> Now I wonder why only one outlink is stored in ol column family. Any
> >> advice,
> >> please?
> >>
> >>
> >> Regards,
> >> Ake Tangkananond
> >>
> >>
> >>
>
>
>

Re: Tika's outlink is not as expected

Posted by Ake Tangkananond <ia...@gmail.com>.

Thanks for reply Ferdy.

Variable 'db.max.outlinks.per.page' was set to 100. And I could parse HTML
fine.


Regards,
Ake Tangkananond




On 8/14/12 6:43 PM, "Ferdy Galema" <fe...@kalooga.com> wrote:

>Hi,
>
>Judging by your logs, it might be that you have accidentally set
>'db.max.outlinks.per.page' to 1? If this is not the case, could you try to
>parse some other document types, for example a html page? Please note that
>I'm not using the TikaParser at all; it could be that there is a bug with
>it in Nutch2.
>
>Ferdy.
>
>On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <ia...@gmail.com>
>wrote:
>
>> Hi,
>>
>> I'm getting an unexpected behavior from nutch parsing mechanism.
>>Perhaps I
>> don't really understand Nucth well. Here is what I find it weird. Could
>>you
>> please advise?
>>
>> I crawl a website of mimeType application/rss+xml. The fetched content
>>is
>> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm expecting
>>it
>> to
>> give all outlinks in the RSS Feed, but my command
>> > `scan 'webpage', {COLUMNS => 'ol'}`
>> gives only one ol cf entry.
>>
>> Then I add a code at TikaParser.java line 192 as follows to see what are
>> all
>> outlinks:
>> > …
>> > Parse parse = new Parse(text, title, outlinks, status);
>> > parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
>> >
>> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192
>> >   LOG.trace(outlink.getToUrl());
>> > }
>> >
>> > if (metaTags.getNoCache()) { // not okay to cache
>> >   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
>> > ByteBuffer.wrap(Bytes
>> >       .toBytes(cachingPolicy)));
>> > }
>> >
>> > return parse;
>>
>> The result is as expected. It prints all URL links in the content. But I
>> really wonder why only one URL is stored in a storage of cf ol. Here's a
>> log4j log:
>> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser
>> > org.apache.tika.parser.feed.FeedParser for mime-type
>>application/rss+xml
>> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
>> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null,
>> noCache=false,
>> > noFollow=false, noIndex=false, refresh=false, refreshHref=null
>> >  * general tags:
>> >    - description        =       Manager Online Update ตลอด 24 ชม.
>> >  * http-equiv tags:
>> >
>> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
>> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
>> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
>> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks in
>> > http://www.manager.co.th/RSS/Politics/Politics.xml
>> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
>> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843
>>
>> Now I wonder why only one outlink is stored in ol column family. Any
>> advice,
>> please?
>>
>>
>> Regards,
>> Ake Tangkananond
>>
>>
>>

Re: Tika's outlink is not as expected

Posted by Ferdy Galema <fe...@kalooga.com>.

Hi,

Judging by your logs, it might be that you have accidentally set
'db.max.outlinks.per.page' to 1? If this is not the case, could you try to
parse some other document types, for example a html page? Please note that
I'm not using the TikaParser at all; it could be that there is a bug with
it in Nutch2.

Ferdy.

On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <ia...@gmail.com> wrote:

> Hi,
>
> I'm getting an unexpected behavior from nutch parsing mechanism. Perhaps I
> don't really understand Nucth well. Here is what I find it weird. Could you
> please advise?
>
> I crawl a website of mimeType application/rss+xml. The fetched content is
> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm expecting it
> to
> give all outlinks in the RSS Feed, but my command
> > `scan 'webpage', {COLUMNS => 'ol'}`
> gives only one ol cf entry.
>
> Then I add a code at TikaParser.java line 192 as follows to see what are
> all
> outlinks:
> > …
> > Parse parse = new Parse(text, title, outlinks, status);
> > parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
> >
> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192
> >   LOG.trace(outlink.getToUrl());
> > }
> >
> > if (metaTags.getNoCache()) { // not okay to cache
> >   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
> > ByteBuffer.wrap(Bytes
> >       .toBytes(cachingPolicy)));
> > }
> >
> > return parse;
>
> The result is as expected. It prints all URL links in the content. But I
> really wonder why only one URL is stored in a storage of cf ol. Here's a
> log4j log:
> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser
> > org.apache.tika.parser.feed.FeedParser for mime-type application/rss+xml
> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null,
> noCache=false,
> > noFollow=false, noIndex=false, refresh=false, refreshHref=null
> >  * general tags:
> >    - description        =       Manager Online Update ตลอด 24 ชม.
> >  * http-equiv tags:
> >
> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks in
> > http://www.manager.co.th/RSS/Politics/Politics.xml
> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843
>
> Now I wonder why only one outlink is stored in ol column family. Any
> advice,
> please?
>
>
> Regards,
> Ake Tangkananond
>
>
>