You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Joseph Naegele <jn...@grierforensics.com> on 2016/04/05 21:27:27 UTC

script tags in LinkContentHandler

Hi all,

 

I'm using Nutch for crawling the web, and one of its built-in HTML parsers
uses Tika and its LinkContentHandler. I'm interested in collecting *all*
links on a web page, but I'm surprised the LinkContentHandler doesn't parse
<script> tags as links. When a <script> tags contains the "src" attribute,
the attribute should specify a URI and the tag should not contain any
content.

 

Is there any particular reason the LinkContentHandler doesn't parse <script>
tags, or is it just that I'm the first to look for this functionality? I can
ping the dev mailing list too if necessary.

 

Nutch's other built-in HTML parser collects all "outlinks", including
<script> tags, but I'd prefer to use Tika and Boilerpipe.

 

Thanks,

Joe Naegele

Re: script tags in LinkContentHandler

Posted by Ken Krugler <kk...@transpac.com>.

Hi Joe,

In that case, I’d file a Jira issue with two test docs attached, one with a regular <script> in the body, and another with <script> in the <head> section.

Regards,

— Ken

> On Apr 6, 2016, at 3:01pm, Joseph Naegele <jn...@grierforensics.com> wrote:
> 
> IdentityHtmlMapper solves the problem of elements being discarded. There is another problem with extracting <script> links however:
> 
> HtmlHandler only checks for "META", "BASE", and "LINK" within <head>. See the "if (bodylevel == 0 && discardLevel == 0)" section in HtmlHandler's startElement() method.
> 
> I don't mean to drag out this topic, but I only want to report actual issues. In this case I think HtmlHandler is missing at least a check for "SCRIPT" tags in the HTML header.
> 
> - Joe
> 
> 
> From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com] 
> Sent: Wednesday, April 06, 2016 5:21 PM
> To: user@tika.apache.org
> Subject: Re: script tags in LinkContentHandler
> 
> Hi,
> 
> I'm one of those from forensic world and, of course, my use case needs to extract everything.
> 
> I have already tried IdentityHtmlMapper to extract "value" attributes from "input" elements with no luck. It is not extracted by DefaultHtmlMapper and is rendered by browsers, so I think DefaultHtmlMapper needs some improvement. But HtmlMapper is the correct place to configure that or something must be done with HTMLSchema (I've tried that too, but I am not a html expert)?
> 
> Thanks,
> Luis
> 
> 2016-04-06 17:33 GMT-03:00 Allison, Timothy B. <ta...@mitre.org>:
> On #2, I'd prefer not skipping elements.  I definitely understand the use case to extract what a human can see, but I suspect if your email address ends in 'forensics.com', you'd probably like to see everything as well.
> 
> -----Original Message-----
> From: Joseph Naegele [mailto:jnaegele@grierforensics.com]
> Sent: Wednesday, April 06, 2016 4:14 PM
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
> 
> Great, sounds good. Would you like me to open a ticket?
> 
> With respect to parsing outlinks in Nutch, there's actually two problems:
> 
> 1) <script> missing in LinkContentHandler
> 2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element so it's discarded during the parse, similarly to <style>.
> 
> Does anyone have opinions on #2?
> 
> - Joe
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Wednesday, April 06, 2016 9:26 AM
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
> 
> Yes indeed! Script is missing and that's a mistake. See discussion at TIKA-1835. We should open a new ticket for it.
> Markus
> 
> 
> 
> -----Original message-----
>> From:Ken Krugler <kk...@transpac.com>
>> Sent: Tuesday 5th April 2016 22:24
>> To: user@tika.apache.org
>> Subject: Re: script tags in LinkContentHandler
>> 
>> Hi Joe,
>> <br class="" />I was looking at the version of this file in the (git) Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my mistake.
>> <br class="" />I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have been better to make that support conditional on using a different constructor, as it’s usually not a good idea to surprise consumers of parse output with new types of data (links).
>> <br class="" />I’ll take this discussion over to TIKA-1835 now.
>> <br class="" />— Ken
>> <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
>> <br class="Apple-interchange-newline" />Thanks Ken, I'm confused
>> though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script src="…"> belongs in there with the rest of them. What do you think?
>> Joe
>> From: Ken Krugler [mailto:kkrugler_lists@transpac.com
>> <ma...@transpac.com>] <br class="" />Sent: Tuesday,
>> April 05, 2016 3:48 PM<br class="" />To: user@tika.apache.org <ma...@tika.apache.org><br class="" />Subject: Re: script tags in LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
>> Hi all,
>> I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags as links. When a <script> tags contains the "src" attribute, the attribute should specify a URI and the tag should not contain any content.
>> Is there any particular reason the LinkContentHandler doesn't parse <script> tags, or is it just that I'm the first to look for this functionality? I can ping the dev mailing list too if necessary.
>> I don’t think there’s a specific reason it’s not included, though see my comment on https://issues.apache.org/jira/browse/TIKA-503 <https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link> elements?
>> — Ken
>> <br class="" /><br class="" />Nutch's other built-in HTML parser collects all "outlinks", including <script> tags, but I'd prefer to use Tika and Boilerpipe.
>> Thanks,
>> Joe Naegele
> 



--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: script tags in LinkContentHandler

Posted by Joseph Naegele <jn...@grierforensics.com>.

IdentityHtmlMapper solves the problem of elements being discarded. There is another problem with extracting <script> links however:

HtmlHandler only checks for "META", "BASE", and "LINK" within <head>. See the "if (bodylevel == 0 && discardLevel == 0)" section in HtmlHandler's startElement() method.

I don't mean to drag out this topic, but I only want to report actual issues. In this case I think HtmlHandler is missing at least a check for "SCRIPT" tags in the HTML header.

- Joe

From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com] 
Sent: Wednesday, April 06, 2016 5:21 PM
To: user@tika.apache.org
Subject: Re: script tags in LinkContentHandler

Hi,

I'm one of those from forensic world and, of course, my use case needs to extract everything.

I have already tried IdentityHtmlMapper to extract "value" attributes from "input" elements with no luck. It is not extracted by DefaultHtmlMapper and is rendered by browsers, so I think DefaultHtmlMapper needs some improvement. But HtmlMapper is the correct place to configure that or something must be done with HTMLSchema (I've tried that too, but I am not a html expert)?

Thanks,
Luis

2016-04-06 17:33 GMT-03:00 Allison, Timothy B. <ta...@mitre.org>:
On #2, I'd prefer not skipping elements.  I definitely understand the use case to extract what a human can see, but I suspect if your email address ends in 'forensics.com', you'd probably like to see everything as well.

-----Original Message-----
From: Joseph Naegele [mailto:jnaegele@grierforensics.com]
Sent: Wednesday, April 06, 2016 4:14 PM
To: user@tika.apache.org
Subject: RE: script tags in LinkContentHandler

Great, sounds good. Would you like me to open a ticket?

With respect to parsing outlinks in Nutch, there's actually two problems:

1) <script> missing in LinkContentHandler
2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element so it's discarded during the parse, similarly to <style>.

Does anyone have opinions on #2?

- Joe

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
Sent: Wednesday, April 06, 2016 9:26 AM
To: user@tika.apache.org
Subject: RE: script tags in LinkContentHandler

Yes indeed! Script is missing and that's a mistake. See discussion at TIKA-1835. We should open a new ticket for it.
Markus

-----Original message-----
> From:Ken Krugler <kk...@transpac.com>
> Sent: Tuesday 5th April 2016 22:24
> To: user@tika.apache.org
> Subject: Re: script tags in LinkContentHandler
>
> Hi Joe,
> <br class="" />I was looking at the version of this file in the (git) Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my mistake.
> <br class="" />I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have been better to make that support conditional on using a different constructor, as it’s usually not a good idea to surprise consumers of parse output with new types of data (links).
> <br class="" />I’ll take this discussion over to TIKA-1835 now.
> <br class="" />— Ken
> <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
> <br class="Apple-interchange-newline" />Thanks Ken, I'm confused
> though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script src="…"> belongs in there with the rest of them. What do you think?
> Joe
> From: Ken Krugler [mailto:kkrugler_lists@transpac.com
> <ma...@transpac.com>] <br class="" />Sent: Tuesday,
> April 05, 2016 3:48 PM<br class="" />To: user@tika.apache.org <ma...@tika.apache.org><br class="" />Subject: Re: script tags in LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
> Hi all,
> I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags as links. When a <script> tags contains the "src" attribute, the attribute should specify a URI and the tag should not contain any content.
> Is there any particular reason the LinkContentHandler doesn't parse <script> tags, or is it just that I'm the first to look for this functionality? I can ping the dev mailing list too if necessary.
> I don’t think there’s a specific reason it’s not included, though see my comment on https://issues.apache.org/jira/browse/TIKA-503 <https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link> elements?
> — Ken
> <br class="" /><br class="" />Nutch's other built-in HTML parser collects all "outlinks", including <script> tags, but I'd prefer to use Tika and Boilerpipe.
> Thanks,
> Joe Naegele
> ----------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com <http://www.scaleunlimited.com>custom
> big data solutions & training Hadoop, Cascading, Cassandra & Solr <br
> class="Apple-interchange-newline" /><br
> class="Apple-interchange-newline" /><br
> class="Apple-interchange-newline" /><br
> class="Apple-interchange-newline" />

> <br class="" />

Re: script tags in LinkContentHandler

Posted by Luís Filipe Nassif <lf...@gmail.com>.

Hi,

I'm one of those from forensic world and, of course, my use case needs to
extract everything.

I have already tried IdentityHtmlMapper to extract "value" attributes from
"input" elements with no luck. It is not extracted by DefaultHtmlMapper and
is rendered by browsers, so I think DefaultHtmlMapper needs some improvement.
But HtmlMapper is the correct place to configure that or something must be
done with HTMLSchema (I've tried that too, but I am not a html expert)?

Thanks,
Luis

2016-04-06 17:33 GMT-03:00 Allison, Timothy B. <ta...@mitre.org>:

> On #2, I'd prefer not skipping elements.  I definitely understand the use
> case to extract what a human can see, but I suspect if your email address
> ends in 'forensics.com', you'd probably like to see everything as well.
>
> -----Original Message-----
> From: Joseph Naegele [mailto:jnaegele@grierforensics.com]
> Sent: Wednesday, April 06, 2016 4:14 PM
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
>
> Great, sounds good. Would you like me to open a ticket?
>
> With respect to parsing outlinks in Nutch, there's actually two problems:
>
> 1) <script> missing in LinkContentHandler
> 2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element
> so it's discarded during the parse, similarly to <style>.
>
> Does anyone have opinions on #2?
>
> - Joe
>
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Wednesday, April 06, 2016 9:26 AM
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
>
> Yes indeed! Script is missing and that's a mistake. See discussion at
> TIKA-1835. We should open a new ticket for it.
> Markus
>
>
>
> -----Original message-----
> > From:Ken Krugler <kk...@transpac.com>
> > Sent: Tuesday 5th April 2016 22:24
> > To: user@tika.apache.org
> > Subject: Re: script tags in LinkContentHandler
> >
> > Hi Joe,
> > <br class="" />I was looking at the version of this file in the (git)
> Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my
> mistake.
> > <br class="" />I’d rolled in Markus’s patch directly to support these
> other link types, but I wish I’d remembered the old TIKA-503 discussion, as
> it would have been better to make that support conditional on using a
> different constructor, as it’s usually not a good idea to surprise
> consumers of parse output with new types of data (links).
> > <br class="" />I’ll take this discussion over to TIKA-1835 now.
> > <br class="" />— Ken
> > <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele
> <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
> > <br class="Apple-interchange-newline" />Thanks Ken, I'm confused
> > though. The LinkContentHandler in 1.12 now collects <a>, <link>,
> <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835
> <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script
> src="…"> belongs in there with the rest of them. What do you think?
> > Joe
> > From: Ken Krugler [mailto:kkrugler_lists@transpac.com
> > <ma...@transpac.com>] <br class="" />Sent: Tuesday,
> > April 05, 2016 3:48 PM<br class="" />To: user@tika.apache.org <mailto:
> user@tika.apache.org><br class="" />Subject: Re: script tags in
> LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele <
> jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
> > Hi all,
> > I'm using Nutch for crawling the web, and one of its built-in HTML
> parsers uses Tika and its LinkContentHandler. I'm interested in collecting
> *all* links on a web page, but I'm surprised the LinkContentHandler doesn't
> parse <script> tags as links. When a <script> tags contains the "src"
> attribute, the attribute should specify a URI and the tag should not
> contain any content.
> > Is there any particular reason the LinkContentHandler doesn't parse
> <script> tags, or is it just that I'm the first to look for this
> functionality? I can ping the dev mailing list too if necessary.
> > I don’t think there’s a specific reason it’s not included, though see my
> comment on https://issues.apache.org/jira/browse/TIKA-503 <
> https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link>
> elements?
> > — Ken
> > <br class="" /><br class="" />Nutch's other built-in HTML parser
> collects all "outlinks", including <script> tags, but I'd prefer to use
> Tika and Boilerpipe.
> > Thanks,
> > Joe Naegele
> > ----------------
> > Ken Krugler
> > +1 530-210-6378
> > http://www.scaleunlimited.com <http://www.scaleunlimited.com>custom
> > big data solutions & training Hadoop, Cascading, Cassandra & Solr <br
> > class="Apple-interchange-newline" /><br
> > class="Apple-interchange-newline" /><br
> > class="Apple-interchange-newline" /><br
> > class="Apple-interchange-newline" />
>
> > <br class="" />
>
>

Re: script tags in LinkContentHandler

Posted by Ken Krugler <kk...@transpac.com>.

> On Apr 6, 2016, at 1:33pm, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> On #2, I'd prefer not skipping elements.  I definitely understand the use case to extract what a human can see, but I suspect if your email address ends in 'forensics.com', you'd probably like to see everything as well.

I’m not sure I see the issue.

The _default_ implementation is for the parser to be configured to extract what a person can see, which is what you’d typically want.

IdentityHtmlMapper is a way to be more lenient, in that it gets you back more of “stuff that can be rendered as valid XHTML 1.0”.

But if you need access to a very specific element in the HTML, which isn’t text content, then what you really want to do is run the raw data through TagSoup/JSoup, then into Dom4J or equivalent, and use XPath queries to extract specific elements.

— Ken


> -----Original Message-----
> From: Joseph Naegele [mailto:jnaegele@grierforensics.com] 
> Sent: Wednesday, April 06, 2016 4:14 PM
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
> 
> Great, sounds good. Would you like me to open a ticket?
> 
> With respect to parsing outlinks in Nutch, there's actually two problems:
> 
> 1) <script> missing in LinkContentHandler
> 2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element so it's discarded during the parse, similarly to <style>.
> 
> Does anyone have opinions on #2?
> 
> - Joe
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: Wednesday, April 06, 2016 9:26 AM
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
> 
> Yes indeed! Script is missing and that's a mistake. See discussion at TIKA-1835. We should open a new ticket for it.
> Markus
> 
> 
> 
> -----Original message-----
>> From:Ken Krugler <kk...@transpac.com>
>> Sent: Tuesday 5th April 2016 22:24
>> To: user@tika.apache.org
>> Subject: Re: script tags in LinkContentHandler
>> 
>> Hi Joe,
>> <br class="" />I was looking at the version of this file in the (git) Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my mistake. 
>> <br class="" />I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have been better to make that support conditional on using a different constructor, as it’s usually not a good idea to surprise consumers of parse output with new types of data (links). 
>> <br class="" />I’ll take this discussion over to TIKA-1835 now. 
>> <br class="" />— Ken
>> <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote: 
>> <br class="Apple-interchange-newline" />Thanks Ken, I'm confused 
>> though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script src="…"> belongs in there with the rest of them. What do you think?
>> Joe
>> From: Ken Krugler [mailto:kkrugler_lists@transpac.com
>> <ma...@transpac.com>] <br class="" />Sent: Tuesday, 
>> April 05, 2016 3:48 PM<br class="" />To: user@tika.apache.org <ma...@tika.apache.org><br class="" />Subject: Re: script tags in LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
>> Hi all,
>> I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags as links. When a <script> tags contains the "src" attribute, the attribute should specify a URI and the tag should not contain any content. 
>> Is there any particular reason the LinkContentHandler doesn't parse <script> tags, or is it just that I'm the first to look for this functionality? I can ping the dev mailing list too if necessary. 
>> I don’t think there’s a specific reason it’s not included, though see my comment on https://issues.apache.org/jira/browse/TIKA-503 <https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link> elements? 
>> — Ken
>> <br class="" /><br class="" />Nutch's other built-in HTML parser collects all "outlinks", including <script> tags, but I'd prefer to use Tika and Boilerpipe. 
>> Thanks,
>> Joe Naegele
>> ----------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com <http://www.scaleunlimited.com>custom
>> big data solutions & training Hadoop, Cascading, Cassandra & Solr <br 
>> class="Apple-interchange-newline" /><br 
>> class="Apple-interchange-newline" /><br 
>> class="Apple-interchange-newline" /><br 
>> class="Apple-interchange-newline" />
> 
>> <br class="" />
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: script tags in LinkContentHandler

Posted by "Allison, Timothy B." <ta...@mitre.org>.

On #2, I'd prefer not skipping elements.  I definitely understand the use case to extract what a human can see, but I suspect if your email address ends in 'forensics.com', you'd probably like to see everything as well.

-----Original Message-----
From: Joseph Naegele [mailto:jnaegele@grierforensics.com] 
Sent: Wednesday, April 06, 2016 4:14 PM
To: user@tika.apache.org
Subject: RE: script tags in LinkContentHandler

Great, sounds good. Would you like me to open a ticket?

With respect to parsing outlinks in Nutch, there's actually two problems:

1) <script> missing in LinkContentHandler
2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element so it's discarded during the parse, similarly to <style>.

Does anyone have opinions on #2?

- Joe

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
Sent: Wednesday, April 06, 2016 9:26 AM
To: user@tika.apache.org
Subject: RE: script tags in LinkContentHandler

Yes indeed! Script is missing and that's a mistake. See discussion at TIKA-1835. We should open a new ticket for it.
Markus

 
 
-----Original message-----
> From:Ken Krugler <kk...@transpac.com>
> Sent: Tuesday 5th April 2016 22:24
> To: user@tika.apache.org
> Subject: Re: script tags in LinkContentHandler
> 
> Hi Joe,
> <br class="" />I was looking at the version of this file in the (git) Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my mistake. 
> <br class="" />I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have been better to make that support conditional on using a different constructor, as it’s usually not a good idea to surprise consumers of parse output with new types of data (links). 
> <br class="" />I’ll take this discussion over to TIKA-1835 now. 
> <br class="" />— Ken
> <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote: 
> <br class="Apple-interchange-newline" />Thanks Ken, I'm confused 
> though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script src="…"> belongs in there with the rest of them. What do you think?
> Joe
> From: Ken Krugler [mailto:kkrugler_lists@transpac.com
> <ma...@transpac.com>] <br class="" />Sent: Tuesday, 
> April 05, 2016 3:48 PM<br class="" />To: user@tika.apache.org <ma...@tika.apache.org><br class="" />Subject: Re: script tags in LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
> Hi all,
> I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags as links. When a <script> tags contains the "src" attribute, the attribute should specify a URI and the tag should not contain any content. 
> Is there any particular reason the LinkContentHandler doesn't parse <script> tags, or is it just that I'm the first to look for this functionality? I can ping the dev mailing list too if necessary. 
> I don’t think there’s a specific reason it’s not included, though see my comment on https://issues.apache.org/jira/browse/TIKA-503 <https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link> elements? 
> — Ken
> <br class="" /><br class="" />Nutch's other built-in HTML parser collects all "outlinks", including <script> tags, but I'd prefer to use Tika and Boilerpipe. 
> Thanks,
> Joe Naegele
> ----------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com <http://www.scaleunlimited.com>custom
> big data solutions & training Hadoop, Cascading, Cassandra & Solr <br 
> class="Apple-interchange-newline" /><br 
> class="Apple-interchange-newline" /><br 
> class="Apple-interchange-newline" /><br 
> class="Apple-interchange-newline" />
 
> <br class="" />

RE: script tags in LinkContentHandler

Posted by Joseph Naegele <jn...@grierforensics.com>.

Great, sounds good. Would you like me to open a ticket?

With respect to parsing outlinks in Nutch, there's actually two problems:

1) <script> missing in LinkContentHandler
2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element so it's discarded during the parse, similarly to <style>.

Does anyone have opinions on #2?

- Joe

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Wednesday, April 06, 2016 9:26 AM
To: user@tika.apache.org
Subject: RE: script tags in LinkContentHandler

Yes indeed! Script is missing and that's a mistake. See discussion at TIKA-1835. We should open a new ticket for it.
Markus

 
 
-----Original message-----
> From:Ken Krugler <kk...@transpac.com>
> Sent: Tuesday 5th April 2016 22:24
> To: user@tika.apache.org
> Subject: Re: script tags in LinkContentHandler
> 
> Hi Joe,
> <br class="" />I was looking at the version of this file in the (git) Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my mistake. 
> <br class="" />I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have been better to make that support conditional on using a different constructor, as it’s usually not a good idea to surprise consumers of parse output with new types of data (links). 
> <br class="" />I’ll take this discussion over to TIKA-1835 now. 
> <br class="" />— Ken
> <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote: 
> <br class="Apple-interchange-newline" />Thanks Ken, I'm confused 
> though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script src="…"> belongs in there with the rest of them. What do you think?
> Joe
> From: Ken Krugler [mailto:kkrugler_lists@transpac.com 
> <ma...@transpac.com>] <br class="" />Sent: Tuesday, 
> April 05, 2016 3:48 PM<br class="" />To: user@tika.apache.org <ma...@tika.apache.org><br class="" />Subject: Re: script tags in LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
> Hi all,
> I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags as links. When a <script> tags contains the "src" attribute, the attribute should specify a URI and the tag should not contain any content. 
> Is there any particular reason the LinkContentHandler doesn't parse <script> tags, or is it just that I'm the first to look for this functionality? I can ping the dev mailing list too if necessary. 
> I don’t think there’s a specific reason it’s not included, though see my comment on https://issues.apache.org/jira/browse/TIKA-503 <https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link> elements? 
> — Ken
> <br class="" /><br class="" />Nutch's other built-in HTML parser collects all "outlinks", including <script> tags, but I'd prefer to use Tika and Boilerpipe. 
> Thanks,
> Joe Naegele
> ----------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com <http://www.scaleunlimited.com>custom 
> big data solutions & training Hadoop, Cascading, Cassandra & Solr <br 
> class="Apple-interchange-newline" /><br 
> class="Apple-interchange-newline" /><br 
> class="Apple-interchange-newline" /><br 
> class="Apple-interchange-newline" />
 
> <br class="" />

RE: script tags in LinkContentHandler

Posted by Markus Jelsma <ma...@openindex.io>.

Yes indeed! Script is missing and that's a mistake. See discussion at TIKA-1835. We should open a new ticket for it.
Markus

 
 
-----Original message-----
> From:Ken Krugler <kk...@transpac.com>
> Sent: Tuesday 5th April 2016 22:24
> To: user@tika.apache.org
> Subject: Re: script tags in LinkContentHandler
> 
> Hi Joe, 
> <br class="" />I was looking at the version of this file in the (git) Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my mistake. 
> <br class="" />I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have been better to make that support conditional on using a different constructor, as it’s usually not a good idea to surprise consumers of parse output with new types of data (links). 
> <br class="" />I’ll take this discussion over to TIKA-1835 now. 
> <br class="" />— Ken  
> <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote: 
> <br class="Apple-interchange-newline" />Thanks Ken, 
> I'm confused though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script src="…"> belongs in there with the rest of them. What do you think? 
> Joe 
> From: Ken Krugler [mailto:kkrugler_lists@transpac.com <ma...@transpac.com>] <br class="" />Sent: Tuesday, April 05, 2016 3:48 PM<br class="" />To: user@tika.apache.org <ma...@tika.apache.org><br class="" />Subject: Re: script tags in LinkContentHandler 
> Hi Joe, 
> On Apr 5, 2016, at 12:27pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote: 
> Hi all, 
> I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags as links. When a <script> tags contains the "src" attribute, the attribute should specify a URI and the tag should not contain any content. 
> Is there any particular reason the LinkContentHandler doesn't parse <script> tags, or is it just that I'm the first to look for this functionality? I can ping the dev mailing list too if necessary. 
> I don’t think there’s a specific reason it’s not included, though see my comment on https://issues.apache.org/jira/browse/TIKA-503 <https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link> elements? 
> — Ken 
> <br class="" /><br class="" />Nutch's other built-in HTML parser collects all "outlinks", including <script> tags, but I'd prefer to use Tika and Boilerpipe. 
> Thanks, 
> Joe Naegele 
> ---------------- 
> Ken Krugler 
> +1 530-210-6378 
> http://www.scaleunlimited.com <http://www.scaleunlimited.com>custom big data solutions & training 
> Hadoop, Cascading, Cassandra & Solr 
> <br class="Apple-interchange-newline" /><br class="Apple-interchange-newline" /><br class="Apple-interchange-newline" /><br class="Apple-interchange-newline" />
 
> <br class="" />

Re: script tags in LinkContentHandler

Posted by Ken Krugler <kk...@transpac.com>.

Hi Joe,

I was looking at the version of this file in the (git) Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my mistake.

I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have been better to make that support conditional on using a different constructor, as it’s usually not a good idea to surprise consumers of parse output with new types of data (links).

I’ll take this discussion over to TIKA-1835 now.

— Ken 


> On Apr 5, 2016, at 12:53pm, Joseph Naegele <jn...@grierforensics.com> wrote:
> 
> Thanks Ken,
>  
> I'm confused though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script src="…"> belongs in there with the rest of them. What do you think?
>  
> Joe
>  
> From: Ken Krugler [mailto:kkrugler_lists@transpac.com <ma...@transpac.com>] 
> Sent: Tuesday, April 05, 2016 3:48 PM
> To: user@tika.apache.org <ma...@tika.apache.org>
> Subject: Re: script tags in LinkContentHandler
>  
> Hi Joe,
>  
>> On Apr 5, 2016, at 12:27pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
>>  
>> Hi all,
>>  
>> I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags as links. When a <script> tags contains the "src" attribute, the attribute should specify a URI and the tag should not contain any content.
>>  
>> Is there any particular reason the LinkContentHandler doesn't parse <script> tags, or is it just that I'm the first to look for this functionality? I can ping the dev mailing list too if necessary.
>  
> I don’t think there’s a specific reason it’s not included, though see my comment on https://issues.apache.org/jira/browse/TIKA-503 <https://issues.apache.org/jira/browse/TIKA-503>
>  
> e..g what about <link> elements?
>  
> — Ken
> 
> 
>>  
>> Nutch's other built-in HTML parser collects all "outlinks", including <script> tags, but I'd prefer to use Tika and Boilerpipe.
>>  
>> Thanks,
>> Joe Naegele
> 
> ----------------

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: script tags in LinkContentHandler

Posted by Joseph Naegele <jn...@grierforensics.com>.

Thanks Ken,

 

I'm confused though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835. In my opinion, <script src="…"> belongs in there with the rest of them. What do you think?

 

Joe

 

From: Ken Krugler [mailto:kkrugler_lists@transpac.com] 
Sent: Tuesday, April 05, 2016 3:48 PM
To: user@tika.apache.org
Subject: Re: script tags in LinkContentHandler

 

Hi Joe,

 

On Apr 5, 2016, at 12:27pm, Joseph Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com> > wrote:

 

Hi all,

 

I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags as links. When a <script> tags contains the "src" attribute, the attribute should specify a URI and the tag should not contain any content.

 

Is there any particular reason the LinkContentHandler doesn't parse <script> tags, or is it just that I'm the first to look for this functionality? I can ping the dev mailing list too if necessary.

 

I don’t think there’s a specific reason it’s not included, though see my comment on https://issues.apache.org/jira/browse/TIKA-503

 

e..g what about <link> elements?

 

— Ken





 

Nutch's other built-in HTML parser collects all "outlinks", including <script> tags, but I'd prefer to use Tika and Boilerpipe.

 

Thanks,

Joe Naegele

 

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

Re: script tags in LinkContentHandler

Posted by Ken Krugler <kk...@transpac.com>.

Hi Joe,

> On Apr 5, 2016, at 12:27pm, Joseph Naegele <jn...@grierforensics.com> wrote:
> 
> Hi all,
>  
> I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags as links. When a <script> tags contains the "src" attribute, the attribute should specify a URI and the tag should not contain any content.
>  
> Is there any particular reason the LinkContentHandler doesn't parse <script> tags, or is it just that I'm the first to look for this functionality? I can ping the dev mailing list too if necessary.

I don’t think there’s a specific reason it’s not included, though see my comment on https://issues.apache.org/jira/browse/TIKA-503

e..g what about <link> elements?

— Ken

>  
> Nutch's other built-in HTML parser collects all "outlinks", including <script> tags, but I'd prefer to use Tika and Boilerpipe.
>  
> Thanks,
> Joe Naegele

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr