You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Christian Johnsson (JIRA)" <ji...@apache.org> on 2012/08/28 02:54:07 UTC

[jira] [Created] (NUTCH-1461) Problem with TableUtil

Christian Johnsson created NUTCH-1461:
-----------------------------------------

             Summary: Problem with TableUtil
                 Key: NUTCH-1461
                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: nutchgora
         Environment: Debian / CDH3 / Nutch 2.0 Release
            Reporter: Christian Johnsson


Affects parse and updatedb and parse.

Think i got some missformated urls into hbase but i can't fin them.
It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?

2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
	at org.apache.hadoop.mapred.Child.main(Child.java:260)
2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1461) Problem with TableUtil

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442876#comment-13442876 ] 

Christian Johnsson commented on NUTCH-1461:
-------------------------------------------

I did a small workaround just to keep my stuff going.
Not right and will still keep trash in the db but atleast it keeps running and the 0.001% thats wrong i'll have to live with so far.. until me or someone else figures out a neat way to handle it.

in /org/apache/nutch/util/TableUtil.java

Change line 98:

- buf.append(splits[1]); // add protocol

to

if (splits.length == 1) 
       {
        buf.append("http"); // Add something to the array so it doesnt crash.
       }
else
      {
       buf.append(splits[1]); // add protocol
      }

It would be nice with a tool that harvested trough the hbase and removed all urls that wheren't legit.
Just in case that someone like me doesn't bother to edit regex-urlfilter.txt to only allow real urls.
Might be a good idea to have that in the default regex-urlfilter that takes only valid domains.



                
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1461) Problem with TableUtil

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Johnsson updated NUTCH-1461:
--------------------------------------

    Attachment: TabelUtil_Fix.patch

Quick fix incase there are some non valid domains in the database.
It will prevent it from crashing.
                
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>         Attachments: TabelUtil_Fix.patch
>
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1461) Problem with TableUtil

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446396#comment-13446396 ] 

Lewis John McGibbney commented on NUTCH-1461:
---------------------------------------------

Hi Christian, you make some suggestions here but I wonder if you can provide a patch to fix this issue? It would be greatly appreciated :0)
 
                
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1461) Problem with TableUtil

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442876#comment-13442876 ] 

Christian Johnsson edited comment on NUTCH-1461 at 8/28/12 12:22 PM:
---------------------------------------------------------------------

I did a small workaround just to keep my stuff going.
Not right and will still keep trash in the db but atleast it keeps running and the 0.001% thats wrong i'll have to live with so far.. until me or someone else figures out a neat way to handle it.

in /org/apache/nutch/util/TableUtil.java

Change line 98:


buf.append(splits[1]); // add protocol

to

if (splits.length == 1) 
       {
        buf.append("http"); // Add something to the array so it doesnt crash.
       }
else
      {
       buf.append(splits[1]); // add protocol
      }

It would be nice with a tool that harvested trough the hbase and removed all urls that wheren't legit.
Just in case that someone like me doesn't bother to edit regex-urlfilter.txt to only allow real urls.
Might be a good idea to have that in the default regex-urlfilter that takes only valid domains.



                
      was (Author: mr.johnsson):
    I did a small workaround just to keep my stuff going.
Not right and will still keep trash in the db but atleast it keeps running and the 0.001% thats wrong i'll have to live with so far.. until me or someone else figures out a neat way to handle it.

in /org/apache/nutch/util/TableUtil.java

Change line 98:

- buf.append(splits[1]); // add protocol

to

if (splits.length == 1) 
       {
        buf.append("http"); // Add something to the array so it doesnt crash.
       }
else
      {
       buf.append(splits[1]); // add protocol
      }

It would be nice with a tool that harvested trough the hbase and removed all urls that wheren't legit.
Just in case that someone like me doesn't bother to edit regex-urlfilter.txt to only allow real urls.
Might be a good idea to have that in the default regex-urlfilter that takes only valid domains.



                  
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1461) Problem with TableUtil

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446487#comment-13446487 ] 

Christian Johnsson edited comment on NUTCH-1461 at 9/1/12 10:29 AM:
--------------------------------------------------------------------

Sure, this one should do the trick.
I also changed little in regex-urlfilter.txt it should filter out non valid domainnames.
Attached that one too.

Good luck :-)
                
      was (Author: mr.johnsson):
    Sure, this one should do the trick.
I also changed the following in regex-urlfilter.txt it should filter out non valid domainnames.

+^(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~])*[^\.\,\)\(\s]$
#accept anything else
#+.


Good luck :-)
                  
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>         Attachments: TabelUtil_Fix.patch
>
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1461) Problem with TableUtil

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446487#comment-13446487 ] 

Christian Johnsson edited comment on NUTCH-1461 at 9/1/12 10:28 AM:
--------------------------------------------------------------------

Sure, this one should do the trick.
I also changed the following in regex-urlfilter.txt it should filter out non valid domainnames.

+^(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~])*[^\.\,\)\(\s]$
#accept anything else
#+.


Good luck :-)
                
      was (Author: mr.johnsson):
    Sure, this one should do the trick.
I also added the following line in regex-urlfilter.txt it should filter out non valid domainnames.

+^(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~])*[^\.\,\)\(\s]$
#accept anything else
#+.


Good luck :-)
                  
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>         Attachments: TabelUtil_Fix.patch
>
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1461) Problem with TableUtil

Posted by "Sebastian Nagel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488585#comment-13488585 ] 

Sebastian Nagel commented on NUTCH-1461:
----------------------------------------

Cf. NUTCH-1484: same error with file:// URLs which do not contain a host.
                
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>         Attachments: regex-urlfilter.txt, TabelUtil_Fix.patch
>
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1461) Problem with TableUtil

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446487#comment-13446487 ] 

Christian Johnsson commented on NUTCH-1461:
-------------------------------------------

Sure, this one should do the trick.
I also added the following line in regex-urlfilter.txt it should filter out non valid domainnames.

+^(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~])*[^\.\,\)\(\s]$

Good luck :-)
                
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1461) Problem with TableUtil

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446518#comment-13446518 ] 

Ferdy Galema commented on NUTCH-1461:
-------------------------------------

Added comment in NUTCH-1448.
                
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>         Attachments: regex-urlfilter.txt, TabelUtil_Fix.patch
>
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1461) Problem with TableUtil

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446487#comment-13446487 ] 

Christian Johnsson edited comment on NUTCH-1461 at 9/1/12 10:27 AM:
--------------------------------------------------------------------

Sure, this one should do the trick.
I also added the following line in regex-urlfilter.txt it should filter out non valid domainnames.

+^(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~])*[^\.\,\)\(\s]$
# accept anything else
#+.


Good luck :-)
                
      was (Author: mr.johnsson):
    Sure, this one should do the trick.
I also added the following line in regex-urlfilter.txt it should filter out non valid domainnames.

+^(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~])*[^\.\,\)\(\s]$

Good luck :-)
                  
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>         Attachments: TabelUtil_Fix.patch
>
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1461) Problem with TableUtil

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Johnsson updated NUTCH-1461:
--------------------------------------

    Attachment: regex-urlfilter.txt
    
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>         Attachments: regex-urlfilter.txt, TabelUtil_Fix.patch
>
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1461) Problem with TableUtil

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446490#comment-13446490 ] 

Christian Johnsson edited comment on NUTCH-1461 at 9/1/12 10:42 AM:
--------------------------------------------------------------------

Quick fix incase there are some non valid domains in the database.
It will prevent it from crashing.
The best way around would be to totally ignore it and skip to the next entry but my java skills are quite limited :-)
                
      was (Author: mr.johnsson):
    Quick fix incase there are some non valid domains in the database.
It will prevent it from crashing.
                  
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>         Attachments: regex-urlfilter.txt, TabelUtil_Fix.patch
>
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1461) Problem with TableUtil

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446487#comment-13446487 ] 

Christian Johnsson edited comment on NUTCH-1461 at 9/1/12 10:27 AM:
--------------------------------------------------------------------

Sure, this one should do the trick.
I also added the following line in regex-urlfilter.txt it should filter out non valid domainnames.

+^(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~])*[^\.\,\)\(\s]$
#accept anything else
#+.


Good luck :-)
                
      was (Author: mr.johnsson):
    Sure, this one should do the trick.
I also added the following line in regex-urlfilter.txt it should filter out non valid domainnames.

+^(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~])*[^\.\,\)\(\s]$
# accept anything else
#+.


Good luck :-)
                  
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>         Attachments: TabelUtil_Fix.patch
>
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1461) Problem with TableUtil

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446487#comment-13446487 ] 

Christian Johnsson edited comment on NUTCH-1461 at 9/1/12 10:43 AM:
--------------------------------------------------------------------

Sure, this one should do the trick.
I also changed little in regex-urlfilter.txt it should filter out non valid domainnames.
Attached that one too.
                
      was (Author: mr.johnsson):
    Sure, this one should do the trick.
I also changed little in regex-urlfilter.txt it should filter out non valid domainnames.
Attached that one too.

Good luck :-)
                  
> Problem with TableUtil
> ----------------------
>
>                 Key: NUTCH-1461
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1461
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: nutchgora
>         Environment: Debian / CDH3 / Nutch 2.0 Release
>            Reporter: Christian Johnsson
>         Attachments: regex-urlfilter.txt, TabelUtil_Fix.patch
>
>
> Affects parse and updatedb and parse.
> Think i got some missformated urls into hbase but i can't fin them.
> It generates this error though. If i empty hbase and restart it goes for a couple of million pages indexed then it comes up again. Any tips on how to locate what row in the table that genereates this error?
> 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.ArrayIndexOutOfBoundsException: 1
> 	at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102)
> 	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:260)
> 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira