You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "KuroSaka TeruHiko (JIRA)" <ji...@apache.org> on 2006/03/06 23:58:02 UTC

[jira] Created: (NUTCH-224) Nutch doesn't handle Korean text at all

Nutch doesn't handle Korean text at all
---------------------------------------

         Key: NUTCH-224
         URL: http://issues.apache.org/jira/browse/NUTCH-224
     Project: Nutch
        Type: Bug
  Components: indexer  
    Versions: 0.7.1    
    Reporter: KuroSaka TeruHiko


I was browing NutchAnalysis.jj and found that
Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
a Unicode character of the hex value xxxx) are not
part of LETTER or CJK class.  This seems to me that
Nutch cannot handle Korean documents at all.

I posted the above message at nutch-user ML and Cheolgoo Kang [appler@gmail.com]
replied as:
------------------------------------------------------------------------------------
There was similar issue with Lucene's StandardTokenizer.jj.

http://issues.apache.org/jira/browse/LUCENE-444

and

http://issues.apache.org/jira/browse/LUCENE-461

I'm have almost no experience with Nutch, but you can handle it like
those issues above.
------------------------------------------------------------------------------------

Both fixes should probably be ported back to NuatchAnalysis.jj.





-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

RE: No space left on device

Posted by an...@orbita1.ru.

Yes, I use dfs. 
How configure nutch for decide problem with disk space? How control number
of smaller files? 

-----Original Message-----
From: Dennis Kubes [mailto:nutch-dev@dragonflymc.com] 
Sent: Wednesday, June 14, 2006 5:46 PM
To: nutch-dev@lucene.apache.org
Subject: Re: No space left on device
Importance: High

The tasktracker require "intermediate" space while performing the map 
and reduce functions.  Many smaller files are produced during the map 
and reduce processes that are deleted when the processes finish.  If you 
are using the DFS then more disk space is required then is actually used 
since disk space is grabbed in blocks.

Dennis

anton@orbita1.ru wrote:
> I'm using nutch v.0.8 and have 3 computers.
> One of my tasktrakers always go down. 
> This occurs during indexing (index crawl/indexes). On server with crashed
> tasktracker now available 53G of free disk space and used only 11G.
> How i can decide this problem? Why tasktarcker requires so much free space
> on HDD?
>
> Piece of Log with error:
>
> 060613 151840 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151841 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151842 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151843 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151844 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151845 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151846 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151847 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151847 SEVERE FSError, exiting: java.io.IOException: No space left
on
> device
> 060613 151847 task_0083_r_000001_0  SEVERE FSError from child
> 060613 151847 task_0083_r_000001_0 org.apache.hadoop.fs.FSError:
> java.io.IOException: No space left on device
> 060613 151847 task_0083_r_000001_0      at
>
org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
> Syst
> 060613 151847 task_0083_r_000001_0      at
>
org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java
> :69)
> 060613 151847 task_0083_r_000001_0      at
>
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStre
> am.j
> 060613 151847 task_0083_r_000001_0      at
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 060613 151847 task_0083_r_000001_0      at
> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 060613 151847 task_0083_r_000001_0      at
> java.io.DataOutputStream.flush(DataOutputStream.java:106)
> 060613 151847 task_0083_r_000001_0      at
> java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 060613 151847 task_0083_r_000001_0      at
>
org.apache.hadoop.io.SequenceFile$Sorter$SortPass.close(SequenceFile.java:59
> 8)
> 060613 151847 task_0083_r_000001_0      at
> org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:533)
> 060613 151847 task_0083_r_000001_0      at
> org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:519)
> 060613 151847 task_0083_r_000001_0      at
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:316)
> 060613 151847 task_0083_r_000001_0      060613 151847 task_0083_r_000001_0
> at org.apache.hadoop.mapred.TaskTracker$Chi
> 060613 151847 task_0083_r_000001_0 Caused by: java.io.IOException: No
space
> left on device
> 060613 151847 task_0083_r_000001_0      at
> java.io.FileOutputStream.writeBytes(Native Method)
> 060613 151847 task_0083_r_000001_0      at
> java.io.FileOutputStream.write(FileOutputStream.java:260)
> 060613 151848 task_0083_r_000001_0      at
>
org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
> Syst
> 060613 151848 task_0083_r_000001_0      ... 11 more
> 060613 151849 Server connection on port 50050 from 10.0.0.3: exiting
> 060613 151854 task_0083_m_000001_0 done; removing files.
> 060613 151855 task_0083_m_000003_0 done; removing files.
>
>
>
>

Re: No space left on device

Posted by Dennis Kubes <nu...@dragonflymc.com>.

The tasktracker require "intermediate" space while performing the map 
and reduce functions.  Many smaller files are produced during the map 
and reduce processes that are deleted when the processes finish.  If you 
are using the DFS then more disk space is required then is actually used 
since disk space is grabbed in blocks.

Dennis

anton@orbita1.ru wrote:
> I'm using nutch v.0.8 and have 3 computers.
> One of my tasktrakers always go down. 
> This occurs during indexing (index crawl/indexes). On server with crashed
> tasktracker now available 53G of free disk space and used only 11G.
> How i can decide this problem? Why tasktarcker requires so much free space
> on HDD?
>
> Piece of Log with error:
>
> 060613 151840 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151841 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151842 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151843 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151844 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151845 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151846 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151847 task_0083_r_000001_0 0.5% reduce > sort
> 060613 151847 SEVERE FSError, exiting: java.io.IOException: No space left on
> device
> 060613 151847 task_0083_r_000001_0  SEVERE FSError from child
> 060613 151847 task_0083_r_000001_0 org.apache.hadoop.fs.FSError:
> java.io.IOException: No space left on device
> 060613 151847 task_0083_r_000001_0      at
> org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
> Syst
> 060613 151847 task_0083_r_000001_0      at
> org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java
> :69)
> 060613 151847 task_0083_r_000001_0      at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStre
> am.j
> 060613 151847 task_0083_r_000001_0      at
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 060613 151847 task_0083_r_000001_0      at
> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 060613 151847 task_0083_r_000001_0      at
> java.io.DataOutputStream.flush(DataOutputStream.java:106)
> 060613 151847 task_0083_r_000001_0      at
> java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 060613 151847 task_0083_r_000001_0      at
> org.apache.hadoop.io.SequenceFile$Sorter$SortPass.close(SequenceFile.java:59
> 8)
> 060613 151847 task_0083_r_000001_0      at
> org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:533)
> 060613 151847 task_0083_r_000001_0      at
> org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:519)
> 060613 151847 task_0083_r_000001_0      at
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:316)
> 060613 151847 task_0083_r_000001_0      060613 151847 task_0083_r_000001_0
> at org.apache.hadoop.mapred.TaskTracker$Chi
> 060613 151847 task_0083_r_000001_0 Caused by: java.io.IOException: No space
> left on device
> 060613 151847 task_0083_r_000001_0      at
> java.io.FileOutputStream.writeBytes(Native Method)
> 060613 151847 task_0083_r_000001_0      at
> java.io.FileOutputStream.write(FileOutputStream.java:260)
> 060613 151848 task_0083_r_000001_0      at
> org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
> Syst
> 060613 151848 task_0083_r_000001_0      ... 11 more
> 060613 151849 Server connection on port 50050 from 10.0.0.3: exiting
> 060613 151854 task_0083_m_000001_0 done; removing files.
> 060613 151855 task_0083_m_000003_0 done; removing files.
>
>
>
>

No space left on device

Posted by an...@orbita1.ru.

I'm using nutch v.0.8 and have 3 computers.
One of my tasktrakers always go down. 
This occurs during indexing (index crawl/indexes). On server with crashed
tasktracker now available 53G of free disk space and used only 11G.
How i can decide this problem? Why tasktarcker requires so much free space
on HDD?

Piece of Log with error:

060613 151840 task_0083_r_000001_0 0.5% reduce > sort
060613 151841 task_0083_r_000001_0 0.5% reduce > sort
060613 151842 task_0083_r_000001_0 0.5% reduce > sort
060613 151843 task_0083_r_000001_0 0.5% reduce > sort
060613 151844 task_0083_r_000001_0 0.5% reduce > sort
060613 151845 task_0083_r_000001_0 0.5% reduce > sort
060613 151846 task_0083_r_000001_0 0.5% reduce > sort
060613 151847 task_0083_r_000001_0 0.5% reduce > sort
060613 151847 SEVERE FSError, exiting: java.io.IOException: No space left on
device
060613 151847 task_0083_r_000001_0  SEVERE FSError from child
060613 151847 task_0083_r_000001_0 org.apache.hadoop.fs.FSError:
java.io.IOException: No space left on device
060613 151847 task_0083_r_000001_0      at
org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
Syst
060613 151847 task_0083_r_000001_0      at
org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java
:69)
060613 151847 task_0083_r_000001_0      at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStre
am.j
060613 151847 task_0083_r_000001_0      at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
060613 151847 task_0083_r_000001_0      at
java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
060613 151847 task_0083_r_000001_0      at
java.io.DataOutputStream.flush(DataOutputStream.java:106)
060613 151847 task_0083_r_000001_0      at
java.io.FilterOutputStream.close(FilterOutputStream.java:140)
060613 151847 task_0083_r_000001_0      at
org.apache.hadoop.io.SequenceFile$Sorter$SortPass.close(SequenceFile.java:59
8)
060613 151847 task_0083_r_000001_0      at
org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:533)
060613 151847 task_0083_r_000001_0      at
org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:519)
060613 151847 task_0083_r_000001_0      at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:316)
060613 151847 task_0083_r_000001_0      060613 151847 task_0083_r_000001_0
at org.apache.hadoop.mapred.TaskTracker$Chi
060613 151847 task_0083_r_000001_0 Caused by: java.io.IOException: No space
left on device
060613 151847 task_0083_r_000001_0      at
java.io.FileOutputStream.writeBytes(Native Method)
060613 151847 task_0083_r_000001_0      at
java.io.FileOutputStream.write(FileOutputStream.java:260)
060613 151848 task_0083_r_000001_0      at
org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
Syst
060613 151848 task_0083_r_000001_0      ... 11 more
060613 151849 Server connection on port 50050 from 10.0.0.3: exiting
060613 151854 task_0083_m_000001_0 done; removing files.
060613 151855 task_0083_m_000003_0 done; removing files.

free disk space

Posted by an...@orbita1.ru.

I'm using nutch v.0.8 and have 3 computers. Two of them have datanode and
tasktracker running, another one has name node and jobtracker running.  Do I
need more disk space with tasktrackers and jobtracker running, as  the
number of pages processed is growing along with the size of database? Would
I be able to add the 3d datanode when I run out of free disk space on those
computers with datanode installed?

How much free disk space do I need in order for task- and jobtrackers to
work properly?

[jira] Commented: (NUTCH-224) Nutch doesn't handle Korean text at all

Posted by "Sean Dean (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-224?page=comments#action_12416108 ] 

Sean Dean commented on NUTCH-224:
---------------------------------

Im still using 0.7.1 and also see this problem.

In the Nutch 0.7.2 release they upgraded to Lucene 1.9.1, which included the above fixes for Korean language support.

Have you tried 0.7.2 or .8-dev with any luck?

> Nutch doesn't handle Korean text at all
> ---------------------------------------
>
>          Key: NUTCH-224
>          URL: http://issues.apache.org/jira/browse/NUTCH-224
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.7.1
>     Reporter: KuroSaka TeruHiko

>
> I was browing NutchAnalysis.jj and found that
> Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
> a Unicode character of the hex value xxxx) are not
> part of LETTER or CJK class.  This seems to me that
> Nutch cannot handle Korean documents at all.
> I posted the above message at nutch-user ML and Cheolgoo Kang [appler@gmail.com]
> replied as:
> ------------------------------------------------------------------------------------
> There was similar issue with Lucene's StandardTokenizer.jj.
> http://issues.apache.org/jira/browse/LUCENE-444
> and
> http://issues.apache.org/jira/browse/LUCENE-461
> I'm have almost no experience with Nutch, but you can handle it like
> those issues above.
> ------------------------------------------------------------------------------------
> Both fixes should probably be ported back to NuatchAnalysis.jj.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira