You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Junqiang Zhang <ju...@gmail.com> on 2018/10/28 20:09:46 UTC

After upgrading Mac OS to Mojave 10.14, Nutch is trying to inject from the .DS_Store file inside its seed folder.

Hello,

I recently upgraded the OS of my MacBook Pro to Mojave 10.14. I run
and debug Apache Nutch on this MacBookPro laptop. In the Apple macOS
operating system, .DS_Store is a file that stores custom attributes of
its containing folder. Before the upgrade of the OS, the .DS_store
file inside Nutch seed folder was not visible to Nutch, and Nutch did
not try to read seed urls from this file. After the upgrade, Nutch
includes the .DS_Store file as a seed file, but Nutch thinks this
input path does not exist.

The relevant log shown on Terminal after I ran the command "bin/crawl
-i -w 30s -s dealfar/urls dealfar/crawl 1" is copied below. How can I
run and debug Nutch on Mojave 10.14? Can this issue be fixed by
modifying Nutch source code? Thanks.



XXXXXmbp:apache-nutch-1.15 XXXXX$ bin/crawl -i -w 30s -s dealfar/urls
dealfar/crawl 1
Time to wait (--wait) = 30 sec.
Injecting seed URLs
/Users/XXXXX/Documents/apache-nutch-1.15/bin/nutch inject
dealfar/crawl/crawldb dealfar/urls
Injector: starting at 2018-10-29 03:01:13
Injector: crawlDb: dealfar/crawl/crawldb
Injector: urlDir: dealfar/urls
Injector: Converting injected urls to crawl db entries.
Injecting seed URL file
file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
Injecting seed URL file
file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/seed.txt
Injector job failed: Input path does not exist:
file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
Injector: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
Input path does not exist:
file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at org.apache.nutch.crawl.Injector.inject(Injector.java:436)
at org.apache.nutch.crawl.Injector.run(Injector.java:570)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:535)

Error running:
  /Users/XXXXX/Documents/apache-nutch-1.15/bin/nutch inject
dealfar/crawl/crawldb dealfar/urls
Failed with exit value 255.

Re: After upgrading Mac OS to Mojave 10.14, Nutch is trying to inject from the .DS_Store file inside its seed folder.

Posted by Junqiang Zhang <ju...@gmail.com>.
Hi,

I made the file invisible again. It works very well now. Thanks.

Best,
Junqiang
On Mon, Oct 29, 2018 at 8:08 PM Sebastian Nagel
<wa...@googlemail.com.invalid> wrote:
>
> Hi,
>
> thanks for the problem report. However, I would argue not handle such specificic
> cases inside Nutch, it makes the Nutch code extremely complex and requires extra
> efforts to be portable among operating systems.
>
> Why not just make the file invisible again?
>
> Or if this isn't possible:
> - write all seeds into a single file and
> - pass this single seed file to Injector
>   (the seed list can be both - a directory
>    or a single file)
>
> Best,
> Sebastian
>
> On 10/28/18 9:58 PM, Junqiang Zhang wrote:
> > If a folder used to hold seed url links files is created after the OS
> > is upgraded to Mojave 10.14, the .DS_Store file inside the folder is
> > NOT visible to Nutch. If a folder was created before the upgrade, the
> > .DS_Store file inside this old folder is visible to Nutch.
> > On Mon, Oct 29, 2018 at 4:09 AM Junqiang Zhang <ju...@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >> I recently upgraded the OS of my MacBook Pro to Mojave 10.14. I run
> >> and debug Apache Nutch on this MacBookPro laptop. In the Apple macOS
> >> operating system, .DS_Store is a file that stores custom attributes of
> >> its containing folder. Before the upgrade of the OS, the .DS_store
> >> file inside Nutch seed folder was not visible to Nutch, and Nutch did
> >> not try to read seed urls from this file. After the upgrade, Nutch
> >> includes the .DS_Store file as a seed file, but Nutch thinks this
> >> input path does not exist.
> >>
> >> The relevant log shown on Terminal after I ran the command "bin/crawl
> >> -i -w 30s -s dealfar/urls dealfar/crawl 1" is copied below. How can I
> >> run and debug Nutch on Mojave 10.14? Can this issue be fixed by
> >> modifying Nutch source code? Thanks.
> >>
> >>
> >>
> >> XXXXXmbp:apache-nutch-1.15 XXXXX$ bin/crawl -i -w 30s -s dealfar/urls
> >> dealfar/crawl 1
> >> Time to wait (--wait) = 30 sec.
> >> Injecting seed URLs
> >> /Users/XXXXX/Documents/apache-nutch-1.15/bin/nutch inject
> >> dealfar/crawl/crawldb dealfar/urls
> >> Injector: starting at 2018-10-29 03:01:13
> >> Injector: crawlDb: dealfar/crawl/crawldb
> >> Injector: urlDir: dealfar/urls
> >> Injector: Converting injected urls to crawl db entries.
> >> Injecting seed URL file
> >> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
> >> Injecting seed URL file
> >> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/seed.txt
> >> Injector job failed: Input path does not exist:
> >> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
> >> Injector: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
> >> Input path does not exist:
> >> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
> >> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
> >> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
> >> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
> >> at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
> >> at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
> >> at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
> >> at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
> >> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
> >> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
> >> at java.security.AccessController.doPrivileged(Native Method)
> >> at javax.security.auth.Subject.doAs(Subject.java:422)
> >> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
> >> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
> >> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
> >> at org.apache.nutch.crawl.Injector.inject(Injector.java:436)
> >> at org.apache.nutch.crawl.Injector.run(Injector.java:570)
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >> at org.apache.nutch.crawl.Injector.main(Injector.java:535)
> >>
> >> Error running:
> >>   /Users/XXXXX/Documents/apache-nutch-1.15/bin/nutch inject
> >> dealfar/crawl/crawldb dealfar/urls
> >> Failed with exit value 255.
>

Re: After upgrading Mac OS to Mojave 10.14, Nutch is trying to inject from the .DS_Store file inside its seed folder.

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi,

thanks for the problem report. However, I would argue not handle such specificic
cases inside Nutch, it makes the Nutch code extremely complex and requires extra
efforts to be portable among operating systems.

Why not just make the file invisible again?

Or if this isn't possible:
- write all seeds into a single file and
- pass this single seed file to Injector
  (the seed list can be both - a directory
   or a single file)

Best,
Sebastian

On 10/28/18 9:58 PM, Junqiang Zhang wrote:
> If a folder used to hold seed url links files is created after the OS
> is upgraded to Mojave 10.14, the .DS_Store file inside the folder is
> NOT visible to Nutch. If a folder was created before the upgrade, the
> .DS_Store file inside this old folder is visible to Nutch.
> On Mon, Oct 29, 2018 at 4:09 AM Junqiang Zhang <ju...@gmail.com> wrote:
>>
>> Hello,
>>
>> I recently upgraded the OS of my MacBook Pro to Mojave 10.14. I run
>> and debug Apache Nutch on this MacBookPro laptop. In the Apple macOS
>> operating system, .DS_Store is a file that stores custom attributes of
>> its containing folder. Before the upgrade of the OS, the .DS_store
>> file inside Nutch seed folder was not visible to Nutch, and Nutch did
>> not try to read seed urls from this file. After the upgrade, Nutch
>> includes the .DS_Store file as a seed file, but Nutch thinks this
>> input path does not exist.
>>
>> The relevant log shown on Terminal after I ran the command "bin/crawl
>> -i -w 30s -s dealfar/urls dealfar/crawl 1" is copied below. How can I
>> run and debug Nutch on Mojave 10.14? Can this issue be fixed by
>> modifying Nutch source code? Thanks.
>>
>>
>>
>> XXXXXmbp:apache-nutch-1.15 XXXXX$ bin/crawl -i -w 30s -s dealfar/urls
>> dealfar/crawl 1
>> Time to wait (--wait) = 30 sec.
>> Injecting seed URLs
>> /Users/XXXXX/Documents/apache-nutch-1.15/bin/nutch inject
>> dealfar/crawl/crawldb dealfar/urls
>> Injector: starting at 2018-10-29 03:01:13
>> Injector: crawlDb: dealfar/crawl/crawldb
>> Injector: urlDir: dealfar/urls
>> Injector: Converting injected urls to crawl db entries.
>> Injecting seed URL file
>> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
>> Injecting seed URL file
>> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/seed.txt
>> Injector job failed: Input path does not exist:
>> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
>> Injector: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
>> Input path does not exist:
>> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
>> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>> at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
>> at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>> at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>> at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>> at org.apache.nutch.crawl.Injector.inject(Injector.java:436)
>> at org.apache.nutch.crawl.Injector.run(Injector.java:570)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>> at org.apache.nutch.crawl.Injector.main(Injector.java:535)
>>
>> Error running:
>>   /Users/XXXXX/Documents/apache-nutch-1.15/bin/nutch inject
>> dealfar/crawl/crawldb dealfar/urls
>> Failed with exit value 255.


Re: After upgrading Mac OS to Mojave 10.14, Nutch is trying to inject from the .DS_Store file inside its seed folder.

Posted by Junqiang Zhang <ju...@gmail.com>.
If a folder used to hold seed url links files is created after the OS
is upgraded to Mojave 10.14, the .DS_Store file inside the folder is
NOT visible to Nutch. If a folder was created before the upgrade, the
.DS_Store file inside this old folder is visible to Nutch.
On Mon, Oct 29, 2018 at 4:09 AM Junqiang Zhang <ju...@gmail.com> wrote:
>
> Hello,
>
> I recently upgraded the OS of my MacBook Pro to Mojave 10.14. I run
> and debug Apache Nutch on this MacBookPro laptop. In the Apple macOS
> operating system, .DS_Store is a file that stores custom attributes of
> its containing folder. Before the upgrade of the OS, the .DS_store
> file inside Nutch seed folder was not visible to Nutch, and Nutch did
> not try to read seed urls from this file. After the upgrade, Nutch
> includes the .DS_Store file as a seed file, but Nutch thinks this
> input path does not exist.
>
> The relevant log shown on Terminal after I ran the command "bin/crawl
> -i -w 30s -s dealfar/urls dealfar/crawl 1" is copied below. How can I
> run and debug Nutch on Mojave 10.14? Can this issue be fixed by
> modifying Nutch source code? Thanks.
>
>
>
> XXXXXmbp:apache-nutch-1.15 XXXXX$ bin/crawl -i -w 30s -s dealfar/urls
> dealfar/crawl 1
> Time to wait (--wait) = 30 sec.
> Injecting seed URLs
> /Users/XXXXX/Documents/apache-nutch-1.15/bin/nutch inject
> dealfar/crawl/crawldb dealfar/urls
> Injector: starting at 2018-10-29 03:01:13
> Injector: crawlDb: dealfar/crawl/crawldb
> Injector: urlDir: dealfar/urls
> Injector: Converting injected urls to crawl db entries.
> Injecting seed URL file
> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
> Injecting seed URL file
> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/seed.txt
> Injector job failed: Input path does not exist:
> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
> Injector: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
> Input path does not exist:
> file:/Users/XXXXX/Documents/apache-nutch-1.15/dealfar/urls/.DS_Store
> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
> at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
> at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
> at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
> at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
> at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
> at org.apache.nutch.crawl.Injector.inject(Injector.java:436)
> at org.apache.nutch.crawl.Injector.run(Injector.java:570)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.Injector.main(Injector.java:535)
>
> Error running:
>   /Users/XXXXX/Documents/apache-nutch-1.15/bin/nutch inject
> dealfar/crawl/crawldb dealfar/urls
> Failed with exit value 255.