You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Gabriele Kahlout (JIRA)" <ji...@apache.org> on 2011/03/27 11:12:05 UTC
[jira] [Created] (NUTCH-972) Mergedb doesn't merge with empty
directory, as is the case with merge (for indexes)
Mergedb doesn't merge with empty directory, as is the case with merge (for indexes)
-----------------------------------------------------------------------------------
Key: NUTCH-972
URL: https://issues.apache.org/jira/browse/NUTCH-972
Project: Nutch
Issue Type: Bug
Components: storage
Affects Versions: 1.2
Reporter: Gabriele Kahlout
Priority: Minor
Fix For: 1.3
Just an issue of unexpected behavior. This series of commands works with bin/nutch merge to merge indexes but not with crawldb.
allcrawldb="crawl/allcrawldb"
temp_crawldb="crawl/temp_crawldb"
merge_dbs="$it_crawldb $allcrawldb"
# if [[ ! -d $allcrawldb ]]
# then
# merge_dbs="$it_crawldb"
# fi
# uncomment the above and mergedb will work fine.
bin/nutch mergedb $temp_crawldb $merge_dbs
rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb
mv $temp_crawldb $allcrawldb
This is the exception that occurs:
bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb
CrawlDb merge: starting at 2011-03-27 10:13:06
Adding crawl/crawldb
Adding crawl/allcrawldb
CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/simpatico/nutch-1.2/crawl/allcrawldb/current
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:126)
at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:187)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:159)
Beside the scripting workaround I've attached a patch which skips adding the empty folder to the collection of dbs to merge. I've also added it a log of which dbs actually get added, consistent with merge interface.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-972) Mergedb doesn't merge with empty
directory, as is the case with merge (for indexes)
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-972.
-------------------------------
Bulk close of resolved issues for 1.3.
> Mergedb doesn't merge with empty directory, as is the case with merge (for indexes)
> -----------------------------------------------------------------------------------
>
> Key: NUTCH-972
> URL: https://issues.apache.org/jira/browse/NUTCH-972
> Project: Nutch
> Issue Type: Bug
> Components: storage
> Affects Versions: 1.2
> Reporter: Gabriele Kahlout
> Priority: Minor
> Labels: patch
> Fix For: 1.3
>
> Attachments: check_empty.diff
>
>
> Just an issue of unexpected behavior. This series of commands works with bin/nutch merge to merge indexes but not with crawldb.
> allcrawldb="crawl/allcrawldb"
> temp_crawldb="crawl/temp_crawldb"
> merge_dbs="$it_crawldb $allcrawldb"
>
> # if [[ ! -d $allcrawldb ]]
> # then
> # merge_dbs="$it_crawldb"
> # fi
> # uncomment the above and mergedb will work fine.
> bin/nutch mergedb $temp_crawldb $merge_dbs
> rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb
> mv $temp_crawldb $allcrawldb
> This is the exception that occurs:
> bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb
> CrawlDb merge: starting at 2011-03-27 10:13:06
> Adding crawl/crawldb
> Adding crawl/allcrawldb
> CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/simpatico/nutch-1.2/crawl/allcrawldb/current
> at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
> at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
> at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:126)
> at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:187)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:159)
> Beside the scripting workaround I've attached a patch which skips adding the empty folder to the collection of dbs to merge. I've also added it a log of which dbs actually get added, consistent with merge interface.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-972) Mergedb doesn't merge with empty
directory, as is the case with merge (for indexes)
Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabriele Kahlout updated NUTCH-972:
-----------------------------------
Attachment: check_empty.diff
> Mergedb doesn't merge with empty directory, as is the case with merge (for indexes)
> -----------------------------------------------------------------------------------
>
> Key: NUTCH-972
> URL: https://issues.apache.org/jira/browse/NUTCH-972
> Project: Nutch
> Issue Type: Bug
> Components: storage
> Affects Versions: 1.2
> Reporter: Gabriele Kahlout
> Priority: Minor
> Labels: patch
> Fix For: 1.3
>
> Attachments: check_empty.diff
>
>
> Just an issue of unexpected behavior. This series of commands works with bin/nutch merge to merge indexes but not with crawldb.
> allcrawldb="crawl/allcrawldb"
> temp_crawldb="crawl/temp_crawldb"
> merge_dbs="$it_crawldb $allcrawldb"
>
> # if [[ ! -d $allcrawldb ]]
> # then
> # merge_dbs="$it_crawldb"
> # fi
> # uncomment the above and mergedb will work fine.
> bin/nutch mergedb $temp_crawldb $merge_dbs
> rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb
> mv $temp_crawldb $allcrawldb
> This is the exception that occurs:
> bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb
> CrawlDb merge: starting at 2011-03-27 10:13:06
> Adding crawl/crawldb
> Adding crawl/allcrawldb
> CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/simpatico/nutch-1.2/crawl/allcrawldb/current
> at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
> at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
> at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:126)
> at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:187)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:159)
> Beside the scripting workaround I've attached a patch which skips adding the empty folder to the collection of dbs to merge. I've also added it a log of which dbs actually get added, consistent with merge interface.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-972) Mergedb doesn't merge with empty
directory, as is the case with merge (for indexes)
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche resolved NUTCH-972.
---------------------------------
Resolution: Fixed
Committed revision 1090199.
Thanks Gabriele. In the future could you use 'svn diff' to generate patches? See [http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer] for best practices
> Mergedb doesn't merge with empty directory, as is the case with merge (for indexes)
> -----------------------------------------------------------------------------------
>
> Key: NUTCH-972
> URL: https://issues.apache.org/jira/browse/NUTCH-972
> Project: Nutch
> Issue Type: Bug
> Components: storage
> Affects Versions: 1.2
> Reporter: Gabriele Kahlout
> Priority: Minor
> Labels: patch
> Fix For: 1.3
>
> Attachments: check_empty.diff
>
>
> Just an issue of unexpected behavior. This series of commands works with bin/nutch merge to merge indexes but not with crawldb.
> allcrawldb="crawl/allcrawldb"
> temp_crawldb="crawl/temp_crawldb"
> merge_dbs="$it_crawldb $allcrawldb"
>
> # if [[ ! -d $allcrawldb ]]
> # then
> # merge_dbs="$it_crawldb"
> # fi
> # uncomment the above and mergedb will work fine.
> bin/nutch mergedb $temp_crawldb $merge_dbs
> rm -r $it_crawldb $allcrawldb crawl/segments crawl/linkdb
> mv $temp_crawldb $allcrawldb
> This is the exception that occurs:
> bin/nutch mergedb crawl/temp_crawldb crawl/crawldb crawl/allcrawldb
> CrawlDb merge: starting at 2011-03-27 10:13:06
> Adding crawl/crawldb
> Adding crawl/allcrawldb
> CrawlDb merge: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/simpatico/nutch-1.2/crawl/allcrawldb/current
> at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
> at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
> at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.crawl.CrawlDbMerger.merge(CrawlDbMerger.java:126)
> at org.apache.nutch.crawl.CrawlDbMerger.run(CrawlDbMerger.java:187)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.CrawlDbMerger.main(CrawlDbMerger.java:159)
> Beside the scripting workaround I've attached a patch which skips adding the empty folder to the collection of dbs to merge. I've also added it a log of which dbs actually get added, consistent with merge interface.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira