You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2015/03/15 05:25:39 UTC
[jira] [Resolved] (NUTCH-1957) FileDumper output file name
collisions
[ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann resolved NUTCH-1957.
--------------------------------------
Resolution: Fixed
Fix Version/s: 1.10
- patch applied in r1666777
{noformat}
[chipotle:~/tmp/nutch-1.10-trunk] mattmann% svn commit -m "Fix for NUTCH-1957 FileDumper output file name collisions contributed by Renxia Wang this closes #12"
Sending CHANGES.txt
Sending src/java/org/apache/nutch/tools/FileDumper.java
Adding src/java/org/apache/nutch/util/DumpFileUtil.java
Adding src/test/org/apache/nutch/util/DumpFileUtilTest.java
Transmitting file data ....
Committed revision 1666777.
{noformat}
> FileDumper output file name collisions
> --------------------------------------
>
> Key: NUTCH-1957
> URL: https://issues.apache.org/jira/browse/NUTCH-1957
> Project: Nutch
> Issue Type: Bug
> Components: tool
> Affects Versions: 1.10
> Reporter: Renxia Wang
> Assignee: Chris A. Mattmann
> Priority: Minor
> Labels: dumper, filename, tools
> Fix For: 1.10
>
> Attachments: NUTCH-1957.patch
>
>
> The FileDumper extracts file base name and extension and use <basename>.<extension>(e.g. given the url https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the <basename>.<extension> will be project.html) as the file name to dump the file.
> Code from FileDumper.java:
> String url = key.toString();
> String baseName = FilenameUtils.getBaseName(url);
> String extension = FilenameUtils.getExtension(url);
> ...
> String filename = baseName + "." + extension;
> This introduce file name collision and leads to loss of data when using bin/nutch dump.
> Sample logs:
> 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: http://beringsea.eol.ucar.edu/data/
> 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists
> 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: http://catalog.eol.ucar.edu/
> 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Christopher%20Arp/project.html
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Mary%20Albert/project.html
> 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Yarrow%20Axford/project.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)