You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Giuseppe Totaro (JIRA)" <ji...@apache.org> on 2015/03/11 17:26:38 UTC

[jira] [Commented] (NUTCH-1957) FileDumper output file name collisions

    [ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357127#comment-14357127 ] 

Giuseppe Totaro commented on NUTCH-1957:
----------------------------------------

Hi [~zhique], I agree with your description. Using this "file-naming schema", some collisions may occur. If two or more files have the same basename but different pathname, only the first file will be written because all deserialized files will be included in the same outputDir folder. Currently, CommonCrawlDataDumpoer tool works in the same way.
I am working to solve it in CommonCrawlDataDumper tool (but it is the same in FileDumper). We can use either a unique "key" value as filename (but it could be very long) or the same structure/hierarchy as the input. In the latter case, each output file has the same pathname as the original one.
Please give your feedback.
Thank you,
Giuseppe

> FileDumper output file name collisions
> --------------------------------------
>
>                 Key: NUTCH-1957
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1957
>             Project: Nutch
>          Issue Type: Bug
>          Components: tool
>    Affects Versions: 1.10
>            Reporter: Renxia Wang
>            Priority: Minor
>              Labels: dumper, filename, tools
>
> The FileDumper extracts file base name and extension and use <basename>.<extension>(e.g. given the url https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the <basename>.<extension> will be project.html) as the file name to dump the file. 
> Code from FileDumper.java: 
> String url = key.toString();
> String baseName = FilenameUtils.getBaseName(url);
> String extension = FilenameUtils.getExtension(url);
> ...
> String filename = baseName + "." + extension;
> This introduce file name collision and leads to loss of data when using bin/nutch dump. 
> Sample logs:
> 2015-03-10 23:38:01,192 INFO  tools.FileDumper - Dumping URL: http://beringsea.eol.ucar.edu/data/
> 2015-03-10 23:38:01,193 INFO  tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists
> 2015-03-10 23:38:16,717 INFO  tools.FileDumper - Dumping URL: http://catalog.eol.ucar.edu/
> 2015-03-10 23:38:16,719 INFO  tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Christopher%20Arp/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Mary%20Albert/project.html
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Yarrow%20Axford/project.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)