You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2013/05/22 09:21:23 UTC
[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move
data formats used to parse "lastModified" to a config file.
[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-1190:
-----------------------------------
Fix Version/s: 1.8
> MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.
> ---------------------------------------------------------------------------------------------
>
> Key: NUTCH-1190
> URL: https://issues.apache.org/jira/browse/NUTCH-1190
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Affects Versions: 1.4
> Environment: jdk6
> Reporter: Zhang JinYan
> Fix For: 2.3, 1.8
>
> Attachments: date-styles.txt, MoreIndexingFilter.patch, NUTCH-1190-trunk.patch
>
>
> There many issues about missing date format:
> [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
> [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
> [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
> The data formats can be diverse, so why not move those data formats to a extra config file?
> I move all the data formats from "MoreIndexingFilter.java" to a file named "date-styles.txt"(place in "conf"), which will be load on startup.
> {code}
> public void setConf(Configuration conf) {
> this.conf = conf;
> MIME = new MimeUtil(conf);
>
> URL res = conf.getResource("date-styles.txt");
> if(res==null){
> LOG.error("Can't find resource: date-styles.txt");
> }else{
> try {
> List lines = FileUtils.readLines(new File(res.getFile()));
> for (int i = 0; i < lines.size(); i++) {
> String dateStyle = (String) lines.get(i);
> if(StringUtils.isBlank(dateStyle)){
> lines.remove(i);
> i--;
> continue;
> }
> dateStyle=StringUtils.trim(dateStyle);
> if(dateStyle.startsWith("#")){
> lines.remove(i);
> i--;
> continue;
> }
> lines.set(i, dateStyle);
> }
> dateStyles = new String[lines.size()];
> lines.toArray(dateStyles);
> } catch (IOException e) {
> LOG.error("Failed to load resource: date-styles.txt");
> }
> }
> }
> {code}
> Then parse "lastModified" like this(sample):
> {code}
> private long getTime(String date, String url) {
> ......
> Date parsedDate = DateUtils.parseDate(date, dateStyles);
> time = parsedDate.getTime();
> ......
> return time;
> }
> {code}
> This path also contains the "path" of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
> Find more details in the patch file.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira