You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2008/03/17 18:44:27 UTC

[jira] Updated: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat

     [ https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-449:
---------------------------------

    Attachment: filtering_v2.patch

After spending some time on thinking about his patch, I have redesigned the API. The changes are : 

* Refactored WritableFilter to Filter, so that Filter can be applied to non-Writables (according to Serialization framework)
* Added a Stringifier interface and a Default implementation using hadoop serialization framework. Now ordinary objects can be kept in the configuration. Acknowledging the performance loss in String.equals() comparison, we had to pass the actual objects in the configuration, or not use filtering at all.
* Added FilterEngine to evaluate postfix filter expressions
* Added OR, AND, NOT Filters
* Fixed synchronization issue in MessageDigest
* Filtering is moved to core framework instead of a library. 
* Changed the API so that JobConf is now used to add filters. This API is better since it hides nearly all the details from the appliaction code. The applications just configures the filter by calling JobConf#addFilter().
* Added a counter for filtered-out records
* Added filtering section to the mapred tutorial. 



> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filtering_v2.patch, filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in HADOOP-412 so that it can be applied to any InputFormat. To do this, I propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an internal RecordReader rather than the SequenceFile. This will require adding a next(key) and getCurrentValue(value) to the RecordReader interface, but that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.