You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Olivier Grisel (JIRA)" <ji...@apache.org> on 2009/10/01 13:28:23 UTC

[jira] Updated: (MAHOUT-183) WikipediaXmlSplitter spits one chunk per line

     [ https://issues.apache.org/jira/browse/MAHOUT-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olivier Grisel updated MAHOUT-183:
----------------------------------

    Attachment: MAHOUT-183-wikipedia-xml-splitter.patch

> WikipediaXmlSplitter spits one chunk per line
> ---------------------------------------------
>
>                 Key: MAHOUT-183
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-183
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Olivier Grisel
>             Fix For: 0.2
>
>         Attachments: MAHOUT-183-wikipedia-xml-splitter.patch
>
>
> The Wikipedia XML splitter inner loops erronously detects end of the line iterators which cause it to create chunks with just one line worth of page content instead of respecting the --chunkSize cli option.
> Simple patch to fixe this will follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.