You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Antony Bowesman (JIRA)" <ji...@apache.org> on 2007/04/24 14:10:15 UTC
[jira] Updated: (NUTCH-473) ExcelExtractor performance bad due to
String concatenation
[ https://issues.apache.org/jira/browse/NUTCH-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antony Bowesman updated NUTCH-473:
----------------------------------
Summary: ExcelExtractor performance bad due to String concatenation (was: ExcepExtractor performance bad due to String concatenation)
> ExcelExtractor performance bad due to String concatenation
> ----------------------------------------------------------
>
> Key: NUTCH-473
> URL: https://issues.apache.org/jira/browse/NUTCH-473
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Affects Versions: 0.9.0
> Environment: Tested under Windows, Java 1.5 and 1.6
> Reporter: Antony Bowesman
>
> Using 0.9 version of ExcelExtractor was still running after 4 hours at 100% CPU trying to extract the text from a 3MB Excel file containing 26 sheets, half with a matrix of approx 1100 rows x P columns and the others with approx 1000 rows x E columns.
> After changing ExcelExtractor to use StringBuffer the same extraction process took 3 seconds under Java 1.5. Code changes below - example uses a 4K buffer per sheet - this was a completely arbitrary choice but keeps the number of StringBuffer expansions low for large files without using too much space for small files.
>
> protected String extractText(InputStream input) throws Exception {
>
> String resultText = "";
> HSSFWorkbook wb = new HSSFWorkbook(input);
> if (wb == null) {
> return resultText;
> }
>
> HSSFSheet sheet;
> HSSFRow row;
> HSSFCell cell;
> int sNum = 0;
> int rNum = 0;
> int cNum = 0;
>
> sNum = wb.getNumberOfSheets();
>
> // Allow 4K per sheet - seems a reasonable start
> StringBuffer sb = new StringBuffer(4096 * sNum);
> for (int i=0; i<sNum; i++) {
> if ((sheet = wb.getSheetAt(i)) == null) {
> continue;
> }
> rNum = sheet.getLastRowNum();
> for (int j=0; j<=rNum; j++) {
> if ((row = sheet.getRow(j)) == null){
> continue;
> }
> cNum = row.getLastCellNum();
>
> for (int k=0; k<cNum; k++) {
> if ((cell = row.getCell((short) k)) != null) {
> /*if(HSSFDateUtil.isCellDateFormatted(cell) == true) {
> resultText += cell.getDateCellValue().toString() + " ";
> } else
> */
> if (cell.getCellType() == HSSFCell.CELL_TYPE_STRING) {
> sb.append(cell.getStringCellValue());
> sb.append(' ');
> // resultText += cell.getStringCellValue() + " ";
> } else if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC) {
> Double d = new Double(cell.getNumericCellValue());
> sb.append(d.toString());
> sb.append(' ');
> // resultText += d.toString() + " ";
> }
> /* else if(cell.getCellType() == HSSFCell.CELL_TYPE_FORMULA){
> resultText += cell.getCellFormula() + " ";
> }
> */
> }
> }
> }
> }
> return sb.toString();
> }
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.