You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Keith R. Bennett (JIRA)" <ji...@apache.org> on 2007/09/13 23:20:32 UTC

[jira] Updated: (TIKA-17) Need to support URL's for input resources.

     [ https://issues.apache.org/jira/browse/TIKA-17?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith R. Bennett updated TIKA-17:
---------------------------------

    Attachment: tika-17.patch

I apologize for the large patch, but it was near impossible to avoid.  Here are the issues addressed by this patch:

====================

LiusConfig:

1) Changed to use URL's instead of File's.

2) Created constructor w/Document parameter; this was how it was being created anyway.

3) In getParserConfig(), added check for null object in list.

4) Added to error message the URL that was being processed when the error occurred.

5) a) Changed:
  static void populateConfig(Document doc, LiusConfig tc)
to:
  void populateConfig(Document doc)

... and called it in the LiusConfig(Document) constructor.

5) b) Removed static member 'tc'; it was no longer necessary and, given the above change, leaving it in would have been confusing.

==================================

ParserFactory:

1) Changed to use URL's instead of File's.

2) Added:
  public static Parser getParser(URL url, LiusConfig tc).

Removed:
  public static Parser getParser(File file, String tcPath)
  public static Parser getParser(String str, String tcPath)
.. since this could easily be accomplished by instantiating the LiusConfig object and passing it instead of tcPath... or do we really need it?  

3) Changed worker method to throw exception if a parser configuration cannot be found
for a mime type.  Currently, I think execution would continue and a NullPointerException would be thrown when 'parser' is dereferenced.

4) Added log error for parser configuration not found error.

==================================

LiusLogger:

1) Changed to use URL's instead of File's.

==================================

TestParsers:

1) Changed to use URL's instead of File's.

2) Method testWORDxtraction() to testWORDExtraction().

3) Added output that lists on one line all the content objects, such as:
  
  Structured Content contains the following 12 items: fullText, title, author, creator, 
  summary, keywords, producer, subject, trapped, creationDate, modificationDate,   
  outLinks

This was because some of the content pieces were many lines long, so it was difficult to find out the total set of content pieces found.

4) A message is printed to stdout if either the config.xml or the log4j.properties file cannot be found.

5) log4j.properties is in the repository in src/test/resources/log4j.  I changed the source code to look for it there.

6) config.xml is in the repository in src/test/resources.  I changed the source code to look for it there.

7) When exception stack traces are printed, the URL that caused the error is printed immediately afterward:
  "Exception getting parser for URL file://...."


> Need to support URL's for input resources.
> ------------------------------------------
>
>                 Key: TIKA-17
>                 URL: https://issues.apache.org/jira/browse/TIKA-17
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>         Attachments: tika-17.patch
>
>
> It would be extremely helpful to support URL's instead of just File's for input resources.  This would enable us to use class loaders to find resources, and in general support resources that are not available via the filesystem.
> Patch coming...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.