You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Thomas Mueller (JIRA)" <ji...@apache.org> on 2018/05/22 09:47:00 UTC

[jira] [Comment Edited] (OAK-7353) oak-run tika extraction should support getting assistance from stored indexed data from a lucene index

    [ https://issues.apache.org/jira/browse/OAK-7353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16480330#comment-16480330 ] 

Thomas Mueller edited comment on OAK-7353 at 5/22/18 9:46 AM:
--------------------------------------------------------------

Worked on the idea a bit more with [~rma61870@adobe.com] and here are a few thoughts.
h5. Constraints
h6. Consistency between csv and index

Csv provides us with a pairing of blob id and path mapping. Dumped index would contain extracted binary output for some path.

Combining these 2 implies that we're assuming that a binary hasn't changed in a path after it had been indexed.

One way to make sure that this assumption holds is to tie csv generation and dumping of indexed data in a single command (hence using same state of the repository for both ends). But that poses at least 2 problems:
 * while generating csv we often would want usage of a fake DS to avoid reaching out to a remote DS just to read in blob ids. BUT, index dump requires real blob id. So, we'd need to improve fake DS impl to fallback to real DS for index path.
 * forcing to couple these 2 steps means that csv generation needs to take whenever we want to dump index for this usage (csv generation requires repository traversal and hence can be slow)

Otoh, binaries are not updated very often in real world cases - so, we can simply add a disclaimer such as "Please ensure that no binaries are updated between csv generation and index dump steps". (I and [~rma61870@adobe.com] seem to tend towards this option.)
h6. Which index is suitable for such optimization

Extracted text for a binary is index as stored field {{:fulltext}}. Aggregate rules or {{nodeScopeIndex}} -ed property definitions would be the ones that would get this prepared in most cases. It's possible to have an index definition with multiple aggregate rules (combined with different nodetypes as well) which would extract binary from a relative path under indexed node. There's no way to distiguish which part of {{:fulltext}} data is coming from which relative path.

So, to simplify things, we'd only support indexes extracting binary on the same path where the binary is stored. Iow, we'd only extract stored binary from index if path in csv fetches a stored {{:fulltext}} field for the same path in index.

A couple of examples of index which can be used would look something like:
{noformat}
+ /oak:index/usableIndex1
  ...
  + indexRules
    ...
    + nt:resource
      + properties
        ...
        + binary
          - name="jcr:data"
          - nodeScopeIndex=true

+ /oak:index/usableIndex2
  ...
  + aggregates
    ...
    + nt:resource
      + include0
        - path="*"
{noformat}
[~chetanm], can you please double check if aggregate rule in {{usableIndex2}} is indeed what we'd expect.
h5. Steps
 # Prepare CSV using Step2 in [0]
 # Dump some compatible index (as described above) using [1]. Use {{--index-paths}} option to dump only the required index.
 # Use feature from this issue to prepare text store by pulling in data from index for blobs pointed to in csv
 # Run classic tika based text extraction for binaries which might not be part of the index (Currently Step3 in [0])

h5. Extra notes

Classic tika based text extraction prepares a FDS like structure to store extracted data. Along with that, it also outputs 2 metadata files - {{blobs_error.txt}} and {{blobs_empty.txt}} for marking which blobs, while extracting, threw an error or produced empty output respectively. This is done to save time when we prepare text extraction store incrementally.
 In approach used by this issue, we would populate {{blobs_empty.txt}} on the same lines as classic extraction BUT we'd avoid populating {{blobs_error.txt}} because it could be the case that a given binary is not indexed by the index which is feeding in the extracted text OR if the index being used doesn't quite comply with the constraints we outlined above. Populating {{blobs_error.txt}} would not allow even classic text extraction to do extraction for genuine binaries not present in the provided indexed data.

[~rma61870@adobe.com], [~tmueller], [~chetanm], please share your thoughts.

[0]: [https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html]
 [1]: [https://jackrabbit.apache.org/oak/docs/query/oak-run-indexing.html#async-index-data]


was (Author: catholicon):
Worked on the idea a bit more with [~rma61870@adobe.com] and here are a few thoughts.
h5. Constraints
h6. Consistency between csv and index

Csv provides us with a pairing of blob id and path mapping. Dumped index would contain extracted binary output for some path.

Combining these 2 implies that we're assuming that a binary hasn't changed in a path after it had been indexed.

One way to make sure that this assumption holds is to tie csv generation and dumping of indexed data in a single command (hence using same state of the repository for both ends). But that poses at least 2 problems:
 * while generating csv we often would want usage of a fake DS to avoid reaching out to a remote DS just to read in blob ids. BUT, index dump requires real blob id. So, we'd need to improve fake DS impl to fallback to real DS for index path.
 * forcing to couple these 2 steps means that csv generation needs to take whenever we want to dump index for this usage (csv generation requires repository traversal and hence can be slow)

Otoh, binaries are not updated very often in real world cases - so, we can simply add a disclaimer such as "Please ensure that no binaries are updated between csv generation and index dump steps". (I and [~rma61870@adobe.com] seem to tend towards this option.)
h6. Which index is suitable for such optimization

Extracted text for a binary is index as stored field {{:fulltext}}. Aggregate rules or {{nodeScopeIndex}}ed property definitions would be the ones that would get this prepared in most cases. It's possible to have an index definition with multiple aggregate rules (combined with different nodetypes as well) which would extract binary from a relative path under indexed node. There's no way to distiguish which part of {{:fulltext}} data is coming from which relative path.

So, to simplify things, we'd only support indexes extracting binary on the same path where the binary is stored. Iow, we'd only extract stored binary from index if path in csv fetches a stored {{:fulltext}} field for the same path in index.

A couple of examples of index which can be used would look something like:
{noformat}
+ /oak:index/usableIndex1
  ...
  + indexRules
    ...
    + nt:resource
      + properties
        ...
        + binary
          - name="jcr:data"
          - nodeScopeIndex=true

+ /oak:index/usableIndex2
  ...
  + aggregates
    ...
    + nt:resource
      + include0
        - path="*"
{noformat}
[~chetanm], can you please double check if aggregate rule in {{usableIndex2}} is indeed what we'd expect.
h5. Steps
 # Prepare CSV using Step2 in [0]
 # Dump some compatible index (as described above) using [1]. Use {{--index-paths}} option to dump only the required index.
 # Use feature from this issue to prepare text store by pulling in data from index for blobs pointed to in csv
 # Run classic tika based text extraction for binaries which might not be part of the index (Currently Step3 in [0])

h5. Extra notes

Classic tika based text extraction prepares a FDS like structure to store extracted data. Along with that, it also outputs 2 metadata files - {{blobs_error.txt}} and {{blobs_empty.txt}} for marking which blobs, while extracting, threw an error or produced empty output respectively. This is done to save time when we prepare text extraction store incrementally.
 In approach used by this issue, we would populate {{blobs_empty.txt}} on the same lines as classic extraction BUT we'd avoid populating {{blobs_error.txt}} because it could be the case that a given binary is not indexed by the index which is feeding in the extracted text OR if the index being used doesn't quite comply with the constraints we outlined above. Populating {{blobs_error.txt}} would not allow even classic text extraction to do extraction for genuine binaries not present in the provided indexed data.

[~rma61870@adobe.com], [~tmueller], [~chetanm], please share your thoughts.

[0]: [https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html]
 [1]: [https://jackrabbit.apache.org/oak/docs/query/oak-run-indexing.html#async-index-data]

> oak-run tika extraction should support getting assistance from stored indexed data from a lucene index
> ------------------------------------------------------------------------------------------------------
>
>                 Key: OAK-7353
>                 URL: https://issues.apache.org/jira/browse/OAK-7353
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, oak-run
>            Reporter: Vikas Saurabh
>            Assignee: Vikas Saurabh
>            Priority: Major
>
> oak-run supports pre-text-extraction \[0] which does a great job at doing text extraction in parallel so that in can be used ingested later during indexing.
> But:
> * it still reaches to datastore, which, in case of s3 could be very slow
> * it still does extraction (duh!) - which is expensive
> A common case where we want to get pre-extracted text is reindexing - say on update of index definition which won't impact extracted data from binaries (basically updates which don't change tika configuration)
> In those case, it's often possible that there is a version on indexed data from older version of index def that can supply extracted text (as it's binary properties are indexed as stored fields)
> So, essentially, it would be nice to have tika based pre-text-extraction be able to consult an index and pick extracted text from there to fill up text extraction store. Of course, if the index doesn't have data for a given binary, it should still fallback to extract it.
> \[0]: https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)