You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Chetan Mehrotra (JIRA)" <ji...@apache.org> on 2016/02/05 12:54:39 UTC
[jira] [Resolved] (OAK-3989) Add S3 datastore support for Text Pre
Extraction
[ https://issues.apache.org/jira/browse/OAK-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chetan Mehrotra resolved OAK-3989.
----------------------------------
Resolution: Fixed
Added support with 1728642. Now extraction tool supports argument {{--s3-config-path}} which points to a property file containing configuration required for S3DataStore to work
*S3 Config*
{noformat}
# AWS account ID
accessKey=
# AWS secret key
secretKey=KRMG6TVTSNqjMfr3S3Ojexwrg32EYAs9JMIzb5BN
# AWS bucket name
s3Bucket=aem
# AWS bucket region
# Mapping of S3 regions to their constants
# US Standard us-standard
# US West us-west-2
# US West (Northern California) us-west-1
# EU (Ireland) eu-west-1
# Asia Pacific (Singapore) ap-southeast-1
# Asia Pacific (Sydney) ap-southeast-2
# Asia Pacific (Tokyo) ap-northeast-1
# South America (Sao Paulo) sa-east-1
s3Region=us-west-2
# S3 endpoint to be used. It is optional parameter
# and has higher precedence over endpoint derived
# via S3 region.
s3EndPoint=s3.amazonaws.com
connectionTimeout=120000
socketTimeout=120000
maxConnections=40
writeThreads=30
maxErrorRetry=10
# Cache size must be set to 0 so as not unnecessary
# copy files locally
cacheSize=0
#Path to a folder which would be used for temp work
path=/home/chetanm/data/oak/s3/tika-work-dir
{noformat}
*Usage*
For generation of csv file
{noformat}
java -cp jars/*:oak-run-1.4-SNAPSHOT.jar:tika-app-1.9.jar org.apache.jackrabbit.oak.run.Main tika \
--nodestore path/to/segmentstore \
--s3-config-path /path/to/s3config-property-file \
--data-file dump.csv \
generate
{noformat}
For actual text extraction
{noformat}
java -cp jars/*:oak-run-1.4-SNAPSHOT.jar:tika-app-1.9.jar org.apache.jackrabbit.oak.run.Main tika \
--nodestore path/to/segmentstore \
--s3-config-path /path/to/s3config-property-file \
--data-file dump.csv \
--store-path ./store \
extract
{noformat}
*Classpath*
This would require some jars to be placed in jars folder
{noformat}
aws-java-sdk-osgi-1.10.27.jar
crx-ext-s3-1.0.50.jar
httpclient-osgi-4.3.4.jar
httpcore-osgi-4.3.2.jar
jackson-annotations-2.6.1.jar
jackson-core-2.6.1.jar
jackson-databind-2.6.1.jar
joda-time-2.8.1.jar
{noformat}
And jackrabbit-data-2.11.3
> Add S3 datastore support for Text Pre Extraction
> ------------------------------------------------
>
> Key: OAK-3989
> URL: https://issues.apache.org/jira/browse/OAK-3989
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: run
> Reporter: Chetan Mehrotra
> Assignee: Chetan Mehrotra
> Priority: Minor
> Labels: docs-impacting
> Fix For: 1.3.16
>
>
> Text pre extraction feature introduced in OAK-2892 only supports FileDataStore. For files present in S3 we should add support for S3DataStore
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)