You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark <ja...@gmail.com> on 2015/01/26 22:34:44 UTC

SimplePostTool with extracted Outlook messages

I'm looking to index some outlook extracted messages *.msg

I notice by default msg isn't one of the defaults so I tried the following:

java -classpath dist/solr-core-4.10.3.jar -Dtype=application/vnd.ms-outlook
org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg

That didn't work

However curl did:

curl "
http://localhost:8983/solr/update/extract?commit=true&overwrite=true&literal.id=000000006252671B765A1748992DF1A6403BDF81A4A15E00"
-F "myfile=@000000006252671B765A1748992DF1A6403BDF81A4A15E00.msg"

My question is why does the second work and not the first?

Re: SimplePostTool with extracted Outlook messages

Posted by Mark <ja...@gmail.com>.
In the end I didn't find a way to add a new file/ mime type for recursing a
folder.

So I added msg to the static dtring and Mime map.

private static final String DEFAULT_FILE_TYPES =
"xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log,msg";

mimeMap.put("msg", "application/vnd.ms-outlook");

Regards

Mark


On 27 January 2015 at 18:39, Mark <ja...@gmail.com> wrote:

> Hi Alex,
>
> On an individual file basis that would work, since you could set the ID on
> an individual basis.
>
> However recuring a folder it doesn't work, and worse still the server
> complains, unless on the server side you can use the UpdateRequestProcessor
> chains with  UUID generator as you suggested.
>
> Thanks for eveyones suggestions.
>
> Regards
>
> Mark
>
> On 27 January 2015 at 18:01, Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
>
>> Your IDs seem to be the file names, which you are probably also getting
>> from your parsing the file. Can't you just set (or copyField) that as an
>> ID
>> on the Solr side?
>>
>> Alternatively, if you don't actually have good IDs, you could look into
>> UpdateRequestProcessor chains with  UUID generator.
>>
>> Regards,
>>
>>    Alex.
>> On 27/01/2015 12:24 pm, "Mark" <ja...@gmail.com> wrote:
>>
>> > Thanks Eric
>> >
>> > However
>> >
>> > java -classpath dist/solr-core-4.10.3.jar -Dauto=true
>> > org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg
>> >
>> > Fails with:
>> >
>> > osting files to base url http://localhost:8983/solr/update..
>> > ntering auto mode. File endings considered are
>> >
>> >
>> xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
>> > implePostTool: WARNING: Skipping
>> > 000000006252671B765A1748992DF1A6403BDF81A4A02A00.msg. Unsupported file
>> type
>> > for auto mode.
>> > implePostTool: WARNING: Skipping
>> > 000000006252671B765A1748992DF1A6403BDF81A4A02B00.msg. Unsupported file
>> type
>> > for auto mode.
>> > implePostTool: WARNING: Skipping
>> > 000000006252671B765A1748992DF1A6403BDF81A4A02C00.msg. Unsupported file
>> type
>> > for auto mode.
>> >
>> > That's where I started looking into extending or adding support for
>> > additional types.
>> >
>> > Looking into the code as it stands passing you own URL as well as
>> asking it
>> > to recurse a folder means that is requires an ID strategy - which I
>> believe
>> > is lacking.
>> >
>> > Reagrds
>> >
>> > Mark
>> >
>> >
>>
>
>

Re: SimplePostTool with extracted Outlook messages

Posted by Mark <ja...@gmail.com>.
Hi Alex,

On an individual file basis that would work, since you could set the ID on
an individual basis.

However recuring a folder it doesn't work, and worse still the server
complains, unless on the server side you can use the UpdateRequestProcessor
chains with  UUID generator as you suggested.

Thanks for eveyones suggestions.

Regards

Mark

On 27 January 2015 at 18:01, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> Your IDs seem to be the file names, which you are probably also getting
> from your parsing the file. Can't you just set (or copyField) that as an ID
> on the Solr side?
>
> Alternatively, if you don't actually have good IDs, you could look into
> UpdateRequestProcessor chains with  UUID generator.
>
> Regards,
>
>    Alex.
> On 27/01/2015 12:24 pm, "Mark" <ja...@gmail.com> wrote:
>
> > Thanks Eric
> >
> > However
> >
> > java -classpath dist/solr-core-4.10.3.jar -Dauto=true
> > org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg
> >
> > Fails with:
> >
> > osting files to base url http://localhost:8983/solr/update..
> > ntering auto mode. File endings considered are
> >
> >
> xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> > implePostTool: WARNING: Skipping
> > 000000006252671B765A1748992DF1A6403BDF81A4A02A00.msg. Unsupported file
> type
> > for auto mode.
> > implePostTool: WARNING: Skipping
> > 000000006252671B765A1748992DF1A6403BDF81A4A02B00.msg. Unsupported file
> type
> > for auto mode.
> > implePostTool: WARNING: Skipping
> > 000000006252671B765A1748992DF1A6403BDF81A4A02C00.msg. Unsupported file
> type
> > for auto mode.
> >
> > That's where I started looking into extending or adding support for
> > additional types.
> >
> > Looking into the code as it stands passing you own URL as well as asking
> it
> > to recurse a folder means that is requires an ID strategy - which I
> believe
> > is lacking.
> >
> > Reagrds
> >
> > Mark
> >
> >
>

Re: SimplePostTool with extracted Outlook messages

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Your IDs seem to be the file names, which you are probably also getting
from your parsing the file. Can't you just set (or copyField) that as an ID
on the Solr side?

Alternatively, if you don't actually have good IDs, you could look into
UpdateRequestProcessor chains with  UUID generator.

Regards,

   Alex.
On 27/01/2015 12:24 pm, "Mark" <ja...@gmail.com> wrote:

> Thanks Eric
>
> However
>
> java -classpath dist/solr-core-4.10.3.jar -Dauto=true
> org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg
>
> Fails with:
>
> osting files to base url http://localhost:8983/solr/update..
> ntering auto mode. File endings considered are
>
> xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> implePostTool: WARNING: Skipping
> 000000006252671B765A1748992DF1A6403BDF81A4A02A00.msg. Unsupported file type
> for auto mode.
> implePostTool: WARNING: Skipping
> 000000006252671B765A1748992DF1A6403BDF81A4A02B00.msg. Unsupported file type
> for auto mode.
> implePostTool: WARNING: Skipping
> 000000006252671B765A1748992DF1A6403BDF81A4A02C00.msg. Unsupported file type
> for auto mode.
>
> That's where I started looking into extending or adding support for
> additional types.
>
> Looking into the code as it stands passing you own URL as well as asking it
> to recurse a folder means that is requires an ID strategy - which I believe
> is lacking.
>
> Reagrds
>
> Mark
>
>

Re: SimplePostTool with extracted Outlook messages

Posted by Mark <ja...@gmail.com>.
Thanks Eric

However

java -classpath dist/solr-core-4.10.3.jar -Dauto=true
org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg

Fails with:

osting files to base url http://localhost:8983/solr/update..
ntering auto mode. File endings considered are
xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
implePostTool: WARNING: Skipping
000000006252671B765A1748992DF1A6403BDF81A4A02A00.msg. Unsupported file type
for auto mode.
implePostTool: WARNING: Skipping
000000006252671B765A1748992DF1A6403BDF81A4A02B00.msg. Unsupported file type
for auto mode.
implePostTool: WARNING: Skipping
000000006252671B765A1748992DF1A6403BDF81A4A02C00.msg. Unsupported file type
for auto mode.

That's where I started looking into extending or adding support for
additional types.

Looking into the code as it stands passing you own URL as well as asking it
to recurse a folder means that is requires an ID strategy - which I believe
is lacking.

Reagrds

Mark



On 27 January 2015 at 10:57, Erik Hatcher <er...@gmail.com> wrote:

> Try adding -Dauto=true and take away setting url.  The type probably isn't
> needed then either.
>
> With the new Solr 5 bin/post it sets auto=true implicitly.
>
>     Erik
>
>
> > On Jan 26, 2015, at 17:29, Mark <ja...@gmail.com> wrote:
> >
> > Fantastic - that explians it
> >
> > Adding -Durl="
> > http://localhost:8983/solr/update/extract?commit=true&overwrite=true"
> >
> > Get's me a little further
> >
> > POSTing file 000000006252671B765A1748992DF1A6403BDF81A4A22E00.msg
> > SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for
> url:
> > http://localhost:8983/solr/update/extract?commit=true&overwrite=true
> > SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
> > <response>
> > <lst name="responseHeader"><int name="status">400</int><int
> > name="QTime">44</int></lst><lst name="error"><str name="msg">Document is
> > missing mandatory uniqueKey field: id</str><int name="cod
> > e">400</int></lst>
> > </response>
> >
> > However not much use when recursing a directory and the URL essentially
> has
> > to change to pass the document ID
> >
> > I think I may just extend SimplePostToo or look to use Solr Cell perhaps?
> >
> >
> >
> > On 26 January 2015 at 22:14, Alexandre Rafalovitch <ar...@gmail.com>
> > wrote:
> >
> >> Well, you are NOT posting to the same URL.....
> >>
> >>
> >>> On 26 January 2015 at 17:00, Mark <ja...@gmail.com> wrote:
> >>> http://localhost:8983/solr/update
> >>
> >>
> >>
> >> ----
> >> Sign up for my Solr resources newsletter at http://www.solr-start.com/
> >>
>

Re: SimplePostTool with extracted Outlook messages

Posted by Erik Hatcher <er...@gmail.com>.
Try adding -Dauto=true and take away setting url.  The type probably isn't needed then either. 

With the new Solr 5 bin/post it sets auto=true implicitly.  

    Erik


> On Jan 26, 2015, at 17:29, Mark <ja...@gmail.com> wrote:
> 
> Fantastic - that explians it
> 
> Adding -Durl="
> http://localhost:8983/solr/update/extract?commit=true&overwrite=true"
> 
> Get's me a little further
> 
> POSTing file 000000006252671B765A1748992DF1A6403BDF81A4A22E00.msg
> SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url:
> http://localhost:8983/solr/update/extract?commit=true&overwrite=true
> SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">400</int><int
> name="QTime">44</int></lst><lst name="error"><str name="msg">Document is
> missing mandatory uniqueKey field: id</str><int name="cod
> e">400</int></lst>
> </response>
> 
> However not much use when recursing a directory and the URL essentially has
> to change to pass the document ID
> 
> I think I may just extend SimplePostToo or look to use Solr Cell perhaps?
> 
> 
> 
> On 26 January 2015 at 22:14, Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
> 
>> Well, you are NOT posting to the same URL.....
>> 
>> 
>>> On 26 January 2015 at 17:00, Mark <ja...@gmail.com> wrote:
>>> http://localhost:8983/solr/update
>> 
>> 
>> 
>> ----
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>> 

Re: SimplePostTool with extracted Outlook messages

Posted by Mark <ja...@gmail.com>.
Fantastic - that explians it

Adding -Durl="
http://localhost:8983/solr/update/extract?commit=true&overwrite=true"

Get's me a little further

POSTing file 000000006252671B765A1748992DF1A6403BDF81A4A22E00.msg
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url:
http://localhost:8983/solr/update/extract?commit=true&overwrite=true
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">400</int><int
name="QTime">44</int></lst><lst name="error"><str name="msg">Document is
missing mandatory uniqueKey field: id</str><int name="cod
e">400</int></lst>
</response>

However not much use when recursing a directory and the URL essentially has
to change to pass the document ID

I think I may just extend SimplePostToo or look to use Solr Cell perhaps?



On 26 January 2015 at 22:14, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> Well, you are NOT posting to the same URL.....
>
>
> On 26 January 2015 at 17:00, Mark <ja...@gmail.com> wrote:
> > http://localhost:8983/solr/update
>
>
>
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>

Re: SimplePostTool with extracted Outlook messages

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Well, you are NOT posting to the same URL.....


On 26 January 2015 at 17:00, Mark <ja...@gmail.com> wrote:
> http://localhost:8983/solr/update



----
Sign up for my Solr resources newsletter at http://www.solr-start.com/

Re: SimplePostTool with extracted Outlook messages

Posted by Mark <ja...@gmail.com>.
A little further

This fails

 java -classpath dist/solr-core-4.10.3.jar
-Dtype=application/vnd.ms-outlook org.apache.solr.util.SimplePostTool
C:/temp/samplemsg/*.msg

With:

SimplePostTool: WARNING: IOException while reading response:
java.io.IOException: Server returned HTTP response code: 415 for URL:
http://localhost:8983/solr/update
POSTing file 000000006252671B765A1748992DF1A6403BDF81A4A22C00.msg
SimplePostTool: WARNING: Solr returned an error #415 (Unsupported Media
Type) for url: http://localhost:8983/solr/update
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">415</int><int
name="QTime">0</int></lst><lst name="error"><str name="msg">Unsupported
ContentType: application/vnd.ms-outlook  Not in: [applicat
ion/xml, text/csv, text/json, application/csv, application/javabin,
text/xml, application/json]</str><int name="code">415</int></lst>
</response>

However just calling the extract works

curl "http://localhost:8983/solr/update/extract?extractOnly=true" -F
"myfile=@000000006252671B765A1748992DF1A6403BDF81A4A22C00.msg"

Regards

Mark

On 26 January 2015 at 21:47, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> Seems like apple to oranges comparison here.
>
> I would try giving an explicit end point (.../extract), a single
> message, and a literal id for the SimplePostTool and seeing whether
> that works. Not providing an ID could definitely be an issue.
>
> I would also specifically look on the server side in the logs and see
> what the messages say to understand the discrepancies. Solr 5 is a bit
> more verbose about what's going under the covers, but that's not
> available yet.
>
> Regards,
>    Alex.
> ----
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 26 January 2015 at 16:34, Mark <ja...@gmail.com> wrote:
> > I'm looking to index some outlook extracted messages *.msg
> >
> > I notice by default msg isn't one of the defaults so I tried the
> following:
> >
> > java -classpath dist/solr-core-4.10.3.jar
> -Dtype=application/vnd.ms-outlook
> > org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg
> >
> > That didn't work
> >
> > However curl did:
> >
> > curl "
> >
> http://localhost:8983/solr/update/extract?commit=true&overwrite=true&literal.id=000000006252671B765A1748992DF1A6403BDF81A4A15E00
> "
> > -F "myfile=@000000006252671B765A1748992DF1A6403BDF81A4A15E00.msg"
> >
> > My question is why does the second work and not the first?
>

Re: SimplePostTool with extracted Outlook messages

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Seems like apple to oranges comparison here.

I would try giving an explicit end point (.../extract), a single
message, and a literal id for the SimplePostTool and seeing whether
that works. Not providing an ID could definitely be an issue.

I would also specifically look on the server side in the logs and see
what the messages say to understand the discrepancies. Solr 5 is a bit
more verbose about what's going under the covers, but that's not
available yet.

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 26 January 2015 at 16:34, Mark <ja...@gmail.com> wrote:
> I'm looking to index some outlook extracted messages *.msg
>
> I notice by default msg isn't one of the defaults so I tried the following:
>
> java -classpath dist/solr-core-4.10.3.jar -Dtype=application/vnd.ms-outlook
> org.apache.solr.util.SimplePostTool C:/temp/samplemsg/*.msg
>
> That didn't work
>
> However curl did:
>
> curl "
> http://localhost:8983/solr/update/extract?commit=true&overwrite=true&literal.id=000000006252671B765A1748992DF1A6403BDF81A4A15E00"
> -F "myfile=@000000006252671B765A1748992DF1A6403BDF81A4A15E00.msg"
>
> My question is why does the second work and not the first?