You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Philippe EUGENE <ph...@neuf.fr> on 2006/03/03 11:10:48 UTC

Jpeg and Exif Plugin

Hi,
What do you thing about a plug-in for indexing MetaData Exif on Jpeg ?
Do you thing it's a good idea  ?
-- 
Philippe


Re: Jpeg and Exif Plugin

Posted by Nutch Newbie <nu...@gmail.com>.
Hi Philippe:

Any progress? Do you need any help?

On 3/6/06, Ivan Sekulovic <se...@net.yu> wrote:
> I think that licence is OK.
>
> Using that libray for plugin is realy simple. I've done some test some
> time ago.
>
> All you have to do is something like this (content is byte[])
>
> Metadata metadata =
> JpegMetadataReader.extractMetadataFromJpegSegmentReader(new
> JpegSegmentReader(content));
>
> And then you can read all EXIF and IPTC data you need:
>
>         Directory exifDirectory =
> metadata.getDirectory(ExifDirectory.class);
>         String exifCameraMake =
> exifDirectory.getString(ExifDirectory.TAG_MAKE);
>         String exifCameraModel =
> exifDirectory.getString(ExifDirectory.TAG_MODEL);
>         String exifCopyright =
> exifDirectory.getString(ExifDirectory.TAG_COPYRIGHT);
>         String exifArtist =
> exifDirectory.getString(ExifDirectory.TAG_ARTIST);
>         String exifSubjectLocation =
> exifDirectory.getString(ExifDirectory.TAG_SUBJECT_LOCATION);
>         String exifSubjectLocation2 =
> exifDirectory.getString(ExifDirectory.TAG_SUBJECT_LOCATION_2);
>         String exifUserComment =
> exifDirectory.getString(ExifDirectory.TAG_USER_COMMENT);
>         String exifWinTitle =
> exifDirectory.getString(ExifDirectory.TAG_WIN_TITLE);
>         String exifWinComment =
> exifDirectory.getString(ExifDirectory.TAG_WIN_COMMENT);
>         String exifWinAuthor =
> exifDirectory.getString(ExifDirectory.TAG_WIN_AUTHOR);
>         String exifWinKeywords =
> exifDirectory.getString(ExifDirectory.TAG_WIN_KEYWORDS);
>         String exifWinSubject =
> exifDirectory.getString(ExifDirectory.TAG_WIN_SUBJECT);
>
>
>         Directory iptcDirectory =
> metadata.getDirectory(IptcDirectory.class);
>         String iptcCaption =
> iptcDirectory.getString(IptcDirectory.TAG_CAPTION);
>         String iptcWriter =
> iptcDirectory.getString(IptcDirectory.TAG_WRITER);
>         String iptcHeadline =
> iptcDirectory.getString(IptcDirectory.TAG_HEADLINE);
>         String iptcKeywords =
> iptcDirectory.getString(IptcDirectory.TAG_KEYWORDS);
>         String iptcCredit =
> iptcDirectory.getString(IptcDirectory.TAG_CREDIT);
>         String iptcCopyrightNotice =
> iptcDirectory.getString(IptcDirectory.TAG_COPYRIGHT_NOTICE);
>         String iptcObjectName =
> iptcDirectory.getString(IptcDirectory.TAG_OBJECT_NAME);
>         String iptcCategory =
> iptcDirectory.getString(IptcDirectory.TAG_CATEGORY);
>         String iptcSupplementalCategories =
> iptcDirectory.getString(IptcDirectory.TAG_SUPPLEMENTAL_CATEGORIES);
>         String iptcSource =
> iptcDirectory.getString(IptcDirectory.TAG_SOURCE);
>         String iptcCity = iptcDirectory.getString(IptcDirectory.TAG_CITY);
>         String iptcState =
> iptcDirectory.getString(IptcDirectory.TAG_PROVINCE_OR_STATE);
>         String iptcCountry =
> iptcDirectory.getString(IptcDirectory.TAG_COUNTRY_OR_PRIMARY_LOCATION);
>
>
> But I think that jpeg plugin should have some additional search
> criteria, such as image height and width and dominant colors (e.g.
> dominant color search on http://www.ifimages.com/). What would it take
> to have  lucene range queries in nutch ? Something like:  "height:[500
> TO 600] width:[300 TO 400].
>
>
> Sekula
>
>
> Philippe EUGENE wrote:
>
> >
> >> I think it makes sense.
> >> For a general search engine it will allow to search on image comments
> >> for
> >> instance.
> >> For an image search engine it will allow to search on technical metadata
> >> (exposure time, date, ...)
> >>
> >
> >
> > Ok. I can try to make this plug-in next week.
> > I can use this java library :
> > http://www.drewnoakes.com/code/exif/
> >
> > I hope there is no Licensing problem using this library inside Nutch
> > Project.
> > --
> > Philippe
> >
> >
> >
>
>

Re: Jpeg and Exif Plugin

Posted by Ivan Sekulovic <se...@net.yu>.
I think that licence is OK.

Using that libray for plugin is realy simple. I've done some test some 
time ago.

All you have to do is something like this (content is byte[])

Metadata metadata = 
JpegMetadataReader.extractMetadataFromJpegSegmentReader(new 
JpegSegmentReader(content));

And then you can read all EXIF and IPTC data you need:

        Directory exifDirectory = 
metadata.getDirectory(ExifDirectory.class);
        String exifCameraMake = 
exifDirectory.getString(ExifDirectory.TAG_MAKE);
        String exifCameraModel = 
exifDirectory.getString(ExifDirectory.TAG_MODEL);
        String exifCopyright = 
exifDirectory.getString(ExifDirectory.TAG_COPYRIGHT);
        String exifArtist = 
exifDirectory.getString(ExifDirectory.TAG_ARTIST);
        String exifSubjectLocation = 
exifDirectory.getString(ExifDirectory.TAG_SUBJECT_LOCATION);
        String exifSubjectLocation2 = 
exifDirectory.getString(ExifDirectory.TAG_SUBJECT_LOCATION_2);       
        String exifUserComment = 
exifDirectory.getString(ExifDirectory.TAG_USER_COMMENT);
        String exifWinTitle = 
exifDirectory.getString(ExifDirectory.TAG_WIN_TITLE);
        String exifWinComment = 
exifDirectory.getString(ExifDirectory.TAG_WIN_COMMENT);
        String exifWinAuthor = 
exifDirectory.getString(ExifDirectory.TAG_WIN_AUTHOR);
        String exifWinKeywords = 
exifDirectory.getString(ExifDirectory.TAG_WIN_KEYWORDS);
        String exifWinSubject = 
exifDirectory.getString(ExifDirectory.TAG_WIN_SUBJECT);
       
       
        Directory iptcDirectory = 
metadata.getDirectory(IptcDirectory.class);
        String iptcCaption = 
iptcDirectory.getString(IptcDirectory.TAG_CAPTION);
        String iptcWriter = 
iptcDirectory.getString(IptcDirectory.TAG_WRITER);
        String iptcHeadline = 
iptcDirectory.getString(IptcDirectory.TAG_HEADLINE);
        String iptcKeywords = 
iptcDirectory.getString(IptcDirectory.TAG_KEYWORDS);
        String iptcCredit = 
iptcDirectory.getString(IptcDirectory.TAG_CREDIT);
        String iptcCopyrightNotice = 
iptcDirectory.getString(IptcDirectory.TAG_COPYRIGHT_NOTICE);
        String iptcObjectName = 
iptcDirectory.getString(IptcDirectory.TAG_OBJECT_NAME);
        String iptcCategory = 
iptcDirectory.getString(IptcDirectory.TAG_CATEGORY);
        String iptcSupplementalCategories = 
iptcDirectory.getString(IptcDirectory.TAG_SUPPLEMENTAL_CATEGORIES);
        String iptcSource = 
iptcDirectory.getString(IptcDirectory.TAG_SOURCE);
        String iptcCity = iptcDirectory.getString(IptcDirectory.TAG_CITY);
        String iptcState = 
iptcDirectory.getString(IptcDirectory.TAG_PROVINCE_OR_STATE);
        String iptcCountry = 
iptcDirectory.getString(IptcDirectory.TAG_COUNTRY_OR_PRIMARY_LOCATION);


But I think that jpeg plugin should have some additional search 
criteria, such as image height and width and dominant colors (e.g. 
dominant color search on http://www.ifimages.com/). What would it take 
to have  lucene range queries in nutch ? Something like:  "height:[500 
TO 600] width:[300 TO 400].


Sekula


Philippe EUGENE wrote:

>
>> I think it makes sense.
>> For a general search engine it will allow to search on image comments 
>> for
>> instance.
>> For an image search engine it will allow to search on technical metadata
>> (exposure time, date, ...)
>>   
>
>
> Ok. I can try to make this plug-in next week.
> I can use this java library :
> http://www.drewnoakes.com/code/exif/
>
> I hope there is no Licensing problem using this library inside Nutch 
> Project.
> -- 
> Philippe
>
>
>


Re: limit fetching by using crawl-urlfilter.txt

Posted by Ravi Chintakunta <ra...@gmail.com>.
You can have the inclusion and exclusion urls regex specified in
different lines or combine them by ORing. That does not make much
difference. Make sure that you have this line at the end.

-.

This will make sure all other sites are not crawled.

- Ravi

On 3/3/06, Jack Tang <hi...@gmail.com> wrote:
> On 3/3/06, Michael Ji <fj...@yahoo.com> wrote:
> > hi,
> >
> > I tried this, actually in my case, one site ends with
> > .net and the other is .org
> >
> > so I modified it to
> >
> > +^http://([a-z0-9]*\.)*(abc.net|def.org)/
> I guess '.' is metadata in regexp, so pls try
> +^http://([a-z0-9]*\.)*(abc\.net|def\.org)/
>
> Good luck!
>
> > and I run another testing, seems doesn't work, coz I
> > saw a site other than abc and def is being fetched,
> >
> > any hints?
> >
> > thanks,
> >
> > Michael,
> >
> > --- sudhendra seshachala <su...@yahoo.com> wrote:
> >
> > >
> > > Hi,
> > >   Try the following pattern
> > >   +^http://([a-z0-9]*\.)*(abc|def).com/
> > >
> > >   I was able to search couple of sites using similar
> > > pattern.
> > >   If this is what you are asking ?
> > >
> > > Michael Ji <fj...@yahoo.com> wrote:
> > >   Hi,
> > >
> > > I searched on the mail-post, but still have problem
> > > to
> > > run my testing.
> > >
> > > Actually, I want my crawling is limited to two site
> > > solely.
> > >
> > > such as, *.abc.com/*
> > > and *.def.com/*
> > >
> > > so I put two line in crawl-urlfilter.txt as
> > > +^http://([a-z0-9]*\.)*.abc.com/
> > > +^http://([a-z0-9]*\.)*.def.com/
> > >
> > > But after running testing, the crawling is not
> > > limited
> > > to the above two sites.
> > >
> > > From log, I found "not found ...urlfilter-prefix"
> > >
> > > I wonder if the failure is due to not include
> > > crawl-urlfilter.txt in my configure xml or there is
> > > syntax error for my previous statement.
> > >
> > > thanks,
> > >
> > > Michael
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam? Yahoo! Mail has the best spam
> > > protection around
> > > http://mail.yahoo.com
> > >
> > >
> > >
> > >   Sudhi Seshachala
> > >   http://sudhilogs.blogspot.com/
> > >
> > >
> > >
> > >
> > > ---------------------------------
> > > Yahoo! Mail
> > > Bring photos to life! New PhotoMail  makes sharing a
> > > breeze.
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam protection around
> > http://mail.yahoo.com
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

Re: limit fetching by using crawl-urlfilter.txt

Posted by Jack Tang <hi...@gmail.com>.
On 3/3/06, Michael Ji <fj...@yahoo.com> wrote:
> hi,
>
> I tried this, actually in my case, one site ends with
> .net and the other is .org
>
> so I modified it to
>
> +^http://([a-z0-9]*\.)*(abc.net|def.org)/
I guess '.' is metadata in regexp, so pls try
+^http://([a-z0-9]*\.)*(abc\.net|def\.org)/

Good luck!

> and I run another testing, seems doesn't work, coz I
> saw a site other than abc and def is being fetched,
>
> any hints?
>
> thanks,
>
> Michael,
>
> --- sudhendra seshachala <su...@yahoo.com> wrote:
>
> >
> > Hi,
> >   Try the following pattern
> >   +^http://([a-z0-9]*\.)*(abc|def).com/
> >
> >   I was able to search couple of sites using similar
> > pattern.
> >   If this is what you are asking ?
> >
> > Michael Ji <fj...@yahoo.com> wrote:
> >   Hi,
> >
> > I searched on the mail-post, but still have problem
> > to
> > run my testing.
> >
> > Actually, I want my crawling is limited to two site
> > solely.
> >
> > such as, *.abc.com/*
> > and *.def.com/*
> >
> > so I put two line in crawl-urlfilter.txt as
> > +^http://([a-z0-9]*\.)*.abc.com/
> > +^http://([a-z0-9]*\.)*.def.com/
> >
> > But after running testing, the crawling is not
> > limited
> > to the above two sites.
> >
> > From log, I found "not found ...urlfilter-prefix"
> >
> > I wonder if the failure is due to not include
> > crawl-urlfilter.txt in my configure xml or there is
> > syntax error for my previous statement.
> >
> > thanks,
> >
> > Michael
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam? Yahoo! Mail has the best spam
> > protection around
> > http://mail.yahoo.com
> >
> >
> >
> >   Sudhi Seshachala
> >   http://sudhilogs.blogspot.com/
> >
> >
> >
> >
> > ---------------------------------
> > Yahoo! Mail
> > Bring photos to life! New PhotoMail  makes sharing a
> > breeze.
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: limit fetching by using crawl-urlfilter.txt

Posted by Michael Ji <fj...@yahoo.com>.
hi,

I tried this, actually in my case, one site ends with
.net and the other is .org

so I modified it to 

+^http://([a-z0-9]*\.)*(abc.net|def.org)/

and I run another testing, seems doesn't work, coz I
saw a site other than abc and def is being fetched,

any hints?

thanks,

Michael,

--- sudhendra seshachala <su...@yahoo.com> wrote:

> 
> Hi,
>   Try the following pattern
>   +^http://([a-z0-9]*\.)*(abc|def).com/
>    
>   I was able to search couple of sites using similar
> pattern.
>   If this is what you are asking ?
>   
> Michael Ji <fj...@yahoo.com> wrote:
>   Hi,
> 
> I searched on the mail-post, but still have problem
> to
> run my testing.
> 
> Actually, I want my crawling is limited to two site
> solely.
> 
> such as, *.abc.com/*
> and *.def.com/*
> 
> so I put two line in crawl-urlfilter.txt as
> +^http://([a-z0-9]*\.)*.abc.com/
> +^http://([a-z0-9]*\.)*.def.com/
> 
> But after running testing, the crawling is not
> limited
> to the above two sites. 
> 
> From log, I found "not found ...urlfilter-prefix"
> 
> I wonder if the failure is due to not include
> crawl-urlfilter.txt in my configure xml or there is
> syntax error for my previous statement.
> 
> thanks,
> 
> Michael
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
> 
> 
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>    
> 
> 
> 		
> ---------------------------------
> Yahoo! Mail
> Bring photos to life! New PhotoMail  makes sharing a
> breeze. 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: limit fetching by using crawl-urlfilter.txt

Posted by sudhendra seshachala <su...@yahoo.com>.
Hi,
  Try the following pattern
  +^http://([a-z0-9]*\.)*(abc|def).com/
   
  I was able to search couple of sites using similar pattern.
  If this is what you are asking ?
  
Michael Ji <fj...@yahoo.com> wrote:
  Hi,

I searched on the mail-post, but still have problem to
run my testing.

Actually, I want my crawling is limited to two site
solely.

such as, *.abc.com/*
and *.def.com/*

so I put two line in crawl-urlfilter.txt as
+^http://([a-z0-9]*\.)*.abc.com/
+^http://([a-z0-9]*\.)*.def.com/

But after running testing, the crawling is not limited
to the above two sites. 

>From log, I found "not found ...urlfilter-prefix"

I wonder if the failure is due to not include
crawl-urlfilter.txt in my configure xml or there is
syntax error for my previous statement.

thanks,

Michael


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

limit fetching by using crawl-urlfilter.txt

Posted by Michael Ji <fj...@yahoo.com>.
Hi,

I searched on the mail-post, but still have problem to
run my testing.

Actually, I want my crawling is limited to two site
solely.

such as, *.abc.com/*
and      *.def.com/*

so I put two line in crawl-urlfilter.txt as
+^http://([a-z0-9]*\.)*.abc.com/
+^http://([a-z0-9]*\.)*.def.com/

But after running testing, the crawling is not limited
to the above two sites. 

>From log, I found "not found ...urlfilter-prefix"

I wonder if the failure is due to not include
crawl-urlfilter.txt in my configure xml or there is
syntax error for my previous statement.

thanks,

Michael


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Jpeg and Exif Plugin

Posted by Philippe EUGENE <ph...@neuf.fr>.
> I think it makes sense.
> For a general search engine it will allow to search on image comments for
> instance.
> For an image search engine it will allow to search on technical metadata
> (exposure time, date, ...)
>   

Ok. I can try to make this plug-in next week.
I can use this java library :
http://www.drewnoakes.com/code/exif/

I hope there is no Licensing problem using this library inside Nutch 
Project.
--
Philippe


CBIR (Re: Jpeg and Exif Plugin)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:
>> What do you thing about a plug-in for indexing MetaData Exif on Jpeg ?
>> Do you thing it's a good idea  ?
>>     
>
> I think it makes sense.
> For a general search engine it will allow to search on image comments for
> instance.
> For an image search engine it will allow to search on technical metadata
> (exposure time, date, ...)
> But what's about images without comments for instance? How to retrieve them
> in a general search engine?
> The more Nutch have plugins the more it will be usefull fore many purpose
> and so for a wide variety of users.
> +1
>   

I agree, it would be a useful addition.

Also, I think it would be great if someone familiar with CBIR could 
contribute a plugin for indexing & searching for images by their 
fingerprints - there are several known techniques for doing this (look 
at imgSeek for inspiration). Nutch would require only minimal changes to 
support a suitable front-end.



-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Jpeg and Exif Plugin

Posted by Jérôme Charron <je...@gmail.com>.
> What do you thing about a plug-in for indexing MetaData Exif on Jpeg ?
> Do you thing it's a good idea  ?

I think it makes sense.
For a general search engine it will allow to search on image comments for
instance.
For an image search engine it will allow to search on technical metadata
(exposure time, date, ...)
But what's about images without comments for instance? How to retrieve them
in a general search engine?
The more Nutch have plugins the more it will be usefull fore many purpose
and so for a wide variety of users.
+1

Jérôme