Linked Data for better image search on the Web

Today, searching the Web for an image that you’re allowed to use in public (either at no cost or after paying a license fee) is a suboptimal experience. Web search engines Google or Bing turn up images with unclear rights or in bad quality. Specialized “silos” like Getty Images or iStock Photos work well for professionals but only find those images that were submitted to them on their terms.

(An interesting alternative approach is the German i-picturemaxx (APIS) network that allows distributed searches across a network of servers, but is closed / “pay to play” and based on proprietary technology.)

I think the future lies in publishing better image metadata on the Web, and better image search engines that make use of that metadata. Whether you’re a pro photographer, a hobbyist or a news agency – make sure there’s a simple HTML page on the Web for each of your images. With essential metadata (license or offer, description, your contact information) embedded in the HTML source code as semantic RDFa markup. Then let the search engine crawlers do their job. If they don’t pick up and make good use of that metadata, let’s build a new image search engine that does!

Sounds too simple? I’m actually a Semantic Web skeptic. Cory Doctorow’s 2001 criticism is still very much valid and explains why the “SemWeb” hasn’t taken off yet. But I think it could work here: Image licensing is an existing market with some money on the table. There is an incentive for both producers and consumers of digital images; finding the right photo is hard and copyright and licensing become increasingly important. (Plus it helps that it’s potentially a global market with few barriers: If you find the perfect photo of a rose, it shouldn’t matter that it was taken by an amateur who lives on a different continent and doesn’t speak English.)

What is difficult, and will remain so, is getting content creators to take the time to add meaningful, structured metadata. And to make their metadata play along well with other creators’. People describe things in different words: There’ll never be perfect alignment. But some common usage should evolve once the benefits become obvious (think folksonomy and SEO).

These things are also difficult, but we can do something about them: Reusing and improving common vocabularies and combining them with our own, custom terms. Building and spreading software that makes metadata editing and vocabulary juggling easy, or even fun. Agreeing on the protocols and formats to be used for publishing metadata on the Web, and having software support them. Getting existing or new image search engines to use the metadata. And helping creators and customers make transactions.

Lots of work to do. But I think publishing and crawling metadata on the open Web are the critical first step.

The protocol and format should be HTTP and HTML with RDFa: HTTP and HTML (and the ecosystem of browsers and search engines) have proven to work well at “Web scale”, with millions of producers and billions of consumers of information. HTML is readable by any human with a Web browser, which is its killer feature. And RDFa seems to win the race against microdata for semantic markup within HTML. (The current discussion on embedded metadata in image files is important as well, but in HTML it’s so much easier to access and modify that I see it as the primary data source.)

Note that I don’t care whether image distributors offer an API. As a developer, I’m getting tired of APIs (at least for read access). Imagine you have three sources for image metadata; one offering a CMIS API, one implementing OAI-PMH and the third being the Getty Images API. How many pages of documentation are you going to read, how much development time are you going to spend until you can do a simple keyword search and list essential metadata from each? (And once you’re done, how about the other 215 photo Web service APIs?)

What do you think – am I aiming too high, missing something, or am I on the right track? I’d love to hear your thoughts.

Update: Ralph Windsor replies – Applying Linked Data Concepts To Derive A Global Image Search Protocol. My follow-up: Linked Data for public, siloed, and internal images. And an in-depth article by Ralph Windsor: The Building Blocks Of Digital Asset Management Interoperability.

Update (2020-02-21): The IPTC announces Google’s “Licensable Images”, see Image License Metadata in Google Images (BETA).