«

»

Feb 08

Linked data: weaving the web of libraries, museums and archives – Eric Miller

The web is the most successful commerce and communication platform every conceived. It has become so pervasive in such a short time – no other technology has been as pervasive or as universal. It has quickly become one of the most pervasive data management and integration platforms ever imagined. And no-one owns it.

It has moved from only a communication tool to a data tool. Most of the web currently is pages and links – its things pointing at other things, via a common platform, which can be accessed from a variety of devices. The Web as a protocol has been a very effective way of wrapping other protocols which are required for specific purposes. Its a very lightweight infrastructure – a very powerful unifying principle. It has enabled people to make connections on the web, record the connection and make it available for others to follow. And it was done by us!

Most of the web is for humans, but opaque to machines. We understand relationships, but to machines its just code. We add the meaning.

Most of the web is connected, but compartmentalised. Its page granular – pointing from one to another. Not much is being done with underlying data. But there are sites like Expedia.com, retrievr which grab the data from other sites.

Remix

  • mix data from different sites tor provide added value

  • the mix sources don’t need to be involved

  • hybrid client-server mode

Problems:

  • data is mostly locked up in pages

  • each website is different

  • and keeps changing

  • very blurry lines between use and fair-use

  • even after extraction, data needs to be modeled so that it can be mixed

  • a remixed website looks like another website (so difficult for further mixing)

Remixing is extremely useful, hard and doesn’t cascade well.

Success story: News!

Whether its RSS or Atom. It describes a chronology of news items, consumers poll and receive new items, items can be easily mixed-up by web sites and applications and they cascade. A web range of applications can also be built on that. eg. Pulse

Achieve that by using XML instead of HTML, give extensibility through XML namespaces and granularity at the news item level.

But its not enough. Limitations include no standard ways of representing relationships between items (its all temporal and chronological), no ways of joining similar items and no standard way to query the web other than polling (can only get the most recent stuff).

How do we solve this issues? Linked data – ways to integrate data in a huge range of ways. Databases are set up for the types of queries you expect to receive. Not knowing what sort of queries were going to be received, linked data had to be built on flexibility.

Linked data is a term used to describe a recommended best practice for exposing, sharing and connecting pieces of data, information and knowledge on the semantic web using URIs and RDF. (Wikipedia) This allows us to get down to the level of relating things, not just pointing to other things.

This web of data is about making it easier to publish, remix, cascade this data and empower people to do new and interesting things with this data, at a reduced cost.

Many organisations are looking at this as a framework to expose their data, not just libraries, museums and archives. Showed backstage.bbc, the New York Times, NPR,The World Bank, Data.gov, HM Government and many national libraries.

We are no longer matching on the string, but on the identifier. These organisations are creating identifiers for the concepts that they are concerned about sharing. These identifiers can be reused, rethought or new ones can be created.

Rather than leaving data where it naturally resides and making it easy to connect to. Integration is not by heaping it all into centralised repositories or apps.

There is power in human computing – OCR correction, captchas. The power of identifiers – Creative Commons – the licences are identifiers. We are assigning this relationships, making it easier for the search engines to bring back things that we can re-use.

Power of recombinant data – Lego works. Lego can be recombined to create new things. It works for Eric’s kids and it has its own meaning, which is understood and done quickly.

RDF- Resource Description Framework – common model for identifying and linking data. Can link a wide variety of types of data that we didn’t traditionally see as linkable. If the data can be surfaced, it doesn’t matter what format its in, it can be referenced and linked.

What”s the catch? It takes the big step of fundamentally rethinking applications and their integration. Not applications on the web, but in the web, using the webs existing architecture. I want your data, in my way!

Example: where to stay? Ask for accommodation recommendations and was site a website which listed local hotels and motels. He was able to scrape and encode the data as addresses and prices etc and then displayed it on a map. He built wrappers and scrapers to extract data from his calendar, to then match up where his meetings were to be held, in relation to potential accommodation.

LOC Digital Preservation Program:

  • 180+ partners (NDIIPP)

  • Located across the globe

  • each with different charters, goals, budgets

  • benefits for sharing and connecting their data

  • but it exists in disconnected silos

In order to facilitate the sharing, they created “ViewShare – interfaces to our heritage”. http://www.viewshare.org

Using identifiers, we can specify data and then contribute more data – eg. Once assigned address type, can then add latitude and longitude. Was able to do a search of Powerhouse and narrow down by height of the title, as this data is surfaced by them.

Solution is to empower users to create their own views of data, build a community round the data.

Linked data gives us simple conventions for expressing context, a mechanism for collaborating despite different points of view and a mechanism for recording agreements as they evolve. Its about building on how people communicate to mature the way systems interact.

Adoption: Google, Microsoft and Yahoo schema.org effort and LOC Marc efforts.

Libraries have the oppportunity to use our trust, brand and skills to be involved in making these connections. Its not far from where we are to where we need to go. we need to expose what we have, build the policies that enable this and empower our users to build off it.