Warwick Cathro and Susan Collier – Developing Trove: the policy and technical challenges
Trove is a free discovery service for the public. It allows them to discover annotate content. For both the casual user and researcher. It is part of Australian infrastructure not a purchased product. Its all NLAs services rolled into one, then with more added.
Two imperatives for the NLA – streamline and integrating the proliferation of national collection discovery tools and as per their Direction Statement, to develop online spaces for user interaction.
Trove comes from treasure trove – the latter coming from French to discover, so it combines the content and finding it.
It benefits from their experiences with Libraries Australia, Pandora, ARO and more.
Small team of five developed Trove. Started September 2008, prototype in May 2009, nine versions of prototype and released version 1.0 in November 2009. Three updates since then.
Challenges: collection views, works and versions, what is online?
Collection views: search results are grouped into collection views. Need to decide what they would be. Newspapers and people were easy, the rest was not so easy. Realised that they were working from a library view – recruited a group of students, teachers, family historians and general public to card sort the different types into groups and got them to name the groups. Then used the group names to get people to put types into them. The results were: books-journals etc, pictures and photos, Australian newspapers, diaries-letters, and much more.
Creating metadata for these groups was very difficult. Rules are not perfect, so they know that there are items which are in the wrong groups. Hopefully in future, users will be able to suggest alternatives.
Trove is FRBRish. Has a similar structure, with some variations. Trove takes old MARC records and make them do new things.
Issues with determining online access. Easy to discover a resource is online, but hard to discover what the item is and whether access is free. Three types identified: available online, available online (access condition), possibly online.
Want users to add value – they can tag, split and merge records, fix the OCR on the newspapers. Enhancements are included in a separate layer. It improves the quality, as evidenced by the Australian newspapers project.
They can monitor what users are doing online, in terms of interaction with the content. Comments have been added to Trove by users. Eg, photo had comment from person’s grandmother, giving more biographical detail: newspapers have been corrected and more information provided.
Future developments: currently working on RSS feeds, enhanced sorting, more external targets, more full text, an API. Then – search and delivery of NLA digitised journals, inclusion of journal article indexing data from partner vendors, more goals for obtaining data from archives and museums.
Trove release comes after three years of discussion and development. Takes resource discovery to a new level. There are other products out there that will do the same. Trove is different, includes more unique content and is national.
Paul Hagon – Everything I know about cataloguing I learned from watching James Bond.
Senior web person at the secret society of librarians at Canberra – also known as NLA.
Newspapers used to be papers in metal filing drawers, all carefully labelled with metadata – then fed into a microfilm reader. Services like Trove allow the discovery down to deep content – the metadata has been relegated to the rear. Content now rocks and metadata is relegated.
All full text searching of the newspapers is made possible through OCR. Deep content searching is possible with text, but what about images? Computers are good at identifying mathematical markers within images. Begin with facial recognition. Can we use this on our collections on a global scale. Chose a series of photos on a range of Australian Prime Ministers, using iPhoto. Laborious process to do, but didn’t do too well at identifying people accurately – 32%. OpenCV – from Intel was tried out – didn’t try to identify people, just tried to identify a face. When it did, it boxed it. It was very successful in identifying two photos of the same person, regardless of context. Didn’t do so well of people in profile or poor quality images. Was successful 85% of the time.
What could it be used for? If you do a search on Parks, get people, town and feature. If you click on portraits, you would get images as well.
Also did work on colours. Broke down images into colours, recognising both the colours and the % of the image that had that colour. Some colours can be lost however, as there is not enough of the value to display this. Can go up to 64 colours (from 8) to pick those up, but then data storage requirements grow dramatically.
Did more testing with ImageMagick – which can analyse an image – shows the RGB values which can be stored in the database. You can then search the database just by colour. Can end up with different types of images depending on which colours you search.
http://1104.nla.gov.au -go and play and get feedback to Paul.
Why research? Computer applications are already using this technology. Iphone – Shazam app – identifies music that is being played and gives you more info about it. Etsy craft store lets you search by colour. Google Goggles – take a photo and it analyses a feature and brings back info on it. Pattern recognition in an item, no metadata required.