Web archiving in a Web 2.0 world – Edgar Crook – NLA
NLA has 3 main methodologies for web archiving.
Pandora Archive has developed a world class archive of Australian websites, using PANDA, their digital archiving system. PANDA is a distributed system, so their partners can also use it. Other international library archiving systems are based on or similar to PANDA. They have developed persistent naming scheme and have arrangements with archiving and indexing agencies. As of 1st July 2008, it contained 19307 titles over 53 million files adding up to 2.2 TB of data (now over 2.4TB). Files can be a single PDF page, or an entire website. Over 50% of their files are government publications, but they also archived academic journals, blogs, podcasts and more. It is selective, because of the restrictions on staff resources etc. They have chosen their titles carefully and try to choose sustainable sources.
Domain name harvests – once a year, for between 3 and 6 weeks and in conjunction with the Internet Archive. In 2008, they are looking at crawling a billion files. Copyright is a major drawback. The websites are crawled by the Internet Archive and the files are then sent to the NLA. There are gaps where the website publisher bans bots, and the crawler also cant follow embedded links, so there are gaps in the domain harvests. There is also issues with Australian websites without the .au in their name. Data is not publicly available at this time, although it is being use by researchers.
Archive It – is an Internet Archive product, where you can pay money to have your website archived. Sites archived using this process include the PNG governmental and research institute websites the 2007 general election – including content from YouTube an MySpace, Cambodian election 2008, Burmese monk uprising 2007 and more. There are restrictions in that you cant recapture missed files and cant present it the way you want.
Still working on arrangements with other Web 2.0 content, ie. Bebo, Flickr, Facebook etc.
Librarians should think to tell Pandora about resources they should be archiving. Take responsibility for your web presence, make sure it remains or is archived elsewhere.
Will not be making PANDAs version 4, but in future will be working with international partners to develop a new backbone to the system.