Skip to content

BitCurator workshop in de KB


Published: Tue 01 Jul 2014

Ook nog floppy disks en CD’s in huis waarvan de inhoud nog niet de duurzame opslag gehaald heeft? Veel erfgoedinstellingen kampen met dit probleem, vaak ook omdat men niet weet hoe dit probleem aan te pakken. De NCDD organiseerde daarom de BitCurator workshop die werd gegeven door Porter Olsen van het Maryland Institute for Technology in the Humanities (MITH) in de VS.

Wat is BitCurator?
Het project BitCurator, een initiatief van MITH en de School of Information and Library Science aan de University of North Carolina, Chapel Hill (SILS), brengt professionals bijeen die tools, ontwikkeld voor “digital forensics” wil toepassen in de workflows van digital preservation. Op basis van die kennis is een set software gecreëerd, BitCurator genoemd, waarmee informatie op fysieke dragers kan worden veilig gesteld door het maken van een “disk image”.

foto 3

Twaalf deelnemers brachten hun laptop mee waarop de BitCurator software geïnstalleerd was en hadden in hun rugzak vele varianten van digitale dragers om de software uit te proberen. Porter toonde een doosje uit zijn rugzak: een “hardware write blocker” , nodig om de eerste gouden regel van digital forensics na te leven: zorg er voor dat er niet per ongeluk op het origineel geschreven wordt. (dit geldt met name voor floppy disks).

write blocker

Hardware writer blocker

Pas nadat dit is geregeld, kun je de software van BitCurator gaan gebruiken. Deze stelt je in staat om een “forensic disc image” te maken, een getrouwe kopie dus van de originele drager, waarbij alle nullen en enen die er oorspronkelijk op staan meegenomen worden. Je kunt ook een “logische kopie” maken, waarbij bijvoorbeeld systeemfiles en files die ooit gedelete zijn niet meegaan. De “forensic disc image” neemt deze verwijderde files en systeemfiles wel mee.

Porter Olsen

Porter Olsen

Naast de kopie van de file, geeft BitCurator ook metadata mee in de vorm van PREMIS event metadata in een afzonderlijke xml-file. Verschillende tools kunnen je vervolgens helpen de informatie op de disk image te analyseren. BulkExtractor maakt het bijvoorbeeld mogelijk de pdf-files eruit te halen, of te zoeken op een bepaald patroon (zoals een telefoonnummer of burger service nummer). Een andere tool FSLint maakt het mogelijk dubbele files te identificeren.

Binnen het BitCurator project wordt ook gewerkt aan het integreren van deze tools in de workflow, want hoewel het maken van een disc image nog een handmatige handeling vereist (iemand moet het schijfje in een station stoppen), kunnen de overige handelingen geautomatiseerd worden, zodat alle voorbereidingen klaar zijn vóór opname in het digitale archief.

foto 1

Veel van de tools die BitCurator aanbiedt komen oorspronkelijk uit de omgeving van digital forensics, en zullen niet voor elke erfgoedinstelling van toepassing zijn. Maar voor collecties die privacy gevoelige informatie bevatten, is het belangrijk te weten dat je de inhoud van je “disc image” nauwkeurig kunt analyseren. Denk bijvoorbeeld aan een auteur die zijn archief aanbiedt, maar zich daarbij niet realiseert dat met behulp van bovenstaande tools ook de indertijd weggegooide files weer van de disc gelezen kunnen worden. Is dit wel een gewenste situatie en zou dit vanuit ethisch oogpunt wel mogen? Het zal voor instellingen die deze technieken gaan toepassen van groot belang zijn tegelijkertijd ook beleid (policies) te ontwikkelen. Policies die beschrijven hoe de instelling de gegeven technische mogelijkheden toepast in relatie tot het collectiebeleid en privacybeleid, zodat donateurs weten waar ze aan toe zijn. Want dat iets technisch kan, wil niet altijd zeggen dat je het ook moet doen!

Tekst Barbara Sierman, foto’s Marcel Ras

Eerste Workshop Preservation Metadata: kennisdeling smaakt naar meer


Published: Thu 01 May 2014

Preservation metadata staat op dit moment volop in de belangstelling. Voor de workshop Preservation Metadata die Annemieke de Jong, senior beleidsadviseur Digitale Duurzaamheid van het Instituut voor Beeld en Geluid (B&G) en Marcel Ras, programmamanager bij de Nationale Coalitie Digitale Duurzaamheid (NCDD) op 4 maart jl. organiseerden, was zoveel belangstelling dat deze op 19 juni as. herhaald wordt. Ook deze dag is inmiddels volgeboekt. Het was voor het eerst dat een dergelijke bijeenkomst werd georganiseerd door een instelling in nauwe samenwerking met de NCDD. Dit is een direct gevolg van de nieuwe koers die de NCDD vorig jaar is ingeslagen.

Twee belangrijke vragen
Annemieke de Jong stelde in haar openingswoord de vraag centraal wat preservation metadata nu eigenlijk is. Is het iets nieuws of is het gewoon iets ouds in een nieuw jasje? In de praktijk blijkt dat daar nog veel verschillende opvattingen over bestaan, hetgeen dan ook een van de aanleidingen is geweest om deze dag te organiseren. Zij sprak de hoop uit dat na afloop van de workshop niet alleen duidelijk is wat preservation metadata is en maar ook waar het voor gebruikt kan worden.

Wat mij betreft werden beide vragen meteen door de eerste spreker, Rebecca Guenther, beantwoord. Rebecca kan bogen op in totaal 35 jaar ervaring met metadata bij verschillende  bibliotheken. Zij is momenteel hoofd Editorial Committee Preservation Metadata bij The Library of Congress en heeft aan de wieg gestaan van PREMIS (Preservation Metadata: Implementation Strategies) data dictionary. Omdat zij de 19e niet aanwezig zal zijn bij de tweede workshop, kan ik hier wat dieper op haar presentatie in gaan.

Mocht je niet veel afweten van preservation metadata, dan kon je daar na het volgen van Rebecca’s presentatie niet meer met goed fatsoen in volharden. Zij gaf een uitstekend overzicht van de problematiek van het duurzaam bewaren van digitale informatie en de rol die diverse categorieën metadata in het algemeen en die van preservation metadata in het bijzonder daarbij in een trusted repository kunnen spelen. En daarna ging zij natuurlijk uitgebreid in op de PREMIS datadictionary. Wat is het wel, wat is het niet en waartoe dient het?

Wat is het?
Om met de eerste vraag van Annemieke te beginnen: volgens Rebecca is preservation metadata metadata over alle aspecten en handelingen die een rol spelen bij het duurzaam toegankelijk houden van digitale bestanden. Daarbij moet dan onder meer gedacht worden aan technische informatie, informatie over herkomst, over authenticiteit en over intellectueel eigendom. Naast technical preservation metadata bestaat er bijvoorbeeld ook Intellectual Property Rights (=intellectueel eigendom) preservation metadata, maar deze laatste heeft alleen betrekking op het IPR-aspect van het bestand gerelateerd aan handelingen die moeten worden uitgevoerd binnen een repository. Een repository kan bijvoorbeeld als protocol hebben dat bestanden die daar worden opgeslagen, in twee- of drievoud moeten worden opgeslagen. Meestal is er toestemming van de rechthebbende(n) nodig om die bestanden voor dit doel te kopiëren.

En waar gebruik je het voor?
Ook de tweede vraag van Annemieke: waar wordt preservation metadata voor gebruikt? werd door Rebecca beantwoord: preservation metadata wordt gebruikt om zeker te stellen dat een bestand ook werkelijk is wat het zegt te zijn. Want dat is in de digitale wereld namelijk niet zo vanzelfsprekend. Daarom is het toekennen van metadata, het documenteren van handelingen, het maken van goede afspraken en het opstellen van goede procedures zo belangrijk. Volgens Rebecca is voor dit aspect van digitale informatie nog veel te weinig aandacht, waardoor er nu en in de toekomst grote problemen zullen ontstaan. Zij stelt dat het niet alleen belangrijk is om de data zorgvuldig te bewaren, maar dat het is net zo belangrijk is om de metadata van de data te bewaren. En hoe zit het dan met de metadata van de metadata van de data? Jazeker, die moet ook bewaard worden, want die is ook heel belangrijk. Op zich is dat logisch, maar toch kon ik me van tijd tot tijd niet aan het gevoel onttrekken waar dit allemaal moet eindigen. We kunnen de hoeveelheid digitale data nu al nauwelijks hanteren, en dan komt daar nog deze enorme hoeveelheid metadata bij, die ook steeds maar uitdijt. Gaat de wereld straks aan data ten onder?

Case studies

Na de presentatie van Rebecca volgden vier case studies van de grote instituten en partners van de NCDD: Daniel Steinmeier van het Instituut voor Beeld en Geluid, Margriet van Gorsel van het Nationaal Archief, Lian Wintermans van de Koninklijke Bibliotheek en Ben Companjen van DANS. Waar Rebecca Guenther toch meer een ideaalplaatje schetste, kregen we in de case studies te maken met de weerbarstige praktijk en dat bracht de dag mooi in balans.

Omdat de case studies op 19 juni as. opnieuw gepresenteerd zullen worden, zal ik hier volstaan met het noemen van voor mij opvallende adviezen en knelpunten:

* De impact van verlies aan data bij een instelling is bijzonder groot. Daarom is het heel belangrijk om je duurzame toegankelijkheid van digitale informatie goed op orde te hebben.
* Je moet niet alles in één keer willen regelen, maar werk stapje voor stapje naar je einddoel toe. Dat is het meest haalbare.
* Het is voor bewarende instellingen een grote uitdaging om te bewerkstelligen dat zij vooraan staan in het proces van informatievorming. Dat is de enige manier om te bereiken dat de aangeleverde informatie relatief eenvoudig kan worden opgenomen in het e-depot.
* Het is ook noodzakelijk om stevige afspraken te maken met gebruikers over hoe data moet worden uitgeleverd.
* Het is nu al een groot probleem om data die in het (recente) verleden niet goed is beheerd, op te nemen in een e-depot.
* We bevinden ons in een overgangsfase tussen papier en digitaal en dat zal voorlopig nog wel zo blijven.
* Metadata moet net zo goed duurzaam worden bewaard als de objecten waarop deze betrekking heeft.
* Duurzame toegang tot digitale informatie is niet alleen een kwestie van techniek maar zeer zeker ook een kwestie van organisatie.

Behoefte aan kennisdeling
De workshop werd afgesloten met een paneldiscussie, waarbij de zaal tevens de gelegenheid had om vragen te stellen.
En ja, net toen we dachten dat er eensgezindheid bestond over de betekenis en het gebruik van het woord preservation metadata, werd dit toch weer in twijfel getrokken en bleken de opvattingen hierover toch weer uiteen te lopen. De discussie hierover is dus nog lang niet ten einde. Zelf vinden de professionals dat ook, want uit het publiek kwam tevens de vraag of er meer van dit soort bijeenkomsten kunnen worden georganiseerd omdat er grote behoefte is aan kennisdeling en discussie. Gelukkig voorziet het meerjarenplan van de NCDD daar ruimschoots in. Met dit initiatief is in ieder geval een goede aanzet gegeven. Ik kijk uit naar de volgende bijeenkomst!

Jeanine Tieleman (Medewerker kwaliteitszorg)

Stichting Digitaal Erfgoed Nederland

Roles and responsibilities in guaranteeing permanent access to the records of science


Published: Sun 02 Mar 2014

APE conference 2014

On Tuesday 28th and Wednesday 29th of January the annual conference for Academic publishers Europe was held in Berlin. The title of the conference: redefining the Scientific Record.

Dutch politics set on “golden road” to Open Access

During the fist day the focus was on Open Access, starting with a presentation of the Dutch State Secretary for Education, Culture and Science on Open Access. In his presentation called “going for Gold” Sander Dekker outlined his policy with regards to the practice of providing open access to research publications and how that practice will continue to evolve. Open access is “a moral obligation” according to Sander Dekker. Access to scientific knowledge is for everyone. It promotes knowledge sharing and knowledge circulation and is essential for further development of society.

Open access means having electronic access to research publications, articles and books (free of charge). This is an international issue. Every year, approximately two million articles appear in 25,000 journals that are published worldwide. The Netherlands accounts for some 33,000 articles annually. Having unrestricted access to research results can help disseminate knowledge, move science forward, promote innovation and solve the problems that society faces.

The first steps towards open access were taken twenty years ago, when researchers began sharing their publications with one another on the Internet. In the past ten years, various parties in the Netherlands have been working towards creating an open access system. A wide variety of rules, agreements and options for open access publishing have emerged in the research community. The situation is confusing for authors, readers and publishers alike, and the stakeholders would like this confusion to be resolved as quickly as possible.

The Dutch Government will provide direction so that the parties know what to expect and can make arrangements with one another. It will promote the “golden” open access; publication in journals that make research articles available online free of charge. The State Secretaries aim is to have switched entirely to the golden road to open access within ten years, in other words by 2024. In order to achieve this, at least 60 per cent of all articles will have to be available in open access journals in five years’ time. A true switch will only be possible if we cooperate and coordinate with other countries.

Further reading: http://www.government.nl/issues/science/documents-and-publications/parliamentary-documents/2014/01/21/open-access-to-publications.html or http://www.rijksoverheid.nl/ministeries/ocw/nieuws/2013/11/15/over-10-jaar-moeten-alle-wetenschappelijke-publicaties-gratis-online-beschikbaar-zijn.html

Do researchers even want Open Access?

The two other keynote speakers, David Black and Wolfram Koch presented their concerns on the transition from the current publishing model to open access. Researchers are using more frequently subject repositories for sharing their knowledge. There is an urgent need for organization and standards in this field. But who takes the lead? Furthermore, we must not forget the systems for quality assurance and peer review. These are under pressure as a huge amount of articles are being published and peer review tends to take place more and more after publication. Open access should lower the barriers for access to research for the users, but what about the barriers for scholars publishing on their research? Koch stated that the traditional model worked fine for scientists. They don’t want to change. However, there do not seem to be any figures to support this assertion.

Interesting to notice was that in almost all presentation on the first day of APE digital preservation was mentioned, one way or the other. The vocabulary was different, but it is acknowledged as an important topic. Accessibility of scientific publications for long-term is a necessity, regardless of the publishing model.

KB and NCDD workshop on roles and responsibilities

The 2nd day of the conference the focus was on innovation (the future of the article, dotcoms) and on preservation!

The national library of The Netherlands (KB) and the Dutch Coalition for Digital Preservation (NCDD) organized a session on the topic of the preservation of scientific output: “roles and responsibilities in guaranteeing permanent access to the scholarly record”. The session was chaired by Marcel Ras, program manager for the NCDD.

The trend towards e-only access for scholarly information is continuing at a rapid rate, and a growing number of data are ‘born digital’ and have no print counterpart. As for scholarly publications half of all serial publications will be online only by 2016. For researchers and students there is a huge benefit, they now have online access to journal articles to read and download, anywhere, any time. And they are making use of it to an increasing extend. However, the downside is that there is an increasing dependency on access to digital information. Without permanent access to information scholarly activities are no longer possible. For libraries there are many benefits associated with publishing and accessing academic journals online. E-only access has the potential to save the academic sector a considerable amount of money. Library staff resources required to process printed materials can be reduces significantly. Libraries also potentially save money in terms of the management, storage and end user access of print journals. While suppliers are willing to provide discounts for e-only access.

Publishers may not share post-cancellation and preservation concerns

However, there are concerns that what is now in digital form may not always be available due to rapid technological developments or developments within the publishing industry; this and how to ensure post-cancellation access to paid-for content are key barriers to institutions making the move to e-only. There is a danger that e-journals become “ephemeral” unless we take active steps to preserve the bits and bytes that increasingly represent our collective knowledge. We are all familiar with examples of hardware becoming obsolete; 8 inch and 5.25 inch floppy discs, Betamax video tapes, and probably soon cd-roms. Also software is not immune to obsolescence.

On top of this threat of technical obsolescence there is the change in the role of libraries. Libraries have in the past assumed preservation responsibility for material they collect, while publishers have supplied the material libraries need. These well understood divisions of labour does not work in a digital environment and especially so when dealing with e-journals. Libraries buy licenses to enable their users to gain network access to a publisher’s server. The only original copy of an issue of an e-journal is not on the shelves of a library, but tends to be held by the publisher. But the long term preservation of that copy is of importance to the library and research communities rather than to the publisher.

Can third-party solutions ensure safe custody?

So we may need new models and sometimes organizations to ensure safe custody of these objects for future generations. A number of initiatives have emerged in an effort to address these concerns. Research and developments in digital preservation issues has grown mature. Tools and services are being developed to help planning and perform digital preservation activities. Furthermore third-party organizations and archiving solutions are established to help the academic community to preserve publications and to advance research in sustainable ways. These trusted parties can be addressed by users when strict conditions (trigger events or post cancellation) are met. In addition, publishers are adapting to changing library requirements, participating in the different archiving schemes and increasingly providing options for post cancellation access.

In this session the problem was presented from the different viewpoints and the different stakeholders in this game. Focussing on the roles and responsibilities of the stakeholders.

Neil Beagrie explained the problem in depth; technical, organisational and financial. He  highlighted the distinction between perpetual access  and digital preservation. In the case of perpetual access, organisations have a license or subscription for an e-journal and either the publisher discontinues the journal or the organisation stops its subscription. Keeping e-journals available in this case is called “post-cancellation” . This situation differs from long term preservation, where the e-journal in general is preserved for users whether they ever subscribed or not. Several initiatives for the latter situation were mentioned as well as the benefits these organisations like LOCKSS, CLOCKSS, Portico and the e-Depot of the KB brings to publishers.  More details about his vision can be read in the DPC Tech Watch report Preservation, Trust and Continuing Access to e-Journals . (Presentation: APE2014_Beagrie)

Susan Reilly of the Association of European Research Libraries  (LIBER) sketched the changing role of research libraries. It is essential that the scholarly record is preserved, which encompasses e-journal articles, research data, e-books, digitized cultural heritage and dynamic web content. Libraries are a major player in this field and can be seen as an intermediary between publishers and researchers. (Presentation: APE2014_Reilly)

Eefke Smit of the International Association of Scientific, Technical and Medical Publishers (STM) explained to the audience why digital preservation was especially important in the playing field of STM publishers. Publishers are keen to contribute to digital preservation and from their very start have given substantial support to initiatives like Portico, Clockss and the KB e-depot. STM is among the founding members of the Alliance of Permanent Access (APA) and an active partner in various EU-projects on digital preservation. They have also put in place best practice examples for preservation arrangements and post-cancellation policies. Many preservation services are being made available and are being developed but more collaboration between all stakeholders of the scholarly communication chain is needed, from research communities, to libraries, to publishers and archives. The APARSEN project is an important step in this,focusing on some aspects like trust, persistent identifiers and cost models, but there are still a wide range of challenges to be solved as the traditional publication models will continually change, from text and documents to “multi-versioned, multi-sourced and multi-media”. (Presentation: APE2014_Smit)

As Peter Burnhill from EDINA, University of Edinburgh explained, the continued access to the scholarly record is under threat as libraries are no longer the custodians of the scholarly record in e-journals. As he phrased it nicely: libraries have no longer e-collections but only e-connections. His KEEPERS registry is a global registry of e-journal archiving and offers an overview of who is preserving what. Organisations like LOCKSS, CLOCKSS, the e-Depot, the Chinese National Science Library and since recently the Library of Congress send in their holding information to this KEEPERS Registry. However nice, it was also emphasized that this is only a small part of the existing e-journals (currently about 19% of the e-journals with an ISSN assigned). More support for the preserving libraries and collaboration with publishers is needed to preserve the e-journals of smaller publishers to improve the coverage. (Presentation: APE2014_Burnhill)

(By Marcel Ras and Barbara Sierman)

On-line scholarly communications: vd Sompel and Treloar sketch the future playing field of digital archives


Published: Thu 23 Jan 2014

The Dutch data archive DANS invited two ‘great thinkers and doers’ (quote by Kevin Ashley on Twitter) in scholarly communications to do some out-of-the-box thinking about the future of scholarly communications – and the role of the digital archive in that picture. The joint efforts of DANS visiting fellows Herbert van de Sompel (Los Alamos) and Andrew Treloar (ANDS) made for a really informative and inspiring workshop on 20 January 2014 at DANS. (Re-Blog from KB Research blog, by Inge Angevaare, KB Research)

dsc_07551 (1)

(a copy of) Rembrandt’s 17th-century scholar Dr. Tulp overseeing Herbert van de Sompel outlining the research world of the 21st century

Life used to be so simple. Researchers would do their research and submit their results in the form of articles to scholarly journals. The journals would filter out the good stuff, print it, and distribute it. Libraries around the world would buy the journals and any researcher wishing to build upon the published work could refer to it by simple citation. Years later and thousands of miles away, a simple citation would still bring you to an exact copy of the original work.

Van de Sompel and Treloar [the link brings you to their workshop slides] quoted Roosendaal & Geurts (1998) in summing up the functions this ‘journal system’ effectively performed:

  • Registration: allows claims of precedence for a scholarly finding (submission of manuscript)
  • Certification: establishes validity of claim (peer review, and post-publication commentary)
  • Awareness: allows actors in the system to remain aware of new claims (discovery services)
  • Archiving: preserves the scholarly record (libraries for print; publishers and special archives like LOCKSS, Portico and the KB for e-journals).
  • (A last function, that of academic recognition and rewards, was not discussed during this workshop.)

So far so good.

But then we went digital. And we created the world-wide web. And nothing was the same ever again.

dsc_0748

Andrew Treloar (at the back) captivating his audience

Future scholarly communications: diffuse and ever-changing

Van de Sompel and Treloar went online to discover some pointers to what the future might look like – and found that the future is already here, ‘just not evenly distributed’. In other words: one discipline is moving into the digital reality at a faster pace than another, and geographically there are many differences too. But van de Sompel and Treloar found many pointers to what is coming and grouped them in Roosendaal & Geurts’s functional framework:

  • Registration is increasingly done on (discipline-specific) online platforms such asBioRxivideacite (where one can register mere ‘ideas’!) and Github, a collaborative platform for software developers (also used by the KB research team).
    Common characteristics include:
    – Decoupling registration from certification
    – Timestamping, versioning
    – Registration of various types of objects
    – Machines also function as creators and contributors.
    (We’ll discuss below what these features mean for digital archiving)
  • Certification is also moving to lots of online platforms, such as PubMed Commons,PubPeerZooUniverse and even Slideshare, where the number of views and downloads is an indication of the interest generated by the contents.
    Common characteristics include:
    – Peer-review is decoupled from the publication process
    – Certification of various types of objects (not just text)
    – Machines carry out some of the validating
    – Social endorsement
  • Awareness is facilitated by online platforms such as the Dutch ‘gateway to scholarly information’ NARCISmyExperiment and a really advanced platform such aseLabNotebook RSS where malaria research is being documented as it happens and completely in the open.
    Common characteristics include:
    – Awareness for various types of objects (not just text)
    – Real time awareness
    – Awareness support targeted at machines
    – Awareness through social media.
  • Archiving is done by library consortia such as CLOCKSS, data archives such as DANS Easy, and, although not mentioned during the presentation I may add our own KB e-Depot.
    Common characteristics include:
    – Archiving for various types of objects
    – Distributed archives
    – Archival consortia
    – Audit for trustworthiness (see, e.g., the European Framework for Audit and Certification of Digital Repositories).
dsc_0680

Very few seats remained unoccupied

Fundamental changes

Here’s how van de Sompel and Treloar summarise the fundamental changes going on. (The fact that the arrows point both ways is, to my mind, slightly confusing. The changes are from left to right, not the other way around.)

vdsompeltreloar321

Huge implications for digital libraries and archives

The above slide merits some study, because the implications for libraries and digital archives are huge. In the words of vd Sompel and Treloar:

vdsompeltreloar331

From the ‘journal system’ we are moving towards what van de Sompel and Treloar call a ‘Web of Objects’ which is much more difficult to organise in terms of archiving, especially because the ‘objects’ now include ever-changing software & operating systems, as well as data which are not properly handled and thus prone to disappear (Notice on student cafe door: ‘If you have stolen my laptop, you may keep it if you just let me download my PHD-thesis’).

vdsompeltreloar39

It’s like web archiving – ‘but we have to do better’

Van de Sompel and Treloar compared scholarly communications to websites – ever-changing content, lots of different objects (software, text, video, etc.), links that go all over the place. Plus, I may add, a enormous variety of producers on the internet. Van de Sompel and Treloar concluded: ‘We have to do better than present web-archiving methods if we are to preserve the scholarly record in any meaningful way.’

dsc_0701

Two ‘great thinkers and doers’ confer – Herbert van de Sompel (left) and Andrew Treloar

‘The web platforms that are increasingly used for scholarship (Wikis, GitHub, Twitter, WordPress, etc.) have desirable characteristics, such as versioning, timestamping and social embedding. Still, they record rather than archive: they are short-term, without guarantees, read/write and reflect the scholarly process, whereas archiving concerns longer terms, is trying to provide guarantees, is read-only and results in the scholarly record.’

The slide below sums it all up – and it is with this slide that van de Sompel and Treloar turned the discussion over to their audience of some 70 digital data experts, mostly from the Netherlands:

vdsompeltreloar47

Group discussions about the digital archive of the future

So, what does all of this mean for digital libraries and digital archives? One afternoon obviously was not enough to analyse the situation in full, but here are some of the comments reported from the (rather informal) break-out sessions:

  • One thing is certain: it is a playing field full of uncertainties. Velocity, variety and volume are the key characteristics of the emerging landscape. And everybody knows how difficult these are to manage.
  • The ‘document-centred’ days, where only journal and book publications were rated as First Class Scholarly Objects are over. Treloar suggested a move to ‘researcher-centric’ approach, where First Class Objects include publications and data andsoftware.
  • To complicate matters: the scholarly record is not all digital – there are plenty of physical objects to deal with.
  • How do we get stuff from the recording platforms to the archives? Van de Sompel suggested a combination of approaches. Some of it we may be able to harvest automatically. Some of it may come in because of rules and regulations. But Van de Sompel and Treloar both figured that rules and regulations would not be able to cover all of it. That is when Andrea Scharnhorst (workshop moderator, DANS) suggested that we will have to allow for a certain degree of serendipity (‘toeval’ in Dutch).
  • Whatever libraries and archives do, time-stamped versioning will become an essential feature of any archival venture. This is the only way to ensure that scientists can adequately cite anything and verify any research (‘I used version X of software Y at time Z – which can be found in a fixed form in Archive D’).
  • The archival community introduced the concept of persistent identifiers (PID’s) to manage the uncertainties of the web. But perhaps the concept’s usefulness will be limited to the archival stage. Should we distinguish between operational use cases and archival use cases?
  • Lots of questions remain about roles and responsibilities in this new picture, and who is to pay for what. Looking at the Netherlands, the traditional distribution of tasks between the KB National Library (books, journals) and the data archives (research data) certainly merits discussion in the framework of the NCDD (Netherlands Organisation for Digital Preservation); the NCDD’s new programme manager, Marcel Ras, attended the workshop with interest.
  • Who or what will filter the stuff that is worth keeping from the rest?
  • Interoperability is key in this complex picture. And thus we will need standards and minimal requirements (as, e.g., in the Data Seal of Approval)
  • Perhaps baffled by so much uncertainty in the big picture, some attendants suggested that we first concentrate on what we have now and/or are developing now, and at least get that right. In other words, let’s not forget that there are segments of the scientific landscape that are being covered even now. The rest of the landscape was characterised as ‘the Wild West’ by Laurents Sesink (DANS).
dsc_0735

In this breakout session, clearly discussions focussed on the role of the archive. Selection: when and by whom? Roles and responsibilities?

  • What if the Internet fails? What if it succumbs to hacks and abuse? This possibility is not wholly unimaginable. But the workshop decided not to go there. At least not today.

In his concluding remarks Peter Doorn, Director of DANS, admitted that there had been doubts about organising this workshop. Even Herbert van de Sompel and Andrew Treloar asked themselves: ‘Do we know enough?’ Clearly, the answer is: no, we do not know what the future will bring. And that is maybe our biggest challenge: getting our minds to accept that we will never again ‘know enough’ at any time. While yet having to make decisions every day, every year, on where to go next. DANS is to be commended for creating a very open atmosphere and for allowing two great minds to help us identify at least some major trends to inspire our thinking.

See also:

  • tweets #rtwsaf Riding the Wave and the Scholarly Archive of the Future – the title referring to the 2010 European Commission Report on Scholarly Communications which was the last major report on the issue available).
  • Blog post by Simon Hodson
dsc_0631

Where do we go from here? Peter Doorn asked his two visiting fellows in Alice-in-Wonderland fashion

 

ANADP bijeenkomst 2013 in Barcelona.


Published: Sat 30 Nov 2013

De tweede bijeenkomst van Aligning National Approaches to Digital Preservation (ANADP) vond afgelopen week in Barcelona plaats. De eerste bijeenkomt, in Tallinn in Estonia in 2011, resulteerde in een interessante publikatie (www.educopia.org/publications) met een overzicht van de laatste stand van zaken. En  een reeks aanbevelingen voor verdere discussie (6 in de verkorte versie en 47 in de uitgebreide versie). Om een korte indruk te geven, noem ik enkele belangrijke topics die tijdens deze drie dagen steeds opnieuw onderwerp van discussie waren tijdens de panelsessies,  de actieve werkgroepen en de lezingen van Clifford Lynch (Coalition of Networked Information) die de openingslezing hield en Adam Farquhar (British Library) , die de slotlezing verzorgde.

Clifford Lynch blikte terug wat er sinds 2011 bereikt was. Veel ‘collaboration” (“often just a lot of talking”) dat wel,  maar hij waarschuwde dat deze samenwerking ook tot onderlinge afhankelijkheid kon leiden (“interdependency”) wat een risico kan vormen: gaat het bij een ander mis, dan heb jij daar ook last van. Denk dus van te voren goed na hoe ver de samenwerking moet gaan. Een ander punt betrof de grenzen van digitale duurzaamheid. Zijn die wellicht te nauw? Zouden we ons niet over meer druk moeten maken dan alleen de veilige opslag. Bijvoorbeeld over nieuwe toegangsmogelijkheden, zoals Europeana die biedt. Over informatie die verloren gaat als wij niets doen. Over gewijzigd gebruik en een ander verwachtingspatroon bij gebruikers.  Adam Farquhar constateerde dat de meeste systemen die we nu voor digitale duurzaamheid gebruiken, zijn ingericht op opvraging van één object per keer, maar de nieuwe onderzoekers zien onze collecties als “big data” en willen onderzoek doen op grote aantallen objecten.

Niet alleen in de VS werd een “devaluation of public goods” gevoeld,  nog versterkt  door de krimpende budgetten. “Making the case for digital preservation “ zal steeds belangrijker worden.  Dat kan op verschillende manieren, niet alleen door aan te tonen wat we allemaal bewaren, maar ook door aandacht te vragen voor wat er nu (ongemerkt) verloren gaat. Weten de beleidsmakers wel wat er op het spel staat? Wie maakt zich druk om kleine, lokale krantjes? Of om het bewaren van “public broadcasting”, dat in sommige landen nauwelijks gebeurt, terwijl dat een essentiële bron voor toekomstige onderzoekers is. Welke onderzoeken zijn in de toekomst niet meer mogelijk? Als voorbeeld werd genoemd: hoe komt iemand er over 10 jaar achter hoe lang het reizen van A naar B duurde? Er zijn geen papieren spoorboekjes meer, en niemand bewaart de databases van de spoorwegen.  Het kan ons helpen dat het algemene publiek langzamerhand ook begint te beseffen dat de traditionele manier van overdracht van eigendom voor digitale objecten niet meer werkt. Je bent geen eigenaar meer van je favoriete muziek op Spotify of je favoriete boeken op je Kindle en je kunt ze niet aan je kinderen nalaten.

Tegenwerping is vaak dat we gehinderd worden door de copyrightwetgeving. Dat gaf Lynch direct toe, maar als “digital preservation community” zouden we overeenstemming moeten zien te bereiken over “some sweeping statements” , waarmee we direct de noodzaak voor wijzigingen kunnen aantonen, in plaats van ons in details te verliezen.

En hoe tonen we aan dat we onze beloften waar maken? Kleine organisaties zeggen soms dat ze “sustainable” zijn voor een bepaalde periode, maar wie controleert dat? Lynch merkte op dat in alle branches sprake is van data verlies, maar dat dit in onze (library) wereld niet lijkt voor te komen. Meermalen is tijdens de conferentie gesproken over het opzetten van een “registry of failures”. Maar er is al een plaats waar de “horror stories” van verloren digitaal materiaal verteld kunnen worden: www.atlasofdigitaldamages.info

“Economics, the nightmare of sustainability”, waar “sustainability” volgens Lynch maar al te vaak uitgelegd werd als “somebody else need to pay for his”) was een ander terugkerend onderwerp. Ons antwoord hierop kan gerelateerd zijn aan het feit dat we “public goods’ bewaren:  men heeft er recht op om er toegang tot te houden, het is met publieke middelen gemaakt en het is een enorme desinvestering als dit verloren zou gaan. Aan de andere kant is het de vraag of we erg veel energie moeten steken in gedetailleerde kostenmodellen.

Luciane Duranti (InterPARES/CICRA) wees er op dat het belangrijk is om de juiste bondgenoten te vinden (de cloud storage providers bijvoorbeeld zouden ook tot onze digital preservation community moeten horen, evenals  leveranciers van systemen en services) en dat we op de juiste plekken moeten zijn, bij UNESCO en bij de conferenties van leveranciers om ons verhaal te vertellen en elkaar te versterken. Ook Chris Greer (Research Data Alliance) pleitte voor meer aansluiting bij andere disciplines en noemde als voorbeeld bio medici die nu beginnen hun collecties duurzaam op te slaan. Zij zouden kunnen profiteren van onze kennis.

Adam Farquhar vatte de trends samen in zijn slotlezing. We zullen overspoeld worden met data en toch moeten we er in slagen digitale duurzaamheid te integreren in onze dagelijkse activiteiten. Dat kunnen we niet meer alleen en zal leiden tot samenwerkingsverbanden en (gezonde) concurrentie met externe partijen die services verlenen. Onderzoekers zullen onze digitale collecties op een andere manier gebruiken, dat vergt aanpassingen in onze systemen (en m.i. mogelijk ook van het OAIS model). Maar bovenal zal de digitale duurzaamheid gemeenschap één consistente boodschap uitstralen; onze activiteiten zijn niet alleen gericht op het gebruik van het digitale materiaal in de toekomst maar ook in het heden.

Hoe nu verder? Men vond unaniem dat het niet nodig was weer een nieuwe organisatie op te richten om “alignment” te bevorderen, er zijn vele samenwerkingsverbanden die we kunnen gebruiken om bovengenoemde punten verder uit te werken. (Kijk maar eens op  cdb.io/17laZbO  voor samenwerkingen). Wel was er behoefte aan om over enkele jaren weer op deze  strategische wijze over digital preservation te praten. Daar kijk ik naar uit!

Barbara Sierman, Koninklijke Bibliotheek

iPRES2013


Published: Fri 25 Oct 2013

The iPRES2013 conference took place in beautiful Lisbon, together with the Dublin Core 2013 conference. In total there were around 400 people, from 38 countries. Each conference had its own program. But the three (shared) key note speakers draw the attention from both the bibliographic people and the digital preservation in the room and sketched their views on important challenges we need to work on collaboratively.

Gildas Illien (BnF) strongly advocated that bibliographic people and digital preservation people would be more cooperative, as they both are trying to make the collections accessible but from a different angle. The user expectations should be leading in both fields and, if so, will require more collaboration in the organizations. Management need to be convinced of this.
Paul Bertone from the European Bioinformatics Institute explained the recent breakthrough in storage: storage in DNA, which might be a solution for massive storage of data.
And finally Carlos Morais Pires, from the European Commission, talked about Horizon 2020 and data infrastructures (and here – as libraries we need to point this out again and again: data is not only restricted to scientific data generated by instruments, but also the big data collections in libraries and data centres for social sciences ! Carlos Morais Pires immediately agreed on this and changed his slide.)
All presentations can be found on http://purl.pt/24107, covering a wide range of aspects.

There are simply so many aspects related to digital preservation ( webarchiving, preservation policies, open source preservation systems, trust, storage, and so on…). I can only advise you to have a look at the above mentioned URL. Is there a trend to be discovered in all these presentations? To me, they demonstrate there is a lot of national and international collaboration nowadays. The European projects like Blog4Ever, SCAPE, APARSEN, ENSURE and Timbus, national initiatives like Goportis and international collaboration in the 4C project, they all bring together people from various disciplines . No longer is it only about libraries, archives and data centers, but institutional repositories, health care and business are now also tackling the problem and are presenting their views. The presentations reflect a greater self-confidence of the digital preservation community; we don’t have the answers to all challenges but we are developing a methodological way to deal with them: the development of standards, life cycle models, cost models, monitoring of the environment, lending from other communities to create tools etc. And most important of all, we know how to find each other.

But there was also another topic, mainly raised in discussions and during breaks. Our own organisations. The elephant in the room is the fact that our own organisations will need to deal with both analogue and digital material, while the expertise in dealing with analogue material is far more developed in the organisation then the competence of dealing with digital material. Someone said to me “these are different people”. May be that is the case. Look at the sometimes heated debates around reading e-books or preferring the paper ones. I like both and don’t think the paper book will disappear. So as a reader I will integrate both worlds and sometimes prefer a paper book above an e-book. This is the world we need to deal with, and organizations need to integrate both worlds. It will require training to have employees that are both familiar with digital as well as print collections. This is a management challenge, but as digital preservation people we cannot close our eyes for it. We need to convince our management and as the keynote speaker Gildas Illien said (paraphrased by me): “ We need to show our added value. Use the rest of the world to convince your management.” This is how we as digital preservation people can exploit our existing collaboration structures!

Author: Barbara Sierman

Digital preservation: how are we doing as a community? – #iPRES2012 (7)


Published: Sat 06 Oct 2012

Active maintenance of this blog was paused at the end of 2012.

 

The question as to how we are doing as a DP community was expressly posed by Paul Wheatley of Leeds University (and avowed digital preservation geek), but it was echoed in other presentations, panels and workshops throughout the conference. How are we doing as a community? Are we making good progress or getting stuck in old issues? Are we closing the gap between theory and practice? Are we pulling in knowledge and tools from other disciplines, especially IT? Are we getting things done? Obviously, these are no simple yes or no questions. But let me share some opinions with you that were voiced this week at iPRES2012by Inge Angevaare

Chairs were hurried in to accommodate the last-arriving participants at Wednesday’s plenary session . Eventually, even the first row was completely filled up.

Paul Wheatley spoke at the first plenary session on Wednesday. Our host, Seamus Ross, had welcomed the participants by saying that he was impressed by the papers at this iPRES. “They show how far we have come as a field. This is rigorous, thoughtful work. It is all grounded.”

A show of hands indicated that about one third of the audience were from libraries, one third from archives, and one third from other organizations

Knight: “10 Years on we are still pretty much talking about the same things”

In the following plenary keynote, Steve Knight of the National Library of New Zealand took a different view. He pulled out his notes from previous conferences, especially iPRES2008, and concluded that “We are still pretty much talking about the same things. Tools like DROID and PRONOM etc. didn’t work properly then, and they still don’t work properly now. The wish list from this year’s Future Perfect Conference (New Zealand) did not differ that much from the wish list four or ten years ago. Knight noted a “mismatch” between standard documents such as OAIS and “what we have to do now”. Knight also quoted conclusions from the Aligning National Approaches to Digital Preservation (or ANADP) conference that our present preservation systems are pretty much untested (see also my blog post then).

Steve Knight: “OAIS got us started, but it is not carrying us further.”

Wheatley: “We are duplicating efforts”

In his subsequent cRIsp presentation with Maureen Pennock, Paul Wheatley pointed at lots of duplication in digital preservation research. Now, a certain amount of duplication – or rather: trying out different models and pathways – is useful and necessary. But when you look at this list of initiatives on cost modeling, shown by Wheatley …

(click on image to enlarge)

… you do begin to wonder if all of that is really effective and necessary.

From hobbyists to artisans to industrialists

On the previous day, during the “research challenges workshop” Knight had presented a paper by Peter McKinney describing the various stages of development in digital preservation, from hobbyist to artisan to industrialist. McKinney argues that the transitions are not clear-cut. Some organizations may still be in the hobbyist phase where others are in the artisan phase. But – and this is me speaking – it seems that very few of us are yet in the industrialist stage. But given the amount of data we are dealing with, it would seem that that is where we need to go. McKinney proposes a massive digital preservation “war games” based on real data to establish exactly where we stand and what research challenges we have. Now that would be a true international testing effort!

Michael Day of UKOLN and Sheila Morrissey of Ithaka at the research challenges workshop

At the workshop not everybody agreed with terms like hobbyism and artisanship. Sheila Morrissey of Ithaka (the organization behind the Portico digital archive) argued that there has been a lot of “industrialist” work – it just hasn’t been done by “us”, the digital preservation community at this conference. It’s the likes of Google and Amazon and major cloud service providers that are doing the “industrialist” work. Are “we” reinventing the wheel?

Are commercial services maturing?

Most memory institutions are  not buying into cloud solutions by large commercial partners. I need only remind you of Australian National Archives’ Michael Carden’s speech recently at ICA2012 (see blog post), who said “this is core business, and we have to do it in-house.” Library and Archives Canada Director General and CIO Ron Surette disagrees. “It is all about control. And you can still have control over information stored by commercial partners.” What about “trust”, then? Surette: “When those commercial companies fail, they have a much bigger problem than non-profit institutions. In that way, I tend to trust them more. But of course you should never rely on one supplier; you should always have a redundancy.”

Ron Surette (at right) talking with Amsterdam City Archives’ colleagues Jacob Takema (left) and Sander Ujzanovitch.

Perhaps wanting to do all the work ourselves is about hating to lose our jobs, Surette speculated. Interestingly, there was a poster at the conference about the US Chronopolis distributed preservation network based at the San Diego Supercomputer Center (SDSC) entering into a behind-the-scenes collaboration with not-for-profit cloud services provider DuraCloud. And during the APARSEN session on trust (see post), Raivo Ruusalepp suggested that, as much as we may distrust commercial services providers’ solutions, for may small organizations they may be the only way to achieve some type of preservation without massive investments. Ruusalepp is a consultant for the preservation of business archives in Estonia.

Raivo Ruusalepp during the APARSEN session

Lots of modelling and frameworking in European projects

On a personal note, I may add that the likes of DuraCloud are at least providing very practical solutions for redundant storage and data health checks – even though they many not go all the way in terms of trust and preservation planning. Whereas a lot of work coming out of the major European projects (which accounted for something like 70-80% of the presentations!) seemed to focus more on theoretical work: capacity models, evaluation frameworks, etc. Which look impressive on a slide but sometimes – I admit it – go over my head (and not only mine).

Let 1000 flowers bloom, or more design?

At the end of his presentation Steve Knight asked whether we should let the proverbial 1000 flowers bloom or have more design in digital preservation. Some of the answers from the Future Perfect Conference were (click to enlarge image):

Knight also quoted a suggestion by Laura Campbell of the Library of Congress at Aligning National Approaches to Digital Preservation (2011, p. 29) to establish: “an international preservation body or association that would focus on policy aspects of digital preservation. Such a coordinating body  might be aided by an advisory groups of experts to help identify what is most at risk and most important to preserve. This group could focus on content and changing forms of communication or trends in certain disciplines. Establishing a common index of already preserved content in a virtual international collection, regardless of where it is housed, could be  a  valuable  service  of  such  a  coordinating  body.  It’s  not   about  the  preservation  body  itself,  it’s  about  the  results. Second, we might expand the notion of a national digital collection to an international digital collection. I think it’s worth talking about how such a collection might be made accessible broadly.”

To my mind, that is an interesting idea to develop. As Knight suggested, perhaps we can take our lead from the requirements for organizing digital preservation which were developed at ANADP:

For more info, see Aligning volume, pp 89-115.

During the closing session in the ballroom, we finally had all the space we wanted.

“Preservation is knowledge” revisited: promising work by the #SCAPE project – #iPRES2012 (6)


Published: Fri 05 Oct 2012

Others said it yesterday at iPRES2012, and Christoph Becker of TU Wien and the SCAPE project reiterated it on Friday: much of preservation and preservation planning is about knowledge. Nobody can predict the future, but as a community we can get a lot better at harnessing all the knowledge that is out there, experiences from colleagues, successes and failures, emerging trends, etc. All that knowledge can help us weigh alternatives and make informed decisions about our collections. We call it preservation watch, and the SCAPE project is doing promising work to make that easier. Full paper available here.- by Inge Angevaare

There were three presenters at Friday morning’s session, but the organizers made sure that all the other authors were there too – at least in spirit.

Any process to design a preservation plan basically looks something like this (click images to enlarge).

At every stage of this process we need information from the outside world. What is happening out there? What alternatives do we have? Typical questions can be:

The answers are not easy to find, as they are scattered all over the place:

The SCAPE project has set out on the ambitious road of bringing the information from all these sources together and providing an automated monitoring service:

The word “automated” is crucial here, because the tool’s forerunner, the PLANETS PLATO tool, required too much manual work to be usable at any scale. The whole purpose of the SCAPE project is to make the PLANETS tools scalable, which inevitably involves automation. These are the goals of SCOUT, the monitoring service:

Item 6 is a very interesting feature of the tool: automatic alerts when something happens in the world that affects your collections. It is the stuff most digital archives are dreaming about!

Personally, I am impressed by this work, but I also think that it is very ambitious. So I asked Becker if this was going to be the umptieth project to do a lot of work on setting something up, only to drop everything when funding runs out. Becker is counting on the Open Planets Foundation to support the tool after the project is finished. That will certainly help. But Becker agreed that much – if not everything – depends on involvement from the community to actually supply all the information needed. On the face of it, it would be worth our while, because wouldn’t we all want this:

(Did you notice the word quickly at bullet no. 5?)

In the iPRES2012 audience, not everybody was immediately convinced. Peter Doorn of DANS Data Archive would really like to see some applications of the service before deciding on its suitability for his data archive DANS. Perla Innocenti of HATII asked how flexible the tool would be. Becker promised that SCAPE is targeting its tools at real users, so they have to be practical. Let’s hope that SCAPE will deliver on that!

C3PO, a policy model and “MyExperiment” are some of the other tools SCAPE is developing. See the website for more details (I missed those presentations [those darn parallel sessions] but the papers will be forthcoming in the conference proceedings in the next few weeks.)

Conference host Seamus Ross manning the Lost and Found desk during the closing session.

PS: the conference may be over, but my coverage will continue for another week or so. I have plenty of notes & pictures, but it will take a little time to process all of the information. Stay tuned.

“Preservation is knowledge” – or: cRIsp it! – #iPRES2012 (5)


Published: Fri 05 Oct 2012

All of us together know a whole lot about file formats, data structures and relevant standards, and about tools to interpret digital objects – in other words: representation information. But the information is scattered in many places. The formal registries do not seem to mature quickly enough – there is duplication, lack of engagement, lack of content, and lack of use. “That is a big fail for our community,” said Paul Wheatley on Wednesday. Together with Maureen Pennock (and Andrew Jackson) he presented a new, very light-weight crowdsourcing tool that is intended to bring that knowledge together and make it more accessible: cRIsp. – by Inge Angevaare

The nice thing about cRIsp is that you can contribute to the pool of knowledge through social media such as Twitter or through a Google form. You simply send the URL with relevant information to bit.ly/crisp-dp.

The idea, Pennock said, is: “First to get the data, then to make it useful and make it powerful. It is a bottom-up approach.”We first want to get the prose, and then we hope to move on to RDF and linked data.”

Here is how it works: the data that come in are manually curated and then picked up by web archives for long-term preservation.

In a jam-packed Giovanni room, some questioned the amount of trust such a crowd sourcing approach could generate.

A jam-packed room on Wednesday morning for cRIsp

But Pennock was confident: “We rely on the wisdom of the crowd to correct any inaccurate input.” Kevin Ashley of the UK Digital Curation Centre agreed with that assessment. He told the room that he had good experiences with crowdsourcing initiatives.

Paul Wheatley and Maureen Pennock: “This is perhaps not a complete solution, but it is a beginning. We need the community to make it a success.”

With little or no funding behind it, the success of this project will depend on: