Library Guides: Digital Toolkit: Digital Preservation & Web Archiving

About Digital Preservation & Web Archiving

Digital Preservation is not a tool, but an array of methods of tools. Most digital preservation orients itself around the NDSA's Level of Preservation; the key foundation of digital preservation is many copies of files stored many locations, but there are other things that can be done such as good record keeping (metadata), checksums (checks for file changes) and file migration.

The first part of this guide will focus on website preservation. Website preservation includes best development practices, because practitioners have learned that the most essential foundational aspect of good website preservation is to following development standards.

The second part is general part of this guide regards backing up files; they should inform and support each other, because WARC (Web Archival Format) and or HTML, PHP, MYSQL all need to be backed up.

Files should be platform agnostic whenever possible, so checkout the Library of Congress' Recommended Formats page.

Keeping files in multiple locations can mean keeping files on your computer; on an external hard drive; or in a digital storage service like Google Drive or Dropbox or even Internet Archive (archive.org) (please see the following for more info). Enterprise-level solutions (discussed below) often use Amazon cloud storage as well.

Checksums are akin to digital fingerprints for a file. A software product can create a checksum for a file, and then, after the file is moved or changed in some way, an analysis can be performed to confirm that the checksum is the same as it was prior to the move or change. This confirms that the file has not been altered or corrupted. Checksums can be created and monitored using a many tools. LoC's Bagit is possibly the most reliable. There is also Bagger and AVP'S Fixity Pro.

More robust, enterprise-level solutions for digital preservation include Preservica, Rosetta, and Archivematica (coupled with digital storage). These products perform many digital preservation tasks (such as generating and running checksums, performing file normalization, generating duplicate copies, etc.) in one user-friendly product.

Websites

Website preservation can mean many things.

The ability to archive a website is partially based upon how well a site has been developed, so it is suggested that you review our best practices web development guidelines.

The Internet Archive might capture your website, but there is no guarantee that the crawler will find your site and even if it does there are known fidelity issues. Using the Save Page Now feature, users can point the Internet Archive to their web-page, but unlike the paid subscription, Archive-It, users of Save Page Now have limited control in regards to scoping the crawl. Institutions who opt for a paid subscription to Archive-It can select a range of websites they would like to crawl and save, as well as the frequency with which they can perform these crawls, in order to capture updated content.

If your project is a website, a Web Archive (WARC) file capture of your website is a standard approach to archiving. Visit each page that you desire capturing by using Conifer. Download the capture as a WARC file, then test using ReplayWeb.page before including it as a part of your deposit in CUNY Academic Works. Note: You must play the entire recording of a video or audio file during Conifer capture if you want it to be captured in the WARC file.

If you want to archive your website beyond merely creating a WARC file, for fuller reproducibility, here is a list of the most common platforms and associated unique files and directories that are suggested to archive.

Wordpress:

Download the MYSQL database
Download the /wordpress/wp-contents/themes/ directory
Download the /wordpress/wp-contents/uploads directory
Make a list of all plugins (with version numbers and hyperlinks) and include in documentation
Zip the database, themes and uploads directories with documentation
Include MYSQL, PHP and Wordpress (include link https://wordpress.org/about/history/) versions in documentation

Omeka:

Download the MYSQL database
Download the /omeka/themes/ directory
Download the /omeka/files/ directory
Make a list of all plugins (with version numbers and hyperlinks) and include in documentation
Include MYSQL, PHP and Omeka (https://github.com/omeka/Omeka/releases) versions in documentation
Zip the database, themes and files directories with documentation

Scalar:

Download the MYSQL database
Download the /scalar/yourdirectory/media directory
Include MYSQL, PHP and Scalar versions in documentation
Zip the database, media directory with documentation

HTML:

Using HTTRACK then Zip all html, css, media (jpg, wav, mp4, etc.)

Best Practices for Web Development

The ability for web crawlers to capture your page is partially based upon web creators following best development practices.

Following best development practices is essential for the ability of web crawlers to best capture your website.

Here are a few things to think about:

The Internet Archive's crawler preserves web pages and sites, but to guarantee best results:

Delete or modify robots.txt file to allow for crawling. Test with the Google tester.
Use local copies of fonts, css, javascript to attempt to fully encapsulate the site avoid external dependencies.
Every page and media element has a unique URL (avoid platforms such as Wix, Squarespace).
Avoid orphaned pages and link rot by maintaining stable URLs
Avoid proprietary formats
Include an XML Sitemap.
Architecture that has re-directs that are not scoped by most web crawlers.
Architecture that depends upon dynamic content are typically not captured by crawlers. For example scalar content has many ‘URLs that contain ?path= ‘ Crawlers do not capture elements that require a user’s input. Searches are inherently dynamic and require user interaction.
Embedded javascript is often hard to archive, but especially if they generate links without having the full name in the page. Also, if javascript needs to contact the originating server in order to work, an attempt at archiving might fail. An interesting example.
Crawlers can only crawl the publicly accessible web, so avoid password protecting unless you absolutely need to.
Using responsive design ensures that archive users will continue to have a comparable experience of the original website, regardless of the platform they use for access.

Provide equivalent text for non-textual content can facilitate both search crawler indexing and later full-text search in the archive. Here are some useful tools to determine accessibility:

https://wave.webaim.org/

http://www.archiveready.com/
Future proof: To increase the chance that future browsers will be able to interpret today's code, validate against current web standards: http://validator.w3.org/
In addition to proper rendering of the web-page, setting character encoding in the HTTP header allows for successful capture and rendering of the archived copy insuring readability of the displayed text. It informs the browser of the character set being used. See:

https://www.w3.org/International/questions/qa-what-is-encoding and https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type
Social media or calendars and other 'infinite scrolling' gadgets on the web can cause a structural issue within a website that causes crawlers to find a virtually infinite number of irrelevant URLs. In theory, crawlers could get stuck in one part of a website and never finish crawling these irrelevant URLs. Crawler never escapes and gets to where it needs to go.
Because they are interactive, complex interactive maps are not typically good candidates for web archiving. ARCGIS, StoryMaps and Flash compositions are difficult for the Internet Archive to preserve..
Whenever possible, rather than embed links, host media (multimedia, video, audio) content locally. Or, host media on the Internet Archive (archive.org), and embed the Internet Archive URL in your current website.
Streaming media (YouTube, Vimeo, and Soundcloud) platforms are not built for long term preservation. YouTube videos are easier to preserve with the Internet Archive crawler than Vimeo videos. Each YouTube video can appear only once on the entire site or the crawler will not capture either instance of the same video. Vimeo embeds can be preserved with the Internet Archive, but only one Vimeo video can be embedded on each page.
WebRecorder preserves Scalar better than the Internet Archive does.
To archive searches, collect URLs of popular search result pages and add them to a page on your site. The Internet Archive crawler might be able to capture theses searches.
Interactivity is not easily preserved by the Internet Archive. Build a static, rather than a dynamic site, and screen video capture the interactive aspects of the site, then post the video on the site. WebRecorder captures interactivity better than the Internet Archive.
WebRecorder does not archive Tableau visualizations.
WebRecorder does not archive Vega-Lite.

Tools:

Preserving New Forms

Guidelines for Preserving New Forms of Scholarship

Is your site ArchiveReady?

More:

Known Web Archiving Challenges

Stanford Libraries Best Practices

Columbia University's Best Practices

LOC Guide

Smithsonian Guide

5 Tips for Creating Preservable Websites

How-to Guides

The Digital Preservation Coalition Data Type Technology Watch Guidance Notes:

A few simple practices could help save your digital data

Keep many copies (software, databases, data, WARC files) in different locations (local hard drives, Google Drive, Dropbox, etc.)
Use bagit to create checksums (digital fingerprint), monitor and replace the file with one of your backups when bagit reports bit corruption.
Migrate file formats, if and when file formats become obsolete.
Repeat steps 1 and 2.

Understanding Checksums

Each file will have a unique checksum (fingerprint) after running bagit.

Tool to create Checksums with:

Software Standards

Successful archiving of software and data is dependent upon good data management practices. Here are a few things to remember:

Every project should have its own root directory
Separate folders for data, images, and scripts
Comment the code. Include a readme with (a file manifest and codebook), installation instructions with requirements, dependencies, operating instructions, copyright and licensing info, contact info, known bugs, troubleshooting, acknowledgments, and news. A File Manifest, a simple listing of files and directories.
Include a Codebook with the following info:
- Variable Names - the short column header in the data. No spaces, no symbols, maybe some numbers
- Variable Labels - the full description of a variable, clarifying the variable name and allowable value ranges.
- Missing Data - How do you know if the data is omitted vs missing? Different standards are NA, . , #N/A, -88, -99, -999. Be consistent.
- Date Variables - standard date-time formats - ymd_hms 2019-11-05 13:15:00 UTC or YYYYMMDD. Be consistent.
- See the following:
  - http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/CodebookCookbook.html
  - https://github.com/DataCurationNetwork/data-primers
- Use one variable per column
Copyright considerations: If hosting or distributing, ensure that you ‘own’ or have permission (creative commons) to use others’ work.
Do not encrypt or compress files.

Guides & Documentation

Questions about Digital Preservation? Ask Us!

Stephen Klein

Digital Services Librarian, CUNY Graduate Center
Jessica Wagner Webster

Digital Initiatives Librarian, Baruch College
Lindsay Wittwer

Digital Archivist, Centro: Center for Puerto Rican Studies, Hunter College