By Christopher Campbell

He put down his tablet and took a gulp of coffee. Startled by how cold it had become, he glanced at the clock. He had been researching for hours and now he had found what he sought. He found an email, a smoking gun that proved his theory about political complicity in the 2007 corporate scandal that he was researching. The problem was that the evidence was too good. The scandal was still a contentious topic, even now in 2022. Having the email was not sufficient; he would need to be able to prove its veracity. As a late 20th century historian, he often worked with born digital documents from prior to the millennium. Through this historical research, he faced the many challenges posed by digital sources. For example, the digital documents that he sought had often been deleted or the single copies that he found had been corrupted because of backup tape degradation. Even with the recoverable files, many of the documents were in proprietary formats, created by old software such as WordStar. It was also difficult to determine if the artifacts he had were the final document, or earlier drafts. Finally, even if he had a document or an email, it was hard to authenticate the document beyond the chain of custody of the storage medium. Just thinking about the issues with the digital sources from the late twentieth century caused his head to ache. Would finding and authenticating a digitally born document from 2007 be just as difficult?

The challenges presented by digital artifacts have been well known. In June 2003, Digital Historian Roy Rosenzweig published “Scarcity or Abundance? Preserving the Past in a Digital Era” in the American Historical Review. In the article, Rosenzweig provides a high-level analysis of the challenges of preserving and authenticating born digital artifacts. He discussed possible preservation solutions that existed at that time. In the years since the publication, digital humanities progressed and practitioners such as Johanna Drucker encouraged the adoption of complex digital solutions without being mindful of digital preservation issues. Although other historians have voiced concerns about the need for authentication and preservation, there has been little practical analysis about the technical complications of achieving this. In the meantime, the landscape of modern computing has changed. It will continue to change. Technology is, by its nature, an evolutionary process. Some of these modifications addressed concerns with preservation and authentication, while other changes have introduced new issues. This article looks at considerations of preservation with the modern computing environment and discusses the types of data that future historians will have access to and the issues they will face in employing these resources.

Adoption of Common Data Formats

Proprietary data formats have been one of the largest issues in artifact preservation. When the software industry first emerged, there were no standards for file formats. Manufacturers developed software that used proprietary formats for their data files. As a result, digital artifacts exist in a myriad of data formats. Past solutions to this issue have included everything from the preservation of the operating environment that is able to open the formats (program, OS, and possibly even hardware) to converting the data files to modern formats. For programs, such as WordStar, which are already nearly four decades old, conversion is more practical than attempting to preserve or emulate forty-year-old hardware. There are many solutions for conversion. In the case of WordStar, conventional office suites, such as Microsoft Office, include converter packs for backwards compatibility. There are also software packages called ETLs (Extract-Transfer-Load) which act as a juncture for taking data from older formats and converting it. Some historians oppose conversion of digital sources, as they prefer to see the content in its intended form, but with WYSIWYG interfaces, as long as the formatting remains and the document looks the same, it is in its intended form. Opposition to conversion ignores the fact that modern computing relies on interoperability. Data conversion is common and has long been a key practice in computing: data processing service bureaus have existed since mainframe computing emerged and have provided this service to foster inter-system compatibility. For example, with the System 390 line, IBM introduced Extended Binary Coded Decimal Interchange Code (EBCDIC), an 8-bit character-encoding format.   Other computing industry platforms used the American Standard Code for Information Interchange (ASCII), a 7-bit character-encoding format. Conversions were a common and necessary part of working across platforms.

Since the widespread adoption of the Internet, modern computing has begun to coalesce around standardized file formats. For word processing and spreadsheet documents, the adoption of standards such as European Computer Manufacturers Association (ECMA) EMC-376 format. This format stores documents in a text-readable XML (Extensible Markup Language) format that eliminates the format preservation concerns. Some companies, such as Apple and Microsoft have introduced proprietary aspects to new formats, but these are largely for the purposes of Digital Rights Management (DRM). Despite new technologies and industries, many of the core formats have remained in place. Introduced in 1991, the MPEG media formats remain in use for audio (MP3) and video (MPEG 4, 6.) Introduced by Adobe in 1993, Portable document format (PDF), is now an open standard commonly used for document interchange. Instead of introducing new formats, market competitors tend to adopt and support existing ones. This makes it much more likely that digital artifacts using these common formats will be accessible through a variety of software applications.

For data interchange, flat text files remain popular. Comma Separated Variable format, first introduced in 1967 remains a common data interchange format. Newer approaches, such as Extensible Markup Language (XML) and industry specific interchange formats, such as healthcare’s HL-7 format are still built around flat text. This means that the file is essentially human readable in its base form.   No specific programs need to be maintained for it; at worst, the data can be easily parsed in any programming language.

The adoption of standard file formats has lessened the likelihood that born digital documents will be unreadable in the future. Proactive practices like file conversion can further alleviate these digital preservation issues.

Legal Changes

Since the appearance of digital computing, governments have struggled to develop legal constructs that adequately account for technology. These laws have often had monumental effects on the technological landscape itself. For example, the Copyright Act of 1976 was a factor in the commoditization of software. The law defined copyright protections for software while also affording software consumers a number of rights, including the right to modify or re-vend programs they purchased. To circumvent these consumer rights, software vendors created self-executing End User License Agreements (EULA). This created an industry where software developers sold software licenses rather than the programs themselves. Just as this law helped create a technology market segment, recent laws have had a pronounced effect on technology and with it the preservation and authentication of digital artifacts

Regulatory Compliance

As information, technologies have become more commonplace, so have issues related to privacy and security. To address these issues, and provide better corporate oversight in the wake of several major corporate scandals, such as Enron, governments have created regulations that govern the handling of certain types of data. Each of these new regulations covered different aspects of industry, but required similar steps to achieve compliance. For example, in the United States, the 1996 Health Insurance Portability and Accountability Act (HIPAA) governs the accessibility, security, and retention of protected health information (PHI.) Similarly, the Sarbanes-Oxley Act of 2002 (SOX) sought to provide oversight of publically traded companies and provided requirements for the security, auditing, and preservation of financial data. Other nations, such as members of the European Union, have adopted broader, non-industry specific measures such as the Data Projection Directive (Directive 95/46/EC) to achieve similar goals. These laws specified broad implementation timelines that required progressive compliance to the laws.

These laws have altered the handling of industry specific data, but they have also fundamentally changed how organizations handle general programs, such as email. These alterations also affect how historians will be able to access and authenticate email artifacts from this era.

Email originated on UNIX-based computers and messages were originally stored in mbox and maildir format in each individual’s home directories. This made the messages extremely easy to modify. With the adoption of personal computers, it became more common to access email via remote clients using clients that support POP (Post-Office Protocol) and IMAP (Internet Message Access Protocol.) With POP, email messages were downloaded to the user’s computer. This method placed the user directly in the chain of custody for the artifact, making it susceptible to user modification. With IMAP, messages were kept on the mail server, with the local client having a cached copy of the message. This approach increased the options for preserving emails, as the messages existed in multiple locations:

  • The Senders outbox
  • The inbox of each recipient
  • The local client of each recipient (header or full message)
  • The local client of the sender (header or full message)

Historians looking at email artifacts on these systems authenticate the messages by comparing the contents of the sender and recipient’s mailboxes. If the content of an email is the same on both the sender and the recipient’s mail systems, it suggests that the email artifact is unaltered. To authenticate that a message came from a specific mail system, a historians review the message headers for the email and compare it to the mail system’s delivery log. If both sources agree, it substantiates the metadata about the email transmission such as which account sent the message, what mail servers it went through, when it was sent, when it was received, etc. There are two issues with this, however. First, the messages could be deleted, with no record made of the expunged data. Second, the logs on mail servers grow very large, very quickly and as a result were usually retained only briefly.  Missing this needed information makes authentication of email artifacts difficult.

New compliance requirements increase the likelihood that future historians will have the information that they need for authenticating sources such as email. Laws, such as SOX require companies to track email activity, archive emails, and retain deleted content for at least seven years. As a result, audit trail features have become common in most commercial mail systems. Email archiving appliances have also become a popular method to address these compliance concerns. By increasing the likelihood of artifact preservation through prescribed retention periods, and providing authentication, mechanisms such as audit trails these regulations are a boon to historians who seek to work with these email sources.

Compliance laws have also created new challenges for historians who seek to access regulated artifacts such as email. For example, HIPAA requires that emails-containing PHI must be encrypted, both in transmission and at rest. As a result, emails with PHI may be inaccessible to historians who seek to access their contents.   That said, within the United States, this HIPAA control is less of a concern, as the U.S. government has launched programs for encourage the adoption of Electronic Medical records. One of the compliance measures for the meaningful use of these systems requires the adoption of intra-EMR secure messaging systems. This takes the encryption requirements largely out of the email arena, and places it within the EMR systems, which already employs encryption for stored data.

Copyright and Regulated Data

Beyond the authentication and preservation of digital artifacts, legal changes have introduced complication for historians who seek to use and interpret digital sources, many times limiting their access and uses of digital documents.

The 1998 Digital Millennium Copyright Act (DMCA) sought to update the United States’ copyright law to make it compatible with international intellectual property agreements. In practice, the law significantly changed the copyright handling of digital media. Where, traditional copyright laws protected content, the DMCA restricts the use of the content and limits the mechanisms by which the content is accessible. As a result, it may be unlawful for historians to use digital sources, even if they have access to the information. For example, Princeton University Professor Edward William Felten was pursued by the Recording Industry Association of America (RIAA) and the Secure Digital Music Initiative (SDMI) to prohibit him from publishing a paper about his successful participation in the SDMI’s public decryption contest. Here, the SDMI had encouraged efforts to circumvent their encryption and they still sued to suppress the research that discussed the results.

In addition, laws such as HIPAA, prohibit access to patient data without the explicit permission of the patient. Typically, the only individuals that can legally have access to this protect information are their healthcare providers and their business associates. According to HIPAA, in order to use this data for research, a historian would need to complete a HIPAA Data Use Agreement that would give access to a limited, de-identified data set.

The USA PATRIOT Act and Government Surveillance

                In the weeks after the September 11 attacks, the U.S. Congress passed the USA PATRIOT Act, which vastly expanded the nation’s surveillance laws while removing the oversight and accountability of intelligence agencies. Following the law, the U.S. government created a mass surveillance apparatus that tracked the communication activities of the nation’s citizens. Revelations of whistleblowers such as Edward Snowden have shown the breadth of the states’ surveillance, including the ease of warrantless surveillance via the Prism program. Since this initial disclosure, the public understanding of the scope of the government’s monitoring continues to grow. Because of this extra-constitutional spying, organizations and individuals have become more interested in protecting themselves from government surveillance. This has led to an increased adoption of encryption technologies, such as PGP (Pretty Good Privacy).   Desire for security has also increased interest in anonymous networks called “dark nets.” Solutions such as TOR (The Onion Router) provide anonymous web access as well as hidden networks.

Historians are going to have a difficult time with digital sources from individuals and groups who embrace these technologies. The same tools that hide activities from governmental intrusion also secure digital artifacts from future review. The dark nets eliminate the metadata that could be used to authenticate sources and the use of encryption technologies prevents access to the digital sources themselves. Currently the adoption of these technologies has been limited because of the technical complexity involved. Should concerns over government surveillance continue to grow; the demand for more accessible technologies will increase. Naturally, the availability of useable digital resources will decrease with greater adoption of these surveillance countermeasures.

Between concerns over government monitoring, highly regulated information, such as HIPAA protected patient information, and digital sources are subject to restrictive laws such as the DMCA, historians must be mindful about how these laws affect their ability to access and use digital sources.

Technological Advances

In addition to these new laws, there also have been technological advances that affect the authentication and preservation of digital sources. As with the laws, these changes both help and hinder historians in the use of digital sources.

Disk Space Increases

The first hard drive was introduced in 1956. The drive held a paltry 3.75 MB. Drives have increased in size steadily since then.   By the early 2000’s the computer industry struggled to create a hard drive that had a usable space larger than 128 GB. (The original personal computers used Integrated Drive Electronics (IDE) hard drives that were limited to 28 Bits of address space, which can map to 128 GB when formatted.) By 2014, a new type of drive, Serial AT Attachment (SATA) overcame this limitation by allowing for larger address spaces. By 2014, it became common for even home computers to have hard drives with over a terabyte of data.

This growth has led to several changes for digital historians.

  • The preservation of digital artifacts by copying to new hard drives has become much easier. Data storage on devices such as floppy disks requires slow and laborious work to transfer to new media.   However, once the data has been moved to hard drives, it is easy to copy data to newer drives. This eliminates some of the concerns with the slowness and effort required to retain digital artifacts on newer platforms.
  • The larger storage space increases the likelihood that data will be available. When storage is plentiful, the administrators of the computer systems are less concerned with storage quotas (and with them, the deletion of digital artifacts.)
  • Multiple copies will be likely be available. Plentiful disk space has resulted in the adoption of creating Disk-to-Disk Copies for backups, as well as replicating data for redundancy.   Software such as Microsoft’s Distributed File System provided high availability to network shares by having the data replicated between servers. As a result, multiple copies of each digital artifact will exist.
  • Authentication logs will likely be available. With increased disk space, quick deletion of access logs has become less common. Instead, logs are now being centralized, stored and analyzed using software such as Splunk or Nagios Log server. This analysis allows for better detection of performance and security issues.
  • Iteration tracking concerns are reduced. Part of the concern with using digital artifacts surrounds the fact that they are point in time representations. The file could have had different forms previously and the files are commonly modified rather than copied. Having plentiful disk space has lessened this concern by introducing backup software that takes point-in-time snapshots of the data. Microsoft’s Volume Shadow Copy does this, as does Apple’s Time machine. These systems can provide multiple, point-in-time copies of the data for digital historians, enabling them to see multiple iterations of the document leading to its final form.

These advances address a number of preservation issues that digital historians face with older digital artifacts. At the same time, the increased storage means that there is more artifacts for a historian to have to vette to find what they are looking for.

Virtualization and High Speed Networks

As personal computers have become more powerful, it is more common for computers to act as hypervisors, physical hosts that run multiple instances of different virtual computer environments performing different tasks. A side effect of this change was a fundamental shift in the handling of institutional disaster recovery planning. Traditionally, disaster recovery plans involved having a recovery contract with an organization such as SunGard. The organization would provide servers, storage, and tape devices and in the case of an emergency, organizations would recreate their operating environment at the SunGard facility by restoring all of their data from backup tapes. This was time consuming and unreliable.   With the Virtualized environments, this approach became unnecessary. For example, if a company had 150 physical servers, their recovery contract for the servers would be quite expensive and the recovery would takes days or weeks. With virtualized servers, those 150 servers might reside on three physical servers that are clustered to provide redundancy.   With the smaller hardware requirements and the ability to failover between physical devices, recovery contracts no longer make sense.   For a considerably lower cost, organizations could lease space in a remote data center facility and keep a copy of the virtualized environment in the remote location, and fail-over to the environment should a disaster strike. High-speed networks combined with the increases in storage capacity means that data can be replicated instantly to the disaster recovery site using software such as Microsoft’s Distributed File System or Storage Networking products such as NetApp’s Snapmirror or EMC’s Recovery Point.

For digital historians, this shift means that there are also copies of digital artifacts being stored remotely and synchronized with iterative tracking software. In addition, since Virtual machines exist as data files, there are also entire copies of the computer themselves kept in multiple locations, which theoretically increases the chance of artifact recovery.

Smart Phones, Ubiquitous Internet, and Software as a Service (SAAS)

With the introduction of smart phones with wireless Internet, consumers of computer technologies have become increasingly detached from traditional computing models. Instead of working on a computer in a dorm room or office, computer users are more commonly using their tablet computers and cell phones to send emails, review web pages, and collaborate on documents. This change in work habits requires a change in the location of the digital materials that these individuals are working with. Documents stored on hard drives in a dorm room are not accessible to students working in the University library. Likewise, organization documents on the corporate servers are not readily available to business executives on business trips.

These remote working habits have been fostered by the introduction of the “Cloud,” an Internet-based Software as a Service model, where software vendors provided hosted software environments for their clients. Instead of hosting application servers locally, consumers pay a monthly fee and outsource the software administration to vendors. The SAAS model leverages the increase in storage to accommodate storing the data from their clients while using virtualization and high speed networks to provide globally distributed, fault-tolerant application instances. The result is easy access anywhere for the resource consumer, regardless of whether that consumer is an individual storing personal data on Google Drive or a company storing the email and office documents on Microsoft 365.

For digital historians, this trend effects the future availability of digital artifacts considerably:

  • There will be more copies of digital artifacts. Unless access solely by a web client, web storage systems such as Google Drive and Microsoft’s One Drive create cached copies of the user’s data on any machine that the consumer has configured the drive client on.   Individuals that use the drive client on their tablet, work PC, home PC, and laptop would have copies on all of those machines as well as the “Cloud.”
  • The formats of these sources will be actively maintained. Format issues will not be an issue with emails, for example, as Cloud email vendors such as Microsoft are responsible to keep these systems up to date. This eliminates concerns over backwards compatibility.
  • Corporations such as Google and Microsoft have the audit trails and copies of all of these data artifacts. With the growing adoption of this approach, these companies will have a large concentration of digital artifacts and the means by which to authenticate them. Internet based resources require authentication and authorization to shared resources. The access logs for these systems provide an extremely authoritative audit trail.
  • There are no backups of these digital sources. Unless explicitly contracted to do so, cloud providers only provide for redundant copies of data by synchronizing the data between many different geographically distributed data centers. Deleted data is only available for the retention period specified in the hosting contract, after which points the artifacts are gone.   That said many of the cloud providers have service offerings that accommodate the compliance standards set by laws such as HIPAA and SOX.

The Internet-based SAAS model has changed the way society works and in the process, addresses many concerns with the preservation and authentication of digital sources. However, unless these organizations chose to introduce terms of service that make these artifacts available to future historians, this concentration of data could pose a significant access issues. These cloud service providers have the data and authentication information that digital historians will need to work with the sources, but there is no mechanism to allow future historians access to these artifacts.

Conclusion

Born digital artifacts introduce complexity to the processes of preserving and authenticating sources. Technology is not a finite state, but an ever-evolving process. As a result, digital historians need to understand how changes in technology effect their ability to access and work with born digital materials. Understanding the challenges presented by this complex intersection of technology and history is important for historians of late twentieth and twenty-first century digital culture who will need and want to authenticate artifacts, sources, and texts as they would other non-digital material.


Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.