Recommended Web Archive Formats

This format specification covers the our preferred format for archived web content or web archives. We are aware that websites, including blogs, social media, and other web content that make up websites, are presented and created in formats for viewing in a web browser, and are often different than the standard format that is recommended for preservation and long-term access. Given that the focus of this document is preservation and long-term access, the following format preferences favor those outcomes.

Formats

Preferred
The Library, and other organizations involved in web archiving, are preserving web content in the Web Archive (WARC) format using record-at-a-time GZIP compression, as described in Appendix A of the WARC Standard.
Acceptable

Delivery method

Preferred
Capture using tools that produce non-proprietary output, to conform with standard formats and requirements
Acceptable
Transmission of WARC or ARC_IA files created by web content producers or other archiving organizations.

Metadata

Preferred
  1. Refer to the WARC ISO-standard specification for mandatory and recommended metadata fields
  2. When displaying archived content, the following should be clearly indicated: archiving institution, dates and time of capture, and statements about functionality within the archive to distinguish from the live site.
Acceptable
The ARC_IA should be named in a manner that easily identifies the archiving institution (see WARC standard for recommended naming conventions).

Technological measures

Preferred
Tools currently available cannot capture all web content, so certain types of web content may not be preservable through web capture at this time. These include multi-media rich content, streaming media, deep web content, and databases.

Referencing

Preferred
Web materials in any web archive can be referred to persistently using the URN Namespace Registration for Persistent Web IDentifiers (PWID).