Recommended Web Archive Formats
This format specification covers the our preferred format for archived web content or web archives. We are aware that websites, including blogs, social media, and other web content that make up websites, are presented and created in formats for viewing in a web browser, and are often different than the standard format that is recommended for preservation and long-term access. Given that the focus of this document is preservation and long-term access, the following format preferences favor those outcomes.
Formats
- Preferred
- The Library, and other organizations involved in web archiving, are preserving web content in the Web Archive (WARC) format using record-at-a-time GZIP compression, as described in Appendix A of the WARC Standard.
- Acceptable
- Internet Archive’s ARC_IA format, a precursor to the WARC format
- Web Archive Collection Zipped (WACZ), as used in the Webrecorder project
- CDX as a component file for WARC file content
Delivery method
- Preferred
- Capture using tools that produce non-proprietary output, to conform with standard formats and requirements
- Acceptable
- Transmission of WARC or ARC_IA files created by web content producers or other archiving organizations.
Metadata
- Preferred
-
- Refer to the WARC ISO-standard specification for mandatory and recommended metadata fields
- When displaying archived content, the following should be clearly indicated: archiving institution, dates and time of capture, and statements about functionality within the archive to distinguish from the live site.
- Acceptable
- The ARC_IA should be named in a manner that easily identifies the archiving institution (see WARC standard for recommended naming conventions).
Technological measures
- Preferred
- Tools currently available cannot capture all web content, so certain types of web content may not be preservable through web capture at this time. These include multi-media rich content, streaming media, deep web content, and databases.
Referencing
- Preferred
- Web materials in any web archive can be referred to persistently using the URN Namespace Registration for Persistent Web IDentifiers (PWID).