skip to content

GitHub Archive Program

Preserving open source software for future generations

Get your code into the GitHub Arctic Code Vault

02/02/2020

88 Days 17 Hrs 13 Mins 56 Secs

The world is powered by open source software.

It is a hidden cornerstone of modern civilization, and the shared heritage of all humanity. The mission of the GitHub Archive Program is to preserve open source software for future generations.

GitHub is partnering with the Long Now Foundation, the Internet Archive, Software Heritage, Arctic World Archive, Microsoft Research, the Bodleian Library, and Stanford Libraries to ensure the long-term preservation of the world's open source software. We will protect this priceless knowledge by storing multiple copies, on an ongoing basis, across various data formats and locations, including a very-long-term archive designed to last at least 1,000 years.

Partners

  • Internet Archive
  • Software Heritage
  • The Long Now
  • piql
  • Stanford University
  • Bodleian Libraries
  • GH Torrent
  • GH Archive
  • Microsoft Research

Our Approach

Library of Alexandria

Library of Alexandria

Hundreds of thousands of texts, comprising an enormous amount of classic literature, science, and culture, were lost with the Library of Alexandria.

Saturn V Rocket

Saturn V

After the Challenger disaster, a hunt for the blueprints of the abandoned Saturn V rocket ensued. They were largely recovered, thanks to the work of archivists.

Why we use multiple forms of storage

As today’s vital code becomes yesterday’s historical curiosity, it may be abandoned, forgotten, or lost. Worse, albeit much less likely, in the case of global catastrophe, we could lose everything stored on modern media in a few generations. Archiving software across multiple organizations and forms of storage will help ensure its long-term preservation: online archivists call this "LOCKSS," for Lots Of Copies Keeps Stuff Safe.

A worrying amount of the world's knowledge is currently stored on ephemeral media: hard drives, SSDs, CDs good for a few decades, backup tapes whose notional 30-year lifespans assume strictly controlled heat and humidity. Because (some) hardware can be much longer-lived, there exists a range of possible futures in which working modern computers exist, but their software has largely been lost to bit rot. The GitHub Archive Program will include much longer-term media to address the risk of data loss over time.

How the future might use our code

Future historians will be able to learn about us from open source projects and metadata. They might regard our age of open source ubiquity, volunteer communities, and Moore’s Law as historically significant. We are already partnering with Stanford Libraries to help archive curated repositories along with the cultural and other context in which they are set, as key elements of wide-ranging historical and social research and analysis.

Because hardware can be much longer-lived than most of today’s storage media, especially older ones and/or those with mask ROM, there exists a range of possible futures in which working modern computers exist, but their software has largely been lost to bit rot. The Archive Program will preserve that software.

Even in the near future, storing data with multiple partners provides options to people whose access might otherwise be restricted. If GitHub were to become unavailable in any location, for example due to an internet routing issue, those affected could access public code for their projects using the Internet Archive and Software Heritage.

There is a long history of lost technologies from which the world would have benefited, as well as abandoned technologies which found unexpected new uses, from Roman concrete, or the anti-malarial DFDT, to the hunt for mothballed Saturn V blueprints after the Challenger disaster. It is easy to envision a future in which today’s software is seen as a quaint and long-forgotten irrelevancy, until an unexpected need for it arises. Like any backup, the GitHub Archive Program is also intended for currently unforeseeable futures as well.

pantheon

Pantheon

Rome’s 1800-year-old Pantheon remains the largest unreinforced concrete dome in the world, thanks to Roman concrete, a technology whose properties were only rediscovered in 2014.

Pace Layers

A flexible, durable strategy for archiving code

We've adopted a “pace layers” strategy for archiving code, inspired by Long Now founder Steward Brand. This approach is designed to maximize both flexibility and durability by providing a range of storage solutions, from real-time to long-term storage. The Archive Program is partitioned into three tiers: hot, warm, and cold.

Hot: Near real-time Warm: Updated monthly to yearly Cold: Updated every 5+ years
Expand all

GitHub

On every push to GitHub, we replicate your Git data to multiple datacenters around the world. Additionally, we store backups of Git data, Issues, Pull Requests, and all of your data on GitHub in multiple locations. All of this data is available live via the GitHub API.

GHTorrent

GHTorrent monitors the GitHub public event timeline, archives those events, and recursively crawls and archives their contents and dependencies. Those archives are made available for download on a daily or monthly basis.

GH Archive

GH Archive monitors the GitHub public event timeline, archives those events, and makes them queryable using BigQuery. You can also download snapshots by hour, day, or month.

Internet Archive

The Internet Archive’s well-known Wayback Machine will crawl GitHub’s public repositories—including new repositories, issues, pull requests, wikis, and more—and store copies on hard drives in San Francisco and other locations. These archives will be publicly available via git and https.

Software Heritage

Software Heritage will crawl GitHub on a regular basis and add its public repos to their archive, to which they provide public API access.

Bodleian Library

Oxford University’s Bodleian Library will provide redundancy for the Arctic Code Vault by keeping GitHub’s 10,000 most-starred and most-depended-upon repositories in their depository as duplicate Piql film reels.

Arctic world archive

On February 2, 2020, GitHub will capture a snapshot of every active public repository, to be preserved in the GitHub Arctic Code Vault. This data will be stored on 3,500-foot film reels, provided and encoded by Piql, a Norwegian company that specializes in very-long-term data storage. The film technology relies on silver halides on polyester. This medium has a lifespan of 500 years as measured by the ISO; simulated aging tests indicate Piql’s film will last twice as long.

Project Silica from Microsoft Research

The GitHub Archive Program is partnering with Microsoft’s Project Silica to ultimately archive all active public repositories for over 10,000 years, by writing them into quartz glass platters using a femtosecond laser.

The GitHub Arctic Code Vault

The GitHub Arctic Code Vault is a data repository preserved in the Arctic World Archive (AWA), a very-long-term archival facility 250 meters deep in the permafrost of an Arctic mountain. The archive is located in a decommissioned coal mine in the Svalbard archipelago, closer to the North Pole than the Arctic Circle. GitHub will capture a snapshot of every active public repository on 02/02/2020 and preserve that data in the Arctic Code Vault.

code

How the cold storage will last 1,000 years

Svalbard is regulated by the international Svalbard Treaty as a demilitarized zone. Home to the world’s northernmost town, it is one of the most remote and geopolitically stable human habitations on Earth.

The AWA is a joint initiative between Norwegian state-owned mining company Store Norske Spitsbergen Kulkompani (SNSK) and very-long-term digital preservation provider Piql AS. AWA is devoted to archival storage in perpetuity. The film reels will be stored in a steel-walled container inside a sealed chamber within a decommissioned coal mine on the remote archipelago of Svalbard. The AWA already preserves historical and cultural data from Italy, Brazil, Norway, the Vatican, and many others.

While Svalbard is affected by climate change, it’s likely to affect only the outermost few meters of permafrost in the foreseeable future. Warming is not expected to threaten the stability of the mine. The mine’s proximity to the famous Global Seed Vault, only a mile away, reinforces Svalbard’s status as a stable, very-long-term archive site for humanity’s collective knowledge.

What’s in the 02/02/2020 snapshot

The 02/02/2020 snapshot archived in the GitHub Arctic Code Vault will sweep up every active public GitHub repository, in addition to significant dormant repos as determined by stars, dependencies, and an advisory panel. The snapshot will consist of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size. Each repository will be packaged as a single TAR file. For greater data density and integrity, most of the data will be stored QR-encoded. A human-readable index and guide will itemize the location of each repository and explain how to recover the data.

Tools for the future

How we’re ensuring the future can use our software

code
golden records of Voyager 1 and 2

Voyager Golden Record

We’re convening a GitHub Archive Program advisory panel, including experts in anthropology, archaeology, history, linguistics, archival science, futurism, and more, to advise us on what content should be included in the archive and how to best communicate with its inheritors.

A thousand years is a very long time. Ancient ruins such as Angkor Wat, Great Zimbabwe, and Macchu Picchu had not yet been built a thousand years ago. Nevertheless, we can consider and plan for a broad range of possibilities over the next 1,000 years. This program builds on the best ideas we have today.

The introduction to the archive will include technical guides to QR decoding, file formats, character encodings, and other critical metadata so that the raw data can be converted back into source code for use by others in the future. The archive will also include a Tech Tree—a roadmap and Rosetta Stone for future curious minds inheriting the archive’s data.

An overview of the archive and how to use it, the Tech Tree will serve as a quickstart manual on software development and computing, bundled with a user guide for the archive. It will describe how to work backwards from raw data to source code and extract projects, directories, files, and data formats.

Inspired by (and including elements of) the Long Now’s Manual for Civilization, the archive will also include information and guidance for applying open source, with context for how we use it today, in case future readers need to rebuild technologies from scratch. Like the golden records of Voyager 1 and 2, it will help to communicate the story of our world to the future.

In the range of possible futures in which humanity has working modern computers, but no software to run on them, the archive and its Tech Tree could be extremely valuable. However, the value is more likely to be historical, perhaps ensuring that today’s technology is not lost by a tomorrow that carelessly considers it irrelevant—until an unexpected use for our software is discovered.

Long Now's Manual for Civilization

Pioneer plaque

Archive Program Advisors

Guidance from experts in technology and the humanities.

  • headshot of Shannon Lee Dawdy

    Shannon Lee Dawdy

    Archaeologist / Anthropologist / Historian

  • headshot of Brewster Kahle

    Brewster Kahle

    Internet Archive

  • headshot of John McWhorter

    John McWhorter

    Linguist

  • headshot of Alexander Rose

    Alexander Rose

    Executive Director, Long Now Foundation

  • headshot of Ada Palmer

    Ada Palmer

    Historian / Science Fiction Author

  • headshot of Hussein Bassir

    Hussein Bassir

    Archaeologist / Egyptologist / Director of the Antiquities Museum at the Library of Alexandria

  • headshot of Christine Moran

    Christine Moran

    Computational Astrophysicist / Security Engineer

Our primary mission is to preserve open source software for future generations. We also intend the GitHub Archive Program to serve as a testament to the importance of the open source community. It’s our hope that it will, both now and in the future, further publicize the worldwide open source movement; contribute to greater adoption of open source and open data policies worldwide; and encourage long-term thinking.