What is the GitHub Archive Program?
The GitHub Archive Program is a GitHub initiative to ensure that open source software is preserved for future generations. It includes both very-long-term storage such as the GitHub Arctic Code Vault and ongoing backups of public repositories by nonprofit organizations such as Software Heritage and the Internet Archive.
What are the “Greatest Hits” ?
The “Greatest Hits” are archives of GitHub’s 17,000 most-popular and most-dependended-upon repositories, written to hardened film designed to last for 1,000 years, enclosed in beautiful museum-quality cases, and donated to three of the world’s great libraries, on three different continents.
Which repositories are in the Greatest Hits?
A list of all of them is available here.
How were the Greatest Hits repositories chosen?
A combination of popularity (star count) and dependency extent. The selection was purely algorithmic; GitHub did not and does not pass judgement on the significance or quality of the selected repositories.
What is the GitHub Arctic Code Vault?
The Arctic Code Vault is a snapshot of every active public repository on GitHub. These millions of repos were written to the same 1,000-year hardened film as the Greatest Hits and stored in the Arctic World Archive, a very-long-term storage facility in a decommissioned coal mine in Svalbard, Norway.
What public repositories are archived in the Arctic Code Vault?
On February 2, 2020 we took a snapshot of all of GitHub’s public repositories that have been active within recent months. The archive will include every repo with any commits between the announcement at GitHub Universe on November 13, 2019 and February 2, 2020, every repo with at least 1 star and any commits from the year before the snapshot (02/02/2019 - 02/02/2020), and every repo with at least 250 stars. Plus, gh-pages for any repository that meets the aforementioned criteria.
What public repositories are getting archived in the partner archives in the Archive Program?
All active public repositories on GitHub are continuously archived by the organizations that take part in the GitHub Archive Program.
How do I create something new to ensure it’s added to the archive?
Create a new public repository, and add the software or content you’d like to have archived.
What does it mean for my project to be archived?
There are multiple layers to the GitHub Archive Program. First, we have community efforts like GH Archive and GH Torrent that provide “hot storage” archiving GitHub events (like new Issues, updated pull requests, and much more) in near real time. Then, we work with partners such as the Internet Archive and the Software Heritage Foundation. These layers of “warm storage” crawl every public repository, complete with its history, issues, pull requests, and other meta-data, archive them, and update them on the interval required by their layer (days, months, or years).
In addition to the warm layers, your repository will be stored in the GitHub Arctic Code Vault at the Arctic World Archive in Svalbard. The GitHub Arctic Code Vault stores the latest revision of your project’s default branch (potentially excluding large binary files depending on the overall size of the repository) at the time of capture. This “cold storage” is designed to last for 1,000 years.
What if I want to opt out?
GitHub will archive only public repositories, so opting out is as simple as making your repository private. Alternatively, go to your repository’s Settings tab, scroll down to the “Data Services” section, and unselect the Preserve this repository checkbox. Please visit https://support.github.com/ for all other opt out concerns.
Can I access my repository as part of each archive?
For the warm storage, you can access the archived copies of your repositories by going to the GitHub Archive Program site and following the content from each partner.
Due to the nature of the GitHub Arctic Code Vault and other cold storage, it is not possible for the public to access the code stored there. However, if you go to the GitHub Archive Program website once the data is stored, we will tell you how to figure out where your data is stored.
How do you handle repositories with dependencies, like libraries? What about build system? Will you archive everything needed to build my repository?
We will archive the code at the state of HEAD on the default branch of your repository. If you include your dependencies within your repository, those will be included (with the exception of large binary files). The Tech Tree (see below) will also describe the importance of dependencies and how to locate dependencies within the various languages.
If your project’s dependencies are open source projects on GitHub, they will also be in the archive alongside your code; otherwise, you need to add them into your repository or create a mirror on GitHub.
What happens if I find a bug or vulnerability in my project after it has been archived?
For the hot and warm archives, your code will be updated by each archive to include any fixes on their next refresh cycle. However, due to the nature of cold archives, your repository is stored there in the state it was on February 2, 2020.
Are the repositories in the GitHub Arctic Code Vault at the Arctic World Archive ever updated?
We plan to evaluate the program, and the state of the art of archival technology, every five years. Depending on the results of each evaluation, we may then decide to take another snapshot of GitHub’s public code and archive it in cold storage.
If I delete my repositories from GitHub, will it eventually be deleted from all warm storage partners?
Keeping a historic view is an important part of each archive. If you have a concern about your repository continuing to be a part of the archive, please contact the archives. For the GitHub Arctic Code Vault, we are unable to remove data that has already been stored.
Can you explain how GDPR works for repositories in the cold or warm storage, for example for personal data in the git commit message (author name and email)? Have you reviewed this with your legal team and cleared all open source projects of any liabilities?
The cold storage contains only a snapshot of the code, so individual commit messages are not captured, although each repository also has a list of contributors appended to it. Warm storage contains more thorough information, but archives have a special legal status under GDPR which protects them. GitHub’s Legal Team has approved the Archive Program.
Can I visit GitHub Arctic Code Vault at the Arctic World Archive?
Unfortunately, in order to maintain the security of the archive, the Arctic World Archive is unable to provide tours. Mine No. 3, in which the AWA is set, provides tours which can be booked by visiting visitsvalbard.com. We invite you to watch our video announcing the GitHub Arctic Code Vault for more information.
It is possible to do tours of a coal mine, Mine 3 in Longyearbyen, Svalbard, see visitsvalbard.com. When production was stopped in Mine 3 in November 1996, all the machinery and equipment were left behind in the facility. You can expect an authentic tour, walking 250m into one of the main tunnels of the mine and seeing the origins of the first seed bank in one of the side tunnels. It’s not possible to visit the Global Seed Vault or the Arctic World Archive.
How is the Arctic World Archive protected against climate change, natural disasters, or outside access?
The Arctic World Archive is located within a decommissioned mine located 104 meters above sea level. It is 350 meters deep inside permafrost, built with an incline and decline within the mine tunnel such that water will naturally flow out of the mine. The archive chamber has been reinforced with steel rebar and does not require electricity to operate. The mine is still being maintained and secured by Store Norske in cooperation with the Norwegian Government and the Arctic World Archive. The mine itself has multiple levels of physical access control before getting to the archive which has additional physical protections.
Why does GitHub need an archive program? Isn’t GitHub having multiple data centers and is protected against downtimes and server crashes?
GitHub’s backup strategy is meant to protect your day-to-day work and code within GitHub. This includes multiple levels of data centers, redundancy, and backup systems to keep your data safe and secure. The goal of the GitHub Archive Program is to provide access for academic research and humanitarian interests in the state of public software, both in the near term via the warm layer and in the long term via the cold layer.
What is the “Tech Tree”? Why is the “Tech Tree” necessary?
The Tech Tree gives an overview of the archive, describes the structure of the archive, how to use it and extract projects, directories, files, and data formats, and how to use dependencies. It is essentially a quick start into software development and computing bundled with a user guide for the archive. It will also include elements of the Long Now Foundation’s “Manual for Civilization.”
To build the Tech Tree, we are working with open source community maintainers, archivists, archeologists, anthropologists, sociologists, and more from professional and academic backgrounds.