From "Make a local backup mirror of a third-party website" to security awareness

xavi · 19 September 2025 09:19

In 2025, we started out a trifle naively Make a local backup mirror of a third-party website with wget and create a library that can be used outside the Internet

Since several people have started practicing this “offline first” approach, questions have arisen, as well as issues related to the security of people copying websites and then physically “carrying” this information with them.

So I reached some savvy people and NGOs to start putting together a specific safety guide for this activity.

Here’s very first contribution by Rysiek and I’ll follow-up day after day in comments below this primer post. Feel free to jump in this topic too.

You need to think of three things: Confidentiality, Integrity, Availability.
This is the “information security triad”. These are separate concerns, but
they are tied and in tension. For example, something that needs to be strictly
confidential will necessarily be less available than something that needs to
be public.

These three concerns will have different weights, different importance
depending on the dataset, on who you are archiving for, the source of the
data, and so on.

In case of archiving publicly available sites/data, my guess is that Integrity
will often be the most important concern.

Is the dataset complete?

Has it not been tampered with? Has it not been damaged in some way (say, by read/write errors on the disk, or in transmission)?

How many independent locations is it backed-up at?

How is integrity of these verified?

A lot of this is basically being a good librarian. Cataloguing data, checking
and re-checking that data is where it should be and in a state it needs to be
in.

Other, non-integrity-related concerns might include:

are you okay with the original website operators to know they are being scraped at all?

Or that they are being scraped by you specifically?

Are we not overloading or inconveniencing them in some way? That could lead to them blocking us just for that reason.

How do we want to make the data available, and to whom?

Do we want to consider publicly mirroring the data through systems like BitTorrent or IPFS? And so on, and so forth…