Make a local backup mirror of a third-party website with wget and create a library that can be used outside the Internet

xavi · 28 March 2025 09:52

Version en français : ici

The threats we face are increasingly numerous and violent, and the information we can access is increasingly compromised. All the more so when we depend on companies or public institutions for critical services. There is also the growing intensity of hazards linked to climate disruption.

One way of mitigating these risks is to make back-ups, which can then be used to restore websites, and to put these back-ups where few people and organisations have access to them; in other words, to put them in our own local storage space. Among a number of strategies and different technical approaches, we can opt for the mirroring approach.

Mirror sites or mirrors are replicas of other websites. The concept of mirroring applies to network services accessible through any protocol, such as HTTP or FTP. Such sites have different URLs than the original site, but host identical or near-identical content
Wikipedia Mirror site

A mirror copy is an identical duplicate of the original made at a point in time T. And we can add changes made to the original to our mirror copy at a later date, or make a new mirror copy to compare two versions. Mirrored copies can help restore deleted or defaced websites; mirrored copies made on a device can be used to access information without an internet connection from that device.

An example of when wget isn’t needed for mirroring: the website you want to back up has open source code and is free to access. In this case, it makes more sense to clone the source code repository and update the clone afterwards.

There is a plethora of documentation, blog posts, tutorials and videos on mirroring a website. Let’s just say that this blog post is part of a ‘learning by repetition’ approach.

wget can run on Windows, macOS, Linux, Android and other operating systems.

If this approach is too technically difficult for you, send me a discreet message by whatever means suits you and I’ll be happy to chat, help you, and maybe even mirror your web resources.

Why do this?

Non-exhaustive list

To access information published on the web without having access to the Internet
To safeguard the work of individuals or organisations
To collect evidence
To help individuals and organisations in the face of potential or existing problems
Participate in a network of mutual aid and solidarity

As far as I’m concerned, the technique is only incidental. What really counts and contributes to structuring collective capacities are the exchanges, discussions and encounters that can be generated by this type of operation.

or an example of how mirroring can be used, the article ‘Supporting trans literature without the GAFAMs’, published by Bethany Karsten’s, may inspire you to save locally https://thetransfemininereview.com/.

thetransfemininereview.com mirror: 556,1 Mio on March 23rd 2025

Very few people and organisations have their own web hosting and storage infrastructure. As a result, many resources are rendered inaccessible (Internet connections interrupted or censored, files deleted, servers seized, hacking, etc.). Our efforts disappear, our assets are destroyed and our resources are stolen.

What’s more, the lack of viable long-term storage poses problems for organisations that may be required by law to retain data, as well as for individuals who wish to reduce the risk of how long they keep their data.

As individuals and organisations generate ever-increasing amounts of information, in the form of web pages, which need to be stored for the long term, storage costs continue to rise and reliable, long-term digital storage is in short supply.

There is also the complementary option of cold storage: archived data storage that requires long-term retention, but does not require continuous access and/or does not require easy accessibility to the device.

See also Backup strategy

This could be an external hard drive that is only connected to our machine after the action using wget (which I will describe below), onto which we copy our mirror and which is then stored in a secure location, away from the internet and power supply.

Encrypting stored data is essential!

Why is cold storage important?

The data is backed up to a device whose system is shut down when it is not in use.
This cold backup is included in the ‘3-2-1-1-0’ backup strategy:
- Use 2 different storage media (the one with the mirror and the one with the copy of the mirror)
- Have at least 1 copy physically stored elsewhere on a non-networked device
Your computer, and other everyday household devices, can be destroyed by a disaster, or lost, or stolen, or legally seized.

Wget and the mirror

Important! Always ask the people in charge of the website you want to mirror before taking any action! The server on which you make requests to be set up before sabotage measures against ‘AI’ and/or scrapping robots and fill you literally with hundreds of gigabytes of rubbish. See, for example, this documentation on the use of iocaine software and this one. Also, a website may contain pages with deliberately malicious and/or rogue resources.

wget allows you to make requests from one machine to a remote machine, for example from your computer to a server hosting the https://thetransfemininereview.com/ website. One of the consequences of these requests is that you transmit information such as your IP address, the date and time of the requests, and other data.

We’ll look at some measures to mitigate these consequences below.

GNU Wget is a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS, the most widely used Internet protocols.

Reader, I’ll let you install wget as you like, then choose which folder you want to work from to download the mirrors of the websites you like.

With wget we’re going to generate a folder named after the copied website. You can open this folder (right click) in your web browser and view and browse it whenever you want, even without being connected to the internet.

According to the wget documentation, with what is called ‘very advanced use’, if you want to use wget in a terminal user interface (TUI) to mirror a web page or an entire website on your computer:

$ wget -m -k -K -E https://www.gnu.org/

With:

-m: to create a local mirror on your machine
-k (Convert links): Converts absolute hyperlinks to relative hyperlinks for offline viewing.
-K (-backup-converted): saves the X file as X.orig before conversion. *The ORI file extension stands for Original File and is used to designate a file that has not been modified or altered in any way.
-E (Adjust Extension): Changes the file extensions of web pages to .html for offline viewing.

We can therefore use the same thing with :

$ wget -mkKE https://www.gnu.org/

We may want more functionality because we want a precise and useful mirror, using :

-p (Page Requisites): Includes page dependencies such as images, stylesheets, etc. in the download.
-c (Continue): Resumes a website that has been partially downloaded.

This now looks like :

$ wget --mirror -p -c -k -K -E <url>

Or

$ wget -mpckKE <url>

More advanced with gloves for requests

As I said, we have to worry about the traces we leave when we make requests that send data in order to receive other data.

Then, in order to request in a less brutal way and avoid the server ban, we could add arguments.

$ wget -mpckKE --user-agent="" -e robots=off <url>

With :

--user-agent="": wget does not provide a user agent
-e robots=off: Websites specify crawling rules in the robots.txt file to guide crawlers. By default, wget respects these restrictions. However, some testing or archiving scenarios require robots.txt to be ignored in order to access restricted paths. By using --ignore-robots or adjusting the configuration, wget will ignore these directives. Although practical for certain tasks, this must be done ethically and with respect for server resources.

Warning! As I said, always ask before mirroring a website that is not your own. And not respecting the rules on the robot.txt server can seriously annoy the person doing the system administration or even result in your wget request being abandoned.

Taking a more cautious approach to the information we transmit

We can also continue to reduce our footprint and mitigate privacy issues by using the Tor network. This will provide an IP address to the server to which we send our requests instead of our actual connection IP address.

With torsocks - socks -

$ torsocks wget -mpckKE --user-agent="" -e robots=off <url>

We can also add another level of mitigation with a waiting time between each request, so as not to be too brutal on the target server and avoid being rejected or banned

$ torsocks wget -mpckKE --user-agent="" -e robots=off --wait 1 <url>

However, this option can increase the mirroring time from a few seconds (for a static site) to 3 or 4 minutes, and up to several hours for a lightweight and/or more complex site.

Running multiple websites requests

Create a list.txt containing all the URLs of the websites you want to mirror. Then simply run :

$ torsocks wget -mpckKE --user-agent="" -e robots=off --wait 1 list.txt

Some people may want to make an alias called mirror to have a command like

$ mirror <url>

Others may want to Torify a script (bash, zsh, other), which would give :

$ torifiy ./mirror.sh

We can also plan and automate such requests in advance, for example with cron, to repeat our actions with regularity.

Thanks and acknowledgement

xavi · 5 April 2025 13:24

Yesterday on Mastadon we were discussing about NOAA websites and Christian Pietsch pointed us that we can create WARC file format using wget.
See also:

I need to do some test but it seems really useful.