What's new
British Ordnance Collectors Network

This is a sample guest message. Register a free account today to become a member! Once signed in, you'll be able to participate on this site by adding your own topics and posts, as well as connect with other members through your own private inbox!

Batch download pictures from archives.gov

sgdbdr

Well-Known Member
Hi,

On the example above, there are 1000 files without any possibility to download everything at once.

I found a small free utility that lets you record macros based on mouse moves and clicks, that you can set up to repeat the number of times you need. You can also delay the repeat to allow the complete download of a file before downloading the next.

https://www.pymacrorecord.com/download
(I use the portable version)

But you have to let the computer alone during download.

If someone has a better solution, I am all ears...

Cheers,

S.
 
Hi,

the first and last images are:

Code:
https://s3.amazonaws.com/NARAprodstorage/lz/dc-metro/rg-243/561742/M1652/M1652-0060/M1652-0060-00001.tif
...
https://s3.amazonaws.com/NARAprodstorage/lz/dc-metro/rg-243/561742/M1652/M1652-0060/M1652-0060-01000.tif

So you can easily create a download list with Excel, Libre Office Calc, etc. and then place the list in a download-manager of your choice.

If you like coding you can write your own simple downloader which creates a download list and automatically is downloading the files. If you use c# the WebClient class for easy downloads or the HttpWebRequest for more complicated downloads are your friends. I think it's less then 100 lines of code needed to do this.

For my own purpose I wrote a small tool which I called "sequential downloader" with a user-friendly interface where I can enter first and last file number (1, 1000 in this case). The download link is entered as:

Code:
https://s3.amazonaws.com/NARAprodstorage/lz/dc-metro/rg-243/561742/M1652/M1652-0060/M1652-0060-0[###].tif

During download the [###] is replaced with the actual image number for each file and adds leading zeros.

But sorry, I can't share this tool. It is in a pre-alpha state with more bugs than features and it contains a lot more code for libraries where downloading is much more complicated (some libraries offer no download option and images are displayed in small tiles in a viewer and the download is only authorized by delivering valid cookies). Some also use random guid's as filenames instead of a linear numbering system. That means that the code isn't universal enough and needs to be changed most times to run properly.

//Edit:
You may also use the tool "wget" to download the images. Simply add "wget " in front of your Excel-generated list and save and run the list as a Windows .bat-file.
 
Last edited:
Thanks ! You gave me an idea. I asked ChapGPT for a python script. I tried it and it seems to work fine !
IMPORTANT NOTICE : The space in the expression ": param" needs to be removed. I had to add one to avoid replacement by an emoticon :param)
----------------------------------------------

import os
import requests

def download_images(base_url, save_dir, start, end, file_extension):
"""
Télécharge des images avec des noms séquentiels depuis une URL de base.

: param base_url: URL de base sans le numéro séquentiel (e.g., "http://example.com/images/image_").
: param save_dir: Répertoire où sauvegarder les images téléchargées.
: param start: Numéro de début pour les noms de fichiers.
: param end: Numéro de fin pour les noms de fichiers.
: param file_extension: Extension des fichiers (e.g., "jpg", "png").
"""
# Créer le répertoire si nécessaire
if not os.path.exists(save_dir):
os.makedirs(save_dir)

for i in range(start, end + 1):
file_name = f"{i:05}.{file_extension}"
url = f"{base_url}{file_name}"
save_path = os.path.join(save_dir, file_name)

try:
print(f"Téléchargement de {url}...")
response = requests.get(url, stream=True)
response.raise_for_status()
with open(save_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print(f" -> Sauvegardé sous {save_path}")
except Exception as e:
print(f"Erreur lors du téléchargement de {url}: {e}")

if __name__ == "__main__":
# Exemple d'utilisation
base_url = "https://s3.amazonaws.com/NARAprodstorage/lz/dc-metro/rg-243/561742/M1652/M1652-0001/M1652-0001-" # Remplacez par votre URL de base
save_dir = "E:\Downloads" # Dossier où sauvegarder les images
start = 1 # Numéro de début
end = 10 # Numéro de fin
file_extension = "tif" # Extension des fichiers

download_images(base_url, save_dir, start, end, file_extension)

---------------------------

I will also try the excel/wget tip.

Cheers,

S.
 
Last edited:
Excel/wget tip :

Text file (urls.txt) with each URL on a new line, generated with Excel or similar as you said

Command in a terminal : wget -i (complete path if needed)\urls.txt

Cheers,

s.
 
--------------------------------------------------------------------------------------------------
Another problem outlined by Alpini is if a photo is named for example like this: 2g7K2-23h-789-S234k.tif (which means that the next photo has a different name but has (blocks of characters) 5-3-3-5.tif
Logically, then each subsequent photo is marked with "random characters" in this case a combination of numbers 0-9, lowercase a-z , A-Z . So always 5 dash characters, always 3 dash characters, always 5 dash characters ..etc (characters can be repeated).
The whole thing is a unique designation that is not used to mark another file.
Ideas:Generate a "generator combination" of all possible characters that meet the specified parameters.
Then have this "Source.txt" imported into the automatic download manager and systematically search the storage.We don't know what used and which ones did not. That's why we use combinatorics.
A partial problem that if we solve will rapidly reduce the number of combinations but ... how to guess the actually used blocks for file labeling.
Ideas are welcome:

Akon .
 
Top