Data Engineering 3 min read

How to Download 100M Images in 20 Hours: The Ultimate Guide to Building Massive AI Training Datasets

B
Bright Coding
Author
Share:
How to Download 100M Images in 20 Hours: The Ultimate Guide to Building Massive AI Training Datasets
Advertisement

How to Download 100M Images in 20 Hours: The Ultimate Guide to Building Massive AI Training Datasets

┌─────────────────────────────────────────────────────────────┐
│  ⚡ 100M IMAGES IN 20 HOURS: THE COMPLETE BREAKDOWN ⚡      │
├─────────────────────────────────────────────────────────────┤
│  🎯 TOOL: img2dataset by LAION                              │
│  🔥 SPEED: 1,350 images/second (4.8M/hour)                  │
│  💾 OUTPUT: ~50GB (256x256 JPG, quality 95)                 │
│  🖥️  HARDWARE: 1Gbps, 16 cores, 32GB RAM                   │
│  ⏱️  TIME TO 100M: 20.8 hours                               │
├─────────────────────────────────────────────────────────────┤
│  3 SIMPLE STEPS:                                            │
│  1️⃣  Setup: pip install img2dataset + configure DNS        │
│  2️⃣  Prepare: URL list (txt/json/parquet)                  │
│  3️⃣  Download: Run with optimized threads                  │
├─────────────────────────────────────────────────────────────┤
│  🔒 SAFETY FIRST:                                           │
│  ✓ Respect robots.txt & X-Robots-Tags                       │
│  ✓ Use --disallowed_header_directives filter                │
│  ✓ Enable incremental mode to resume fails                  │
│  ✓ Set user_agent_token for transparency                    │
├─────────────────────────────────────────────────────────────┤
│  💡 PRO TIP: WebDataset format >1M images avoids filesystem │
│     overload                                                │
└─────────────────────────────────────────────────────────────┘

The Data Gold Rush

In the AI revolution, data is the new oil. Training state-of-the-art computer vision models requires massive image datasets and we're talking hundreds of millions of images. But here's the challenge: how do you ethically and efficiently download 100 million images without breaking the internet (or your infrastructure)?

Enter img2dataset, the powerhouse tool developed by LAION that transforms a simple list of URLs into a battle-ready training dataset at blistering speeds. This open-source marvel achieves what was once impossible: downloading, resizing, and packaging 100 million images in just 20 hours on a single machine.

Whether you're training the next CLIP model, building a generative AI system, or conducting large-scale research, this guide will show you exactly how to harness this capability safely and effectively.


🚀 Real-World Case Studies: Speed in Action

Case Study #1: LAION-400M - The Foundation of Open-Source AI

Dataset: 400 million image-text pairs
Time to Download: 3.5 days (continuous)
Average Speed: ~1,320 images/second
Infrastructure: Single node with 1Gbps connection, 32GB RAM, 16-core CPU
Result: The dataset that powered Stable Diffusion v1 and countless other models

Key Insight: The LAION team used img2dataset with CLIP similarity filtering (threshold >0.3) to create a high-quality subset from Common Crawl data. They processed petabytes of WAT files to extract image URLs with alt-text, demonstrating the importance of a robust preprocessing pipeline.

Case Study #2: COYO-700M - Pushing the Boundaries

Dataset: 747 million image-text pairs + metadata
Time to Download: ~6.5 days on optimized infrastructure
Unique Features: Preserved additional metadata (aesthetic scores, watermark detection, safety ratings)
Storage: ~15TB in WebDataset format
Use Case: Training multimodal models with rich attribute understanding

Key Insight: COYO leveraged img2dataset's --save_additional_columns feature to retain comprehensive metadata, enabling researchers to filter by aesthetic quality, safety, and watermarks post-download.

Case Study #3: LAION-5B - Distributed Scaling Mastery

Dataset: 5.85 billion image-text pairs
Time to Download: 7 days
Infrastructure: 10 nodes parallel processing (PySpark distributor)
Average Speed: 9,500 images/second (cluster-wide)
Total Storage: 240TB

Key Insight: At this scale, single-machine processing becomes impractical. The team used img2dataset's PySpark integration to distribute shards across 10 machines, achieving near-linear scaling. Each node handled ~585M URLs independently.


🛠️ Essential Tools & Alternatives

Primary Tool: img2dataset (⭐ Recommended)

Alternative Tools Comparison

Tool Speed Scale Resizing Ethics Features Best Use Case
img2dataset 1,350 img/s 100M-5B+ ✅ Yes ✅ X-Robots-Tag AI training, research
wget/curl scripts 50-200 img/s <1M ❌ No ❌ Minimal Quick prototype
Scrapy 300-500 img/s <10M ❌ Manual ⚠️ Partial Custom crawling logic
FiftyOne Varies <1M ✅ Yes ❌ Limited Computer vision workflows
TensorFlow Datasets Varies Pre-built only ✅ Yes ✅ Yes Standard academic datasets

Infrastructure Requirements

Minimum Setup (10M images):

  • 4 CPU cores
  • 16GB RAM
  • 100 Mbps connection
  • 500GB SSD storage

Recommended Setup (100M+ images):

  • 16+ CPU cores (i7/Ryzen 9)
  • 32GB+ RAM
  • 1Gbps dedicated connection
  • 2TB+ NVMe SSD
  • Local DNS resolver (bind9/knot)

Enterprise Setup (1B+ images):

  • 10+ node cluster (32 cores/node)
  • 10Gbps aggregate bandwidth
  • 100TB+ distributed storage (HDFS/S3)
  • PySpark/Slurm orchestration

📋 Step-by-Step Safety Guide: Download Ethically & Efficiently

Phase 1: Pre-Download Preparation (1-2 hours)

Step 1: Legal & Ethical Compliance

# Check robots.txt for target domains
curl https://example.com/robots.txt

# Review Terms of Service
# Look for "noai", "noimageai" directives

# Configure img2dataset to respect opt-outs
img2dataset --disallowed_header_directives '["noai","noimageai","noindex"]' \
            --user_agent_token "AcmeResearchDownloader"

Step 2: Optimize DNS Resolution

# Install bind9 (Ubuntu/Debian)
sudo apt install bind9

# Configure for high performance
sudo nano /etc/bind/named.conf.options
# Add:
recursive-clients 10000;
resolver-query-timeout 30000;
max-clients-per-query 10000;
max-cache-size 2000m;

# Restart and set as default resolver
sudo systemctl restart bind9
echo "nameserver 127.0.0.1" | sudo tee /etc/resolv.conf

Step 3: Prepare Your URL List

# Example: Filter URLs before download
import pandas as pd

# Load parquet with URLs and captions
df = pd.read_parquet('image_links.parquet')

# Remove duplicates
df = df.drop_duplicates(subset=['url'])

# Filter by domain if needed
allowed_domains = ['flickr.com', 'wikimedia.org']
df = df[df['url'].str.contains('|'.join(allowed_domains))]

# Save in optimal format
df.to_parquet('filtered_urls.parquet', compression='snappy')

Phase 2: Configure & Launch (30 minutes)

Step 4: Test with Small Sample

# Download 1,000 images first
head -n 1000 filtered_urls.parquet > test_sample.parquet

img2dataset --url_list=test_sample.parquet \
            --output_format=webdataset \
            --image_size=256 \
            --thread_count=32 \
            --number_sample_per_shard=10000 \
            --output_folder=test_output

Step 5: Monitor Performance

# Enable Weights & Biases logging
pip install wandb
wandb login

img2dataset --enable_wandb=True \
            --wandb_project="my-dataset-download"

Step 6: Full-Scale Launch

# OPTIMIZED COMMAND FOR 100M IMAGES
img2dataset --url_list=filtered_urls.parquet \
            --output_folder=/mnt/nvme/dataset \
            --output_format=webdataset \
            --image_size=256 \
            --encode_quality=95 \
            --processes_count=16 \
            --thread_count=256 \
            --number_sample_per_shard=10000 \
            --timeout=10 \
            --retries=2 \
            --incremental_mode=incremental \
            --compute_hash=sha256 \
            --disallowed_header_directives '["noai","noimageai","noindex","noimageindex"]' \
            --user_agent_token="MyResearchProject_v1.0" \
            --max_shard_retry=3 \
            --enable_wandb=True

Phase 3: Safety & Verification (Ongoing)

Step 7: Verify Downloads

# Check successful downloads
find /mnt/nvme/dataset -name "*.parquet" | xargs \
  parquet-tools head -n 5

# Verify hashes if available
img2dataset --verify_hash '["md5","md5"]' \
            --compute_hash="md5"

Step 8: Handle Failures Gracefully

# Python script to resume failed shards
from img2dataset import download

download(
    url_list="filtered_urls.parquet",
    output_folder="/mnt/nvme/dataset",
    incremental_mode="incremental",  # Only download missing shards
    max_shard_retry=3
)

Step 9: Respect Rate Limits

# Add delays for specific domains (if needed)
# Use --timeout and ensure thread_count isn't overwhelming small servers
# Monitor with: netstat -an | grep ESTABLISHED | wc -l

⚠️ Critical Safety Rules

RULE #1: Never Bypass Robots.txt or X-Robots-Tags

# ❌ WRONG: Ignoring all directives
img2dataset --disallowed_header_directories '[]'

# ✅ RIGHT: Respect creator opt-outs
img2dataset --disallowed_header_directories '["noai","noimageai"]'

RULE #2: Use Transparent User-Agent

# ✅ Include contact/project info
--user_agent_token="StanfordCVLab_ProjectName_Contact@stanford.edu"

RULE #3: Implement Incremental Mode

# ✅ Allows resuming without re-downloading
--incremental_mode=incremental

RULE #4: Monitor Bandwidth Impact

# ✅ Limit concurrent connections if needed
--thread_count=128  # Reduce from default 256 if causing issues

RULE #5: Validate Content Post-Download

# ✅ Check for corrupted images
from PIL import Image
import os

def validate_image(filepath):
    try:
        img = Image.open(filepath)
        img.verify()
        return True
    except:
        return False

💡 Real-World Use Cases

Use Case 1: Training Multimodal AI Models

Scenario: Building a CLIP-like vision-language model
Dataset: 400M image-text pairs (LAION-400M)
Pipeline: Download → CLIP filtering → Resizing → WebDataset → Training
Result: State-of-the-art zero-shot classification

Use Case 2: Generative AI Development

Scenario: Training Stable Diffusion on specific domains
Dataset: 50M high-resolution art images (filtered from LAION-Aesthetic)
Pipeline: Download → Aesthetic scoring → Watermark removal → Augmentation
Result: Domain-specific image generation model

Use Case 3: Academic Research

Scenario: Studying visual representation learning
Dataset: 10M curated subset of ImageNet + LAION
Pipeline: Download → ResNet preprocessing → TFRecord format
Result: Reproducible computer vision experiments

Use Case 4: E-commerce Analysis

Scenario: Product image classification at scale
Dataset: 20M product images from public listings
Pipeline: Download → Standardize format → Metadata extraction
Result: Automated product tagging system

Use Case 5: Medical Imaging Research

Scenario: Collecting public medical images for AI diagnostics
Dataset: 2M de-identified radiology images
Pipeline: Download → Privacy checking → DICOM conversion
Result: Disease detection model (with IRB approval)


🎯 Performance Optimization Tips

For Maximum Speed:

  1. DNS is King: Use local bind9 or knot resolver (4x instances)
  2. SSD is Essential: NVMe drives sustain 30-130MB/s writes
  3. Thread Tuning: Start with 256 threads, adjust based on CPU usage
  4. Format Matters: WebDataset format reduces filesystem overhead by 10x

For Cost Efficiency:

  1. Use Spot Instances: Save 70% on cloud costs
  2. Compress Smart: JPG quality 95 offers best size/quality (30% smaller than PNG)
  3. Filter Early: Remove duplicates and low-quality URLs before download
  4. Incremental Saves: Resume failed downloads instead of starting over

For Quality Control:

  1. Hash Verification: Use --compute_hash=sha256 for integrity
  2. Size Filtering: --min_image_size=100 removes thumbnails
  3. Aspect Ratio: --max_aspect_ratio=3.0 filters banners
  4. Metadata Preservation: Always keep .parquet files for analysis

🔬 Technical Deep Dive: How It Achieves 1,350 img/s

The secret lies in shard-based multiprocessing architecture:

  1. URL Sharding: URLs split into 10K-sample shards → 10,000 shards for 100M
  2. Process Pool: Each CPU core gets its own shard and output tar file
  3. Thread Explosion: 256+ threads per process handle async I/O
  4. CPU Efficiency: Only 1 resize thread per core prevents overload
  5. Bandwidth Saturation: Thousands of parallel connections maximize pipe

Benchmark Results:

  • 18M images: 3.7 hours (1350 img/s)
  • 36M images: 7.4 hours (1345 img/s)
  • 190M images: 41 hours (1280 img/s)
  • 100M images: ~20.8 hours (theoretical optimal)

📊 Expected Resource Consumption

For 100M images at 256x256 resolution:

Resource Usage Notes
Bandwidth ~5TB total 50KB average per image
Disk Write 50-100GB/s sustained NVMe recommended
CPU 80-95% across 16 cores Mostly resizing operations
Memory 8-12GB peak Per-process overhead
DNS Queries 100M+ Requires local resolver
Storage 2.9TB (JPG Q95) Or 9.8TB (PNG)

🎓 Best Practices for Ethical AI Training

1. Transparency Documentation

Create a DATASET_CARD.md:

# Dataset Card
- **Source:** LAION-400M subset
- **Filtering:** CLIP similarity >0.3, aesthetic >7
- **Opt-out respected:** ✓ X-Robots-Tags honored
- **Contact:** research-team@university.edu
- **License:** CC-BY-4.0 (where applicable)

2. Attribution

When publishing models, cite:

@misc{beaumont-2021-img2dataset,
  author = {Romain Beaumont},
  title = {Dataset created using img2dataset},
  year = {2025},
  howpublished = {\url{https://github.com/rom1504/img2dataset}}
}

3. Data Minimization

Only download what you'll actually use:

# Filter by relevance BEFORE download
--max_image_area 1048576  # Max 1024x1024
--min_image_size 100      # Remove too-small images

4. Regular Audits

# Monthly check: Are we respecting new opt-outs?
grep -r "noai" /var/log/img2dataset/

⚡ Quick Start Commands

For Beginners (10K images):

pip install img2dataset
echo "https://picsum.photos/200/300" > urls.txt
img2dataset --url_list=urls.txt --output_folder=images

For Researchers (1M images):

img2dataset --url_list=urls.parquet \
            --output_format=webdataset \
            --processes_count=8 \
            --image_size=224

For Production (100M+ images):

# See full command in Step 6 above
# Run in screen/tmux for persistence
screen -S download
# ... command ...
# Detach with Ctrl+A then D

🚨 Troubleshooting Common Issues

Problem Cause Solution
Slow downloads (<500 img/s) Poor DNS resolution Setup local bind9 resolver
Corrupted images Network timeouts Increase --timeout=15, add retries
Disk full Underestimated storage Use WebDataset, lower quality to 90
CPU bottleneck Too many resize threads Reduce --processes_count
Memory errors Shard size too large Lower --number_sample_per_shard
Legal concerns Ignoring opt-outs Enable --disallowed_header_directives

🏆 Final Checklist: Before You Hit Enter

  • Reviewed robots.txt for target domains
  • Configured local DNS resolver (bind9/knot)
  • Tested on 1K sample first
  • Enabled incremental mode
  • Set transparent user_agent_token
  • Respecting X-Robots-Tags with --disallowed_header_directives
  • Have 3x storage space available (9TB for 100M)
  • Running in tmux/screen for persistence
  • Enabled W&B for monitoring
  • Documented dataset sources and filtering

📣 Share This Guide

Found this useful? Share the infographic above with your team! Tag #img2dataset and #MLCommunity to help others build ethical AI datasets.


Created with insights from LAION's official benchmarks and ethical AI practices. For the latest updates, follow the project at

https://github.com/rom1504/img2dataset/

Advertisement

Comments (0)

No comments yet. Be the first to share your thoughts!

Leave a Comment

Apps & Tools Open Source

Apps & Tools Open Source

Bright Coding Prompt

Bright Coding Prompt

Categories

Coding 7 No-Code 2 Automation 14 AI-Powered Content Creation 1 automated video editing 1 Tools 12 Open Source 24 AI 21 Gaming 1 Productivity 16 Security 4 Music Apps 1 Mobile 3 Technology 19 Digital Transformation 2 Fintech 6 Cryptocurrency 2 Trading 2 Cybersecurity 10 Web Development 16 Frontend 1 Marketing 1 Scientific Research 2 Devops 10 Developer 2 Software Development 6 Entrepreneurship 1 Maching learning 2 Data Engineering 3 Linux Tutorials 1 Linux 3 Data Science 4 Server 1 Self-Hosted 6 Homelab 2 File transfert 1 Photo Editing 1 Data Visualization 3 iOS Hacks 1 React Native 1 prompts 1 Wordpress 1 WordPressAI 1 Education 1 Design 1 Streaming 2 LLM 1 Algorithmic Trading 2 Internet of Things 1 Data Privacy 1 AI Security 2 Digital Media 2 Self-Hosting 3 OCR 1 Defi 1 Dental Technology 1 Artificial Intelligence in Healthcare 1 Electronic 2 DIY Audio 1 Academic Writing 1 Technical Documentation 1 Publishing 1 Broadcasting 1 Database 3 Smart Home 1 Business Intelligence 1 Workflow 1 Developer Tools 144 Developer Technologies 3 Payments 1 Development 4 Desktop Environments 1 React 4 Project Management 1 Neurodiversity 1 Remote Communication 1 Machine Learning 14 System Administration 1 Natural Language Processing 1 Data Analysis 1 WhatsApp 1 Library Management 2 Self-Hosted Solutions 2 Blogging 1 IPTV Management 1 Workflow Automation 1 Artificial Intelligence 11 macOS 3 Privacy 1 Manufacturing 1 AI Development 11 Freelancing 1 Invoicing 1 AI & Machine Learning 7 Development Tools 3 CLI Tools 1 OSINT 1 Investigation 1 Backend Development 1 AI/ML 19 Windows 1 Privacy Tools 3 Computer Vision 6 Networking 1 DevOps Tools 3 AI Tools 8 Developer Productivity 6 CSS Frameworks 1 Web Development Tools 1 Cloudflare 1 GraphQL 1 Database Management 1 Educational Technology 1 AI Programming 3 Machine Learning Tools 2 Python Development 2 IoT & Hardware 1 Apple Ecosystem 1 JavaScript 6 AI-Assisted Development 2 Python 2 Document Generation 3 Email 1 macOS Utilities 1 Virtualization 3 Browser Automation 1 AI Development Tools 1 Docker 2 Mobile Development 4 Marketing Technology 1 Open Source Tools 8 Documentation 1 Web Scraping 2 iOS Development 3 Mobile Apps 1 Mobile Tools 2 Android Development 3 macOS Development 1 Web Browsers 1 API Management 1 UI Components 1 React Development 1 UI/UX Design 1 Digital Forensics 1 Music Software 2 API Development 3 Business Software 1 ESP32 Projects 1 Media Server 1 Container Orchestration 1 Speech Recognition 1 Media Automation 1 Media Management 1 Self-Hosted Software 1 Java Development 1 Desktop Applications 1 AI Automation 2 AI Assistant 1 Linux Software 1 Node.js 1 3D Printing 1 Low-Code Platforms 1 Software-Defined Radio 2 CLI Utilities 1 Music Production 1 Monitoring 1 IoT 1 Hardware Programming 1 Godot 1 Game Development Tools 1 IoT Projects 1 ESP32 Development 1 Career Development 1 Python Tools 1 Product Management 1 Python Libraries 1 Legal Tech 1 Home Automation 1 Robotics 1 Hardware Hacking 1 macOS Apps 3 Game Development 1 Network Security 1 Terminal Applications 1 Data Recovery 1 Developer Resources 1 Video Editing 1 AI Integration 4 SEO Tools 1 macOS Applications 1 Penetration Testing 1 System Design 1 Edge AI 1 Audio Production 1 Live Streaming Technology 1 Music Technology 1 Generative AI 1 Flutter Development 1 Privacy Software 1 API Integration 1 Android Security 1 Cloud Computing 1 AI Engineering 1 Command Line Utilities 1 Audio Processing 1 Swift Development 1 AI Frameworks 1 Multi-Agent Systems 1 JavaScript Frameworks 1 Media Applications 1 Mathematical Visualization 1 AI Infrastructure 1 Edge Computing 1 Financial Technology 2 Security Tools 1 AI/ML Tools 1 3D Graphics 2 Database Technology 1 Observability 1 RSS Readers 1 Next.js 1 SaaS Development 1 Docker Tools 1 DevOps Monitoring 1 Visual Programming 1 Testing Tools 1 Video Processing 1 Database Tools 1 Family Technology 1 Open Source Software 1 Motion Capture 1 Scientific Computing 1 Infrastructure 1 CLI Applications 1 AI and Machine Learning 1 Finance/Trading 1 Cloud Infrastructure 1 Quantum Computing 1
Advertisement
Advertisement