How to Collect Images for AI Projects

Gather and Label Pictures for Machine Learning

Introduction

AI programs that can “see” are used in many exciting ways, like unlocking phones with your face or helping doctors detect diseases in X-rays. But to make these programs smart, they need lots of images to learn from.

The better and more organized the images, the smarter the AI will be. But where can you get all these pictures? One great way is by using web scraping.

I remember when I first tried to make an AI that could recognize different birds. At first, I was excited, imagining my model effortlessly identifying every species.

But that excitement quickly turned to frustration when I realized that many of the images were mislabeled or completely unrelated. It felt like I had spent hours collecting a treasure chest only to open it and find it full of junk!

That’s when I learned that collecting pictures is just the start—you also have to clean and label them properly.

Web scraping is like using a giant fishing net to catch fish. Instead of grabbing one fish at a time, you pull in a whole lot at once.

But just like fishing, some of what you collect isn’t useful—you might catch seaweed, old boots, or the wrong type of fish. That’s why sorting through the data is just as important as collecting it.

In this guide, I’ll show you how to gather, organize, and label images so your AI project gets the best data possible.


1. What is Web Scraping?

Web scraping is a way to automatically collect pictures and other information from websites. Here’s how it works:

  • Making Requests: A program asks a website for its content (just like when you visit a page in your browser).
  • Reading the Page: The program looks through the website’s code to find images.
  • Saving the Images: The program downloads the images onto your computer.
  • Cleaning Up: You check the images to make sure they’re useful and organized.

Is Web Scraping Allowed?

Before scraping a site, always check its robots.txt file to see if it allows scraping. Ignoring these rules could lead to legal trouble or violations of website terms.

Be mindful of copyright laws and ethical considerations, especially when using scraped data for public or commercial projects.


2. Setting Up a Web Scraper

Tools You’ll Need

If you’re using Python, here are some great tools:

  • Requests – Helps fetch web pages.
  • BeautifulSoup – Helps find and extract images from a webpage.
  • Selenium – Helps when images load with JavaScript.
  • Scrapy – A more advanced tool for large scraping projects.

Simple Web Scraper in Python

Here’s a basic script to collect images from a website:

import os
import requests
from bs4 import BeautifulSoup

url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}

if not os.path.exists("images"):
    os.makedirs("images")

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

img_tags = soup.find_all("img")

for img in img_tags:
    img_url = img.get("src")
    if img_url:
        img_data = requests.get(img_url).content
        img_name = os.path.join("images", img_url.split("/")[-1])
        with open(img_name, "wb") as img_file:
            img_file.write(img_data)
        print(f"Downloaded: {img_name}")

This script finds image URLs, downloads the images, and saves them in a folder.


3. Scraping Websites That Use JavaScript

Some websites don’t load images until you scroll down or interact with the page because they use JavaScript to dynamically load content.

This helps websites load faster by only displaying images when needed, reducing initial load times and saving bandwidth. Selenium can help with that:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)

driver.get("https://example.com")
time.sleep(3)  # Wait for the page to load

img_elements = driver.find_elements(By.TAG_NAME, "img")
for img in img_elements:
    print(img.get_attribute("src"))

driver.quit()

4. Organizing and Labeling Your Images

After collecting pictures, you need to organize and label them.

How to Organize Your Images

  • Create folders for different categories (e.g., cats/, dogs/).
  • Delete duplicate or low-quality images.

How to Label Your Images

Labeling tells your AI what’s in each picture. You can do this with tools like:

  • LabelImg – Draws boxes around objects in an image.
  • VIA (VGG Image Annotator) – Helps mark objects inside images.
  • Roboflow – A tool that helps with labeling and organizing data.

Example of a labeled image:

{
    "image": "dog1.jpg",
    "annotations": [
        {"label": "dog", "bbox": [50, 100, 200, 250]}
    ]
}

5. Making Labeling Faster

Labeling thousands of pictures by hand is slow. Instead, you can use AI to label images for you:

  • Pre-trained Models – AI models like MobileNet can guess what’s in an image.
  • Active Learning – Let an AI label images, then correct its mistakes to improve accuracy.

Example using AI to label images:

import tensorflow as tf
model = tf.keras.applications.MobileNetV2(weights='imagenet')

img = tf.keras.preprocessing.image.load_img("image.jpg", target_size=(224, 224))
x = tf.keras.preprocessing.image.img_to_array(img)
x = tf.expand_dims(x, axis=0)

preds = model.predict(x)
print(preds)

6. Making a Balanced Dataset

If you have too many pictures of one category and not enough of another, your AI might become biased. You can fix this by:

  • Augmenting Data – Flip, rotate, or change the brightness of images to create more variety.
  • Adding More Images – Find more pictures to balance the dataset.
  • Removing Extra Images – Cut down images from overrepresented categories.

7. Conclusion

Web scraping is a great way to collect images for AI projects, but it’s only the first step. After scraping, you need to clean, organize, and label your images so they work well for training AI models.

What’s Next?

  • Try scraping images for your own dataset.
  • Use labeling tools to mark your images.
  • Train an AI model with your dataset.

By following these steps, you’ll be on your way to building a great computer vision project!


If you’re ready to streamline your web scraping workflow, Octoparse makes it easy to extract and organize data, including images, text, links, and product details, without coding.

With its user-friendly interface and powerful automation features, such as scheduled scraping and data extraction templates, you can quickly gather data from websites and structure it effortlessly.

Whether you’re a beginner or an experienced scraper, Octoparse simplifies the process and saves you valuable time.

Try Octoparse today to automate your web scraping tasks, save time, and get the data you need with ease!

Affiliate Link