Chainsaw-OCR – My Very First Blog Post!

Hello world! This is Oreng, coming to you live from the vast expanse of cyberspace. Sorry, still figuring out how to kick off these posts, but here we are. For my very first blog post, I’m diving into a cool little project I’ve been working on: an app called Chainsaw-OCR. If you’re a fan of Chainsaw Man, a Python coder, or even just my mom (hi mom!), stick around because things are about to get interesting.

So What is Chainsaw-OCR?

Chainsaw-OCR is an app I’m building that uses the MangaDex API to scan manga pages, extract text from them, and allow users to search for specific phrases. If you’re asking yourself “What’s OCR? API? Who or what is MangaDex? Is that like some new Pokémon?” — don’t worry, I got you covered. Here’s a quick breakdown for the non-coders out there:

  • OCR stands for Optical Character Recognition, which is just a fancy way of saying “a program that can look at an image and return any text within it.”
  • MangaDex is a site where fans upload unofficial translations of manga.
  • API stands for Application Programming Interface, but think of it like this: It’s a way for my app to “talk” to MangaDex’s server and fetch the data I need—like which manga chapters are available and where the images are stored.

Why Did I Build It?

I’m a huge Chainsaw Man fan (if you haven’t read it, stop everything and dive in, especially if you’re a manga nerd!). A couple of months back, I came across a theory on Reddit about Chainsaw Man. Now, I won’t bore you with the details, but in order to confirm this theory, I needed to search through the manga for a specific phrase.

So, you’d think a logical person would just reread the manga, right? Well, where’s the fun in that when you can overcomplicate the solution? Instead, I figured, “Hey, I’ve been learning Python, why not build something cool?” That’s how Chainsaw-OCR was born. It started off messy—two weeks of work tossed out and rebuilt from the ground up—but you’ve gotta start somewhere, right?

Tech Stack and Structure

The project itself combines a bunch of moving parts that somehow work together, kinda like techy play doh. Here’s the breakdown of what I used:

  • Libraries: requests, sqlite3, os, pytesseract, ratelimit, tqdm, and pillow
    • requests: for making HTTP requests to the MangaDex API.
    • sqlite3: for database management.
    • os: to handle file paths.
    • pytesseract: does the heavy lifting for OCR, extracting text from manga pages.
    • ratelimit: limits the number of API calls (to avoid getting banned!).
    • tqdm: adds progress bars for long processes.
    • pillow: helps with image processing.
import requests
import sqlite3
import os
import pytesseract

from ratelimit import limits, sleep_and_retry
from tqdm import tqdm
from PIL import Image, ImageOps

Key Classes in Chainsaw-OCR

1. MangaDexRequests Class

This class is all about handling communication with the MangaDex API.

  • get_manga_data: Takes a manga title (like “Chainsaw Man”) and the language you’re interested in. It fetches the manga ID and pulls all the chapter data, filtering out chapters hosted on external sites (we just want the ones on MangaDex).
  • get_page_metadata: Grabs image URLs for a specific chapter, but there’s a catch—MangaDex limits API calls to 40 requests per 60 seconds. So, I added a buffer with sleep timers to avoid getting rate-limited by MangaDex.
  • download_url: Downloads the manga pages as .png files and saves them locally.
class MangaDexRequests:
    """Handles all interactions with the MangaDex API"""

    def __init__(self):
        self.chapter_data = {}
        self.page_links = []

    def get_manga_data(self, title, languages):
        try:
            # Retrieve all manga ids that correspond to title and save first
            # result
            manga_response = requests.get(
                f"{BASE_URL}/manga",
                params={"title": title}
            )
            manga_id = [manga["id"] for manga in
                        manga_response.json()["data"]][0]

            # Retrieve all chapters in given language in ascending order
            chapter_response = requests.get(
                f"{BASE_URL}/manga/{manga_id}/feed",
                params={"translatedLanguage[]": languages,
                        "order[chapter]": "asc"}
            )

            # Only save if chapter is hosted natively on MangaDex site
            chapter_ids = [_chapter["id"] for _chapter in
                           chapter_response.json()["data"] if
                           _chapter["attributes"]["externalUrl"] is None]
            attributes = [_chapter["attributes"] for _chapter in
                          chapter_response.json()["data"] if
                          _chapter["attributes"]["externalUrl"] is None]

            # Save result lists to dictionary
            self.chapter_data = {"id": chapter_ids, "attributes": attributes}
            return self.chapter_data
        except requests.exceptions.ConnectionError as e:
            print(e)

    # Add rate limit with extra 5-second buffer and retry after rest period
    @sleep_and_retry
    @limits(calls=40, period=65)
    def get_page_metadata(self, _chapter_id):
        try:
            metadata = requests.get(
                f"{BASE_URL}/at-home/server/{_chapter_id}"
            )

            # Retrieve required fields to build image url
            base_url = metadata.json()["baseUrl"]
            chapter_hash = metadata.json()["chapter"]["hash"]
            chapter_data = metadata.json()["chapter"]["data"]

            # Save results to list
            self.page_links = [f"{base_url}/data/{chapter_hash}/{page}" for
                               page in chapter_data]
            return self.page_links
        except requests.exceptions.ConnectionError as e:
            print(e)

    @staticmethod
    def download_url(image_url, filepath, filename):
        # Create filepath if non-existent
        if not os.path.exists(filepath):
            os.makedirs(filepath)

        image_response = requests.get(image_url)

        # If request is successful write image data to png file
        if image_response.status_code != 200:
            print(f"Failed to download image, status code: "
                  f"{image_response.status_code}")
        else:
            with open(f"{filepath}/{filename}.png", "wb") as file:
                file.write(image_response.content)

2. Database Class

The Database class manages all the chapter and page data.

  • create_table: Sets up tables in an SQLite database for storing chapter details.
  • insert_data: Inserts chapter info like numbers, titles, and links.
  • retrieve_data: Fetches that data when needed later on.
class Database:
    """Handles all database interactions"""

    def __init__(self, database_path):
        # Establish connection and cursor using given path
        with sqlite3.connect(f"{database_path}.db") as self.connection:
            self.cursor = self.connection.cursor()

    # Create table using given info and commit changes
    def create_table(self, name, columns):
        try:
            self.cursor.execute(f"CREATE TABLE {name} ({columns})")
            self.connection.commit()
            print(f"table {name} successfully created")
        except sqlite3.OperationalError as e:
            print(e)

    # Insert given info to table of choice and commit changes
    def insert_data(self, table, columns, data):
        try:
            # Add values according to number of given data
            data_count = len(data)
            values = ("?," * data_count).strip(",")

            self.cursor.execute(f"INSERT INTO {table} ({columns}) VALUES "
                                f"({values})", data)
            self.connection.commit()
        except sqlite3.OperationalError as e:
            print(e)

    # Retrieve relevant data from chosen table and columns
    def retrieve_data(self, table, columns):
        try:
            table_data = self.cursor.execute(f"SELECT {columns} FROM {table}")
            return table_data.fetchall()
        except sqlite3.OperationalError as e:
            print(e)

3. ImageReader Class

Welcome to the ImageReader class, where the magic happens! This class uses OCR to extract text from the manga pages.

  • scan_folder: Looks for .png files and stores the paths in a list.
  • extract_text: Opens each image, converts it to grayscale for better readability, and extracts the text using pytesseract.
  • store_text: Saves all this extracted text as .txt files organized by chapter and page number.
class ImageReader:
    """Handles Tesseract-OCR and file interactions"""

    def __init__(self):
        self.png_list = []

    def scan_folder(self, parent):
        # Iterate over files in parent directory for png files
        for file in os.listdir(parent):
            if file.endswith(".png"):
                self.png_list.append(f"{parent}/{file}")
            else:
                # Add file to path
                current_path = "".join((parent, "/", file))
                # Call method for every subdirectory if it is a folder
                if os.path.isdir(current_path):
                    ImageReader.scan_folder(self, parent=current_path)
        return self.png_list

    @staticmethod
    def extract_text(image_path, scale_factor=3):
        # Turn image greyscale to improve readability
        with Image.open(image_path, mode="r") as image:
            grey_image = ImageOps.grayscale(image)

            # Resize image to improve readability
            resized_image = grey_image.resize(
                (grey_image.width * scale_factor,
                 grey_image.height * scale_factor),
                resample=Image.Resampling.LANCZOS
            )
            # Extract text and store as string
            _extracted_text = pytesseract.image_to_string(resized_image)
        return _extracted_text

    @staticmethod
    def store_text(results, filepath):

        # Remove png property from file
        storage_path = (f"text_results/{filepath.replace(".png", "")
                        .replace("test_download/","")}")
        # Separate directories from filename
        head, sep, tail = storage_path.partition("page_")

        try:
            # Create subdirectories (head) if non-existent
            if not os.path.exists(head):
                os.makedirs(head)
        except FileExistsError:
            pass

        # Rejoin storage_path elements and write results as txt file
        with open(f"{head}/{sep}{tail}.txt", "w") as file:
            file.write(results)

What’s Next?

At this stage, the next step involves searching through the text files using regular expressions to find the specific phrase I’m after, so I’ll be sure to post a follow-up of the project in due time. But for now, here’s is the main script that runs the whole operation, starting by fetching the manga data for Chainsaw Man in English. If the database doesn’t exist yet, it creates tables for storing the chapter data and image URLs. Then, it downloads each page, scans it, and extracts the text.

if __name__ == "__main__":

    mdx = MangaDexRequests()
    mdx.get_manga_data(title="chainsaw man", languages="en")

    test_directory = "test"
    db = Database(test_directory)
    if not os.path.exists(f"{test_directory}.db"):
        db.create_table(
            name="chapters",
            columns="volume_number INTEGER,"
                    "chapter_number INTEGER,"
                    "title TEXT,"
                    "chapter_id TEXT,"
                    "chapter_link TEXT"
        )

        # Iterate over chapter attributes and insert relevant data
        for index, chapter in enumerate(mdx.chapter_data["attributes"]):
            db.insert_data(
                table="chapters",
                columns="volume_number,"
                        "chapter_number,"
                        "title,"
                        "chapter_id,"
                        "chapter_link",
                data=(chapter["volume"],
                      chapter["chapter"],
                      chapter["title"],
                      mdx.chapter_data["id"][index],
                      f"https://mangadex.org/chapter/"
                      f"{mdx.chapter_data["id"][index]}")
            )

        chapters_db = db.retrieve_data(table="chapters", columns="*")

        db.create_table(
            name="page_links",
            columns="volume_number INTEGER,"
                    "chapter_number INTEGER,"
                    "title TEXT,"
                    "page_number INTEGER,"
                    "link TEXT"
        )

        for chapter in tqdm(chapters_db):
            volume_number = chapter[0]
            chapter_number = chapter[1]
            chapter_title = chapter[2]
            chapter_id = chapter[3]

            # Iterate over every page for each chapter and insert data
            for index, url in enumerate(mdx.get_page_metadata(chapter_id)):
                db.insert_data(
                    table="page_links",
                    columns="volume_number,"
                            "chapter_number,"
                            "title,"
                            "page_number,"
                            "link",
                    data=(volume_number, chapter_number, chapter_title,
                          index + 1, url)
                )

    # Retrieve page_links data and establish parent folder for downloads
    page_links_data = db.retrieve_data(table="page_links", columns="*")
    image_download_dir = "test_download"

    if not os.path.exists(image_download_dir):
        for page_link in tqdm(page_links_data):
            volume_number = page_link[0]
            chapter_number = page_link[1]
            chapter_title = page_link[2].replace(" ", "_").replace("/", "_")
            page_number = page_link[3]
            url = page_link[4]

            # Use retrieved values to create directories within download folder
            download_directory = (
                f"{image_download_dir}/volume_{volume_number}/"
                f"chapter_{chapter_number}-{chapter_title}"
            )
            # Download each page and save to respective directory
            mdx.download_url(
                image_url=url,
                filepath=f"{download_directory}",
                filename=f"page_{page_number}"
            )
    else:
        print(f"parent directory {image_download_dir} already exists")

    image_directory = "test_download"
    img = ImageReader()
    img.scan_folder(image_directory)
    image_list = img.png_list

    for png in tqdm(sorted(image_list)):
        png_text = img.extract_text(png)
        img.store_text(results=png_text, filepath=png)

Wrapping Up

And there you have it! That’s what I’ve got so far on my Chainsaw-OCR project. I hope this blog post has been at least a little entertaining and informative. If you have any feedback or suggestions for optimizing the code—or even just random questions about the project—feel free to reach out! You can find the link to the project’s GitHub page here, and my socials on the homepage of this site.

For now, I’m signing off! I’ve got some upcoming content, including a details on my next coding project, so stay tuned. And hey, if you liked what you read, feel free to follow along this journey—but no pressure, you’re your own person with free will and all.

Thanks for reading,
Oreng

Leave a comment