Update: You can benchmark yourself against the model’s exact predictions here: https://ai.conradkay.com/grade
Attempting this project as a web developer without a math or data science background was a long shot, but I ended up getting better results than I thought was possible.
Background
This a comic graded and encapsulated by CGC, which was the first comic book grading service and remains the most popular.
There are some companies (AGS) which use AI to grade trading cards, but nothing has been used for comic books except asking chatbots.
After trying a few popular models I began to suspect they were basically spitting out the “average” grade, which gets lower as the comic’s age increases. To test that I gave them some very high-grade old comics, and a couple lower-grade modern comics.
Prompt: “As a professional CGC grader, use the provided scans of the front and back cover to assign a grade from 0.5 to 10.” Prompting to describe any the defects before assigning a grade didn’t seem to help.
They barely did better than just random guessing, and the explanations were very well-written but ultimately BS. For many tasks they have remarkable vision capabilities (see https://www.astralcodexten.com/p/testing-ais-geoguessr-genius), just this one doesn’t gel with their training.
Gathering Data
Getting (good) data is the hardest part for most real-world ML projects. But there’s plenty of excellent datasets of all sizes available for free, and the process of obtaining data depends so much on the task that I get too in the weeds here. Essentially it was a lot of scraping sales listings.
For just CGC graded comics, I got ~2.1m pairs of images, or ~7TB of .jpg files which is rather annoying.
eBay listings have 0-12 images, each at most 1600x1600. Anyone can post on eBay so the images are very diverse and unstructured. I’m scraping the data for now, but to be usable for training they’d need additional filtering and processing.
Exploratory Data Analysis (EDA)
Some things to note about grading:
- The front and back cover are equally important, but it’s rare for the interior pages to impact the grade
- Manufacturing defects are mostly ignored except for grades above 9.8, which are extremely rare
- Cheaper comics aren’t worth grading unless a high grade is expected (often 9.8 or bust)
- Older comics are much more scarce, but therefore expensive and more likely to be graded
I made a bunch of visualizations and reports to understand the data better. This heatmap is probably the most succinct
Designing a Dataset
There’s a few unique and interesting constraints here:
A scoring based approach seemed natural
Each year grouping is treated as a separate dataset, and an initial pass just determines the size of each.
I made each year (range) a heap,
This maximizes image diversity, smooths the total distribution of grades, and limits the effectiveness of grading based on year or the specific comic
The validation set is created first from a random 10% of the data, so it can’t steal all the highest score examples. A better way than random splitting would be to make every data source (what site/seller) exclusive to either the training or validation set. That way it’s clear if the model doesn’t generalize to things like different camera usage, backgrounds, or scanner settings.
I didn’t bother creating a test set since new data comes in fast enough I could just create one on the fly.
Image Pre-processing
224x224 is the most common resolution for computer vision models. Most of the combined front and back cover scans are rougly 4000x3000, or 239 times as many pixels.
I don’t have VC money to burn, so if I don’t want to be financially ruined the name of the game is to keep as much information as possible while using the least number of pixels.
If we give a model uncropped images of CGC comics, It’ll just cheat by looking at the big grade number, so it’s necessary to remove the label, and ideally the other sides to remove meaningless pixels. Here’s what I tried:
- Just removing a fixed percent from each side works well enough as a baseline if images are cropped to the case already
- Non-ML methods using OpenCV often failed with several boundaries close together, or light being wonky passing through several layers of transparent plastic
- A custom algorithm to generate candidate edges from all 4 directions independently. Worked well when tuned for specific cases but didn’t generalize well to outliers in perspective, lighting, rotation, etc.
- A DETR object detection model. Handles slabbed (CGC) and ungraded comics. Fast, but requires manually labeling some images with bounding boxes. I found running label-studio locally the easiest out of the popular annotation tools, but I ended up building my own simple web app to do it. That was honestly less difficult since I had easier control over the input and output, and LLMs are good at trivial projects like that.
Most of the grade impact is in the pixels at the edges of each cover, so the simplest approach is to only use the edges. Unfortunately there’s no easy way to remove the center of an image like cropping to remove sides.
Doing this for both covers, there’s 2 sets of 4 images. Images are oriented to be vertical, scaled to the same height, and mirrored so that the edges are on the left.
Later I experimented with adding the centers, but proportionally smaller. Metrics didn’t improve as much as I would have liked, so maybe differing scales within an image is a bad idea.
Training
All of the models I used had already been pre-trained on millions of images. Even though most of what they learned won’t be useful, anything is better than starting from scratch.
timm is a fantastic library with implementations and weights for many different models.
The DINO training method (https://arxiv.org/abs/2104.14294) is really interesting, and blew basically all of the 20 or so architectures I tested out of the water.
dinov2-base (87m parameters) got 0.717 MAE keeping the model frozen and just training the final classifiers. That’s shockingly good.
Benchmark
The simplest metric I’m using is mean absolute error (MAE), basically how far off a guess is on average on the 0.5-10 scale.
To measure human performance, I used posts from https://boards.cgccomics.com/forum/42-hey-buddy-can-you-spare-a-grade/ where they say the official grade once it comes back from CGC. The images tend to be very high quality with a lot of different angles. While there might be a bias towards people posting harder to grade comics, the validation set is very biased towards difficult/outlier examples. Across 34 posts there were 151 predictions, with an MAE of 0.80. Using the average guess for each post results in an 0.74 MAE.
On the validation set, I graded 50 examples myself using full resolution scans, and got an MAE of 0.55
Even dinov2_small with it’s 22 million parameters beat me
It didn’t really feel like there was a clear weakness. Analyzing the largest differences in the model’s prediction and the official grade, I found myself usually siding with the model, meaning there were just defects not visible from the 2 images.
Funny enough a lot of them were just due to egregious errors and grades.
Grad-CAM
Generating these visualizations is fairly simple, in both code and computation:
- We force PyTorch to do some calculus measuring much each region impacts the probability it gives for a specific output
- That gets normalized and reshaped into a grayscale image matching our input image’s dimensions
- OpenCV makes it easy to use a colormap (Turbo in this case) so that instead of displaying from black to white, it shows blue to yellow to red
These give a way to basically check the model’s work. If it misses anything that seems significant, the grade is probably lower than it predicts. The model paying attention to things which aren’t defects is more nuanced. For flawless comics it might focus the most on sharp corners.
Multi-task Learning
Right now the model doesn’t use any language data, but it would be nice if it could describe the defects like multimodal models do. Luckily there’s a number on each comic which can be used to visit a web page that includes information about the comic and grade, and has “grading notes” a bit less than half the time.
Usually there’s 1-4 defects listed, which follow a rough format:
- light spine stress lines to cover
- multiple moderate crease back cover
- spine stress lines breaks color
In my sample there’s ~20k unique descriptions, with 80% of them occuring less than 5 times.
For now I’m ignoring positional information (like “top right of back cover”) since that would increase the size by >50x, and Grad-CAM can supplement that information in the results.
I extracted 36 unique and common defects, and converted words measuring the impact (severity, size, frequency) to a number (0-1)
I did the same thing with the restoration info, while page color and year were more straightforward.
I wouldn’t say it’s intuitive to add this these the existing model which focuses on grading. With most physical systems (a car for example), they’re either specialized and good at just a few things, or general and decent at many things. Deep learning models are closer to monsters than machines, consuming as much data as possible, and more outputs means more data.
Entire Process
flowchart TD SCR[Scrape Marketplaces] SCR --> DB[Database] DB --> DL[Download Images] DL --> ANCR[Manually Crop] ANCR --> TRCR[Train DETR Crop Model] TRCR --> TSFM DL --> OCR("Certificate # from OCR Model") OCR --> SCR2("Scrape Grader's Registry (Restoration, Defects)") SCR2 --> DB DB --> SORT["Canonization, Filtering, and Scoring"] %% Splitting SORT --> DS["Create Validation Set (10k) then Training Sets (10k-1m)"] DS --> EDA["EDA (Exploratory Data Analysis)"] DS --> TSFM[Generate Transformed Images at Multiple Resolutions] TSFM --> TR[Train Model] TR --> VIS[ Biggest Losses Bias / Skew Grad-CAM Visualizations ]
v2.0 Ideas
Takeaways
For everything I eventually figured out there was much trial-and-error, looking at data, analyzing failures, and getting stuck or confused.
The potentially huge advantage you can have over someone much more experienced in ML is specific knowledge about some domain, or unique access to data. There’s probably not something that immediately comes to mind but maybe you know someone, or having the proverbial “hammer” will make some “nail” stick out in the future.
Resources I found useful
I made this blog with Quarto, which was easy to set up but seems very capable
Python for Data Analysis by Wes McKinney (the creator of pandas)
Technical Details
Probably skip this part if you don’t have PyTorch or ML experience
I found the FastAI defaults (which haven’t changed much in 7 years) really difficult to beat. There’s quite a bit of randomness in training, so even if I could get a slight improvement it’s hard to say whether it was just luck.