I think it’s safe to say that AI is pretty hot right now, what with things like ChatGPT taking the world by storm and the whole myriad of AI art generators pissing artists off. It’s an area I’ve dabbled in a bit before, but usually I’m trying out some wacky concept where I’m trying to get a bunch of colored circles to learn how to evolve and survive in a some simulated environment. I’ve never really taken the time to actually explore the types of AI that have a real-world use to them, and it’s starting to feel more and more urgent that I actually do that.
So that’s what I’ve decided to do with my next little personal project.
One of my favorite nerdy math YouTube channels, 3 Blue 1 Brown, recently posted a video titled “But what is a convolution?” that really inspired me. I had seen the term convolution before when I was playing around with AI, but I never quite understood exactly what it was or how to use it. After watching this video, it clicked and I was instantly curious to start exploring all that convolutional neural nets (CNNs) have to offer.
After doing some additional research around how CNNs actually work and are built, I finally had a decent little project idea. One of the things I found particularly fascinating in that research was how the convolutional layers process images and over the course of training learn to identify the important features they need to make a classification. I thought it would be cool to build something out that would let me get visual snapshots of what’s going on inside the network, to “see” through the eyes of AI if you will, as it goes about its learning. As for the model itself, it would just be a fairly simple classification model (simple being a very relatively used term here), which seems to basically be the “Hello World!” of learning CNNs.
The Dataset
Feeling pretty good about the achievability of the project I had set out for myself, it was time to go searching for a data set. Luckily, there is whole wide world out there of free and open data sets to choose from, all I had to do was narrow down what I wanted. I spent some time browsing around a site called Kaggle, and relatively quickly settled on a particular data set that was calling out to me. A cache of 30k images of cats and dogs, 150×150, greyscale.
I liked this data set for a variety of reasons, the most personal being that it’s cats and dogs, and who wouldn’t enjoy looking at those? But it also seemed quite good from a practical standpoint. I liked that all the images were uniform in size, and even better, that they were square. While through my initial research, I know that CNNs are very capable of operating over images of any size, I thought it would be best to keep a consistency for my initial learning approach. Whenever I’m learning something new, I prefer to keep the variables I have to take into consideration at a minimum. The other thing I particularly liked about this data set is that all the images are in greyscale. While it wouldn’t add too much in complexity to handle color images, what it does do is allow me to keep the number of input nodes to the network lower by a factor of 1/3 (color needing 3 values per pixel for each RGB, while greyscale only needs 1). This was important because I’m planning to run this model in a browser, not famously known for being the most efficient environment to run code in, so fewer nodes in the network seemed like a good idea.
With a freshly downloaded data set in hand, I was now up against my first challenge. I didn’t even want to entertain the idea of having to load in 30k separate images to a browser, not to mention there was absolutely no naming convention to the files provided. I needed a way to get these images into a more workable format. For that, I turned to a tried and true solution that I was familiar with from my various forays into game development: the sprite sheet. Basically, stitching together a bunch of different images into one larger image, in such a way that you can predictably sample from it to get back out the individual images. Also not uncommon to use for efficiency reasons in front-end development, but my usual course of work doesn’t typically deal with many images.
Here’s where I really hit my first roadblock. You see, there are quite a few generators already out there that can read in the images you feed them and give you back a sprite sheet. The problem is, none of these were built for the type of image load I was throwing at them. The initial ones I tried, all browser based, would just give up and spit back a completely blank image. I found a few CLI tools that I tried out, but they were badly outdated and due to dependency issues, I couldn’t even get them to run.
All of this was quite disappointing. I was quite excited to get a start on this project, and here I was thinking I would probably have to spend a good portion of my time writing up my own sprite sheet generator. While I don’t believe it would have been too difficult, it just always feel bad when you feel like you have to reinvent the wheel to make progress, and it would delay my ability to actually get to the code that I was excited to write. I decided to give one more search to see if I could find anything that I had missed, and I’m glad I did.
Tucked away on something like the 3rd or 4th page of Google results was a little tool called Spritesmith. Despite not feeling very confident that it wouldn’t just run into all the same problems I had with the other tools I had tried, I figured it was better to try than not. Woah and behold, it installed without any dependency issues and it actually worked! It was still struggling to handle the metric fuck-ton of images I was trying to feed it, but it was doing better than the other tools so-far.
Here I made a decision, that I honestly wonder if it would have been better to make before going through all this trouble. If I could not have a single sprite sheet (or rather 2, one for the cats and one for the dogs), I could at least try to break it down into a few. With ~15k images for each animal, it seemed natural to break that down into 3 groups of 5000 images each. I gave that a try, and it worked! But it had some straggling images off to one side that I didn’t want to have to write code around, so I figured it would be better if the sprite sheet itself was also a square. At 4900 images each, that would give me a 70×70 square sprite image to work with. I wouldn’t be leveraging the full data set, but I can live with that.
I did end up making an error in this process that would prove to be quite important later when I was trying to actually push things up to GitHub. I was exporting the sprite sheets as PNGs. This is my typical go-to format whenever I’m working with images, so I didn’t really put much thought into it at the time. Turns out, the JPG format produces SIGNIFICANTLY smaller file sizes, which when you’re working in the realm of tens of thousands of images, REALLY matters. Later, when I realized this mistake and went back to fix it, I was able to reduce the size of the final sprite images from ~200MB each down to ~16MB.
The Code
I am quite a few hours into this project by now, and I have yet to write a single line of actual code. The process gave me a huge amount of respect and appreciation for the people who actually do this at much larger scales than what I was working with. I can only imagine the types of challenges they have to deal with.
But here I was, finally having my data set in a format that I could work with. Eager to write some code, I spun up an initial project using the create-react-app
tool. I always like working in React whenever I’m doing personal projects because it’s very good at just getting you up and running and writing the code you actually care about, rather than having to spend a bunch of time on setup and boilerplate.
I installed the TensorFlowJS library. I have used this library in the past when working on AI projects, and feel pretty comfortable with it. In fact, it’s this library where I first encountered the concept of convolution, so I already know that it has what I need for this project built-in. That was a plus.
Needing to brush up a bit on my React (its been a minute since I’ve worked with it), I figured a good first step would be to just get something in place that would let me browse through this huge data set I had. Doing this would help me figure out how I was going to slice up my sprite images in order to feed them into my eventual model, while also giving me a fun little tool I could use to just look at all the various images I had downloaded. Again, who doesn’t enjoy looking at pictures of cats and dogs?
In really no time at all, I was feeling comfortable with React again. I was already seeing some payoffs from the earlier decisions I made around choosing the data set I wanted to work with, and setting my sprite sheets up as perfect squares. It really simplified a lot of the logic I had to write around slicing those images up.
Here was my end result:
The code as it stands can be found on my GitHub, and thanks to GitHub Pages, you can see it in action here. (Disclaimer: All this is subject to change as I continue this project, so if you’re coming by this post later, things might look very different.)
What’s Next?
With all this initial work done, I’m feeling ready to begin designing and building the actual model that will ultimately be the key piece of this project. I’m expecting it to be both challenging and fun, especially as I try to figure out the ways in which I can achieve my goal of being able to see through the model’s eyes at it’s various layers. I already have some ideas on how to do this, and I’m quite excited to have those torn asunder by reality.
More to come soon.