Wondering How Nam Do San (Start-Up) Create NoonGils? — Basic Image Recognition
I am currently watching an on-going Korean Drama Start-Up starring Bae Suzy as Soe Dal Mi, Nam Joo Hyuk as Nam Do San, and Kim Seon Ho as Han Ji Pyeong. Like, who doesn't?! This drama is themed about business and technology (and a triangle love story). Long story short, Dal Mi recruited by Do San’s team named SamSan Tech to make a startup business that engages in the field of Image Recognition with Artificial Intelligence. They are making a mobile-based application named NoonGils to help blind people in their daily life. The basic idea is although blind people can’t see, they can hear. First thing first, as we see in the drama, NoonGils using their phone camera to see the object. The camera can replace their eyes function by pointed its camera in the direction they want to see. NoonGils system act as our brain that detects and recognizes the object and turns it into sound, so blind people know what the object they are pointing at. It actually video processing but the video is actually the form of an array of images that played sequentially to cause movement. It has the almost same recognition process, the video format needs to be converted into an image format for the recognition process.
To understand more about what's inside NoonGils, let me explain a little about image processing.
What is an image?
As reported from dictionary.com, an image is a physical likeness or representation of a person, animal, or thing, photographed, painted, sculptured, or otherwise made visible. When we talk about digital images, an image is a collection of pixels. Know about camera specification 12 Mega Pixels? That means an image obtained from the camera contains 12 million pixels. So, the more amount of pixels will reach the high quality/sharpest image.
Each pixel in the image contains RGB values. In the drama, we can see that Samsan Tech quite often alludes to RGB which means the color values red, green, and blue. We have to process these RGB values in such a way to retrieve information from the whole image. Image processing can be used to know the information from the image. The information can be the object name in the image, count of the object, differentiation between object and background, etc.
Main Process
The main process is divided into 2 parts, object detection and object recognition. Before recognizing an object, the system needs to localized where is the object, it called object detection. Object recognition is a process of getting to know what the object is, it can be the object name.
Why do objects in the image need to be classified?
We know that Halmeoni needs to know the number in the automatic door lock so she can enter the passcode, but her eye disease prevented her. By using NoonGils, when Halmeoni pointing her camera into it, she will know which number she is pointing at. The system needs to differentiate which one number 1, number 2, number 3, etc by classifying them.
Let’s make it easy. To classify an image, several steps are required. Start with handle the data training used for the learning process.
Step 1 — Datasets
In the early episodes, we can see that Do San and the team making a desktop-based object recognition app. It produced misclassification when Do San pointing the camera into his dad and the system said it was a “toilet”. Why does this happen? Do San says that it's because of the low light so the image contains a lot of noises. Misclassification can also happen because of a lacking amount of data training. The data training for face images that Do San use can be less varied or he uses his dad's image and labeling them as ‘toilet’ huft. I would not take this seriously because this scene is actually just for fun.
Dataset has an important role in image classification. The dataset used is of course the images of the object. If we want to recognize which mango and which watermelon, we need images of mango and watermelon. Do San needs a really huge amount of data so the machine learning model can learn from them by identifying the relation between common features related to the object. NoonGils is not just distinguished which cat and dog, amazingly, it can recognize everything even distinguish pill medicine. Wondering how much dataset that Dosan use? it would be huuuuuugggggeeeeee amount of dataset.
More datasets more accuracy? Based on my research by surving the internet, it's not always, but usually yes if the additional data are representative of the training data.
Step 2 — Pre-processing Images
The goal of this process is to improve the data so it can produce the best accuracy. Nam Do San once said that misclassification can be caused by noises. Noises must be reduced in order to enhance image quality. Not only that, in the pre-processing step, we do:
Image resize
Some images may have various sizes. The image training should be resized to get a uniform size between all the images. Resize images also can be used to minimize the image size to reduce loading times.
Image Enhancement
This step is important especially if the data contains a lot of noises. Noise can be reduced by applying image enhancement methods such as Gaussian Blur, Median Blur, and many other methods. Not only handling noises, to have a good quality image, the method that used adjusts the image quality. We can use a contrast stretching if the images are low contrast, image brightening if the image is way too dark, transformation, slicing, and other methods that adjusting to the image quality problem. The challenge in this part is not every image has the same problem, one may too bright, or one may too dark.
Segmentation
To classify the object we need to know where the object is. When Halmeoni using NoonGils and pointing the camera to Dal Mi, NoonGils detect that there is someone in front of her. The segmentation process is used to separate the object and its background. When it separated, it can be easier to detect the object. One of the most popular segmentation methods is Otsu thresholding. It converts the image into a binary image (black and white image). The object and background will have a different color black or white.
Step 3 — Feature Extraction
In this step, we have to find the feature of the images we want to recognize. The feature is unique, it used to differentiate each object. The feature of an image can be its color, texture, form. We can distinguish a cat and dog by seeing them and think it's a cat or dog by its form. Or we can see the 2 objects by their surface and distinguish by their texture. We also can distinguish the level of doneness of the bread by its color.
How to get the features? By calculating pixel values of the image dataset. This calculation can use various methods depending on what features we want to take. We can combine the feature, for example like recognizing fruits. We know that each category has a different color and form. So we calculate the color feature and the form as the input of the classification model. The drama once mentioning about using LBP and RCNN. LBP (Local Binary Pattern) is one of the feature extraction method used to distinguish texture. RCNN (Region-based Convolutional Neural Network) is a method that usually used to object detection. This method extracts the image into regions and uses CNN features to distinguish each object. We can get the CNN feature by comparing each region/pixel in the images.
Step 4 — Classification
Classification refers to the task of extracting information classes from an image. By using the features, the system will classify the label that matches the features. In this stage Nam Do San defining the input and the output. The input is the feature of the image testing. The output is the label or the name of the image we want to recognize. We also need to determine the classification method. Several methods can be used and can be compared and used to yields the best accuracy. Selection of the appropriate model and hyperparameter can greatly affect the classification result. After Samsan Tech’s system can’t detect Injae’s dataset as false handwriting, Do San was trying to select another hyperparameter (tuning hyperparameter) to fix their system. In this step, we can use a machine-learning algorithm to classify the images such as deep learning neural networks and CNN. Or can also use simpler algorithms such as Decision Tree and K-Nearest Neighbor.
Tarzan and Jane Analogy
Do San explain Dal Mi about machine learning with the Tarzan and Jane analogy. Tarzan live on the island and met Jane. Tarzan fell in love with Jane and decided to gave her a present but he didn't know what Jane liked. When Tarzan gave her a stone, she didn’t like it. When Tarzan gave her a flower, she liked it. When Tarzan gave her a snake, she didn’t like it. In this time, Tarzan learned what Jane’s like and what Jane didn’t like, so he would give the best present to Jane.
Machine Learning
Machine learning is a system that is able to learn automatically and improve from experience without being explicitly programmed. The machine learning algorithm divided into three types such as supervised learning, unsupervised learning, and reinforcement learning. If you are interested you can learn more about machine learning here.
Step 5— Testing
The testing phase used to know the performance of our classification model. Do San test the system by pointing the camera into the objects around him. The basic process of the testing phase can be seen from the diagram below:
Do San pointing the camera into an object -> the system does pre-processing image -> feature extraction process -> the system detects the objects -> the system recognizes the objects using the features -> Dosan knows the accuracy of the system.
That’s all the simple way to understand about image recognition process as Do San did in Start-Up K-Dramas. Anyway, I said Do San’s name multiple times in this post doesn’t mean I am #teamDoSan, I actually choose to be #teamJiPyeong. But I adore Do San as a really intelligent programmer. I’ll insert Ji Pyeong photos as a bonus. He is sooooo stunning.