Tips for a Successful Proof of Concept with Deep Learning

Victoria Mazo
8 min readDec 2, 2019

Do you have a fascinating idea for a startup in the field of AI? That’s great! Now you need to prove to yourself and others (mainly, investors) that it is not only a great idea, but it can also work. If you are an expert in AI, you know what to do: gather enough data, label it, find an open source/build a model, train/fine-tune it, find the best hyperparameters and deploy it. Sounds complicated? Then, probably, you should get to know the main terms in order to find a common language with an expert, who is able to do a Proof of Concept (POC) for you. But first you need to understand if your idea is, in general, feasible from the AI perspective. Below I will go through the main flow of an AI algorithm development, focusing on Computer Vision (CV) and Deep Learning (DL).

Yes, AI, but weak

Recently there has been a huge advancement in the field of AI, and, probably, you have heard about self-driving cars, cashier-free stores, deep fake, deep dream, automatic machine translation, automatic speech recognition, automatic face recognition, etc.

Mass media frightens us with headlines like “AI will take our jobs”, “AI will take over the world”, etc. AI algorithms might take some jobs in 15 years, like cashier, driver, translator, etc., but, for certain, it will not rule the world, since what we call AI now is not a full AI at the human level or above, but a weak AI, which can achieve a human-level or above performance only in very narrow fields, e.g. face recognition. Scientists persist with their efforts in developing a full AI, but they still have a very long way to go, so don’t be afraid!

On the other hand, there are many examples of successful applications of a weak AI, mainly, in the Computer Vision field, but also in Machine Translation, Speech Recognition and Time Series Analysis. Unfortunately, there is still not much advancement in Robotics, since Reinforcement Learning used for teaching agents to act in a complex environment is tough to implement in reality, besides that, affordable robots are not really available on the market, except for very expensive industrial ones.

If your startup idea can be customized for the following CV subfields, your life is pretty easy, since there are many available open source models with a good performance:

  • Classification of an image, e.g. classification of cats and dogs
  • Object detection — localization (with a bounding box) and classification of objects, e.g. cats and dogs
  • Semantic Segmentation — defining to which class every pixel of an image belongs to, e.g. cat, grass, trees, and sky classes
  • Instance Segmentation — marking all the pixels of an object belonging to a certain class, e.g. catsdogs and ducks
  • Face Reidentification — identifying the same person on all images he/she appears
  • Human Pose Estimation — defining a “skeleton” of every person in an image
  • Superresolution — improving the quality of an image
  • Neurostyle — making a masterpiece out of every picture by converting it into a painting with a specific style
  • Image-to-image translation — converting an image from one domain into another, e.g. from a sketch to a full image, inpainting or image colorization

There are many other CV subfields, such as Objects Counting , Person Tracking, Action Detection, Camera Pose Estimation, Depth Estimation, Scene Understanding, 3D Object Reconstruction, etc., in which research is very active, but there is still no universal model or the model’s performance is not good enough to adopt it easily.

DL models are usually not written from scratch, but using high-level libraries, which simplify and speed up the process. The most popular such DL frameworks are: Tensorflow (supported by Google), PyTorch (supported by Facebook), MXNet (supported by Amazon) and Caffe (supported by the Berkley University). They all have Python API and can run on both CPU and Nvidia GPU (training a model on a GPU speeds up the process x50).

Data is everything

Before you even start to think which AI model can fit your application, stop and take a look at you data. Got any data?

I won’t tell you that 50% of your POC is having enough labeled data, I will, probably, just recommend you forget the current idea and go do something else. There are so many other things to do in the world like ant-watching in Uganda, counting shells on a seashore or catching your neighbor’s cat who stole your food.

If you do have data or, at least, an idea how to get it, e.g. a public dataset, it is important to understand how much data you need. There is a golden rule: data is never too much. In other words, there are no exact rules how much data is needed, but the rule of thumb is:

  • Classification: ideal — 1000 images per class, minimum — 200 images per class
  • Object Detection: ideal — 1000 bounding boxes per class, minimum — 200 bounding boxes per class
  • Segmentation: ideal — 500 images per class, minimum — 100 images per class

If you take a model pretrained on data similar to yours and fine-tune it, you need less data, but if you train from scratch, you need more data.

Don’t forget about tagging your data! Labeling several thousands of images is time consuming, but there are many available services like Mechanical Turk, etc., or you can do it yourself using a tagging tool like this, when you have some time free from ant-watching.

Like a child

A Neural Network (NN) is a simplified model of our brain: one neuron layer is connected to the next neuron layer, and the connection’s weights are NN’s memory. A network is considered deep if it contains at least 4 layers.

Similar to a child who learns about the world by looking at many examples of objects and their relations, a NN learns by looking at images of objects and predicting a required outcome, e.g. class of the object. Since data is labeled, a NN can learn where it makes errors and improve itself by “adjusting” its memory, i.e. weights. The procedure is fully automated with a Gradient Descent. All you need to do is define a network’s architecture, that is a number of layers and connections between them, and hyperparameters, such as a speed of learning (learning rate), etc. There are no clear rules on how to choose the best hyperparameters and, in practice, they are found by trial and error.

If a network is already trained on a dataset, not the one you need, but a similar one, you still can profit from it by transfer learning, also called fine-tuning,– retraining only several last layers on your data. In this case you need much less data than training a network from scratch and it is much faster.

Real world is tough

Congratulations! You have data and found an open source model. Is that it?

Not exactly! Deep Learning model training or fine-tuning requires powerful Nvidia GPUs (other GPUs don’t support CUDA, a C++-based programming language for computing on GPUs). You might want to do it on a cloud with Nvidia GPUs, the best are AWS and Google Cloud, but it’s pretty costly (1.5–8$ per hour), depending on the size of your model. At the active training phase, if the model is trained from scratch, you might need several days and up to weeks of GPU time — I leave you calculation of the cost as a home exercise.

You need to decide on which platform you want to deploy your model: server/cloud with CPUs, server/cloud with Nvidia GPUs or an edge device. Each of them has pros and cons. From the price perspective, the cheapest is an edge device, and a GPU cloud is the most expensive (~1.5–3$/hour). Besides the price, it is important to take into account the processing power and RAM — it is, approximately, inversely proportional to the price. More complex models, like object detection, require more processing power and RAM. If you want to build a real-time application, you should consider either a computer/server with a GPU or an edge device (mobile device) with a very optimized model.

Edge devices are usually small, so that they can be a part of a product, e.g. a smart camera, they are cheap, but weak. Edge devices for DL are still in development, and some samples on the market, such as Intel’s Movidius, Google’s TPU and Nvidia’s Jetson Nano, have very limited capabilities. If you decide to use an edge device, you will need to optimize your Deep Learning model, to adapt it to the device and, probably, to lose part of its performance.

Another option for a deployment platform is a mobile device with a neural processing unit, such as Qualcomm’s Snapdragon 835 and higher (half of Android phones) and Apple’s A11 Bionic (iPhone X and iPad Pro). But here, as well, you need to adapt your model to the neural engine’s API and optimize it. If you need to deploy your model on a mobile device without a specialized neural processing unit and know how to optimize it well, you can reach 10 frames per second for a classification task with a pretty high accuracy. You can find an AI benchmark comparison of most of the existing phones here.

Tips for a successful POC

  • Make an effort to gather as much data as possible with accurate tagging, this is 50% of your POC’s success
  • Define the AI subfield for your POC (Classification, Object Detection, Segmentation, etc.) and understand whether your POC is feasible from the algorithmic perspective
  • Find a corresponding open source model, if there is any, or develop a specialized AI model for your POC
  • Find the best model by searching for the best training hyperparameters
  • Define the platform for the AI model’s deployment (CPU, GPU, cloud, edge device, mobile device), while understanding the corresponding constrains
  • If necessary, perform optimization and adaptation of your model to the deployment platform

--

--