top of page

Computer Vision vs. Deep Learning: How Are They Different, How Are They Related?

What is Computer Vision?

Computer vision is a multidisciplinary field, focused on computer system design. These systems capture and interpret image and video data, then translate it into insights. The ultimate goal of computer vision is to use image data and develop methods to reproduce the capabilities of human vision.


What is Deep Learning?

Deep learning is a subset of machine learning in artificial intelligence that proposes deeper networks capable of learning from data. Deep learning imitates the workings of the human brain in processing data and creating patterns for use in decision making. The learning process can be developed via three different ways:


Unsupervised learning:

In unsupervised learning, the training dataset is a collection of examples without a specific desired outcome or correct answer. A deep learning model is handed the dataset without explicit instructions on what to do with it. The neural network attempts to automatically find structure in the data by extracting useful features and analyzing its structure.


Supervised Learning:

In supervised learning, the dataset is composed of labeled data used in a training algorithm. In supervised learning fully labeled means that each example in the training dataset is tagged with the answer the algorithm should come up with on its own. When using a new image as an input, the model compares it to the training examples to predict the correct label. There are two main areas where supervised learning is useful: classification problems and regression problems.


Semi-supervised Learning:

In semi-supervised learning, the training dataset presents both labeled and unlabeled data. This method combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning is particularly useful when extracting relevant features from the data is difficult, and labeling examples is a time-intensive task for experts.


The Difference Between Computer Vision and Deep Learning

Computer vision is defined as a field of study that seeks the proposition of techniques that help computers to ‘see’ and understand the content of digital images such as photographs and videos, while deep learning is a subset of techniques that can be used to speed up computer vision applications.


How Are Computer Vision and Deep Learning Related?

Computer vision has been traditionally based on image processing algorithms, where the main process was extracting the features of the image, by detecting colors, edges, corners and objects as the first step to do when performing a computer vision task. Deep learning techniques emerged in the computer vision field a few years back, and they have shown a significant performance and accuracy gain in computer vision applications. Using deep learning, the feature definition is performed automatically by the deep neural networks.


The relationship between computer vision and deep learning had its peak in the usage of convolutional neural networks (CNN) and modified architectures of them for computer vision tasks. Convolutional neural networks represent a class of deep neural networks (DNN) which is most commonly applied to analyzing visual imagery due to its greater capabilities for image pattern recognition.


Is deep learning the only tool for computer vision?

Although deep learning is currently the main approach on computer vision, it is not the only path available. Deep learning techniques typically require huge amounts of training data and large computation power, which are expensive resources under certain circumstances. It should be a tradeoff between the required computer vision task and the available resources to perform it, where traditional machine learning techniques could be more convenient. So, deep learning is not the only tool to be used in computer vision.


Other tools for computer vision


Feature descriptors:

Feature descriptors rely on extracting interest points from images. An interest point in an image is a pixel which has a well-defined position and can be robustly detected. Interest points have high local information content and they should be ideally repeatable between different images. Interest point detection has applications in image matching, object recognition, object tracking, and more. The most popular feature descriptor algorithms are:

  • Scale-Invariant Feature Transform (SIFT): SIFT is a complex algorithm that consists in five steps:

    • Scale-space peak selection: defining potential location for finding features.

    • Keypoint localization: accurately locating the feature keypoints.

    • Orientation assignment: assigning orientation to keypoints.

    • Keypoint descriptor: describing the key points as a high dimensional vector.

    • Keypoint matching.

The problem with SIFT is that the algorithm uses a series of approximations, by using the difference of Gaussians for standardizing the scale. Unfortunately, this approximation scheme is slow.


  • Speeded-Up Robust Features (SURF): A speeded-up version of SIFT. SURF works by finding a quick and dirty approximation to the difference of Gaussians using a technique called box blur. A box blur is the average value of all the images values in a given rectangle and it can be computed efficiently

  • Binary Robust Independent Elementary Features (BRIEF): BRIEF takes a shortcut by avoiding the computation of the feature descriptors that both SIFT and SURF rely on. Instead, it uses a procedure to select n patches of pixels and computes a pixel intensity metric. These n patches and their corresponding intensities are used to represent the image.

  • Features from Accelerated Segment Test (FAST): FAST is a corner detection method, which could be used to extract feature points and later used to track and map objects in many Computer Vision tasks. The most promising advantage of the FAST corner detector is its computational efficiency, being faster than other feature extraction methods such as SIFT. The FAST corner detector can be used for real-time video processing applications due to its high-speed performance.

  • Oriented FAST and Rotated BRIEF (ORB): ORB represents an efficient and viable alternative to SIFT and SURF. ORB was proposed mainly because SIFT and SURF are patented algorithms, while ORB is free to use. ORB builds on the well-known FAST keypoint detector and the BRIEF descriptor, performing as well as SIFT on the task of feature detection while being almost two orders of magnitude faster.

  • Binary Robust Invariant Scalable Keypoints (BRISK): The BRISK descriptor has a predefined sampling pattern in comparison to BRIEF or ORB. Pixels are sampled over concentric rings. For each sampling point, a small patch is considered around it. Before starting the algorithm, the patch is smoothed using gaussian smoothing. Two types of pairs are used for sampling, short and long pairs. Short pairs are those where the distance is below a set threshold while the long pairs have distance above such a threshold. Long pairs are used for orientation and short pairs are used for calculating the descriptor by comparing intensities.

Deep learning vs Machine Learning

Machine learning is a set of techniques which use statistics to find patterns typically in massive amounts of data, where data involves anything that can be stored digitally. Machine learning can be also considered as an approach to achieve artificial intelligence.


How machine learning differs from deep learning

Deep learning is a technique for implementing machine learning. Therefore, not all the machine learning techniques belong to the deep learning group.


Applications of machine learning for computer vision

The most common machine learning approaches used in computer vision applications are neural networks, k-means clustering, and support vector machines (SVM). Machine learning is currently used in computer vision to perform object detection, object classification, and extraction of relevant information from images, graphic documents, and videos. Those tasks involve segmentation, feature extraction, refining visual models, pattern matching, shape representation, surface reconstruction, and modelling.


Machine learning in computer vision is used to extract graphical and textual information from document images, to perform gesture and face recognition, to recognize handwritten characters and digits, to interpret remote sensing data for geographical information systems, to implement advanced driver assistance systems, to interpret data contained in car and pedestrian detection images, and many more applications.


726 views
bottom of page