Using deep learning for invoice data extraction

Listen to this article

Extracting information from images/pdfs is an age-old problem of the AI world. Although the latest achievements in the field of deep learning have seen tremendous success, data extraction from these invoices in the form of images or pdfs remains a challenge. Historically, we have relied on paper invoices to process payments or support accounts. However, this requires manual interference and remains a time-consuming process.

Typically, large organizations have several vendors, and manually processing an influx of invoices is a tedious process. It is also prone to errors and thus consumes a lot of time and resources leading to outstanding payments and reworking on the erroneous invoices. To combat these issues, deep learning along with OCR is used for invoice data extraction to automate the business processes.

This process can be broken into 3 steps:

  1. Digitize the invoices– Invoices are in the form of pdfs that need to be digitized. Depending on the quality of the input, we need to add an image preprocessing pipeline for best results.
  2. Extract data– Data extraction is done using AI algorithms. We can process this extracted information using Optical Character Recognition. Here, it is important to identify which piece of text corresponds to which field.
  3. Create database– After the data has been extracted, we need to create a database based on a unique identifier.


Benefits of Invoice Data Extraction

Extracting data from these invoices can offer a lot of benefits, some of which are discussed below:

  1. Automation

Automating a system for data extraction leaves less scope for manual intervention. With automation, every step- right from ingesting data to giving desired output in a required format, can be done automatically. Using deep learning and OCR, we can ingest images and extract text from them. The only time we may have to intervene is to crosscheck whether the process is running as expected.

  1. Accurate data extraction

Since there is no manual intervention, the number of errors is drastically reduced. Additionally, the deep learning model tends to get better with time, which facilitates document processing accurately in reduced time.

  1. Cost Reduction

While extracting data from invoices using traditional methods, we need to develop rules-based engines and keep changing them as the data variability increases. This adds to implementation costs as well as other operational costs while processing the invoices. A deep learning data extraction process increases the efficiency of the system and reduces errors which helps in achieving significant returns in a short time.

  1. Efficient Process Management

An automated data extraction system reduces the rework required as well as tracks overpayments, leakages or late payments. An efficiently managed system will improve the relationship with the vendors and help in becoming a result-driven and optimally functioning organization.

The concept behind invoice data extraction-Object Detection

The image below illustrates how an object detection algorithm works. Each object in the image has been located and identified with a certain level of precision.

Object detection is breaking into a wide range of industries, with use cases ranging from personal security to productivity in the workplace. It is applied in many areas of computer vision, including image retrieval, security, surveillance, automated vehicle systems, machine inspection, etc. Of course, the possibilities are endless when it comes to future use cases for object detection.

So, how does it work? Simply put, object recognition software detects patterns. Similar to the chair or person in the image above, with invoices, we try to detect our area of interest such as invoice date, amount, etc. The primary objective of the algorithm is to understand what the object is and its location regardless of its position, and that is the beauty of these algorithms.

So, let’s delve into the concept of object detection. To understand detection, the first thing we should keep in mind is the difference between object detection and object recognition. Object detection is used to locate an object for example, to show you where an object is in a given image, while object recognition is used to identify/classify an object.

Image classification is the task of assigning an image to one from a fixed set of categories, essentially answering the question “What is in this picture?”. One image has only one category assigned to it. This is one of the core problems in Computer Vision but despite its simplicity, it has numerous practical applications.

Object localization then allows us to locate our object in the image, so our question changes to “What is it and where it is?”. Object localization is the name of the task of “classification with localization”. This means given an image, classify the object that appears in it, and find its location in the image, usually using a bounding-box.

Object detection entails detecting objects of a certain class within an image.  It tries to find all the objects in an image and draws the so-called bounding boxes around them. The state-of-the-art methods in object detection can be categorized into two main types: one-stage methods and two stage-methods. One-stage methods prioritize inference speed, and example models include YOLO, SSD and RetinaNet. Two-stage methods prioritize detection accuracy, and example models include Faster R-CNN, Mask R-CNN and Cascade R-CNN.

Let’s discuss YOLO “You Only Look Once ”, one of the most popular algorithms in AI. It has always been the go-to for real-time object detection. Research shows that the unified architecture of YOLO is extremely fast in manner. The base of YOLO model processes images in real-time at 45 frames per second, while the smaller version of the network, Fast YOLO processes at 155 frames per second while still achieving double the accuracy of other real-time detectors. This algorithm outperforms other detection methods, including Deformable part models and R-CNN family of models, while generalizing from natural images to other domains like artwork.


YOLO algorithm utilizes bounding box regression heads and classification methods. The YOLO architecture in simple terms consists of an S×S(typically s=19) grid cells of classifiers and regressors. Essentially, it tries to predict a class of an object and the bounding box specifying object location attached to every grid cell. Each bounding box can be described using four descriptors:

  1. bx,by center of a bounding box
  2. bw width
  3. bh height
  4. C as probability corresponding to a class of an object (cat,dog,person etc.).

It also predicts a pc value, which is the probability that there is an object in the bounding box. For example- We can see that there is an object in the image. So, for a grid anywhere on the car pc will be equal to 1. bx, by, bh, bw will be calculated relative to the particular grid cell that we are dealing with. Let’s say the car is of second class, c2 = 1 & c1 and c3 = 0. So, for each of the 19 grids, we will have an eight-dimensional output vector. This output will have a shape of 19 X 19 X 8. So, now we have an input image and its corresponding target vector. Using the above example (input image – 224 X 224X 3, output – 3 X 3 X 8).


Most of these cells don’t contain the bounding boxes. Therefore, we predict the value pc, which helps us remove boxes with low object class probability. Pc also helps in removing overlapping boxes using a method called non-max suppression. In this, we take the boxes with maximum probability and suppress the close-by boxes with non-max probabilities.

Let’s see how we can code YOLO. Training a model from scratch consists of a few steps such as preparing a dataset, setting up the environment, configuring the files, training the model, and model inference.


We can get our dataset labelled by any object detection annotation tool. Some tools that we can use are LabelImg, LabelMe, Vgg image annotator, etc. After annotating, we can download them in YOLO format or Pascal VOC format. We must ensure that our annotations and images are kept in the same directory. After this, we generate train, test, and validate files. It’s a good practice to keep 70% data in the training set, 20% in the validation set, and 10 % in the testing set.

Next, we have to ensure that our dependencies are compatible with YOLO. The minimum requirements are PyTorch version ≥ 1.5, Python version 3.7, and CUDA version 10.2. The dependencies can easily be installed using pip or the requirement.txt file.


After installing the dependencies, we have to install the repository YOLOv5. We can clone from the official repo. and then modify the YAML file to describe our dataset parameters.

# here we specify the files train, test and validation txt
train: /self/dataset/inv_train1.txt
val: /self/dataset/inv_val1.txt
test: /self/dataset/inv_test1.txt
# number of classes in our dataset
nc: 4
# class names
names: [’InvoiceNumber’,’InvoiceDate’,’InvoiceAmount’,’VendorName’]

While training, we can pass the YAML file to select any of these models based on our requirements. Now that everything is configured we are all set to train our YOLO model.

!python --img 512 --batch 4 --epochs 300 --data
'/data/inv_files/data.yaml' --cfg ./models/custom_yolov5s.yaml --
weights '' --name yolov5l_results --cache

  1. img: define input image size
  2. batch: determine batch size
  3. epochs: define the number of training epochs.
  4. data: set the path to our yaml file
  5. cfg: specify our model configuration
  6. weights: specify a custom path to weights
  7. name: result names
  8. nosave: only save the final checkpoint
  9. cache: cache images for faster training

These are the parameters that we need to pass while training. Once that is completed, the model will be saved in your “weights” directory and the resulting matrix graph will be generated as follows.

Now that the model has been trained, let’s see how we can test the performance on future images. To run the model inference we use the following command.

!python --weights runs/train/yolov5s_results/weights/ --img 512 --conf 0.4 --source ../data/inv_files/test/images


  1. source: input images directory or single image path or video path
  2. weights: trained model path
  3. conf: confidence threshold

And there you have it! This will process the input and store the output in your inference directory.

As you can see, the idea of automated invoice processing is not just limited to invoices. We can expand this idea to finance, banking or any other domain where we have plenty of paperwork. We can always automate a manual task and generalize a solution no matter how complicated it looks. By adding continuous learning within our core idea, we can always move towards perfection.

Reach out to us to learn more about our AI and cognitive engineering capabilities and take a look at my previous blog to dust off on the basics of improving OCR accuracy.

About Sambid Pradhan

Senior Software Engineer

  • Artificial Intelligence
  • Machine Learning
  • NLP
Sambid is an AI enthusiast and is currently working with Nitor Infotech as a Senior Software engineer in the AI/ML team. He has extensive experience in working with Machine Learning, Computer Vision, and NLP projects. In his free time, he loves to connect and is always curious to understand the bridge between the real world and world of data science.