Extracting information from images/pdfs is an age-old problem of the AI world. Although the latest achievements in the field of deep learning have seen tremendous success, data extraction from these invoices in the form of images or pdfs remains a challenge. Historically, we have relied on paper invoices to process payments or support accounts. However, this requires manual interference and remains a time-consuming process.
Typically, large organizations have several vendors, and manually processing an influx of invoices is a tedious process. It is also prone to errors and thus consumes a lot of time and resources leading to outstanding payments and reworking on the erroneous invoices. To combat these issues, deep learning along with OCR is used for invoice data extraction to automate the business processes.
This process can be broken into 3 steps:
- Digitize the invoices– Invoices are in the form of pdfs that need to be digitized. Depending on the quality of the input, we need to add an image preprocessing pipeline for best results.
- Extract data– Data extraction is done using AI algorithms. We can process this extracted information using Optical Character Recognition. Here, it is important to identify which piece of text corresponds to which field.
- Create database– After the data has been extracted, we need to create a database based on a unique identifier.
Benefits of Invoice Data Extraction
Extracting data from these invoices can offer a lot of benefits, some of which are discussed below:
Automating a system for data extraction leaves less scope for manual intervention. With automation, every step- right from ingesting data to giving desired output in a required format, can be done automatically. Using deep learning and OCR, we can ingest images and extract text from them. The only time we may have to intervene is to crosscheck whether the process is running as expected.
- Accurate data extraction
Since there is no manual intervention, the number of errors is drastically reduced. Additionally, the deep learning model tends to get better with time, which facilitates document processing accurately in reduced time.
- Cost Reduction
While extracting data from invoices using traditional methods, we need to develop rules-based engines and keep changing them as the data variability increases. This adds to implementation costs as well as other operational costs while processing the invoices. A deep learning data extraction process increases the efficiency of the system and reduces errors which helps in achieving significant returns in a short time.
- Efficient Process Management
An automated data extraction system reduces the rework required as well as tracks overpayments, leakages or late payments. An efficiently managed system will improve the relationship with the vendors and help in becoming a result-driven and optimally functioning organization.
The concept behind invoice data extraction-Object Detection
The image below illustrates how an object detection algorithm works. Each object in the image has been located and identified with a certain level of precision.
So, how does it work? Simply put, object recognition software detects patterns. Similar to the chair or person in the image above, with invoices, we try to detect our area of interest such as invoice date, amount, etc. The primary objective of the algorithm is to understand what the object is and its location regardless of its position, and that is the beauty of these algorithms.
So, let’s delve into the concept of object detection. To understand detection, the first thing we should keep in mind is the difference between object detection and object recognition. Object detection is used to locate an object for example, to show you where an object is in a given image, while object recognition is used to identify/classify an object.
Image classification is the task of assigning an image to one from a fixed set of categories, essentially answering the question “What is in this picture?”. One image has only one category assigned to it. This is one of the core problems in Computer Vision but despite its simplicity, it has numerous practical applications.
Object localization then allows us to locate our object in the image, so our question changes to “What is it and where it is?”. Object localization is the name of the task of “classification with localization”. This means given an image, classify the object that appears in it, and find its location in the image, usually using a bounding-box.
Object detection entails detecting objects of a certain class within an image. It tries to find all the objects in an image and draws the so-called bounding boxes around them. The state-of-the-art methods in object detection can be categorized into two main types: one-stage methods and two stage-methods. One-stage methods prioritize inference speed, and example models include YOLO, SSD and RetinaNet. Two-stage methods prioritize detection accuracy, and example models include Faster R-CNN, Mask R-CNN and Cascade R-CNN.
YOLO algorithm utilizes bounding box regression heads and classification methods. The YOLO architecture in simple terms consists of an S×S(typically s=19) grid cells of classifiers and regressors. Essentially, it tries to predict a class of an object and the bounding box specifying object location attached to every grid cell. Each bounding box can be described using four descriptors:
- bx,by center of a bounding box
- bw width
- bh height
- C as probability corresponding to a class of an object (cat,dog,person etc.).
Most of these cells don’t contain the bounding boxes. Therefore, we predict the value pc, which helps us remove boxes with low object class probability. Pc also helps in removing overlapping boxes using a method called non-max suppression. In this, we take the boxes with maximum probability and suppress the close-by boxes with non-max probabilities.
We can get our dataset labelled by any object detection annotation tool. Some tools that we can use are LabelImg, LabelMe, Vgg image annotator, etc. After annotating, we can download them in YOLO format or Pascal VOC format. We must ensure that our annotations and images are kept in the same directory. After this, we generate train, test, and validate files. It’s a good practice to keep 70% data in the training set, 20% in the validation set, and 10 % in the testing set.
Next, we have to ensure that our dependencies are compatible with YOLO. The minimum requirements are PyTorch version ≥ 1.5, Python version 3.7, and CUDA version 10.2. The dependencies can easily be installed using pip or the requirement.txt file.
After installing the dependencies, we have to install the repository YOLOv5. We can clone from the official repo. and then modify the YAML file to describe our dataset parameters.
# here we specify the files train, test and validation txt
# number of classes in our dataset
# class names
While training, we can pass the YAML file to select any of these models based on our requirements. Now that everything is configured we are all set to train our YOLO model.
!python train.py --img 512 --batch 4 --epochs 300 --data
'/data/inv_files/data.yaml' --cfg ./models/custom_yolov5s.yaml --
weights '' --name yolov5l_results --cache
- img: define input image size
- batch: determine batch size
- epochs: define the number of training epochs.
- data: set the path to our yaml file
- cfg: specify our model configuration
- weights: specify a custom path to weights
- name: result names
- nosave: only save the final checkpoint
- cache: cache images for faster training
These are the parameters that we need to pass while training. Once that is completed, the model will be saved in your “weights” directory and the resulting matrix graph will be generated as follows.
!python detect.py --weights runs/train/yolov5s_results/weights/best.pt --img 512 --conf 0.4 --source ../data/inv_files/test/images
- source: input images directory or single image path or video path
- weights: trained model path
- conf: confidence threshold
And there you have it! This will process the input and store the output in your inference directory.
As you can see, the idea of automated invoice processing is not just limited to invoices. We can expand this idea to finance, banking or any other domain where we have plenty of paperwork. We can always automate a manual task and generalize a solution no matter how complicated it looks. By adding continuous learning within our core idea, we can always move towards perfection.