Content Aware Parsing – The Next Generation of PDF Data Extraction

PFD documents exist in the trillions and support all types of personal and business activities. A large percentage of these documents were “born digital” meaning that they were created from electronic files such as Microsoft Word or Excel and were converted to PDF. The other portion are simply scanned images that are contained inside a PDF “container”.

Many businesses share PDF documents within an organization and with trading partners in many different forms from contracts, purchase orders and invoices, correspondence, to you name it.

While simply the ability to share a common format with each other has value in and of itself, many knowledge workers and organizations in general desire the ability to take data stored within these documents and use it for different purposes.

These purposes can range from simply copying a paragraph and using it in another document to a more-complex use of specific data in a PDF such as a financial table or to locate and extract metadata for use in a content management system.

Laying on top of that need is also a desire to have more of this activity automated; have data automatically pulled from PDFs instead of a single person doing it manually.

With all of the benefits of PDF such as a compact, sharable, secure, perfectly displayed document, there are a lot of difficulties when attempting to use the data within a PDF document. Certainly copy-pasting text from a PDF into Microsoft Word is fairly easy and straightforward; anyone can do it. Locating and extracting specific data and in a readily usable format is much more complex and out of the reach of many organizations.

Let’s consider the simplest form of data use: copy and paste. Most people can copy-paste data from a PDF document into an email or text editor, but often times the formatting is completely incorrect. Take for instance this text copied using Adobe Acrobat  from a PDF article:

This text is easily usable but it carries with it the original two-column format of the original article complete with the hard carriage returns. If a person wanted to use this text within another document or email with common formatting, she would need to undo all of the formatting. The reason for this apparent limitation is due to the original purpose of PDF: to faithfully reproduce content exactly as it was “printed”. Information common within a Word document such as location of paragraphs, sentences, and words is not present in a PDF, at least in the same manner.

The solution to the ability to locate and extract text with precision involves something similar to a computer vision problem where the entire document must be analyzed, noting the shape of the document, presence of columns, sentences, words, and letters. Additionally, the flow of the document must be analyzed such that right-to-left orientation is detected and differentiated from left-to-right languages. When a more complex analysis is applied to PDF-based text, the above example can be extracted as the following:

We can see that the text is perfectly captured with none of the irrelevant formatting originally stored within the PDF.

Once text is able to be interpreted in the correct manner, other processes can be employed to locate specific data required to support a work process such as metadata tagging or financial data analysis.

Our next article will cover more-complex data scenarios such as location and extraction of data formatted as a table. Since PDF does not store table attributes such as rows and columns nor the coordinates of the data within, other analysis is required.

Leave a Comment

Your email address will not be published. Required fields are marked *