NZ Physioboard APC Reader

By Jasper M-W | Published at 2022-01-16

openhealthnz-credentials/physioboard-apc-reader physioboard-apc-reader Provida

Demo

Svelte Demo Server Source Client Source Sample PDFs

Automatically extracting info from certificates

At it’s core, it’s a typescript library that can read selected fields from the Annual Practising Certificate (PDF document) issued by the Physiotherapy Board. This is a tiny library, but this parser is a proof of concept for a future library that would have wide support for many different medical certificates.

Physiotherapy Board APC Analysis

The certificate is a common PDF v1.7, produced off a template with text boxes pasted in at the right places using the official Adobe SDK. Interestingly, the template only has the dynamic text as “text”, while the rest seems to just be SVG’s. This makes it much simpler to extract text using the PDF standard, as this is what a PDF Text Extractor library sees:

'Text' areas highlighted in red

With OCR, all the text would be detected, and might require cropping to just the areas wanted.

Another interesting part is that it uses the Adobe protection, with AES-256. This means that in its PDF form, it would need a relatively new version of PDF.js to work.

OCR vs PDF Text Extraction

There are two choices when wanting to extract info from a Certificate, OCR or PDF Text extraction.

OCR

Pros

  • Works for both an scan or the original PDF

Cons

  • Less reliable for character recognition, especially with non-ascii characters
  • Includes all text on the PDF, not just the relevant text

PDF Text Extraction

Pros

  • No chance of non-ascii characters preventing scanning
  • Includes all text on the PDF, not just the relevant text

Cons

  • Only works with the original PDF

Implementation

I went with the direction PDF Text Extraction, specifically using the pdf-parse library. This earlier example looks like this when processed:

2021/2022
This certificate is for the year commencing 1 April 2021. It should be available for inspection by 
your employer and clients. It must be surrendered to the Board upon request.
70-69420
69NEIN
Jane Doe
1 April 1847
31 March 2202
Witchcraft
Only on Tuesdays

From here it’s pretty simple to use regex to parse and validate the data, as done in this function.