Promoting and Expanding the Structured Reporting of Cancer (SRC)

Automatic population of colorectal structured reports

Page last updated: 24 June 2013

The stages in this project are:

  1. Acquire a corpus of 400 colorectal cancer reports.
  2. Design a Tagset for annotating the reports.
  3. Build a mapping between the Tagset and the Structured report fields.
  4. Train staff in the use of the tagset.
  5. Annotate the corpus using the tagset.
  6. Build a machine learning model for computing the annotation tags;
  7. Run a series of evaluations on different models;
  8. Develop the programme code to extract the descriptive statistics;
  9. Project report writing.

An update on the stages of the project

Stage 1: A corpus of 312/400 colorectal reports was delivered on March 13, 2012. March 13, 2012 thus constitutes the beginning of the contract for this project. This original corpus has been cleaned and converted into plain text files with UTF 8 encoding. Non-supported characters were stripped from the files and images have not been included in any way (i.e. we have not replaced images with an image reference so unless there is an independent textual reference for that image it will not be recorded as having been present in the file).
A description of the corpus is provided at the bottom of this report. The development of the tag set is set up so that tagged files will automatically provide a measurable index of the structuredness of a report and details about adherence to guidelines and standards for colorectal reporting. The details provided under corpus description are intended to provide an initial general indication of the number of structured reports in the project and are also used for annotation purposes.

Stage 2: Design of a tagset. The design of the stable tagset took 6 draft versions. These draft versions have been recorded in the project wiki and in the repository. The final tagset is a mapping of the Colorectal Cancer Structured Report Sample provided in the protocol with the extractable items provided under chapter 6 of the protocol.

Stage 3: Build a mapping between the Tagset and the Structured report fields. This has been completed and is located on the wiki. The tag set has attribute values that map onto the extractable fields provided under chapter 6 of the protocol. We have not included the following standards and guidelines as these are not readily available from the corpus provided (please see corpus description below):

Clinical information and surgical handling

S1.01 Patient name; Date of birth; Sex; Identification and contact details of requesting doctor; Type of specimen; Date of surgical procedure; Clinical information relevant to the investigations requested
G1.01 Patient identifiers (e.g. MRN, UHI, NHI)
G1.02 Pathology accession number
S1.02 Principal clinician caring for the patient
S1.03 Operating surgeon: Contact address Phone (mobile) number

Stage 4: All staff on the project have been trained in the use of the tagset and have read the protocol document provided as well as numerous support references relevant to the project. The tagset and support notes are available on the project wiki.

Stage 5: Annotation of the corpus is approximately 1/3 complete and is expected to be completed by May 18, 2012., including the development of a gold standard.

Any issues encountered and proposed action to resolve

Stage 1. Delivery of files in various formats including pdf, xls, doc meant that files had to be extracted and converted to the correct format and encoding – delay impacted on stages 2-5 by 4 weeks.

Proposed action: simplified tagset to speed up annotation and implemented some changes to work flow.

Stage 2 – 5. Flow-on effect from the resolution action for stage 1 means that stages 2-5 have also been delayed.

Proposed action: given the anonymised nature of the reports and the low level security risk, it would be possible to bring one off-shore annotator on to complete files in a shorter time frame. Files will be annotated day and night.

Any risks to the successful completion or impact on the timeline

Anticipated high risks to the completion of the remaining stages: None

Anticipated low-moderate risk elements: staff availability; competing project deadlines; technical problems.

A plan of action for the remaining stages

Annotated Gold Standard to be passed to the computational team within the next two weeks (May 18 2012). Leaving 6 weeks for the remaining 4 stages (including writing the project report). Many of these tasks will happen concurrently.

Corpus description

StructurednessNo. of files%
Structured13643.58974359
Partially Structured6219.871794872
Unstructured11436.538461538
Total Corpus312
File FormatNo. of files%
PDF6621.153846154
DOC72.2435897436
XLS23976.602564103
Total Corpus312
Country of OriginNo. of files%
Australia175.4487179487
Australia (?)27588.141025641
Hong Kong41.2820512821
Malaysia51.6025641026
Namibia10.3205128205
New Zealand30.9615384615
South Africa20.641025641
UAE10.3205128205
Unknown41.2820512821
Total Corpus312
Patient SexNo. of files%
Unknown29293.58974359
Female113.5256410256
Male92.8846153846
Total Corpus312
Patient AgeNo. of files%
Unknown29594.551282051
Above 7061.9230769231
50-7082.5641025641
Below 5030.9615384615
Total Corpus312

1. Please note that the Australia(?) category is from the xls files and is assumed to be Australian in origin based on file name.