Exploiting Fast Classification of SNOMED CT for Query and Integration of Health Data

(1)

Exploiting Fast Classiﬁcation of SNOMED CT for Query and Integration of Health Data

Michael J. Lawley

Queensland University of Technology, Faculty of Information Technology, Brisbane, (Queensland), Australia E-Health Research Centre, CSIRO ICT Centre, (Queensland), Australia

Abstract

By constructing local extensions toSNOMEDwe aim to enrich existing medical and related data stores, simplify the expression of complex queries, and establish a foundation for semantic integra- tion of data from multiple sources.

Speciﬁcally, a local extension can be constructed from the controlled vocabulary(ies) used in the medical data. In combination with SNOMED, this local extension makes explicit the implicit se- mantics of the terms in the controlled vocabulary.

By using SNOMED as a base ontology we can exploit the existing knowledge encoded in it and simplify the task of reifying the implicit seman- tics of the controlled vocabulary. Queries can now be formulated using the relationships encoded in the extended SNOMED rather than embedding them ad-hoc into the query itself. Additionally, SNOMEDcan then act as a common point of in- tegration, providing a shared set of concepts for querying across multiple data sets.

Key to practical construction of a local extension toSNOMEDis appropriate tool support including the ability to compute subsumption relationships very quickly. Our implementation of the polyno- mial algorithm forEL+ in Java is able to classify SNOMEDin under 1 minute.

INTRODUCTION

Experience with integrating medical and related data [1] shows that the use of controlled vocabularies successfully modulates the amount of noise in the data. However, when querying the col- lected data, any semantic relationships between the terms that are relevant to the query (for example, specialisation/generalisation or part-of relationships) need to be explicitly encoded in the query and/or accounted for in the interpretation of the query results.

These kinds of implicit relationships are especially

common in the health domain where terms often involve an implicit context of usage (e.g., lobe in the context of lung cancer) or implicit references to anatomical structures (e.g., colorectal cancer) or related classes of diseases, injuries, or procedures.

Accurately and consistently encoding these relationships in queries relies on the person formulat- ing the queries to understand them, thus creating many opportunities for errors, omissions, and in- consistencies to occur. When multiple people are constructing queries these risks are further exac- erbated.

By constructing the vocabularies so as to explicitly represent the relationships between terms, queries can directly and consistently exploit the relationships. Using an ad-hoc explicit representation of these relationships helps, but may introduce new problems in terms of consistency of usage and how the relationships are interpreted (see, for example, the Radiological Electronic Atlas of Malformation Syndromes and Skeletal Dysplasias (REAMS) [2]).

Instead, using a well-understood formal mecha- nism for representing the relationships, such as Description Logic, can avoid these problems.

However we still have two problems to solve:

1. how do we deal with all the existing data sets that do not do this; and

2. how do we mitigate the, potentially quite high¹, cost of explicitly representing all the relationships?

We can deal with both these problems by ex- tending (as needed) an existing standard ontology, such as the Systematized Nomenclature of Medicine (SNOMED) [3], that already embodies

1Getting the modelling right, from scratch, requires not only an excellent understanding of the concepts in- volved as well as their relationships, but also an understanding of how best to represent them in a particular Description Logic formalism.

(2)

many of the relationships we need. However, one of the main diﬃculties with this approach is that building an extension toSNOMEDis not dissim- ilar to maintaining and developingSNOMEDit- self. That is, the sheer size of SNOMED has meant that, until recently, very few tools could compute all of its subsumption relationships, and even those that could would reportedly take several hours.

Fortunately, recent work by Baader et al. [4, 5]

on the tractable family of description logics EL has shown that polynomial time classification al- gorithms exist and are practical. Moreover despite their relatively low expressive power, theELfam- ily of description logics is suitable for representing such real-world ontologies asSNOMED and offer additional expressiveness suitable for properly representing partOf relationships and sufficient conditions.² Their implementation of this algorithm in Lisp is able to classifySNOMEDin 1,782 seconds [5] (approx. 30 minutes) which sug- gests an optimised implementation in a lower-level language may be fast enough for near real-time feedback in an editing tool.

Thus, our goal is to provide tool support for deﬁn- ing alocal extensionto an existing standard formal ontology; a mapping from an existing set of terms that characterise an informal ontology to concepts in the formal ontology. In doing so we eﬀectively realise latent semantics in the existing medical data via the standard ontology. This should facil- itate simpler and more robust queries and in turn aid data integration, a special-case application of querying where related medical data sets use se- mantically overlapping, but distinct term sets.

RELATED WORK

There is a great deal of published work on using ontologies for data integration (see Wache et al. [6]

for an overview), but it is mostly focussed on their use at the meta-data level; ontologies are used to describe, reason about and integrate database schemas. While related to our goals, we are ad- dressing the more specific problem of semantic data integration or semantic translation. Stucken- schmidt et al. [7] discuss an approach to this problem in the context of their Ontology Interchange Language (OIL) [8]. In particular they raise the question of whether it is feasible to find or cre- ate a sufficient shared terminology. In our domain of medical data we believe thatSNOMEDrepre- sents such a shared terminology. A possibly more

2See also http://webont.org/owl/1.1/

tractable.html#2

important problem, and one identiﬁed in our work with skeletal dysplasias [2], is how to cope with errors in the shared terminology.

Wade and Rosenbloom [9] report on the man- ual construction of what is almost a local extension toSNOMED (they conceived the task as a semi-formal mapping). In this work 2002 terms were mapped to combination of single and post- coordinated concepts of which about 75% were equivalencies (20% of these were to single concepts) and only 1% (26) were, in their words,“not mappable”. It is unclear why these terms were cat- egorized as such since they include, for example, presyncope which could reasonably be related to 3006004|disturbance of consciousness|, but it may be that the context of use of the terms was unavail- able in order to properly discern their meaning.

However, their work does demonstrate that the goal of producing a local extension toSNOMED is feasible.

PROBLEM DESCRIPTION

The problem of embedding domain semantics such as specialisation/generalisation or part-of relationships into queries is illustrated in the following.

For example, a query to ﬁnd all performed procedures involving a colectomy might enumerate all such procedures:

SELECT S.*

FROM Surgery S

WHERE S.procedure = ’32003-00’

OR S.procedure = ’32003-01’

OR S.procedure = ’32012-00’

...

which has the potential to accidently omit certain codes and will require updating if the terminology is updated with additional forms of colectomy.

Alternatively, some kind of heuristic query could be used:

SELECT S.*

FROM Surgery S, ProcedureCodes C WHERE S.procedure = C.code

AND C.text LIKE ’%colectomy%’;

which has the potential to miss a term that doesn’t follow the expected naming pattern (e.g., epiploec- tomy) or provide false matches where a compound or composite name does not reﬂect a valid specialisation.

If, however, the terms were encoded as concepts in an ontology, the query is simple³:

3We envisage that the complete set of subsumption relationships would be stored in a database table to support fast subsumption-based queries using only two

(3)

SELECT S.*

FROM Surgery S, Ontology O WHERE O.ancestor = 23968004

AND S.procedure = O.descendant;

Note also that SNOMED, unlike classiﬁcation schemes such as ICD-9 and ICD-10, support a multi-parented generalisation hierarchy.

CONSTRUCTING LOCAL EXTENSIONS

In order to construct an ontology from an existing terminology (or collection of terminologies) we take a multi-step approach:

1. Map each term from the controlled vocabulary to a concept, factoring out any synonyms, to produce P.

This is often a simple one-to-one mapping, but it may be necessary to extend the mapping to include disambiguating data values when the same term is used to mean diﬀerent things in diﬀerent contexts.

2. Make any simple implicit relationships explicit, adding them toP.

For example, generalisation,partOf, orhasLoca- tionrelationships. It may be necessary to introduce new concepts to act as the generalisation of two or more sibling concepts.

3. Specify relationships between these (local) concepts and those in the chosen standard ontology Q, adding them toP.

To be able to answer queries involving our new ontology we ﬁrst need to classifyQ ∪ P to identify all the subsumption relationships it entails.

Note that, we should be careful thatQ ∪ Prepre- sents a conservative extension [10] ofQ. That is, Q ∪ P produces the same consequences over the set of concepts inQasQdoes by itself. We also need to ensure various integrity constraints (such as disjointness) are preserved inQ ∪ P. Thus we would like to be able to interactively editP while exploiting the consequences ofQ ∪ P in live feedback through the mapping tool. These kinds of checks can be performed by classiﬁcation ofQ ∪ P but this may not be viable ifQ ∪ P is large, as is the case whenQisSNOMED.

Colorectal Cancer Example

In this section we consider a sample set of ICD-10- AM [11] terms for procedures relating to colorectal joins.

cancer, shown in Figure 1. We can map these, one- to-one, to a set of concepts for a local ontology.

Procedure Code (ICD-10-AM)

Meaning

32000-00 Sig colectomy with stoma formation

32003-00 Sig colectomy with anasto- mosis

32003-01 Right hemicolectomy 32005-00 Subtotal colectomy 32005-01 Ext right hemicolectomy 32006-00 Left hemicolectomy 32012-00 Total colectomy 32024-00 High anterior resection 32025-00 Low anterior resection ex-

traperitoneal

32026-00 Low anterior resection coloanal anastomosis 32028-00 Ultra low anterior resection 32030-00 Hartmann’s procedure 32039-00 Abdomino-perineal excision 32051-00 Total proctocolectomy with

ileo-anal anastomosis

Figure 1: A Term-Set of Colorectal Cancer Proce- dures

The next step is to make any simple relationship explicit. In our case there are none that can be ex- pressed using just the concepts we have currently identiﬁed.

Figure 2 describes the identiﬁed relationships between these terms and selected SNOMED concepts as per step 3. Note that several concepts (for example, 32028-00|ultra low anterior resection|), have no exact equivalent inSNOMED, and that one, 32051|total proctocolectomy with ileo-anal anastomosis|implies a composite of concepts.

Figure 3 shows a visualisation of the results of classifyingSNOMEDaugmented with the ontology from Figure 2. As can be seen, unifying generalisation concepts such as 84604002|sigmoid colectomy|have been identiﬁed, and thus provide a strong foundation for constructing queries that span the various procedures. Additionally, since SNOMEDincludes detailed anatomical concepts, queries can now be composed in terms of anatomical features even though they did not exist in the original terminology.

(4)

Procedure Relation SNOMED

32000-00 ≡ 315327002

32003-00 ≡ 315326006

32003-01 ≡ 235326000

32005-00 ≡ 43075005

32005-01 ≡ 174071004

32006-00 ≡ 82619000

32012-00 ≡ 26390003

32024-00 ≡ 400988008

32025-00 314592008

32026-00 314592008

32028-00 314592008

32030-00 ≡ 16564004

32039-00 ≡ 265414003

32051-00 17405900570172002

Figure 2: Identiﬁed Relationships withSNOMED Concepts

COMPLEX QUERIES AND CONTEXT

So far we have only considered simple query scenarios where a single database column represents the concept we wish to query (e.g., Surgery.procedure) and there already ex- ists a concept that characterises the bound of the query (e.g., 2396804).

Consider instead a table, as shown in Figure 4, that stores both scheduled and performed procedures while using another column to distinguish them, and which encodes laterality, if any, of the procedure in yet another column. Now imagine we wish to query for all patients who have had an amputation including the left hand.

Patient Date Status . . . .

Procedure Laterality . . . .

Figure 4: Table storing records with contextual information split across columns

Patient Date . . . Laterality Code

. . . .

Code Equivalent SNOMED Expression

. . . .

Figure 5: Augmented table for representing contextualised concepts

To support this kind of problem with reasonable generality and decent query speed, we need to generate a new column containing codes that are mapped to the set of compound concepts that correspond to the contextualised meaning of each database row. Hence, as shown in Figure 5, the table from Figure 4 would be extended with aCode foreign-key column, and an additional table containing theSNOMEDexpressions of the form⁴:

∃associatedProcedure.P

∃laterality.L

∃procedureContext.S

which gives us another ontology extension that we can add toSNOMED.

Finally, in order to be able to pose a subsumption- based complex query involving composite concepts and have it evaluated at database join speeds, we can employ the same strategy: extend the ontology with a new fully-deﬁned concept correspond- ing to our query expression, re-classify, and per- form a join-based query using the new concept.

The need to construct compound expressions that explicitly represent the context associated with a record in a database occurs any time the data needs to be queried outside its original context.

This may happen in as trivial a case as when one table in a database is joined with another, but the more general scenario occurs when integrating data from multiple data sources.

RESULTS

Classifying SNOMED

The practicality of creating local extensions of SNOMEDis dependent on sufficient tool support and, as mentioned previously, a cornerstone of this is fast classification. Indeed we believe that near real-time feedback in an editing environment, be it an IDE for programming or a 3D architectural modelling tool, can have a transformational effect on the authoring and editing process.

To this end, we have implementedsnorocket, using a slightly altered form of the algorithm in [5] writ- ten in Java. We use several optimised Map and Set data-structures tailored for ontologies with roughly the same number of concepts and roles as SNOMED. This implementation is able to classify

4Note that considerable experience withSNOMED and all its documentation may be required to construct suitable valid post-coordinated expressions like those above. Tool support for this is clearly an important issue and recent work in the IHTSDO Concept Model SIG on producing a Machine Readable Concept Model will be valuable for this.

(5)

Figure 3: Visualisation of part of an extendedSNOMEDontology

SNOMEDin 54 seconds on a modern 2.4GHz In- tel Core 2 Duo running Windows XP and Sun’s Java 1.6.0 03.

For a fairer comparison with CEL, which only runs under Linux, we ran both snorocket and CEL on an older four-CPU Xeon 3.6GHz machine running RedHat Linux 2.6.9 and Sun’s Java 1.6.0 04. The results, for several of the ontologies available fromhttp://lat.inf.tu-dresden.de/

~meng/toyont.html, are in Table 1.

Clearly, being able to classifySNOMEDin close to a minute is a substantial improvement over roughly 23 minutes and brings us much closer to the near real-time feedback we are seeking.

Incremental Classiﬁcation

In our mapping scenario we observe that SNOMED(Q) is unchanging while the local extension (P) is modiﬁed. If we can classifyQonce and record the resultC(Q) then, due to the mono- tonicity of the description logic, the classiﬁcation ofQ ∪ P, C(Q ∪ P), is a superset ofC(Q). The goal is then to deriveC(Q ∪ P) givenC(Q) (and, of course,Qand P) which should be much faster than derivingC(Q ∪ P) from scratch.

Suntisrivaraporn [12] calls this Duo-Ontology Classiﬁcation and presents a variation of the algorithm in [5] to do just this. We have indepen- dently derived our own variant of this algorithm along similar lines; the queue-processing core is essentially unchanged but the initialisation of the queues is diﬀerent to account for the work that has already been done.

Currently this work is in a preliminary state and the correspondence with the variant described in [12] is unknown. However the performance of this incremental algorithm is very promising.

WithP consisting of the 14 new concepts as de- ﬁned as in Figure 2, incremental classiﬁcation takes around 0.9s using our un-optimised implementation.

DISCUSSION

Ideally, as a term set is developed, it would be explicitly constructed as an ontology and, to avoid re-invention and promote interoperability, could be developed as an extension of an existing standard ontology such as SNOMED. These extension ontologies could then be shared and evolved within their specialist community while still being useful and usable in more general communities.

One such example is an ontology for skeletal dysplasias extracted from REAMS [13].

It is thus useful to be able to represent these ontologies in a standard format such as OWL so that they can be shared or manipulated using existing toolsets. Currently we use the OWL 1.1 proposal [14] rather than OWL 1.0 since it sup- ports the expression of the role axioms (to describe role transitivity and right-identity). The particular subset we use is characterised by the description logic EL+^⊥. OWL 1.1 is supported by, for example, the latest development-release of Prot´eg´e (4.0 alpha).

Unfortunately, OWL is not practical for representing large ontologies likeSNOMEDwhere an OWL

(6)

SNOMED FULL-GALEN NOT-GALEN NCI

CEL 1391.9 368.9 5.4 1.8

snorocket 72.8 15.1 0.4 0.4

Table 1: Comparison of classiﬁcation time for snorocket and CEL running on the same hardware.

XML representation is approximately 240MB [15], about eight times the size of the equivalent KRSS representation. Moreover, due to the complexities inherent in parsing XML, it is much slower to load and parse than a simpler format such as KRSS.

One work-around for this, and something that would greatly beneﬁt the e-health community, would be for the International Health Terminology Standards Development Organisation, the newly formed governing body ofSNOMED, to formally publish URIs for the concepts inSNOMED. This would allow tool vendors to “bake in”SNOMED to their tools, while still allowing other OWL- based ontologies to referenceSNOMEDconcepts in a consistent and interoperable manner in order to describe extensions toSNOMED.

CONCLUSION

Our preliminary work on producing local extensions to SNOMED for semantic data integration is promising as is the performance of our classiﬁer. The current implementation is single- threaded and we anticipate a further speed in- crease from a multi-threaded implementation running on a multi-core CPU.

We are currently integrating snorocket with a 3rd- partySNOMEDediting tool which requires specific support forSNOMED’s use of role grouping and the ability to distinguish between stated and inferred relationships in the output of the clas- sifier, although this adds little overhead to the classification time. In addition, we are prototyp- ing mapping tools specifically targeting the task of constructing local extensions of SNOMED from existing data.

Finally, we are continuing work on our incremental form of the algorithm but have not yet tuned or veriﬁed the implementation. Preliminary results indicate that this approach should be very useable when integrated with our mapping tool.

Acknowledgements

The work described in this paper was carried out while on secondment to the CSIRO’s E-Health Research Centre and the author would like to gratefully ac- knowledge the support of David Hansen and the other members of the Health Data Integration team.

Address for Correspondence

Michael J. Lawley, Faculty of Information Technology, University of Queensland, 126 Margaret Street Bris- bane Qld 4000, Australia

m.lawley@qut.edu.au

References

[1] D. Hansen, C. Daly, K. Harrop, M. O’Dwyer, C. Pang, and J. Ryan-Brown. HDI: Research Software To Commercial Product.ASWEC 2005 Industry Experience Papers, 2005.

[2] I. Jakobsen, M.J. Lawley, A. Zankl, and D. Hansen. Ontologies for Skeletal Dysplasias.

MedInfo 2007 Workshop: MedSemWeb 2007, 2007.

[3] Snomed Clinical Terms. College of American Pathologists, 2006. http://www.snomed.org.

[4] F. Baader, C. Lutz, and B. Suntisrivaraporn.

CEL—a polynomial-time reasoner for life science ontologies. In U. Furbach and N. Shankar, editors, Proceedings of the 3rd International Joint Conference on Automated Reasoning (IJ- CAR’06), volume 4130 ofLecture Notes in Artiﬁ- cial Intelligence, pages 287–291. Springer-Verlag, 2006.

[5] F. Baader, C. Lutz, and B. Suntisrivara- porn. Eﬃcient Reasoning in EL⁺. In Proceedings of the 2006 International Work- shop on Description Logics (DL2006), CEUR- WS, 2006. http://lat.inf.tu-dresden.de/

research/papers/2006/BaaLutSun-DL-06.pdf.

[6] H. Wache, T. V¨ogele, U. Visser, H. Stuck- enschmidt, G. Schuster, H. Neumann, and S. H¨ubner. Ontology-based integration of information-a survey of existing approaches.

IJCAI-01 Workshop: Ontologies and Information Sharing, 2001:108–117, 2001.

[7] H. Stuckenschmidt. Catalogue Integration: A Case Study in Ontology-based Semantic Trans- lation. Vrije Universiteit, Faculteit der Exacte Wetenschappen, Divisie Wiskunde & Informat- ica, 2000.

[8] Dieter Fensel, Ian Horrocks, Frank van Harme- len, Deborah L. McGuinness, and Peter F. Patel- Schneider. OIL: An Ontology Infrastructure for the Semantic Web. IEEE Intelligent Systems, 16(2), 2001.

[9] G. Wade and S.T. Rosenbloom. Experiences Mapping a Legacy Interface Terminology to SNOMED CT. In Proceedings of the SMCS 2006 - Semantic Mining Conference on SNOMED CT, 2006. http://www.hiww.org/smcs2006/

proceedings/9WadeSMCS2006final.pdf.

[10] S. Ghilardi, C. Lutz, and F. Wolter. Did I Damage my Ontology? A Case for

(7)

Conservative Extensions in Description Log- ics. In Patrick Doherty, John Mylopoulos, and Christopher Welty, editors, Proceedings of the Tenth International Conference on Prin- ciples of Knowledge Representation and Rea- soning (KR’06), pages 187–197. AAAI Press, 2006. http://lat.inf.tu-dresden.de/~clu/

papers/archive/kr06a.pdf.

[11] International Statistical Classification of Diseases and Related Health Problems, Tenth Revision, Australian Modification (ICD-10-AM). National Centre for Classification in Health, 5th edition, 2006. http://www3.fhs.usyd.edu.au/ncch/4.

2.1.1.htm.

[12] Boontawee Suntisrivaraporn. Module extraction and incremental classiﬁcation: A pragmatic approach for EL⁺ ontologies. In Sean Bechhofer, Manfred Hauswirth, Joerg Hoﬀmann, and Mano- lis Koubarakis, editors, Proceedings of the 5th European Semantic Web Conference (ESWC’08), Lecture Notes in Computer Science. Springer- Verlag, 2008. To appear.

[13] C. Hall and J. Washbrook. Radiological Atlas of Malformation Syndromes and Skeletal Dysplasias (REAMS) [software]. Oxford University Press, CD-ROM, 1999.

[14] B. Motik, P.F. Patel-Schneider, and I. Horrocks.

OWL 1.1 Web Ontology Language. World Wide Web Consortium, W3C Member Submission, 2006. http://www.w3.org/Submission/2006/

SUBM-owl11-owl_specification-20061219/.

[15] K. Spackman. An Examination of OWL and the Requirements of a Large Health Care Ter- minology. In Proceedings of the OWL: Ex- periences and Directions Third International Workshop (OWLED 2007), CEURWS, June 2007. http://owled2007.iut-velizy.uvsq.fr/

PapersPDF/submission_26.pdf.