Chemistry International
Vol. 24, No. 4
July 2002
XML
in Chemistry
by
Antony N. Davies
Extensible
Mark-up Language (XML) is a powerful alternative to conventional binary
file storage and information exchange. As many scientific organizations
and companies delivering scientific products have implemented or are
looking at the use of XML, IUPAC decided to review and evaluate what
could and should be its role in advancing the use of XML in chemistry.
In January this year, the IUPAC Committee on Printed and Electronic
Publications (CPEP) organized a two day Strategic Meeting to assess
the Union's position and options. Hosted by the Unilever Cambridge Centre
for Molecular Informatics in the University of Cambridge Department
of Chemistry, delegates from all interested IUPAC Divisions gathered
together with key players in the field.
XML can
be regarded as an extension to the well known HTML or Hyper Text Mark-up
Language, which is the language most frequently encountered when viewing
web pages. XML is considered to be the universal format for structured
documents and data on the Web.1
It
isn't the use of XML itself that is interesting or even particularly
novel, but the content stored within the XML files.
|
As with
a conventional Web page, it isn't the use of XML itself that is interesting
or even particularly novel, but the content stored within the XML files.
In chemistry and associated technical fields, various groups commercial
organizations, academic institutions, and government bodieshave
been developing XML formats independent of each other. These formats
have similar content but differing data dictionaries and conventions.
This means
they are not compatible with each other and, what is far worse, resources
are being deployed to address problems already solved by other groups.
In order to support standardization in this field for the benefit of
the community, IUPAC has decided to actively explore ways in which it
can help to unify the various dictionaries and publicize their availability.
- IUPAC's
Role and Timeline
- Meeting Overview
- A Project for IUPAC
- The Future
- Acknowledgements
- References
IUPAC'S
Role and Timeline
During
the 2001 IUPAC General Assembly in Brisbane, an ad hoc group outlined
the dos and don'ts of a possible IUPAC role in advancing
the use of XML in chemistry and developed a timeline for further action.
The strategic importance of these decisions was reflected in the presentation
of Wendy WarrCPEP chairmanto the IUPAC Council2
and the subsequent comments by IUPAC's secretary general Ted Becker
in his article in CI.3
Dos
and Don'ts
IUPAC should not:
Commence activities better left to the computer scientists.
Re-invent the wheelthe
current activities at various locations should be invited to contribute
to a standardization process through IUPAC as long as their efforts
remain in the public domain.
Become formal members of World Wide Web Consortium (W3C),
Object Management Group (OMG) or other similar organizations,
however they should be informed of IUPAC activities in this area
and we should continue to monitor their work.
IUPAC should:
Establish "ownership" of the definition of standard
terms in chemistry to be used in digital communications through
formal IUPAC recommendations.
Generate a glossary of standard terms in chemistry for
use in applications involved in digital communications such as
scientific data exchange or electronic publishing.
Locate potential interested parties within IUPAC who "own"
glossaries of terms or who are in the process of creating them
Establish a method to identify and resolve problems in
overlap of definitions (within IUPAC as well as with other scientific
standards and other organizations)
|
It was
very clear from the Brisbane meeting that there was an urgent need to
address the issues that were raised there. Hence, by the end of December
2001 the issues of identifying glossaries, project team members, and
contacts between divisions and standing committees had been addressed.
By then, Professor Bobby Glen of the new Unilever Centre for Molecular
Informatics at the University of Cambridge, United Kingdom, agreed to
host a follow-up meeting from 24-25 January 2002, as this type of initiative
is of great interest to the fledgling center. Those invited to attend
included IUPAC division and standing committee representatives and delegates
from outside IUPAC who are active in establishing guideline for handling
of chemical objects within their organizations. The IUPAC Analytical
Chemistry Division was represented by its president David Moore; the
Physical and Biophysical Chemistry Division represented by Jeremy Frey;
and the new Chemical Nomenclature and Structure Representation Division,
represented also by its president, Alan McNaught. In addition, I represented
the IUPAC JCAMP-DX Working Party.
Meeting
Overview
The meeting
started with a welcoming address by Bobby Glen, who briefly explained
the background of the Unilever Centre and provided a useful overview
of the type of projects underway at the center. Alan McNaught, Robert
Lancashire, and I discussed IUPAC's intentions, current activities involving
IUPAC glossaries, and the status of the JCAMP-DX file formats. Currently,
within the eight IUPAC divisions there exist seven glossaries that are
supervised by the Interdivisional Committee on Terminology, Nomenclature,
and Symbols, which is responsible for ensuring conformity with existing
IUPAC recommendations and consistency within and between each volume.
These compendia, known as the IUPAC color books, cover chemical terminology,
quantities, units, and symbols in physical chemistry, inorganic, organic,
macromolecular, and analytical nomenclature, as well as the terminology
and nomenclature of clinical laboratory sciences.4
Jeremy
Frey pointed out that one difficulty encountered during the revision
of the "green book" (which covers quantities, units, and symbols in
physical chemistry) was the accommodation of different definitions,
which originated from different fields of chemistry, for single entries
in the data dictionary. Steve Heller offered an even broader example
of the problem: although nm is widely recognized as nanometers in the
scientific community, there is a significant body of opinion that feels
that the letters obviously refer to nautical miles!
The International
Union of Crystallographers (IUCr), represented at the meeting by Brian
McMahon, has a very special interest in mark-up language because it
has developed a standard format the Crystallographic Information
File (CIF) [more about CIF]
for the deposition, storage, and distribution of crystallographic data
with the publication of peer-reviewed papers. As McMahon explained,
CIF was commissioned by IUCr following long-standing interest in the
need for an open standard for data and information exchange. CIFs are
divided into blocks, with each block consisting of individual labels
or tags whose definition is stored elsewhere. Key points are that the
semantic content is kept separate from the syntax of data representation,
and that different dictionaries are used for different topic areas.
McMahon concluded that one thing was abundantly clear from experience
with CIF: "The design of a file format is an essential step, but it
is only one component (and in many ways the least difficult) in the
process of devising a feature-rich exchange mechanism. Far more difficult
is the detailed definition of the tags that will be used within the
file to ensure that applications attribute exactly the same meaning
to the same item of information. The experience of the expert committees
who undertake this work to extend CIF is that years of painstaking effort
and discussion may be needed to define a few dozen tags, which are accepted
across the community." As a contribution toward the establishment of
content-rich XML applications in related areas of chemistry, the IUCr
will make available its CIF-based definitions to the IUPAC groups working
to establish XML-based applications. The scientific community said McMahon
is looking forward to the day when effective chemical information exchange
standards, widely accepted by the community, should complement and interoperate
with CIF or its successors.
...
for XML to function effectively for the sciences there needs to
be agreement on the vocabularies or "ontologies" in use.
|
Peter
Murray-Rust summarized other global activities surrounding the use of
XML in sciencesee "Markup Languages-How
to Structure Chemistry-Related Documents" for a review of his
work, co-authored with Henry Rzepa. At the meeting, Murray-Rust explained
some of the benefits of using XML-based documents, including the ability
to "validate" documents for correct or complete content, to create better
electronically linked publications, and to significantly simplify information
harvesting from such documents. According to Murray-Rust, for XML to
function effectively for the sciences there needs to be agreement on
the vocabularies or "ontologies" in use. He noted that the W3C expects
that "domains" will create domain-specific tools and protocols for different
subject areas such as chemistry. He also explained how the XML files
differentiate between content, which has often been specified at different
locations. Individual XML files may contain content from different ontologies
such as a structure as defined by Chemical Markup Language (CML), a
spectrum as defined by JCAMP-DX or SPECTROML, and a mathematical relationship
as defined by MathML. This can be regarded as a powerful bonus, but
again poses the question about reliability of the links the content
needs to be put. This is currently leading to situations where "<element>
carbon" might need to be handled differently, such as "<cml:element>
carbon". The key is in the explanation of the data dictionary associated
with the defined name space "cml."
Namespaces
do not have to be registered and so it is simple for any group or company
to define their own version of "element." For example, although they
could quite correctly claim to be using XML for data storage and transfer,
the files generated would be as limited to their own internal applications
as if they were using 17-bit binary encoded files. One way in which
IUPAC could play a significant role in furthering XML for chemistry
explained Murray-Rust is by ensuring that dictionaries are future safe
and don't vanish from the Internet when a particular professor retires
or a software or publishing house is bought out or goes bankrupt.
Jonathan
Goodman, of the Unilever Centre, presented an amusing view from an academic
and educational standpoint ; see How Well
Are We Using XML in Chemistry?. His group has developed several
databases that could lend themselves to being made available in an XML
format. But, Goodman asked, what would be the immediate benefit? Quite
simply, there would be none he stated. Should IUPAC take a clear lead
in laying down guidelines on the presentation of chemical information
in XML then it would be worthwhile to take this additional step as then
other chemists and projects would be able to access and use the information
more easily.
To conclude,
Goodman said "there is a long way to go before XML is used routinely
to improve and enhance chemical communication. However, XML friendly
structures are already in place, and this should mean that a lot of
data can easily be moved to this marked-up language. If an XML-based
standard is accepted, then this process could be very rapid and data
could be shared and reused much more easily than is now possible."
This supported
the views of McMahon, who had commented that to generate an XML file
from CIF would be a simple enough task, but questioned whether this
would be "good" XML and "fit for purpose." Goodman and McMahon agreed
that IUPAC needed to identify the customers who would benefit from XML
projects. This includes clearly identifying stakeholders who will make
the effort to implement whatever is developed.
Other
presentations dealt with XML from various information providers' standpoints.
Bill Town from ChemWeb and Sandy Lawson from MDL Information Systems
pointed out the difficulties in achieving the uptake of technical developments
in large organizations. Efforts have been made across the publishing
industry to establish electronic submission and presentation of published
papers, but authors still are unhappy about changing their habits. A
general discussion was also held on the lack of decent authoring tools.
Kirk Schwall
summarized the views of the Chemical Abstracts Service (CAS). According
to Schwall, CAS has a collection of highly integrated data that have
been organized using SGML since 1994. Since 1997, XML has been used
for some data that have required frequent updating and interchangeability.
Both the document and authority data collection concepts at CAS have
XML as an element of their design. The vast complexity of their operation
meant that they were forced to handle about to every possible mode of
information delivery with only a small minority of their information
suppliers delivering content in an XML format. Even when it is available
it is not used, as the tags are stripped before being regenerated at
the end of the document handling process. CAS does have an extensive
thesaurus, but this is not publicly available. It was agreed that there
is a need for CAS and IUPAC to discuss common ontologies.
Gary Mallard
from the U.S. National Institute of Standards (NIST) summarized XML
activities within that organization. According to Mallard, NIST uses
XML for standardizing the delivery of the following types of scientific
information: numerical data, exchange of instrument/reference data,
materials property, and reactions design. The wide range of experience
gained by NIST in different fields of scientific information delivery
have placed it in a unique position to advise on the strengths and weaknesses
of XML in chemistry. Quite often difficulties have arisen over rather
banal problems such as unit names not being standardized internationally
(e.g., meter vs metre vs mètre), symbols requiring special fonts
and characters (e.g., unit °C, prefix m,
and quantity Vemf) or cases in which symbols are not available (or are
not standardized internationally) for all units or quantities. Mallard,
was, however, quick to point out some of the drawbacks of XML. He highlighted
the problems associated with files that are essentially uninterpretable
if the explanations of the individual labels used are not open and freely
available. According to Mallard, he had created a nice presentation
of the various XML efforts underway, but a problem arose when it turned
out that several of the reference Web sites essential for the understanding
of the ontologies no longer existed.
|
Some
of the attendees at the IUPAC Strategic Meeting on XML in Chemistry:(from
left to right) Robert Lancashire, Bill Town, Jonathan Goodman,
Sandy Lawson, Peter Murray-Rust, Kirk Schwall, Brian McMahon,
Alan McNaught, Gary Mallard, Steve Stein, David Moore, Steve Heller,
Bobby Glen, Kirill Degtyarenko, Richard Cammack, Peter Lampen,
and Tony Davies.
|
A
Project for IUPAC
At the
conclusion of this very successful meeting, Steve Stein of NIST was
appointed to draft a project proposal to IUPAC on "Standard XML Data
Dictionaries for Chemistry." In addition, a group of volunteers was
established for a task group to support this project. The group plans
to give a presentation at the coming CAS/IUPAC Conference on Chemical
Identifiers and XML for Chemistry to be held in Columbus Ohio on 1 July
2002.5
The
Future
The future
is always difficult to predict and those who are brave or foolish enough
to attempt it are usually proved wrongoften before their predictions
go into print. However, I would like to put one point at the end of
this summary: IUPAC is in an excellent position to provide a vital service
to the scientific community by assisting in the development of information
technology in chemistry and associated sciences. This is probably a
unique situation in the history of IUPAC because those championing this
work clearly understand the need to work fast, but also the inherent
limitations of working within an IUPAC framework, as shown by the dos
and don'ts list from the Brisbane meeting. I wish them all the best
and hope to see all of you at the IUPAC/CAS conference in July.
Acknowledgements
I would
like to thank Ian Michael, for permission to use my original column
published in Spectroscopy Europe,6 as the basis for this extended
report, and Henry Rzepa, Peter Murray-Rust, Jonathan Goodman, Brian
McMahon, Gary Mallard, and Kirk Schwall for their contributions. Also,
I would like to thank Bobby Glen for hosting the conference and all
those who attended the meeting, whether it was just to learn and report
back to their IUPAC bodies or whether it was to assist with the drive
for standardization of scientific IT. It is a hard road we tread and
one with few rewards. After all, no one ever won a Nobel Prize for enabling
communication among scientists!
References
1.
www.w3.org/XML/
2. www.iupac.org/news/archives/2001/41_council_minutes.pdf
3. www.iupac.org/publications/ci/2001/september/CI0109.pdf
4. www.iupac.org/publications/books/seriestitles/nomenclature.html
5. www.iupac.org/symposia/conferences/CIandXML_jul02/
6. A.N. Davies, XML in Chemistry, Spectroscopy
Europe, 14(1)2002, 22-24 <www.spectroscopyeurope.com/td_col.html>
Antony
N. Davies <[email protected]>
is secretary of the IUPAC Committee on Printed and Electronic Publications,
chairman of the IUPAC Working Party on Spectroscopic Data Standards,
and JCAMP-DX external professor, University of Glamorgan, Wales, United
Kingdom.
<www.iupac.org/standing/cpep.html>