|
Key-Sun Choi
Korea
Advanced Institute of Science and Technology
1
BUSINESS ENVIRONMENT OF THE ISO/TC37/SC4
1.1
Description of the Business Environment.
Language
resources consist of content represented by linguistic
data and their format for all aspects of human language
(e.g., speech
data, written (full) text corpora, general language
lexical corpora). Text corpus, lexicon, grammar, and
terminology are typical types of language resources to
be used for language and knowledge engineering. Wherever
and whenever information and knowledge content are being
-
prepared
(e.g. in research and development),
-
used
(e.g., in texts or in data fields),
-
recorded
and processed (e.g. in databases),
-
represented
(in the form of language texts or data),
-
passed
on (e.g. via training and teaching),
-
transformed
(e.g. in the course of re-use),
-
implemented
and transferred (e.g. in knowledge and technology
transfer), and
-
translated
and interpreted (e.g. in localization of global market)
in
both mono-lingual and multi-lingual environment,
language resources are accorded a crucial role to
prepare, process and manage the information and
knowledge by human and computer. Therefore, one can
rightfully say, “There
is no information, communication and knowledge
processing without language resources”. But in
order to be prepared, recorded, processed, distributed
and applied efficiently and effectively, it needs
methodology (and methodology standards), software tools
(and the respective standards for mark-up, interchange,
evaluation, etc.).
Relevant research areas are computational linguistics
and computational lexicography, language engineering,
etc. that have provided industrial or de-facto standards
that wait to become official standards, which in turn
helps develop the language industries at large.
Language engineering and computational linguistics
provide the methodology for the preparation, recording,
processing and re-use of language resources.
Computerized lexicography supplies the tools for the
efficient preparation and processing of dictionary data.
Language engineering and natural language processing
provide the tools to represent, manage and access
knowledge represented by linguistic data of different
degrees of complexity. Language resource management
cannot be efficient without a strong language
engineering component (comprising language data, methods
and tools).
In the constantly accelerating development of the global
multilingual information society, characterized by the
all-pervading influence of information and communication
technology (ICT), language resource management including
the respective data, methods and tools is becoming more
and more important not only in the field of linguistics
and language engineering itself, but even more in many
fields of application whether integrated or not into
larger systems. The emerging knowledge and contents
industries will strongly rely on language resources,
methods and tools.
In addition more and more experts are becoming active in
language engineering (or human language technology).
Every year new language communities are getting
interested or involved in language resource management
activities. New types of language (in terms of
special-purpose languages, language register, kinds of
texts, etc.) are becoming needed on the market to
provide the ‘raw material’ for all kinds of
consultancy services, training, and enhanced language
products (e.g., word processors, speech recognition,
machine translation, internet information retrieval,
knowledge management, etc.). All of information and
knowledge management applications need both
terminological data and general language resources with
their methods and tools.
1.2
Quantitative Indicators of the Business Environment.
The
following list of quantitative indicators describes the
business environment in order to provide adequate
information to support actions of ISO/TC 37/SC 4:
All
over the world linguistic infrastructures are being
established or re-enforced as part of the rapid
evolution of the information and communication society.
Globalization and its other side of the coin:
localization – not to mention personalization,
customization etc. - require multilingual communication.
The ubiquity of the Internet requires computational
standards for language processing. Computer-assisted
language learning at all educational levels, and the
increased demand for access to language resources of all
kinds require standards for language resources that are
accepted both by commercial software companies and open
source developers. International standards are the
pre-requisite to meet these new requirements concerning
the reuse, interoperability, usability of data and the
respective systems (or system components)
Activities by experts related to language resource
sharing and standardization increase in
-
intergovernmental
governmental (IGOs) and international non-governmental
organizations (NGOs, e.g., Universal Networking Language
of UN University, ELRA/ELDA)
-
regional
associations and their international federations (e.g.,
EAGLES for EU, ISLE for EU and USA, Asian Federation of
Natural Language Processing),
-
national
NGOs and non-profit organizations (NPOs),
-
public
institutions and organizations,
-
standards
bodies,
-
educational
and training institutions/organizations,
-
international
activities for web documents (e.g., SemWeb is a forum
for semantic linking of chunks inside web documents in
European Conference of Digital Library, and NKOS stands
for the Network Knowledge Organizing System),
-
commercial
entreprises, etc.
New
language and knowledge engineering tools assist all
products of knowledge management and information
management. A variety of value-added information
products and services are conceived on the basis of
language resources, as well as the respective methods
and tools.
Each regional language engineering experts’ group or
association has introduced language resources for
distribution to users, institutions and companies
without standardization of language resource formats.
Increasingly there is a need for new standardization as
well as a fast recognition of already existing de-facto
standards and their transformation into International
Standards. ISO/TC 37/SC 4, therefore, has a broad range
of potential standardizing activities, which are pivotal
to the further development of the language, content and
knowledge industries.
2
BENEFITS EXPECTED FROM THE WORK OF ISO/TC 37/SC
4.
ISO/TC 37/SC4 sees to it that new developments in
language engineering (or human language technology),
knowledge management and information engineering are
followed in international standardization work.
3
REPRESENTATION AND PARTICIPATION IN ISO/TC 37/SC4
P-members
and O-members of ISO/TC 37 are herewith called upon to
register their interest in the new ISO/TC 37/SC 4. This
indication of interest will be forwarded together with
the decision of ISO/TC 37 to establish SC 4 to ISO/CS
for circulation to all ISO members in order to call for
the nomination of experts to participate in the
standardizing activities of ISO/TC 37/SC 4.
The Republic of Korean national standardization body,
KATS (Korea Agency of Technical Standards) will support
the secretariat operation for ISO TC37/SC4.
Many high-ranking international Organizations (e.g.,
EAGLES, ISLE, Korean KIBS, Japanese GSK, etc.) will be
in Liaison with ISO/TC 37/SC4, because most of them have
extensive language resource activities (for operational
purposes, as they all are operating on a multilingual
basis and for their very mission, which necessitates
language engineering work). Most of the international
and regional organizations focussing on language
resources proper will be in liaison with ISO/TC 37/SC4.
4
OBJECTIVES OF ISO/TC 37/SC 4 AND THE STRATEGIES
FOR THEIR ACHIEVEMENT.
4.1
Defined objectives of ISO/TC 37/SC 4.
The
objective of ISO/TC 37/SC4 is to prepare standards
specifying principles and methods for creating, coding,
processing and managing, language resources, such as
written corpora, lexical corpora, speech corpora, etc.
Standards produced by ISO/TC 37/SC4 particularly address
the needs of industry, international trade and global
economy regarding cross-lingual information retrieval,
multi-lingual knowledge management and human language
communication. Its technical work results in
International Standards (and Technical Reports) covering
language resource management principles and methods, as
well as various aspects of computer-assisted
lexicography and language engineering – not to mention
their application in a broad array of applications.
The objective of ISO/TC 37/SC 4 would be to develop
standards containing specifications for
computer-assisted language resource management, focusing
on data modeling, mark-up, data exchange, and evaluation
of language resources (other than terminologies).
4.2
Identified strategies to achieve ISO/TC 37/SC
4’s defined objectives.
The
standardization of principles and methods for the
collection, processing and presentation of language
resources is a distinct type of standardization
activity. Its results are basic
standards that have a wide-ranging application.
The
point of reference of ISO/TC 37/SC4 standards includes
EAGLES documents EAG-CSG/IR-T1.1 “Recommendations on
Corpus Typology”, EAG-TCWG-TTYP/P “Recommendations
on Text Typology”, and EAG-TCWG-CES/R-F “Corpus
Encoding”. ISO 1087 may be extended by a part 4 to
cover the terminology related to SC 4.
VISION:
ISO/TC
37/SC4 shall prepare the International Standards to
support language resource management (knowledge
management, translation management, language resource
aspects within global content management, etc.) in the
multilingual information society.
World-wide
use of ISO/TC 37/SC 4 standards will help to:
-
enhance
overall quality of language resources in human language
communication aspects;
-
improve
information management within various industrial,
technical and scientific environments and to reduce its
costs;
-
increase
efficiency in computer-supported language communication.
ORGANIZATION OF WORKING GROUPS:
WG
1 Terminology (of SC 4)
Project:
either a part 4 of ISO 1087 (with extended scope) or a
terminology standard with a new number containing the
key concepts of the work of SC 4
WG 2 data modeling and mark-up methods
Projects:
existing and future EAGLES and ISLE standard documents
WG
3 data exchange
Projects:
e.g. OLIF
WG 4 evaluation of language resources and language
resource management systems
See
EAGLES and ISLE documents, Speechdat etc.
5
FACTORS AFFECTING IMPLEMENTATION OF THE ISO/TC 37/SC4
WORK PROGRAMME.
The recent efforts to integrate European and US efforts
for language resources have been implemented through
ISLE (International Standards for Language Engineering)
preceded by EAGLES project in Europe. ISO standards for
language resource management must be extended to
-
include
other languages (e.g., Asian languages) that are not
covered by the regime of ISLE;
-
prepare
ISO standards for de-facto standards that had already
been set up in EAGLES;
-
promote
the language resource standards to industries, research
institutes, academic societies for efficient and
effective re-sharing of very large-size and very
reliable language resources;
-
provide
the evaluation method and tools for human language
technologies based on language resource management.
This
new ISO TC37/SC4 will promote new policies, research and
projects to develop language resources on the basis of
its standards in all languages and in all countries
where there is a need for it.
|