编辑: bingyan8 | 2013-01-18 |
1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format.
1.1 Encoding Scheme vs. Repertoire An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the repertoire of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. A , B , &
C are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41,
42 &
43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 MARC8 is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The '
8'
refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single
8 bit byte. (It also includes the EACC set which actually uses fixed length
3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around
1968 and was initially limited to essentially Latin script only. Gradually it was expanded until today it includes the following scripts: Arabic, Chinese, Cyrillic, Greek, Japanese, Korean, and Latin. The vast majority of bibliographic records in North America and in many other locations around the world are exchanged using MARC8 encoding. Very little expansion has been made to the MARC8 repertoire in recent years, as a decision was made in the early 1990s to look toward Unicode for additional characters rather than continue the arduous task of expanding MARC8. (The term '
arduous'
is deliberate and refers not so much to the difficulty of making additions and documenting them but to the labor that every computer systems vendor using MARC records must expend in order to modify their systems so that new characters are recognized and
2 supported.) As the users and systems have proliferated, change has become more costly to the community at large. Although MARC8 is based on ASCII, parts of the repertoire and encodings outside of ASCII are unique to the world of libraries and library records and have been little used outside this domain causing support challenges. Therefore the concept of the adoption of Unicode also has the attraction of bringing the library world more into line with mainstream computer developments such as the Internet. 1.3 Unicode The Unicode encoding scheme and its repertoire have been in development for a little over
10 years now. The intent of Unicode is to provide a single encoding scheme that is capable of handling all the world'
s languages. Although its adoption has not been as quick as was initially hoped or predicted, Unicode is now the underlying encoding used in many major software development efforts―all recent Microsoft products, the Java programming language, and so on. In the year 1998, it was agreed [MARBI Proposal 98-18], that it was acceptable for MARC21 libraries participating in data interchange to begin using the Unicode encoding scheme as an alternative to the MARC8 encoding scheme. However, it was also agreed that the character set repertoire in current use was not to be expanded. The prohibition on repertoire expansion was considered necessary because of concerns over record exchange among systems―a vital element in the world of library information processing. This MARBI decision then meant that MARC21 records could be encoded in either MARC8 or in Unicode but that only the MARC8 repertoire of characters was to be allowed in either case. The MARC8 repertoire represents all of the characters in the MARC8 encoding scheme but it represents only a small fraction of all the characters in the Unicode encoding scheme. 1.4 Issues There is some urgency in the need to come to agreement on the resolution of the issues being raised in this report since already some local library systems are running on Unicode and several more are in the process of developing Unicode-based systems. Although users of these systems can be encouraged to stay within the current MARC8 repertoire, the systems require specialized software filters designed to ensure that no characters outside the current MARC8 repertoire enter the system. If such filters are not provided, some characters outside the current MARC8 repertoire will begin to appear on such systems and will then begin to find their way into distribution channels and subsequently appear in records destined to be loaded on non-Unicode systems.