编辑: bingyan8 | 2013-01-18 |
3 Canadian Aboriginal Syllabics. MARBI proposal 2002-11 allowed the addition of Canadian Aboriginal Syllabics to the MARC21 repertoire but only in Unicode encoding―thus obviating the need to expand the current MARC8 repertoire but recognizing that there is a loss of data if a conversion to MARC8 is needed for an interchange situation. Communication Format vs Internal Processing. In considering the various issues raised in this report, systems engineers need to decide what is a matter of internal handling within the system and what is a matter for inclusion within records being issued in MARC communication format. The stability of the repertoire and encodings in MARC8 have contributed greatly to record sharing, cooperative projects, and vendor system development - which have brought cost savings to libraries. Therefore the exchange environment is the primary issue, rather than the internal system, although the two can be more cost effective when compatible. 1.5 Basic Multilingual Plane (BMP) The present study is limited to the Basic Multilingual Plane of Unicode. The Basic Multilingual Plane (BMP, or Plane 0) contains all the common-use characters for all the modern scripts of the world, as well as many historical and rare characters. By far the majority of all Unicode characters for almost all textual data can be found in the BMP. (The Unicode Consortium. The Unicode standard, version 4.0. Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1, page 35) It should be noted that access to characters beyond the BMP requires special techniques as specified in the Unicode documentation. System software must be aware of such techniques and library automation systems must take them into account in their design for characters beyond the BMP to be available to the applications running on them. While it is very likely that most library systems would only extremely rarely need to access characters beyond the BMP, it should be noted that there are already a couple of han (Chinese) characters in the current MARC8 repertoire which fall outside the BMP in Unicode. The Unicode standard, version 4.0, has this to say about Plane
2 where these han characters have been placed: …the vast majority of Plane
2 characters are extremely rare or of historical interest only .
2 New Scripts and New Characters in Existing Scripts The move from the current MARC8 repertoire to a full Unicode repertoire represents an enormous increase in the number of characters to be handled. The current MARC8 repertoire includes about 17,000 characters. Unicode 4.0 includes 236,029 code points of which 50,635 are for graphic characters in the Basic Multilingual Plane . (Version 4.0 of Unicode was released in the fall of 2003.). There are two areas of concern. One is for new characters not in the MARC8 repertoire. Another is for characters that are in the repertoire but have alternative encodings in Unicode. This section is focused on the former.
4 The Unicode encoding '
code space'
is divided into '
blocks'
or ranges of code points. Appendix B: Unicode 4.0 Blocks and MARC8 Encoding provides a list of all Unicode blocks and shows which blocks have partial coverage in the MARC8 repertoire and which are entirely new. Appendix A: Unicode Scripts provides a similar listing but is ordered by script and shows the number of characters in Unicode that are in addition to those found in the MARC8 repertoire. A further set of tables giving specific code points in each Unicode block that are new to the MARC8 repertoire has been prepared as an ancillary part of this report. 2.1 Moving from Full Unicode Records to MARC8 The issue that raises the most concern in repertoire expansion is the complex of consequences encountered when records move from a system with a large repertoire to a system that has provision only for a much smaller repertoire. Character set is very fundamental and very unforgiving―if a system does not know about a character code, or worse―has a different character assigned already to that code―then data corrup........