• Aucun résultat trouvé

Emulation of Adobe CID Resources by CJK TrueType Fonts

N/A
N/A
Protected

Academic year: 2022

Partager "Emulation of Adobe CID Resources by CJK TrueType Fonts"

Copied!
28
0
0

Texte intégral

(1)

TrueType Fonts

Practical Mappings from Adobe CIDs to TrueType GlyphIDs

Suzuki Toshiya

*

Taiji Yamada

**

Masatake Yamato

***

Hideyuki Suzuki

****

* Information Media Centrer, Hiroshima Univ., Kagamiyama 1-4-2, Higashi- Hiroshima-shi, Hiroshima, 739–8511 Japan** AIHARA Electrical Engineering Co., Ltd., Kanto regional office, Hama-mati 2-16-8, Funabashi-shi, Chiba, 273–0012 Japan *** RedHat K.K., Shin-Roppongi-Build. 6F, Roppongi 7-15-7, Minato-ku, Tokyo, 106–0032 Japan**** Institute of Industrial Science, Univ. of Tokyo, Komaba 4-6-1, Meguro-ku, Tokyo, 153–8505 Japan

ABSTRACT. Due to the totally different glyph management systems between Adobe CID and True- Type, there is an ambiguity when substituting given Adobe CIDs with TrueType glyphs. Our goal is to obtain a portable CJK PS font emulator by a TrueType font, and for that we will dis- cuss several mapping algorithms as well as the statistics of successful/unsuccessful substitution for each of the scripts: Traditional Chinese (Adobe-CNS1), Simplified Chinese (Adobe-GB1), Japanese (Adobe-Japan1) and Korean (Adobe-Korea1). As a practical experiment, we imple- ment the CJK PS font emulator running in the userspace of the PS rasterizer and evaluate the efficiency of our algorithm in comparison with the existing implementation in Ghostscript.

RÉSUMÉ.De par la différence flagrante entre Adobe CID et TrueType en tant que systèmes de gestion de glyphes, il y a une ambiguïté lorsqu’on subsitue les CIDs par des glyphes TrueType.

Notre but étant d’obtenir un émulateur CJK PS portable nous allons discuter un certain nombre d’algorithmes de correspondance ainsi que les statistiques de bonnes ou mauvaises substitu- tions pour chacune des écritures : chinois traditionnel, chinois simplifié, japonais et coréen.

En guise d’essai pratique nous implémentons un émulateur de fonte CJK PS dans l’espace d’utilisateur du moteur de rendu PostScript et nous évaluons l’efficacité de notre algorithme dans le cadre de l’implémentation actuelle de GhostScript.

KEYWORDS: PostScript, TrueType, OpenType, CID, CJK, CJKV, Ghostscript

MOTS-CLÉS : PostScript, TrueType, OpenType, CID, CJK, CJKV, Ghostscript

(2)

1. Introduction

Adobe’s PostScript (PS) (Adobe Systems Inc., 1999) and its successor PDF (Adobe Systems Inc., 2004) have been the most important page description language (PDL) for vector output devices and have been adopted as the basis of several ISO standards (ISO/IEC JTC1/SC34, 1995; ISO TC130, 1993; ISO TC171/SC2, 2005), because of their seamless support from low to high quality printing devices. Although PS has its native font format (Adobe Systems Inc., 1991; Adobe Systems Inc., 2000), today the competitor font technology TrueType (Apple Computer, 1996; Weise et al., 1992) is widely used. In this paper, we discuss a practical emulation algorithm of PS font resources by TrueType fonts, to implement a portable emulator running in PS userspace. In the commercial software the use of PS font has been very limited.

In open source software (OSS), the utilization of PS fonts is almost negligible. The libraries for font and text handling in OSS desktop environmentXft(Packerd, 2001), fontconfig(Packerd, 2002), andcairo(Worthet al., 2003) do not support native PS fonts, although their underlying library FreeType2 (Turner, 2000) supports it. Al- though these libraries are designed to classify characters by Unicode codepoint at font-independent level, native PS fonts for CJK provide no Unicode interface.

1.1. Text Rendering and Font Files

A font is the mechanism to retrieve the concrete image of a character by an abstract specification of it. The codepoint in a character encoding system is the most abstract specification of the character.

At first, we summarize the mechanisms in Roman fonts on figure 1. As an example of a legacy font format, we take the MacOSNFNTbitmap file resource (the same mech- anism is used in most legacy bitmap fonts). In such legacy fonts, there are only two levels of character specification: a codepoint in the character encoding and the graphic data for the codepoint. For most Roman scripts, the codepoint is already sufficient to identify the image of a character under specific typeface. There are typographic liga- ture renderings of a few characters by single connected images (like ‘fi’, ‘fl’), but the rendering without ligature is not recognized as an unrecoverable error. In TrueType fonts, Apple introduced a new abstract layer called “glyph” to separate the character codepoint and graphic data. Here,glyphis a term for denoting the graphical shape of the character. To use a single font via multiple character encodings like ISO 8859-∗

families, the glyph index is introduced to specify the concrete graphic data regardless of character encoding. The glyph index is just an sequential serial number of graphic data as they are stored in the font file. To obtain a glyph index from a given char- acter codepoint, thecmap (character-mapping) table of the font is used. In a single TrueType font,cmaptables are available for each character encodings (e.g. Roman, Symbol, Unicode, etc). The text renderer selects the suitablecmapfor its text encod- ing, and converts a character codepoint to the corresponding glyph index. The graphic data for all characters are continously stored in theglyftable. To access graphic data

(3)

‘cmap’table Codepoint

in character encoding

Graphic data Glyph

PostScript character

name

byte-offset in glyf table

TrueType font

‘cmap’table

‘loca’table 16bit glyph

index

TrueType graphic instruction codepoint

Roman PS font

/Encodingdict

/CharStringsdict

PostScript graphic instruction very legacy

bitmap font

‘NFNT’table

bitmap data codepoint

16 or 8bit

codepoint ‘cmap’table

codepointcodepoint codepointcodepointcodepoint

‘glyf’table non-public

Font dependent part

Figure 1.Procedure to retrieve graphic data for given character codepoints, in Ro- man script fonts.

for a given glyph index, the offset in theglyftable is obtained by theloca(location) table. Although NFNT and most legacy bitmap can handle fixed bitwidth character encodings (8-bit only or 16-bit only), thecmaptable can handle multibyte encodings mixing 8- and 16-bit codepoints. The mechanism returns a single graphic data set for each codepoint. Let us note that the public interface is simply a character codepoint used as a key of thecmap table, the glyph index and graphic data being used only internally. For example, the text rendering system cannot assume that glyph index #1 corresponds to the first US-ASCII character ‘!’ in all TrueType fonts. In a Roman PostScript font its public interface is designed at a lower level vs. TrueType fonts.

Contrarily to TrueType fonts, Roman PostScript fonts use an abstraction layer called

“PostScript character name” between character codepoints and graphic data. The re- markable difference with TrueType fonts is that PostScript character name is a public interface. Indeed, the operation to render an encoded text is defined in the PostScript programming language, as well as the operation to render a glyph by its name. To provide compatibility among font files, the matching between “PostScript character name” and its graphic data is quite important.

(4)

KozuMin-W4

.R]X*R:

Layout/Glyph Shaping Writing-

System Horizontal

Vertical Proportional

Full-width Half-width Quarter-widt

Subscript

Ideograph- Variant

JIS 78 form JIS 83 form JIS 90 form JIS 00 form

Trad. form JIS 04 form Language

Smpl.Ch.

Trad.Ch.

Japanese

Korean

Vietnamese U+623F

+HLVHL.DNX*R HeiseiMin-W5

WadaMin-Bold

:DGD*R%ROG Mincho (Serif)

*RWKLF6DQ6HULI Family

Printing system must identify

Unicode can identify

Language can specify

Font can specify Family can

specify

Figure 2.The glyph identification for a given character codepoint, in the case of CJK scripts.

For non-Roman scripts, even if the character encoding is specified, the character codepoint is often insufficient to identify the graphical shape of the character. The method to obtain a glyph shape out of Unicode codepoint is shown on figure 2. In the case of Hanzi-derived CJK scripts, Unicode has adopted Han unification which ignores the detailed shape of the characters (The Unicode Consortium, 2003, p. 299), and hence Unicode codepoints cannot specify the precise glyph of a given character codepoint. As shown in figure 2, the graphical shape of the same character codepoint can vary among CJK scripts, thus at least the language must be specified. Is the information of language sufficient to specify a glyph by a codepoint? The answer is in fact negative since there are subordinate factors. We call the procedure to determine a glyph for a given codepoint in a given languageglyph shaping.

There is a shaping group related to the writing system: CJK scripts were origi- nally designed for vertical writing mode (top to bottom), and rotated shapes of Roman delimiters and punctuation have been developed for the vertical writing mode. We will show a detailed example in Japanese later on. In addition, to harmonize the char- acters imported from Roman scripts with existing CJK scripts, various typographic styles have been developed, like full-width (Roman characters are designed to fit in a

(5)

square), half-width (widths of Roman characters are half of their heights), etc. Such writing-system dependent differences of glyphs are independent from the font fami- lies and typefaces, but they are not recognized as the rôle of character encodings and should not be part of character-based information interchange.

Another group is related to ideographic variants: the glyph shape of a Hanzi- derived character is important to uniquely identify a noun like a personal name or a place name. Han unification was based on CJK national character encoding stan- dards for information interchange, thus Unicode does not deal with exact glyph shapes of characters. But sometimes specific glyph shapes are required to print diplomas, gazettes, educational publishing, and often they specify their own glyphsets which is mutually incompatible. To support such glyphset, the text rendering system is required to distinguish the various glyph shapes. From the viewpoint of CJK orthography, the ideographic variants are recognized as more abstractly and generically different than the typographic differences of font families and typefaces. We will show a detailed ex- ample in Japanese script later on. However, it is difficult to decide whether a difference between two glyphs is important enough to prevent unification with an existing code- point. Usually it depends on the sociological situation even if the scope is restricted to single country. For example, the governments of Hong Kong, Japan and Taiwan update regularly their character sets to include “existing but unencoded ideograph”, and propose their inclusion into ISO 10646. Often their proposals include ideographs whose shapes are quite similar with the shape of existing coded ones. The reason why such ideographs are regularly “discovered” is that some glyphs considered as “variants of same ideograph” (unification) can be reconsidered as “different ideograph” (disuni- fication) on a later stage. However, if different glyphs are considered as ideographic variants of the same character, it is arguable whether asingletypeface should support various ideograph variants for the character, or whethermultipletypefaces should be produced for ideographic variant families. In any case, the page description language and its rendering system are expected to reflect such glyph shaping to avoid violating the text layout and using wrong shapes of glyphs.

Finally, it is possible to define a thin abstraction layer to categorize the typeface:

Serif, Sans Serif, etc. In comparison with Roman character typography, such typo- graphic styles are a rather recent feature in CJK scripts. Although some font formats or page description languages can include flags to indicate if a given font is Serif or Sans Serif, the weight of glyphs etc., most CJK fonts do not use these flags. For ex- ample, while TrueType fonts provide such flags in theOS/2andPCLTtables, many CJK fonts do not set them or set them to wrong values. This kind of abstraction is not widely used in CJK scripts.

The mechanisms applied by OpenType and PS fonts for this purpose are sum- marized in figure 3. To obtain glyphs specific to vertical or horizontal typesetting, OpenType introduced a conversion table between glyph index. However, the informa- tion which conversion table should be used cannot be transmitted via coded text, since character encoding is basically defined without writing direction. A new public inter- face is added for OpenType to transmit the writing direction information to the text

(6)

‘GSUB’table

‘cmap’table Codepoint

in character encoding

Graphic data Glyph

Adobe CharacterID

(CID)

byte-offset CJK OpenType font

‘cmap’table

‘loca’table 16bit glyph

index

TrueType graphic instruction codepoint

CJK CID PS font

/CMapdict

/GlyphDatadict

PostScript graphic instruction non-public

Font dependent

part

codepointcodepoint

‘GSUB’table

feature

/CMap/CMapdictdict codepoint

+ feature

‘cmap’table glyph index substitution

selection

‘GSUB’table

‘glyf’table

CID

Glyph Shaping

Figure 3.Procedure to retrieve graphic data for a given character codepoint, in the CJK script fonts.

rendering system. Without this information, text rendering is done as for usual Roman TrueType fonts namely always left-to-right. It should be noted that the information obtained through this public interface is character encoding codepoint and abstract in- formation of writing system. The glyph index conversion table is used internally, and the text renderer cannot assume which glyph index should be converted before parsing the font file. On the other hand, CJK PS fonts use a different mechanism. Just like Roman PS fonts include the “PostScript character name” in their public interface, CJK PS fonts define the “Adobe Character Index (Adobe CID)” as their public interface.

The latter is a collection of presentation forms for various characters. Here, we have to distinguish two terms: glyph shapeandpresentation form. In TrueType terminology, the character codepoint is font independent, but the glyph is completely font depen- dent. There is no font-independent layer under character codepoint interfaces. For example, we cannot assume that the glyph index of some character is the same among Serif and Sans Serif fonts. In Adobe CID terminology, for a given character code- point, CID is defined as font-independent layer. Although it is possible to call Adobe CID as a glyph set, we call the abstract characteristics specified by Adobe CID as “the

(7)

presentation form” to avoid the confusion with “glyph” in TrueType and Adobe CID terminology.

To provide an interface to render CJK coded text by CJK PS fonts, theCMapfile is provided to convert a character codepoint to CID. This file is usually selected by the suffixes in font name, and it provides information on the writing system as well.

By selecting a suitable CID andCMappair, the text renderer can deduce the writing system information to render the result. In PS and PDF, the document can specify the presentation form explicitly by using explicit CIDs.

These two approaches of OpenType and PS CID fonts show a remarkable con- trast. In the approach of OpenType fonts, the text rendering system cannot guarantee whether the requested glyph shaping is supported or not before parsing a font for a given character codepoint and feature combination. In addition, there is no standard- ization of portable PDL or document file format that can embed the layout feature flags of TrueType Open fonts explicitly. For example, Rich Text Format (Microsoft Corpo- ration, 1999) and OpenDocument Format (OASIS, 2005) can store the font names but cannot store the layout features for each layout segment. This is because OpenType was originally designed for language-specific text layout like Arabic and Indic scripts.

The text renderer is expected to choose suitable glyph shaping automatically with- out explicit specification of glyph shaping features. Unicode text layout engines like Uniscribe (Bishopet al., 2000), ICU (I.B.M. Corp., 1995) and Pango (Taylor, 1999) do not support ideographic variant selection (Adobe Systems Inc., 2002), although current OpenType has several registered tags to specify several ideographic variants.

In the approach of PS CID fonts, the text renderer can guarantee the presentation form exactly, as far as Adobe compatible CID fonts are used. The text renderer of PS CID works quite simply; it lines the glyphs up on a base line. But the produc- tion of Adobe compatible fonts involves additional costs for the detailed checking of presentation forms. Although Adobe announced the migration from PS CID fonts to OpenType fonts, there is no PDL or electronic document file format that guarantees the presentation form by OpenType layout feature and the mechanism which specifies the presentation form by CID remains the only practical framework.

1.2. Problems of Disrupted Font Management

In spite of many TrueType-centric implementations, common printing interfaces are still designed for PS or PDF. In the scope of OSS, the primary reason is that PS is the only portable graphic data format in UNIX environment. As a result, today PS ren- derers (e.g. Ghostscript (GS) (Artifex Software, 1988)) are the only applications that prefer PS font to TrueType ones in OSS applications. To avoid the problems caused by disrupted font management between TrueType-centric software and PS renderers, the latter are required to support TrueType fonts transparently. GS supports already var- ious bitmap font types, but very few outline font formats. A remarkable extension is the VFlib-patched version of GS (Kakugawa, 1998; Katayama, 1994). VFlib-patched GS converts on the fly various outline font files into Type3 font objects in the PS sys-

(8)

tem space. The substitution of requested native PS font by emulated Type3 font is visible from PS userspace, and the substituted Type3 fonts conform to PS language specification. Recently, two vector-based devices have become important as output of PS renderers: PDF and SVG. These formats are enabled to include embedded fonts.

In the case of PDF, native PS Type1/Type9 fonts (Adobe Systems Inc., 1996) or en- capsulated TrueType Type11/Type42 fonts (Adobe Systems Inc., 1998) are supported.

Because of the difference in mathematical representation of Bézier curves and in hint- ing methods, the graphical conversion between native PS fonts and TrueType ones is incorrect. The converted shapes are broken at low resolution device or very small point sizes—on the contrary, encapsulated TrueType fonts work better because they pass the original graphical data to the rasterizer, therefore, generating of PS Type9 fonts from the graphic data in CJK TrueType fonts is worse than encapsulation of TrueType fonts by PS Type11 font format.

2. Emulating Native PS Fonts by TrueType Ones

In this section, we describe a method to emulate native PS fonts for CJK scripts by TrueType fonts.

2.1. Encapsulation of TrueType fonts in PS Language System

There are two encapsulation formats for TrueType in PS: Type42 and Type11.

For small character sets less than 255 codepoints, like ISO 8859, the encapsulation is done by Type42 font format. Glyphs in the Type42 font format are accessible via character codepoints. For large character sets as CJK scripts, the Type11 font for- mat is used. To support CJK scripts with correct presentation forms, Adobe defined a “Character Collection” for each script and assigned an index “Character ID” (CID) to each character presentation form. There are four “Adobe CID” collections which are widely used: Adobe-CNS1 (Traditional Chinese, for Taiwan), Adobe-GB1 (Sim- plified Chinese, for PRC), Adobe-Japan1 (Japanese, for Japan) and Adobe-Korea1 (Korean, for Korea). Another collection, called “Adobe-Japan2”, has been devel- oped to support JIS X 0212-1990 and although distributed officially it is now ob- solete. Except of these four official Adobe CID collections, Ken Lunde defined a few experimental CID sets for Adobe-CNS2 (for CNS 11643-1992 Planes 1-7 and 15), Adobe-Korea2 (for KS X 1002:1991), Adobe-HongKong1 (for Hong Kong GCCS, the supplementary character set of traditional Chinese for Hong Kong), Adobe-Vietnam1 (for TCVN:6056-1995, the intersection between Vietnamese Chu-Han and traditional Chinese), but they are not distributed officially and have not acquainted popular- ity (Lunde, 1999, p. 328). Native PS fonts for CJK scripts (Type9) contain glyph collections ordered by Adobe CID.

Originally, Type42/Type11 font formats have been designed for transferring True- Type data to PS printer engines. The CID of an embedded Type11 font is, in most cases, exactly the same as its TrueType glyph ID, and hence incompatible with Adobe

(9)

CID. To substitute native PS fonts for CJK scripts (Type9) by encapsulated TrueType (Type11), we need a mapping algorithm from Adobe CID to TrueType glyph ID.

2.2. Mapping Information Resources

In this subsection, we analyze available resources providing a correlation between Adobe CID and TrueType glyph IDs. As mentioned above, the glyph selection mech- anism of Adobe CID fonts differs from TrueType typography. In Adobe CID fonts, the presentation form is provided as part of the public interface, but in TrueType fonts it must be specified by character codepoints. We cannot specify the TrueType presen- tation form with a high degree accuracy in Adobe CID. To minimize the ambiguity in mapping from Adobe CID to TrueType glyph ID, we restrict the refering resources to the data published by Adobe, without various character encoding conversion tables (The Open Group/Japanese Vendors Committee CDE/Motif Technical WG, 1996).

For the relashionship between character codepoint and glyph specified by Adobe CID, Adobe provides resources for their products like PS printers and Adobe Acrobat, listed below. It must be noted that all maps are unidirectional.

CMapfile unidirectional map from character codepoints to Adobe CID, for various character encodings. For example, B5pc-His a map from a Big5 codepoint to CIDs in the Adobe-CNS1 collection. Another example: UniCNS-UCS2-H is a map from UCS2 codepoints to CIDs in the Adobe-CNS1 collection, for horizontal typesetting. Here, the suffixHmeans horizontal writing mode and theUCS2 stands for the source encoding. The prefix UniCNSmeans that the character set is limited to the intersection of 2 character sets: Unicode and Adobe-CNS1, and hence it is smaller than the complete Unicode character set. For example, the “CJK Radicals” (U+2E80–U+2EFF) block is primar- ily introduced for PRC, so UniGB (Adobe-GB1) covers all UCS2 codepoints in this block, but UniCNS (Adobe-CNS1) UniJIS (Adobe-Japan1) only covers the block partially (and included codepoints are printed by similar CJK Uni- fied Ideographs, and not by “CJK Radicals”), UniKS (Adobe-Korea1) does not include the block at all. The latest versions are available from ftp:

//ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/.

ToUnicodemapping file unidirectional map from Adobe CID to UTF-16 strings.

For example, theToUnicodemapping fileAdobe-CNS1-UCS2 is a map from the CIDs of Adobe-CNS1 to UTF-16 string. Originally, theToUnicodemap- ping was designed for UCS2 strings, but it has been extended later on for UTF-16 strings without changing its file name. Sometimes single Adobe CIDs are mapped to a string of multiple UTF-16 characters. UCS2 code- point U+FFFD is used as an “undefined character”, which should not be in- cluded in information interchange. The latest versions are available from ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/.

(10)

ToCodemapping file unidirectional map from Adobe CIDs to strings encoded in var- ious character encodings. A remarkable difference withToUnicodemapping file is the fact that theToCode mapping file does not cover the complete set of Adobe CIDs. For example, theToCodemapping fileAdobe-CNS1-B5pcis a map from CIDs to Big5 codepoints, but the coverage is limited to the Big5 character set.

CMapis a map from single character codepoints to single Adobe CIDs (Adobe Sys- tems Inc., 1999, p. 384–399), but theToUnicode mapping defines a “spelling” of Adobe CIDs in UTF-16. TheToCodemapping has a similar feature for character en- codings used on personal computers before Unicode: Shift-JIS, GB-2312, Big5, Johab and Wansung. The reason why single CIDs are mapped to strings is thatToUnicode andToCodemapping tables are introduced for information interchange: text extrac- tion, text search, copy-and-paste features for PDF. There are two basic algorithms to determine the suitable character codepoint for a given Adobe CID:

ToUnicodeForward refers to ToUnicode mapping directly and obtains UCS2 or UTF-16 codepoints for a given Adobe CID (orToCodemapping and gets legacy encoding codepoints for a given Adobe CID).

ReversalCMap refers to theCMaptable reversally and gets a character codepoint that is rendered by given Adobe CIDs.

In the following two subsections, each algorithm is discussed by a detailed analysis of mapping tables.

2.3. ToUnicodeForward

In this subsection, we discuss the algorithm usingToUnicodeandToCodemap- ping tables. Using these tables we obtain the simplest algorithm to map a given Adobe CID to a character codepoint. First, to show the advantage of this algorithm, we com- pare the numbers of “mappable CIDs” and “printable codepoints”. Here, by “map- pable CID” we mean a CID mapped to a single character codepoint byToUnicodeor ToCodemapping tables, or a CID used to print something inCMaptables. By “print- able codepoint” we mean a codepoint obtained by a CID through a givenCMap, or a codepoint which used to spell a given CID inToUnicodeorToCodemapping tables.

The number of mappable CIDs shows the potential coverage of theToUnicodeFor- ward algorithm, and the number of printable codepoints shows the potential coverage of the ReversalCMapalgorithm. The results for the latest AdobeCMapfiles are sum- marized in table 1. Here, we exclude U+FFFD and UTF-16 multi-codepoint entries fromToUnicode mapping files, because they cannot be used in Type11 encapsula- tion directly. The comparisons are calculated for each group of tables using the same Adobe CID collection and the same encodings. For Korean legacy encoding Johab,

(11)

Adobe CID Encoding Map type Map file NCID Ncp ∆NCID

Adobe-CNS1

Unicode

ToUnicode Adobe-CNS1-UCS2 18698 18392 -

CMap UniCNS-UCS2-H 18273 18285 425

UniCNS-UTF16-H 18427 23418 271

Big5 ToCode Adobe-CNS1-B5pc 13592 13589 -

(MacOS) CMap B5pc-H 13589 13592 3

Big5 ToCode Adobe-CNS1-ETen-B5 14061 13958 -

(+ ETen) CMap ETen-B5-H 13958 13961 103

Adobe-GB1

Unicode

ToUnicode Adobe-GB1-UCS2 29143 28668 -

CMap UniGB-UCS2-H 28625 28840 518

UniGB-UTF16-H 28658 28883 485

GBK ToCode Adobe-GB1-GBK-EUC 22123 22019 -

CMap GBK-EUC-H 22019 22118 104

GB2312 ToCode Adobe-GB1-GBpc-EUC 7715 7706 -

(for MacOS) CMap GBpc-EUC-H 7706 7706 9

Adobe-Japan1

Unicode

ToUnicode Adobe-Japan1-UCS2 23057 15797 -

CMap UniJIS-UCS2-H 9490 9772 13567

UniJIS-UTF16-H 15381 15674 7676

ShiftJIS ToCode Adobe-Japan1-90ms-RKSJ 7659 7489 -

(Win3.1J) CMap 90ms-RKSJ-H 7489 7883 170

ShiftJIS ToCode Adobe-Japan1-90pv-RKSJ 7482 7302 -

(KanjiTalk 7) CMap 90pv-RKSJ-H 7355 7355 127

Adobe-Korea1

Unicode

ToUnicode Adobe-Korea1-UCS2 18075 17429 -

CMap UniKS-UCS2-H 17056 17326 1019

UniKS-UTF16-H 17059 17484 1016

UHC ToCode Adobe-Korea1-KSCms-UHC 17007 16872 -

(Windows) CMap KSCms-UHC-H 16872 17140 135

EUC-KR ToCode Adobe-Korea1-KSCpc-EUC 9234 9195 -

(Hangul Talk) CMap KSCpc-EUC-H 9195 9463 39

Johab CMap KSC-Johab-H 16872 17156 -

Table 1.Comparison of quantities of mappable unique CIDs and printable character codepoints.NCIDis the number of unique CIDs in mapping tables,Ncpis the number of unique character codepoints in mapping tables, and ∆NCID is the difference of NCIDbetweenToUnicode(orToCode) mapping andCMapusing the same encoding space.

noToCode mapping tables are provided. Johab was defined after widely used Ko- rean encoding EUC-KR (Lunde, 1999, p. 177–185) and there are only few systems using Johab for information interchange.ToCodemappings were provided for legacy system information interchange, so there has been noToCodemapping for Johab.

From the results of ∆NCID, we can drawn the conclusion that the coverage of ToUnicode andToCode mapping tables is always wider than that of reversalCMap tables.

There are two remarks in the table 1. The first exception isUniCNS-UTF16-H.

The number of unique UTF-16 codepoints is greater than that of unique CIDs. This is because Adobe-CNS1 includes HKgccs characters in CNS 11643 that had been assigned to the PUA of BMP (U+E000–U+F8FF) and later became official in CJK Extension B (U+20000–U+2A6DF). For such characters,UniCNS-UTF16-Hhas two codepoints pointing to the same CID. Therefore, the number of unique CID and UTF- 16 codepoints appears as if the number of characters is greater than the one of glyphs.

(12)

Adobe CID CMap pair Nvert

Adobe-CNS1 UniCNS-UCS2-H, -V 20

ETen-B5-H, -V 22

Adobe-GB1 UniGB-UCS2-H, -V 35

GBK-EUC-H, -V 37

Adobe-Japan1 UniJIS-UCS2-H, -V 247 90ms-RKSJ-H, -V 110

Adobe-Korea1

UniKS-UCS2-H, -V 38

KSC-EUC-H, -V 39

KSCms-UHC-H, -V 39 KSC-Johab-H, -V 39

Table 2.The number of codepoints that change presentation forms according to the writing mode (Nvert), for each encoding system.

The second exception is Adobe-Japan1 character collection. The unique CID and unique UTF-16 codepoints are almost comparable for Adobe-CNS1, Adobe-GB1 and Adobe-Korea1. Comparing NCID andNcp for Adobe-Japan1-UCS2, the number of unique UTF-16 codepoints is remarkably smaller than that of unique CIDs in Adobe- Japan1. From the result, we can expect two scenarios for Adobe-Japan1: either it includes many presentation forms for single UTF-16 codepoints, or it includes many

“precomposed characters” that cannot be mapped to single UTF-16 codepoints. Next, we discuss the disadvantages ofToUnicodemapping algorithm. As written above, ToUnicode (and ToCode) mappings were introduced for information interchange.

Therefore, there is a possibility that ToUnicode andToCode mappings ignore the detailed presentation forms.

The most significant effect is the form difference caused by writing mode. Adobe classifies the writing direction as a subcategory of the character encoding system.

There are two typical writing modes in CJK scripts: horizontal and vertical. In modern CJK scripting, the horizontal writing mode is usually ordered from left to right (his- torically there was writing-mode from right-to-left, but PostScript does not support it). The vertical writing mode is always ordered from top to bottom. The comparison of horizontal and vertical presentation forms in Adobe-Japan1 is shown on figure 4.

Typical characters the presentation forms of which are different for each writing mode are brackets, braces, punctuation andmora-subscripts, but arrows and box-drawings are also rotated. It should be noted that the horizontal glyph collection has no vertical brackets, but include rotated arrows and box-drawings. By the comparison of horizon- tal and verticalCMap, we can estimate the number of vertical forms for each Adobe CID. The result is summarized in table 2.

The handling of form difference in writing direction depends on the character en- coding. Most of Japanese and Korean standard character encodings ignore the writ- ing direction: a codepoint is not defined for a specific writing direction. Taiwanese Big5 (and CNS 11643-1992) assigns different codepoints for each writing direction for most (but not all) characters. In PRC, GB 2312-1980 (and its revised edition GB 6345.1-1986) had originally no codepoints for vertical glyphs, but the latest GB 18030-2000 indeed has such codepoints. For the compatibility with legacy CJK en-

(13)

Figure 4.Sample comparison of horizontal and vertical presentation forms in Adobe- Japan1: left is for horizontal writing mode and right is for horizontal writing mode.

The shown codepoints are Unicode codepoints for horizontal writing mode. The font we used is HeiseiKakuGo-W3, designed for JIS90 (JIS X 0208-1990).

codings, Unicode has a block called “CJK Compatibility Forms” (U+FE30–U+FE4F) whose codepoints are allocated for vertical characters in Chinese encodings. However, the small number of vertical characters in this block (only 32!) is insufficient to cover all CJK characters that have different forms in vertical writing mode. The numbers of CIDs thatToUnicodemaps to CJK Compatibility Forms,NCF, are summarized in table 3. From the comparison with table 2 and 3, we can conclude that the number of codepoints of Adobe-CNS1 is sufficient (NCF = 27 > Nvert = 20), but not the one of Adobe-GB1 (NCF = 27 < Nvert = 35). In addition,NCF is much smaller than Nvert for Adobe-Japan1 and Adobe-Korea1ToUnicode mappings. Considering the results, we conclude thatToUnicodedrops the information to choose suitable forms for each writing mode, at least for Adobe-Japan1 and Adobe-Korea1. Considering the fact that we cannot preserve the writing mode information by using codepoints of CJK Compatibility Forms, the writing mode information must bypassToUnicode mapping tables and be sent to the text layout system.

(14)

Acrobat version Adobe-CNS1 Adobe-GB1 Adobe-Japan1 Adobe-Korea1

4.0.5 27 27 2 0

7.0.0 27 27 8 0

Table 3.The number of Adobe CIDs mapped to UCS2 codepoints in the “CJK Com- patibility Forms” block (NCF).

Adobe CID Encoding Acrobat version 4.0.5 5.0.5 6.0.1 7.0.0

Adobe-CNS1

UCS2 107 341 460 460

UTF-16 - - 304 299

B5pc 3 3 3 3

ETen-B5 103 103 103 103

Adobe-GB1

UCS2 174 518 518 518

UTF-16 - - 485 485

GBK-EUC 104 104 104 104

GBpc-EUC 9 9 9 9

Adobe-Japan1

UCS2 791 5953 5953 13567

UTF-16 - - 5750 7676

90ms-RKSJ 170 170 170 170

90pv-RKSJ 127 127 127 127

Adobe-Korea1

UCS2 823 1020 1020 1020

UTF-16 - - 1017 1017

KSCms-UHC 135 135 135 135

KSCpc-UHC 39 39 39 39

Table 4.The number of “unmappable CIDs” thatCMappairs (XXX-H, XXX-V) do not use to print codepoints in the encoding space but to whichToUnicode andToCode mapping files assign valid results (UTF-16 codepoints or UTF-16 strings, without any U+FFFD).

2.4. ReversalCMap

In this subsection, we discuss the Reversal CMap which is a competitor to ToUnicodeForward. The advantage of ReversalCMapis the identification of writing mode. As shown in table 2, it is impossible forToUnicodeForward to detect CIDs for vertical forms. In the case of ReversalCMap, we can easily collect the CIDs for ver- tical forms by collecting CIDs in verticalCMap(XXX-V) and substracting from them the CIDs used in horizontalCMap(XXX-H). On the other hand, the disadvantage of ReversalCMapis the smaller CID coverage, as shown in table 1.ToUnicodeForward defines a “UTF-16 spelling” for all CIDs, but ReversalCMap covers only the CIDs whose forms are suitable for specified character encodings. The difference∆NCID

is rather small (≈500/18000) for Adobe-CNS1 and Adobe-GB1, not negligible for Adobe-Korea1 (≈1000/18000), and problematic for Adobe-Japan1 (≈15000/29000 for UCS2CMap, or≈7000/29000for UTF-16CMap). Below, we analyze the “unmap- pable CIDs” that reversalCMaps cannot map to character codepoints, and we evaluate the effect of losing them.

(15)

Adobe-CNS1 UCS2

Unicode block name Acrobat version

4.0.5 5.0.5 6.0.1 7.0.0

Basic Latin 95/ 190 285/ 380 285/ 380 285/ 380

Latin1 Supplement 1/ 11 2/ 32 2/ 32 2/ 32

Currency 2/ 8 6/ 13 6/ 13 6/ 13

Letterlike Symbols 1/ 6 3/ 8 3/ 8 3/ 8

Mathematical Symbols 2/ 24 2/ 24 2/ 24 2/ 24

CJK Unified Ideographs 2/13100 39/15217 104/15282 104/15282

PUA 0/ 0 0/ 1647 54/ 1701 54/ 1701

Half and Fullwidth Forms 4/ 97 4/ 97 4/ 97 4/ 97

Adobe-CNS1 UTF16

Unicode block name Acrobat version

4.0.5 5.0.5 6.0.1 7.0.0

Basic Latin - - 285/ 380 285/ 380

Latin1 Supplement - - 2/ 32 2/ 32

Currency - - 6/ 13 5/ 13

Letterlike Symbols - - 3/ 8 3/ 8

Mathematical Symbols - - 2/ 24 2/ 24

CJK Unified Ideographs - - 2/15282 0/15282

Half and Fullwidth Forms - - 4/ 97 2/ 97

Table 5.The distribution of Adobe-CNS1 CIDs “allocated” byToUnicodemapping but “unmappable” by UnicodeCMaps.

First, we analyze the history of unmappable CIDs. The number of unmappable CIDs for each Adobe CID in various Acrobat versions is summarized in table 4. The unmappable CIDs increase on updates of Adobe character collections. On table 4, the entries of UTF-16 strings inToUnicode mappings are included, but the results for latest Adobe-Japan1 are almost similar to∆NCIDin table 1. In fact, the increase of unmappable CIDs is caused by new presentation forms of existing characters, as in the first scenario mentioned above. Next, we analyze the character properties of unmap- pable CIDs, to evaluate the effect of their potential loss. The classification of Reversal CMap’s unmappable CIDs byToUnicode mapping tables is summarized in tables 5 (Adobe-CNS1), 7 (Adobe-Korea1), 6 (Adobe-GB1), 8, 9, and 10 (Adobe-Japan1).

In these tables, we show the number of CIDs in the form of NCIDunmappable/NCIDallocated. For Adobe-CNS1 (table 5) and Adobe-GB1 (table 6), most unmappable CIDs are found in the “Basic Latin” block. They are considered as the various presentation forms of Latin alphabets: half-width for ISO 8859-1, proportional for 8-bit charac- ters, full-width for 16-bit characters, proportional for 16-bit characters, etc. AllCMaps for Unicode (XXX-UCS2-H, XXX-UTF16-H) use proportional glyph for 8-bit char- acters to display Latin codepoints, thus ReversalCMapcannot cover all CIDs for the various presentation forms of Basic Latin. For Adobe-Korea1 (table 7), the number of unmappable CIDs in Basic Latin is comparable to Adobe-CNS1 and Adobe-GB1, but most unmappable CIDs are distributed as UTF-16 strings. In theToUnicodemapping, UTF-16 strings are used for composed characters (in fact, enclosed alphanumerics).

As shown in table 8, the number of Adobe-Japan1’s unmappable CIDs in “Basic Latin”, “Latin Supplement” and “Extensions” is already larger than the one of other

(16)

Adobe-GB1 UCS2

Unicode block name Acrobat version

4.0.5 5.0.5 6.0.1 7.0.0

Basic Latin 95/ 190 287/ 382 287/ 382 287/ 382

Latin1 Supplement 14/ 34 29/ 50 29/ 50 29/ 50

Latin Extended-A 10/ 16 18/ 24 18/ 24 18/ 24

Latin Extended-B 8/ 16 16/ 24 16/ 24 16/ 24

IPA Extensions 4/ 4 6/ 6 6/ 6 6/ 6

Latin Extended Additional 1/ 2 2/ 3 2/ 3 2/ 3

General Punctuation 1/ 10 1/ 10 1/ 10 1/ 10

Currency 1/ 7 6/ 13 6/ 13 6/ 13

Letterlike Symbols 1/ 6 2/ 7 2/ 7 2/ 7

Mathematical Symbols 1/ 39 1/ 39 1/ 39 1/ 39

CJK Symbols and Punctuations 5/ 39 6/ 47 6/ 47 6/ 47

Hiragana 0/ 87 16/ 104 16/ 104 16/ 104

Katakana 0/ 89 16/ 109 16/ 109 16/ 109

CJK Unified Ideographs 0/20902 79/20981 79/20981 79/20981

PUA 1/ 94 1/ 94 1/ 94 1/ 94

CJK Compatibility Forms 19/ 27 19/ 27 19/ 27 19/ 27

Half and Fullwidth Forms 11/ 111 11/ 111 11/ 111 11/ 111

UTF16 String 2/ 2 2/ 3 2/ 3 2/ 3

Adobe-GB1 UTF16

Unicode block name Acrobat version

4.0.5 5.0.5 6.0.1 7.0.0

Basic Latin - - 287/ 382 287/ 382

Latin1 Supplement - - 29/ 50 29/ 50

Latin Extended-A - - 16/ 24 16/ 24

Latin Extended-B - - 16/ 24 16/ 24

IPA Extensions - - 4/ 6 4/ 6

Latin Extended Additional - - 2/ 3 2/ 3

General Punctuation - - 1/ 10 1/ 10

Currency - - 6/ 13 6/ 13

Letterlike Symbols - - 2/ 7 2/ 7

CJK Symbols and Punctuations - - 2/ 47 2/ 47

Hiragana - - 16/ 104 16/ 104

Katakana - - 16/ 109 16/ 109

CJK Unified Ideographs - - 79/20981 79/20981

PUA - - 1/ 94 1/ 94

Half and Fullwidth Forms - - 6/ 111 6/ 111

UTF16 String - - 2/ 3 2/ 3

Table 6.The distribution of Adobe-GB1 CIDs “allocated” byToUnicodemapping but “unmappable” by UnicodeCMaps.

Adobe CIDs. Also that of “Hiragana” and “Katakana” is remarkably large (1074 for Hiragana, and 985 for Katakana) because of various presentation forms, as shown in table 9. But the largest group is “CJK Unified Ideographs”. From tables 9, 10, we conclude that the number of CIDs for CJK Unified Ideographs has remarkably in- creased at every update. In the cases of Adobe-CNS1 and Adobe-GB1, new CIDs for CJK Unified Ideographs are introduced to print previously unprintable codepoints in Unicode (or previously unassigned codepoints, like, “CJK Unified Ideographs Ex- tension B” and “C1”). Therefore, both “mappable CIDs” and “printable codepoints”

have increased synchronouslly, although the number of “unmappable CIDs” has not increased. On the contrary, most of new CIDs in Adobe-Japan1 are assigned to pro- vide new presentation forms of existing characters. In Japanese national standards,

(17)

Adobe-Korea1 UCS2

Unicode block name Acrobat version

4.0.5 5.0.5 6.0.1 7.0.0

Basic Latin 120/ 221 311/ 412 311/ 412 311/ 412

Latin1 Supplement 8/ 38 9/ 39 9/ 39 9/ 39

Greek 1/ 49 1/ 49 1/ 49 1/ 49

General Punctuation 2/ 11 3/ 12 3/ 12 3/ 12

Currency 11/ 31 13/ 33 13/ 33 13/ 33

Letterlike Symbols 1/ 9 2/ 10 2/ 10 2/ 10

Arrow 19/ 53 19/ 53 19/ 53 19/ 53

Mathematical Symbols 16/ 100 17/ 101 17/ 101 17/ 101

Miscellaneous Technical 4/ 8 4/ 8 4/ 8 4/ 8

Enclosed Alphanumerics 9/ 127 9/ 127 9/ 127 9/ 127

Geometric Shapes 17/ 51 17/ 51 17/ 51 17/ 51

Miscellaneous Symbols 7/ 29 7/ 29 7/ 29 7/ 29

Digbats 8/ 30 8/ 30 8/ 30 8/ 30

CJK Symbols and Punctuations 42/ 70 42/ 70 42/ 70 42/ 70

CJK Unified Ideographs 2/ 4622 2/ 4622 2/ 4622 2/ 4622

Halfwidth and Fullwidth Forms 19/ 118 19/ 118 19/ 118 19/ 118

UTF16 String 537/ 537 537/ 537 537/ 537 537/ 537

Adobe-Korea1 UTF16

Unicode block name Acrobat version

4.0.5 5.0.5 6.0.1 7.0.0

Basic Latin - - 311/ 412 311/ 412

Latin1 Supplement - - 8/ 39 8/ 39

Greek - - 1/ 49 1/ 49

General Punctuation - - 2/ 12 2/ 12

Currency - - 12/ 33 12/ 33

Letterlike Symbols - - 2/ 10 2/ 10

Arrow - - 19/ 53 19/ 53

Mathematical Symbols - - 17/ 101 17/ 101

Miscellaneous Technical - - 4/ 8 4/ 8

Enclosed Alphanumerics - - 9/ 127 9/ 127

Geometric Shapes - - 17/ 51 17/ 51

Miscellaneous Symbols - - 7/ 29 7/ 29

Digbats - - 8/ 30 8/ 30

CJK Symbols and Punctuations - - 42/ 70 42/ 70

CJK Unified Ideographs - - 2/ 4622 2/ 4622

Halfwidth and Fullwidth Forms - - 19/ 118 19/ 118

UTF16 String - - 537/ 537 537/ 537

Table 7.The distribution of Adobe-Korea1 CIDs “allocated” byToUnicodemapping but “unmappable” by UnicodeCMaps.

the presentation forms for each character are not defined in concrete form, and spec- ification publications often change the presentation forms of some codepoints. The earliest multibyte character set JIS78 (JIS C 6226-1978) was a character collection de- fined primarily for information interchange, and the definition of detailed presentation forms is not given. In the update from JIS78 to JIS83 (JIS C 6226-1983, JIS X 0208- 1983), 75 characters were added and the presentation forms of 294 characters were changed expressely because JIS83 is expected to be printable on 24-pixel dot-matrix printers (in the 294 form changes, the 22 form changes were stroke count re-ordering) (Lunde, 1999, p. 919). A sample of such differences is shown on figure 5. The change is remarkable; Adobe has definedCMap“H” for the iso-2022-jp encoding of JIS83 character set, with JIS83 forms. For JIS78, Adobe definedCMap “78-H” for

(18)

Adobe-Japan1 UCS2

Unicode block name Acrobat version

4.0.5 5.0.5 6.0.1 7.0.0

Basic Latin 96/ 192 940/ 1037 940/ 1037 980/ 1077

Latin1 Supplement 38/ 136 354/ 451 354/ 451 374/ 471

Latin Extended-A 0/ 10 85/ 114 85/ 114 455/ 484

Latin Extended-B 0/ 0 29/ 38 29/ 38 77/ 86

IPA Extensions 0/ 0 33/ 42 33/ 42 145/ 154

Spacing Modifiers 4/ 4 11/ 12 11/ 12 41/ 42

Combining Diacritics 0/ 11 40/ 52 40/ 52 86/ 98

Greek 0/ 49 1/ 52 1/ 52 36/ 87

Cyrillic 0/ 66 0/ 66 0/ 66 26/ 92

Lao 1/ 1 7/ 9 1/ 1 1/ 1

Hangul Jamo 1/ 1 1/ 1 1/ 1 1/ 1

Latin Extended Additional 0/ 0 0/ 0 6/ 8 14/ 16

Greek Extended 0/ 0 0/ 0 0/ 0 4/ 4

General Punctuation 12/ 25 37/ 49 37/ 49 55/ 67

Currency 7/ 21 101/ 137 101/ 137 110/ 146

Diactitic Symbols 0/ 0 0/ 0 0/ 0 2/ 2

Letterlike Symbols 1/ 9 13/ 26 13/ 26 33/ 46

Number Forms 0/ 25 55/ 86 55/ 86 59/ 90

Arrow 0/ 13 15/ 36 15/ 36 18/ 39

Mathematical Symbols 5/ 42 8/ 56 8/ 56 162/ 210

Miscellaneous Technical 0/ 1 0/ 2 0/ 2 67/ 69

Control Pictures 0/ 0 0/ 0 0/ 0 1/ 1

Enclosed Alphanumerics 0/ 76 9/ 137 9/ 137 20/ 148

Box Drawing 76/ 163 154/ 241 154/ 241 154/ 241

Geometric Shapes 2/ 22 22/ 48 22/ 48 44/ 70

Miscellaneous Symbols 0/ 24 4/ 31 4/ 31 22/ 49

Digbats 0/ 10 3/ 16 3/ 16 10/ 23

Supplemental Arrows B 0/ 0 0/ 0 0/ 0 2/ 2

Miscellaneous Mathematical Symbols B 0/ 0 0/ 0 0/ 0 4/ 4

Miscellaneous Symbols and Arrows 0/ 0 0/ 0 0/ 0 3/ 3

Table 8.The distribution of Adobe-Japan1 CID “allocated” byToUnicodemapping but “unmappable” byUniJIS-UCS2-H, non-CJK blocks.

the iso-2022-jp encoding of JIS83 character set, but with JIS78 forms. There are 295 iso-2022-jp codepoints for which 78-H and H use different CIDs. When we classify them by Adobe-Japan1-UCS2, 107 CIDs are mapped to different UTF-16 codepoints.

Therefore, when JIS83 was published, these 272 (= 294−22) characters are rec- ognized as the same character with different forms. But 107 among them are recog- nized as different characters in the Unicode character set. Due to these differences in the Unicode codepoint expressions, we compare the characters sharing the same iso-2022-jp codepoints in figure 5.

Sometimes the detailed definition of presentation forms is not given when the stan- dard is published, but rather as a “difference from previous versions” when the next version is published. The typical example is the presentation forms difference between JIS83 and JIS90 (JIS X 0208-1990) shown on figure 6. Before publication of JIS90, most differences on figure 6 were considered as not important enough to assign dif- ferent CIDs to them, so that many JIS83 fonts include JIS90 shapes instead of JIS83 ones. In the case of Adobe CID, the CID indices for JIS83 shapes have been allocated

(19)

Adobe-Japan1 UCS2

Unicode block name Acrobat version

4.0.5 5.0.5 6.0.1 7.0.0

CJK Radicals 0/ 0 1/ 19 1/ 19 1/ 21

Kangxi Radicals 0/ 0 1/ 1 1/ 1 0/ 0

CJK Sym. & Punct. 8/ 46 71/ 113 71/ 113 102/ 144

Hiragana 97/ 185 799/ 887 799/ 887 986/ 1074

Katakana 43/ 137 658/ 752 658/ 752 891/ 985

Kanbun 0/ 0 0/ 0 0/ 0 16/ 16

Katakana Phonetic Ext. 0/ 0 0/ 0 0/ 0 128/ 128

Encl. CJK Lett. & Mon. 0/ 48 0/ 132 0/ 132 11/ 143

CJK Compat. 53/ 118 123/ 254 123/ 254 124/ 255

CJK Unified Ideog. Ext. A 0/ 0 1/ 26 1/ 26 178/ 203

CJK Unified Ideog. 261/ 6981 1256/ 9178 1256/ 9178 6166/14087

PUA 0/ 0 0/ 0 0/ 0 5/ 5

CJK Compat. Ideog. 0/ 34 2/ 36 2/ 36 80/ 110

Alphabetic Present. Forms 0/ 2 15/ 20 15/ 20 15/ 20

CJK Compat. Forms 2/ 2 2/ 2 2/ 2 8/ 8

Halfwid. & Fullwid. Forms 10/ 174 83/ 247 83/ 247 93/ 257

Reserved 54/ 54 142/ 142 142/ 142 9/ 9

CJK Unified Ideog. Ext. B 0/ 0 0/ 0 0/ 0 342/ 343

CJK Compat. Ideog. Sup. 0/ 0 0/ 0 0/ 0 43/ 45

UTF16 String 20/ 20 877/ 877 877/ 877 1363/ 1363

Table 9.The distribution of Adobe-Japan1 CIDs “allocated” byToUnicodemapping but “unmappable” byUniJIS-UCS2-H, CJK blocks.

J-3022

啞唖

J-3029

逢逢

J-3032

芦芦

J-3033

鰺鯵

J-303B

飴飴

J-306E

溢溢

J-3073

鰯鰯

J-307C

淫淫

J-312A

迂迂

J-3135

欝欝

J-3139

厩厩

J-313D

噂噂

J-3142

餌餌

J-316B

焰焔

J-3228

襖襖

J-3229

鶯鴬

J-322A

鷗鴎

J-3260

迦迦 恢恢 拐拐 晦晦 蠣蛎 攪撹 喝喝 葛葛 鞄鞄 竈竃 嚙噛 㵎澗 灌潅 翰翰 諫諌 翫翫 徽徽 祇祇 俠侠 卿卿 堯尭 僅僅 軀躯 喰喰 櫛櫛 屑屑 靴靴 祁祁 慧慧 𥡴稽 繫繋 荆荊 頸頚 𨻶隙 倦倦 嫌嫌 捲捲 鹼鹸 諺諺 巷巷 昻昂 溝溝 礦砿 麴麹 鵠鵠 甑甑 采采 榊榊 栅柵 薩薩 鯖鯖 錆錆 珊珊 屢屡 蘂蕊 遮遮 杓杓 灼灼 繡繍 酋酋 曙曙 渚渚 薯薯 藷藷 哨哨 廠廠 梢梢

Figure 5.Comparison of JIS78 and JIS83: left is JIS78 shape, right is JIS83. Shown codepoints are iso-2022-jp codepoints for each character. The font used is Hiragi- noMincho Pro W3, designed for Adobe-Japan1-5.

after the publication of JIS90. To disambiguate the shapes in existing CIDs, the CID indices for exact JIS90 shapes have been introduced again.

Additional presentation forms for existing characters are heavily needed by Japanese printing factories for document generation. Adobe CID includes various presentation forms which lack the reference of presentation forms and the detailed definition of shape. Some samples are shown on figure 7.

Cet article des Editions Lavoisier est disponible en acces libre et gratuit sur dn.revuesonline.com

Références

Documents relatifs

State transfer faults may also be masked, though to mask a state transfer fault in some local transition t of M i it is necessary for M to be in a global state that leads to one or

Such a syllabic encoding approach has been adopted only for Ethiopic script (The Unicode Consortium, 2003, p. 322 - 324), because the Ethiopic syllabic character set is

correct the observed horizontal directions for dislevelment of the horizontal axis. a) There are two possible reasons for the horizontal axis of a theodolite not being

(a) The easiest thing to do is to apply the algorithm from the lecture: take the matrix (A | I) and bring it to the reduced row echelon form; for an invertible matrix A, the result

These are normal PostScript fonts, with very strong common properties (same number of control points for each character, etc.) and for a good reason: the Multiple Master font will be

When forming their value proposition, which is needed for acquiring new customers, the company may focus one of the three main assets that are needed to run process instances:

Fraud on host card emulation architecture: Is it possible to fraud a payment transaction realized by a mobile phone using an ”Host Card Emulation” system of se- curity ?..

It would be interesting to know whether such views (beyond the case of Aenesidemus and Heraclitus, and again, ignoring for now the case of the Academics) were held