and intelligent technological solutions in encoding Sanskrit

(1)

Linguistic issues

and intelligent technological solutions in encoding Sanskrit

Peter Scharf

ABSTRACT.Contemporary uses of information technology demand higher standards of encoding than the inherited systems prevelant today. Guided by visual factors, current encoding systems reproduce deficiencies inherent in traditional writing systems. The contemporary use of computers for the manipulation of linguistic and textual data demands more relevant information- processing principles. The fundamental issue in encoding natural language texts concerns the relation information selected bears to natural language structure. Selecting appropriate information to encode demands addressing whether to encode written characters or speech sounds, whether to encode segments or features, and what criteria to use to contrast items selected for encoding. These issues are considered in relation to Sanskrit, the principal culture-bearing language of India, which is characterized by an extensive oral tradition, a highly phonetic orthography, and a copious literature.

RÉSUMÉ.Aujourd’hui, l’utilisation de technologies de l’information demande des normes de codage de plus haut niveau que les systèmes actuels. Guidés par des facteurs visuels, les sys- tèmes de codage actuels reproduisent des lacunes inhérentes aux systèmes d’écriture tradition- nelle. L’utilisation contemporaine de l’ordinateur pour la manipulation de données linguis- tiques et textuelles demande des principes de traitement de l’information plus pertinents. La question fondamentale du codage des textes en langue naturelle concerne la relation entre les informations choisies pour le codage et la structure du langage naturel. Choisir les informations appropriées au codage demande qu’on réponde à trois questions : coder des caractères écrits ou les sons de la parole ? Coder les segments ou les caractéristiques ? Quels critères utiliser pour bien distinguer les articles choisis pour le codage ? Ces questions sont examinées dans le cadre du sanscrit, la principale langue porteuse de la culture en Inde, qui se caractérise par une vaste tradition orale, une orthographe très phonétique et une abondante littérature.

KEYWORDS:Sanskrit, linguistics, encoding, phonetics, segments, features, contrast.

MOTS-CLÉS :sanscrit, linguistique, encodage, phonétique, segments, caractéristiques, contraste.

DOI:10.3166/DN.16.3.15-29 c2013 Lavoisier

Document numérique – n^o3/2013, 15-29

(2)

1. Introduction

Today people use computers to manipulate linguistic and textual data in sophisticated ways. An obvious example is that computer users expect to be able to find what they are looking for in a search interface within a few clicks. Fulfilling this expec- tation depends upon making the sought materials available in digital form. Scholars who work on ancient and medieval texts, and on texts in rare languages and scripts are often frustrated when they attempt to represent their subject matter in the digital medium because they do not find appropriate means to encode the sounds and characters of these languages and scripts. The reason for the lack of appropriate means is obvious. Because the digital medium was developed in the environment of modern Western European languages, digital machines were designed on the basis of the sets of common characters of the writing systems of these modern languages. Because the visual and orthographic characteristics of modern Western European writing systems determined the design of the digital medium, this medium is less suited to represent other writing systems, particularly those that have a drastically different structure. As digital machine manufacturers seek to meet the needs of the growing global community of computer users, support for the world’s writings systems is growing, most notably in the establishement and expansion of the Unicode Standard. Version 6.3, just released on 30 September 2013, covers 110,000 characters in 100 scripts and offers advanced support for right-to-left scripts and Chinese characters. Many of the participants in the GIECA conference, at which this paper was originally presented, have directly contributed to Unicode proposals to encode the writing systems with which they work.

Despite the progress in diversifying the support for the world’s languages, much of the effort to encode rare writing systems is spent in getting Unicode to adopt unusual graphic signs as characters. While the presence of appropriate characters is necessary to enable the reproduction of ancient texts, making information encoded in ancient, medieval and rare languages accessible via a search interface involves far more than reproducing the visual appearance of the scripts in which they are written. Unfortu- nately current encoding systems tend to limit themselves to copying writing systems and hence reproduce deficiencies inherent in the traditional orthographies themselves.

In order to meet the expectations of computer users today, encoding must take into account more relevant information-processing principles, rather than reflecting exclusively visual and orthographic design factors. It is essential to consider three questions when encoding a language:

1. Should written characters or speech sounds be encoded?

2. Should segments or features be encoded?

3. What criteria should be used to contrast items selected for encoding?

The present article considers these principles in relation to Sanskrit written in Deva- n¯agar¯ı script.

(3)

Sanskrit encoding issues 17

1.1. Sanskrit

Sanskrit is the primary culture-bearing language of India, with a continuous pro- duction of literature in all fields of human endeavor over the course of four millennia. Preceded by a strong oral tradition of knowledge transmission, records of written Sanskrit remain in the form of inscriptions dating back to the first centuryb.c.e. In surveys to date, the Indian government’s National Mission for Manuscripts has already counted more than five million manuscripts in that country, and David Pingree, the renowned manuscriptologist and historian of mathematics, estimated that extant manuscripts in Sanskrit number over 30 million — more than one hundred times those in Greek and Latin combined — constituting the largest cultural heritage that any civ- ilization has produced prior to the invention of the printing press. Sanskrit works include extensive epics, subtle and intricate philosophical, mathematical, and scien- tific treatises, and imaginative and rich literary, poetic, and dramatic texts. While the Sanskrit language is of preeminent importance to the intellectual and cultural heritage of India, the importance of the intellectual and cultural heritage of India to the rest of the world during the past few millennia and in the present era can hardly be over- estimated. Indian culture has been a major factor in the development of the world’s religions, languages, literature, arts, sciences, and history. The language persists in the recitation of hymns in daily worship and ceremonies, as the medium of instruction in centers of traditional learning, as the medium of communication in selected academic and literary journals and academic fora, and as the primary language of a revivalist community near Bangalore.

Preceded by a strong oral tradition of knowledge transmission, records of written Sanskrit remain in the form of inscriptions dating back to the first centuryb.c.e.. While the oldest Sanskrit inscriptions are in the Br¯ahm¯ı script, texts are mostly written in the many Br¯ahm¯ı-derived scripts used today in South and Southeast Asia, among which the most common is Devan¯agar¯ı. The oldest inscriptions in Br¯ahm¯ı script, however, are edicts and proclamations in the Pr¯akrit language specifically to announce which the emperor A´soka had the script developed two centuries prior to the oldest Sanskrit inscriptions. Most of the languages of India, including the Dravidian languages of the South as well as the Indo-Aryan languages that dominate the rest of the South Asian sub-continent, use scripts derived from Br¯ahm¯ı and share the characteristics of Devan¯agar¯ı described below. A semi-syllabic script, Br¯ahm¯ı is related to Kharos.t.h¯ı script, which had previously been used in the north-west region of South Asia and which in turn was derived from Aramaic. Br¯ahm¯ı diversified into regional varieties in response to the use of different inks, implements, and substrates in writing. Proto-N¯a- gar¯ı appeared in Rajasthan by the end of the sixth century. By the eleventh, Devan¯aga- r¯ı had become important for the transmission of Sanskrit literature. Today it is used for writing Hindi, Marathi, Nepali, and at least twenty-four other South Asian languages.

For details on the history and development of Indic sripts see Bühler (1896); Salomon (1998; 1995; 2003); Scharfe (2002); Voigt (2005); Dani (1963); Sharma (2002); Singh (1991).

(4)

The basic Sanskrit sound catalogue consists of nine vowels, twenty-five stops, four semivowels, four spirants, and two vowel-trailers (a breath-release and a nasal). Vow- els include five simple vowels and four diphthongs and are distinguished by length, pitch and the presence or absence of nasalization. Stops are distinguished by place of articulation (velar, palatal, coronal, dental and labial), and by the presence or absence of voicing, aspiration, and nasalization. Thus the set of five stops produced at each place of articulation includes an unaspirated and aspirated unvoiced stop, an unaspirated and aspirated voiced stop, and a nasal. The other sounds are also distinguished by place of articulation. Table 1 shows the principal Sanskrit sounds arranged in rows by place of articulation and in columns by other features. Unusual is our placement ofhwith semivowels (Table 1, note 4), and the placement of anusv¯ara with the velars (Table 1, note 5). For further information about the phonetics and structure of the Sanskrit language see Allen (1953); Cardona (2003); Renou (1952); Whitney (1889).

1.2. Devan¯agar¯ı script

As in all Br¯ahm¯ı-derived scripts, in Devan¯agar¯ı the basic unit is the orthographic syllable. The orthographic syllable consists of one or more conjoined consonant characters, a vowel diacritic and possibly diacritics indicating a nasal, breath-release, and in Vedic and linguistic texts, tone. Each consonant graph, in the absence of a diacritic indicating another vowel, implies the vowel/a/, the absence of which is indicated by a short diagonal stroke calledvir¯ama. For example, kka,k,k. Vowels are written as diacritics above (:keke,;kEkai), below (kuku,kUk¯u,kxkr

˚,kXk¯r

˚,kwkl

˚), to the left (;a;k ki), or to the right (k+:ak¯a,k+:ak¯ı,k+:eako,k+:Eakau) of the consonant character. A second series of signs represents vowels not preceded by a consonant (Aa,A;a ¯a,Ii,IR¯ı,o u, ¯u,r

˚,¯r

˚,l

˚,O;e,Oe;ai, A;eao,A;Eaau). Sequences of multiple consonants are rendered in a single ligature in which the shape of constituent graphs can vary con- siderably. While in most consonant clusters constituents stack horizontally (t,at+papa

=tpatpa) or vertically (.z,˙n+gaga=ñÍçÅÅ*: ˙nga), some ligatures have idiosyncratic forms opaque to constituent analysis (k,k+:Sas.a=[aks.a). Traditional Sanskrit orthography requires glyphs for representing more than a thousand consonant clusters, and it is not uncommon for there to exist several styles for representing a single cluster.

(5)

Sanskritencodingissues19 Table 1. Sanskrit phonetics

CONSONANTS VOWELS^1,2

stops semivowels³ spirants

contacted slightly cont. slightly open open

simply open more open most open

unvoiced voiced voiced unvoiced voiced voiced voiced

unasp. asp. unasp. asp. nasal short long short long long

guttural h,h⁴ Hh. Aa A;a¯a

velar k,k K,akh g,ag ;G,agh .z,˙n M ˙m⁵ ^h

¯

palatal ..c,ac C,ch .j,aj J,ajh V,añ y,ay Z,a´s Ii IR¯ı O;e Oe;ai retroflex⁶ f,t. F,t.h .q,d. Q,d.h :N,an. .=, r :S,as. r

˚ ¯r dental t,at T,ath d,d ;D,adh n,an l,l .s,as l ˚

˚ ¯l labial :p,ap :P,ph b,ab B,abh m,am v,av ^h ˚

ˇ ou ¯u A;eao A;Eaau Notes:

1. The diphthongsaiandauhave, and the monophthongseandoare considered to have, two places of pronunciation: (i) the glottis, (ii) the palate or lips.

2. Vowels include prolonged lengths calledpluta;three pitchesud¯atta, anud¯atta, svarita;and nasalized variants.

3. Semivowelsy, l, vinclude nasal variants˜y, ˜l, ˜v.

4. With partial stricture and voicing,hshares features with buccal semivowels.

5. Anusv¯ara is a nasal glide with the velum as its primary articulator.

6. Unaspirated and aspirated retroflex lateral flaps writtenL,l.and\h,l.hoccur intervocalically in R

˚gvedic dialect (and in theNirukta), instead ofd.andd.h.

(6)

1.3. Romanization

As Sanskrit studies became important in the West, European scholars devised methods to represent Sanskrit in Roman script. These methods are thus independent encodings of Sanskrit phonetics rather than a transliteration of Devan¯agari script or another Indic script. They use sequences of Roman characters and characters with diacritics to represent distinctions in the Sanskrit sound catalogue not representable in the Roman alphabet. In 1894, the Geneva Oriental Congress proposed a standard Romanization of Sanskrit that with minor modifications remains in use today by San- skrit scholars. The nearly identical U.S. Library of Congress standard Romanization preferred by librarians and Indic linguists has the following conventions:

1. Sanskrit sounds that correspond to normal values for Roman letters are represented by those letters (e.g.b=[b]).

2. The letterh, which by itself indicates a phoneme/H/, is used also to indicate the aspirate series of stops in digraphs such asbh.

3. The coronal consonants are distinguished from the dental by an underdot (e.g.

t.).

4. A macron indicates a long vowel (e.g. ¯a).

5. The palatal nasal is writtenñ; the velar,˙n.

6. The palatal sibilant is written´s(formerly,ç).

7. Vocalic r andl are written with an undercircle (r

˚ l

˚); by an underdot in the Geneva standard.

8. The visarga is writtenh., the anusv¯ara,m.; ˙min ISO 15919 and the United Na- tions Romanization Systems for Geographical Names (UNRSGN).

9. Acute and grave accent marks indicate the ud¯atta and independent svarita accents respectively (y´e,kvà); prosodically determined accents are left unmarked.

2. Contemporary encoding systems and their faults

Current encoding persists in being script-based even though contemporary encoding serves linguistic and archiving purposes that transcend mere display. Most contemporary encoding systems for Sanskrit designed for use with the computer are based either upon the Indic script structure represented by Devan¯agar¯ı or on the Sanskrit Romanization. The encodings therefore retain features inherent in the scripts themselves. The Kyoto-Harvard system, for example, like most ASCII encodings, retains the encoding of aspirated stops byh-final digraphs, and of the open diphthongs by the digraphsaiandaupresent in the Romanization itself.¹ As for encodings based upon the Indic-script model, the Indian Script Code for Information Interchange (ISCII) published by the Indian Department of Electronics (DOE) in 1983 adopted phonetic

1. Velthuis offers alternatives to digraphs, and WX provides an encoding that is mostly a one-to-one map- ping of Sanskrit phonetic segments.

(7)

principles to a large extent. ISCII uses a single encoding for the ten major scripts of India while anALT-character allows for script-selection. The enormous variety of consonant ligatures is produced programmatically by the insertion of a character called halantbetween them. Yet ISCII retains features of Indic scripts: each consonant graph includes the vowel a in the absence of another vowel diacritic; the vir¯ama indicates the absence of a vowel; and a separate series of codes represents vowels not preceded by a consonant. Unicode copied and expanded upon the ISCII encoding in its representation of Indic scripts. Although Unicode provides separate code-pages for each script, these pages were initially isomorphic and diverged only with the subsequent addition of characters proper to each script. Unicode Indic scripts retain the above-mentioned features of the inherent a and dual series of vowels.

The difficulties with the Devan¯agari and Roman encoding systems for Sanskrit are due to problems in the modes of graphic representation of Sanskrit sounds adopted in Devan¯agar¯ı and the standard Romanization themselves. These difficulties become evident by observing the discrepancies between the encoding of Sanskrit embodied in the Devan¯agari script and in standard Romanization. Consider especially the following three points:

1. In the Devan¯agari standards, there are separate characters for vowels when they appear post-consonantally versus when not preceded by a consonant. In the Roman standards, a single character is used in all contexts.

2. In the Devan¯agari standards, post-consonantal/a/is implicitly indicated by the graph of the preceding consonant, while its absence is explicitly represented by a sign indicating the cessation of speech (vir¯ama). In the Roman standards, the distribution ofhaicorresponds exactly to the distribution of the vowel/a/.

3. In the Roman standards, certain single sounds are represented by digraphs: the aspirate stops (kh, gh, ch, jh, t.h, d.h, th, dh, ph, bh) and the open diphthongs (ai, au).

In the Devan¯agari standards, single characters represent each of these segments.

The common feature of these discrepancies is a departure from the principle of representing a single Sanskrit sound by a single character. Both the Devan¯agari and the Ro- man standards concur in departing from this principle in one additional case:

4. In both the Devan¯agari and Roman standards the aspirate retroflex lateral flap / ^h/is represented by a digraph:\h,l.h.

5. An additional discrepancy exists between the encoding of accent in Devan¯agari script and the encoding in standard Romanization. The Romanization encodes lexical or post-prosodic high pitch and independent circumflex, or deep accent. Devan¯agari encodes manifest pitch or surface accent. The failure of scholars to recognize the difference has led to confused explanations of Devan¯agari accentual systems and the obfuscation of genuinely different recitational traditions and dialects (Scharf, 2012).

(8)

3. Scripts as encodings of language

The enterprise of encoding is not new. It does not belong exclusively to the transition from the written and printed media to the digital medium. Encoding is a phe- nomenon inherent in the representation of knowledge in any medium and therefore is apparent in every media-transition. A thinker encodes his thoughts in verbal expression, visual art, or gesture and performance. These primary encodings of thought may be re-encoded in another medium. A student does so when he takes written notes on a lecture. An art or theatre critic does so when he describes visual or performative expressions in writing. In such cases the student or critic selects some information and discards other expressions deemed irrelevant. When an audio recording of a lecture is made, the visual and performative aspects of the lecture are lost. When a linguistic transcription of an audio recording is made, volume, rate and pitch are lost. Conversely, when an actor re-expresses a script in performance, he enriches the script in his interpretation of it by adding variation in voice volume and rate, and ges- tures. These last examples bring attention to the fact that some media of expression are richer than others. In the transition from a richer medium to a more constricted one, there is an inevitable selection that focuses attention on essential information.

In the transition, care must be taken to avoid loss of knowledge while sifting out the extraneous.

Writing was developed initially for administrative purposes and was adapted for use in the transmission of knowledge. While some scripts are ideographic (representing ideas or visual properties), most are phonographic (representing sounds). Scripts designed originally to represent one language were borrowed and adapted to represent others. In India, the primary mode of transmission of knowledge was oral. Writ- ing gradually assumed that function in the second half of the first millenium b.c.e. until moveable-type printing was introduced in India at the beginning of the 19th century. Br¯ahm¯ı script, from which all major Indic scripts are derived, was adapted from Kharos.t.h¯ı, which itself appears to have been adapted from Aramaic. Scripts evolve as technological innovations provide different tools and substrata for writing. In the meantime languages evolve and phonetic structures that once corresponded to particular written characters diverge.

The writing systems for most languages, therefore, are not ideally suited to the languages they represent. So it is not surprising that neither Devan¯agar¯ı nor the standard Romanization are ideally suited to the encoding of Sanskrit as was shown above in section 2. Hence there is no good reason for traditional scripts to be used as the basis for encoding languages. This is particularly true in the case of Sanskrit where a strong oral tradition dating back more than three millennia has persisted to the present day, and where a highly developed linguistic tradition provided precise descriptions of the phonetics of the language already two and a half millennia ago. Since we know what the phonetics of the language are, there is no reason to base an encoding scheme on the script, which is a faulty encoding of these phonetics in the first place. Where knowledge of the phonetics of a language is lacking and the written record preserves information unavailable elsewhere, as for example with Egyptian hieroglyphs, one

(9)

might consider basing encoding on the graphics of the script rather than the phonetics of the language. The same is the case where written heritage is primary, spelling is standardized, and pronunciation varies regionally and deviates from spelling, as with English.

4. Ideal encoding

To prevent loss of knowledge requires determining what information is essential and devising a lossless method of encoding that information while ignoring the extraneous. Efficiency in representing essential information implies the fundamental principle of character encoding, namely, to avoid ambiguity and redundancy. To avoid ambiguity and redundancy requires that an encoding system be characterized by a one-to-one correspondence between characters and items to be encoded, and that all encoded items be of the same kind (e.g., phonemes or written characters). It is essential to consider, therefore, what is being encoded, segments or features.

In items (1), (3), and (4) in section 2, a single sound is represented by more than one character, and in (2), a sound is inversely represented: that is, the presence of the sound is represented by the absence of a character, and the absence of the sound by the presence of a character. The departure from the principle of a one-to-one correspondence between what is to be represented and the representation signals confusion con- cerning the principles of encoding. In the digraphs that represent aspirated consonants in Sanskrit, the hhirepresents aspiration, a phonetic feature rather than a segment, while by itselfhhirepresents a phonetic segment. The Devan¯agari representation of the aspirate retroflex lateral flap/ ^h/suffers the same fault. Similarly in the digraphs haiiandhauithat represent open diphthongs in Romanization, the graphshai,hii, and huirepresent subsegments, while independently they represent independent segments.

While the dual use ofhhi(andhhiin Devan¯agari) does not lead to ambiguity because the sequence of consonant plus/h/does not occur in Sanskrit, the dual use ofhiiand huiin the Romanization is ambiguous. The identical sequencehaiiintaih.andmana- ïcch¯a, for example, does not itself indicate that it represents a diphthong in the former and a sequence of two independent vowels in the latter. Similarly the sequence of charactershaui represents the sequence of two simple vowels inpraüga but represents a diphthong inpraud.ha. Although one could disambiguate the Romanization by introducing a diaeresis over the second character to show that the two characters represent two vowels in sequence rather than a diphthong (thusmanaïcch¯a,praüga), such a procedure introduces redundancy. There would then be two ways of representing the phonetic segments/i/and/u/, one with and one without diaeresis. In practical terms, someone searching for the occurrence of the wordicch¯awould have to search a second time forïcch¯aas well, or software developers would have to accommodate the anomalous encoding in their search program. Introduction of the diaeresis to remove the ambiguity is a visual patch that reveals deeper structural problems in the encoding.

(10)

5. Encoding distinctive elements

Section 3 argued that, for Sanskrit, to encode the phonetics would be more efficient than encoding an Indic script or Romanization. Section 4 pointed out the importance of avoiding ambiguity and redundancy, ideally by achieving a one-to-one correspondence between the encoding and encoded elements. The current section addresses the selection of significant elements to encode.

Information is lost when meaning-bearing distinctions fail to be copied, transmitted, or perceived. The design of a system for the purpose of transmitting or storing information requires first a consideration of what information is needed or desired. It is neither possible nor practical to transmit or store allinformation. If all information were transmitted or stored, relevant information would be swamped in irrelevant detail. Phoneticians recognize that speech sounds vary in numerous ways from one speaker to the next and even from one utterance (of the same speaker) to the next. Yet despite differences of absolute rate, pitch, volume, etc., speakers recognize classes of speech sounds to be of the same kind and recognize patterns of these kinds to convey linguistically relevant information. Speakers of a particular language normally do not distinguish phones that occur only in complementary distribution in their language. Similarly, the stains, smudges, spills, and creases on the page of a manuscript, as well as the wiggles, trailers, flourishes, and absolute (as opposed to relative) stroke dimensions are irrelevant to the scholar who is deciphering thelinguisticcontent of a manuscript. The structure of an encoding should closely follow the structure of the linguistic units themselves. A character-encoding scheme ideally encodes only the minimally distinctive graphs of the language. A sound encoding scheme ideally encodes only the minimally distinctive phones of the language. In either case, codepoints are assigned only to contrastive units.

5.1. The concept of a phoneme

In determining the set of minimally distinctive Sanskrit sounds to encode, it will be instructive to discuss the concept of aphonemebecause the concept is essentially based upon the idea of contrastive units and is considered to be the defining concept of linguistic relevance. Yet an examination of the concept’s limiting parameters reveals shortcomings to adopting the concept as the sole basis of phonetic encoding without some modification. Phonemes are the minimally contrastive segments of sound in a language, on the basis of the contrast between which lexical and grammatical distinctions can be made. Thehpiinstopand thehpiinpotrepresent the same phoneme/p/

in English even though the former is phonetically [p] and the latter phonetically [p^h].

They are allophones that occur in complementary distribution; the latter in word-initial context, the former elsewhere. In Sanskrit, however, these two sounds represent distinct phonemes and are the basis for distinguishing lexical items such aspala‘straw, a small unit of weight, volume or time’ fromphala‘fruit, result’. The concept of a phoneme is yoked with two parameters that limit its utility as the sole basis for encoding. The first is that the sounds belong to the same language. The second is that

(11)

for sounds to be considered contrastive they are required to differentiate semantic content. The issue with these parameters is the definition of a language and of semantic content. The usual definition of a phoneme differentiates languages by differences in style and dialect; hence different dialects are not included in the same phonemic system. Semantic content usually means lexical meaning of ordinary words.

A number of the phonetic segments included in Table 1 are not phonemes because they occur only in complementary distribution with other sounds in parallel contexts and hence are allophones. The retroflexL,l.andL,h,l.h, mentioned in note 6 to Table 1, occur intervocalically in complementary distribution with.q,d.andQ,d.hin R˚gvedic dialect. Certain members of two subgroups of phonetic segments, sibilants and nasals, occur only non-contrastively in complementary distribution in specific dialects. Among the spirants,AHh.,^h

¯, and^h

ˇ are allophones of.s,asand.=,r. Among the nasals, the palatal nasal normally occurs before a palatal stop (sañ-caya), and a- nusv¯ara (AM ˙m) generally occurs before a fricative (sa ˙m-´saya). These nasals occur in complementary distribution withmand hence are allophones ofm. Prolonged vowels occur in such pragmatic contexts as return salutation of an upper casteman, calling from afar, specific ritual situations, and answering a question (As.t.¯adhy¯ay¯ı8.2.82-107).

Because such paralinguistic content is not regarded as semantically contrastive, prolonged vowels are not considered to be separate phonemes.

Although lexical pitch is contrastive in Sanskrit, the differences in the surface pitch that result from different phonotactic rules in the phonetic treatises calledpr¯ati´sakhya proper to various Vedic schools are not contrastive because they are variants proper to different speech communities — the reciters of various Vedic schools — and arguably to different dialects. Hence distinctions in surface pitch are not phonemic distinctions, because phonemic distinctions belong to the same language, same dialect, and same style.

5.2. Modifying the concept of a phoneme

Contrastive and complementary distribution is always with respect to a specific context. The provision in the definition of the phoneme that the sounds belong to the same language in the strictest sense and that differences in style and dialect are not included in the same phonemic system implies the necessity of specifying the boundaries of the language clearly. If dialects are included in the same language, they must be explained in the same phonological system. If they are considered part of some other language, they must be bracketed. The same clarity of scope is required in framing a phonetic encoding scheme.

One of the provisions in the definition of a phoneme was that for sounds in parallel distribution to be contrastive they serve to differentiate semantic content in a narrow sense. Such a segregation of semantic content is somewhat arbitrary. Decisions to ex- clude paralinguistic information were based on the conventions of the Roman alphabet to represent Northwest European languages. Similar decisions had earlier excluded duration, stress, and pitch from the concept of the phoneme, but these were incor-

(12)

porated when phonologists realized the necessity of extending the idea of contrastive distribution to these linguistic attributes in order to accurately represent the minimally contrastive segments of languages such as tonal languages. A comprehensive phonological system of the language should be able to convey whatever information speech conveys. Hence phonetic segments that contrast in conveying paralinguistic information must be accepted as independent phonemes.

For the purposes of devising an encoding scheme for a language as a single whole, it is necessary to broaden the conception of a phoneme to tolerate linguistic variation across dialects as well. If one treated various dialects separately, one could employ higher-level markup to distinguish dialects and distinguish only phones that are contrastive within each dialect separately. It would not be necessary to distinguish retroflex lateral flaps from retroflex lateral stops in an encoding scheme exclusively for the R

˚gvedic dialect, for instance, because the flaps occur uniformly in intervocalic context. Thus whenNirukta3.11 citesR

˚gveda2.23.9, which contains the wordtal.ito with an intervocalic retroflex lateral flap, one could wrap the R

˚gvedic passage in tags that identify the passage as of a distinct dialect from that of the Niruktapassage in which it was embedded. Within the tagged dialect portion, intervocalic/ã/would always be realized as the retroflex lateral flap [ ]; outside such tags, it would always be realized as [ã]. However, theNiruktaitself uses the retroflexl., l.heven when not directly citing. In the same passage just cited, theNiruktacontinues,s¯a hy avat¯al.ayati, usingl.outside of the R

˚gvedic dialect. The text as received makes no mention/use distinction. With the lack of reliable information about the author’s dialect, one is forced to accept that [ã], [ã^h] and unaspirated/aspirated retroflex lateral flaps [ ], [ ^h] occur in contrastive distribution. Similarly, an encoding of surface accent at the character level must embrace the range of pitches utilized across Vedic schools and dialects because it is not always practical to use higher-order text-encoding devices to bracket off excerpts from the texts of various Vedic schools and dialects.

The corpus of Sanskrit texts includes various dialects of Vedic and classical San- skrit and various genres of specialized literature. If a consistent encoding scheme is to be devised for the whole range of Sanskrit, sounds that contrast over that whole range must be encoded distinctively. It is acceptable to adopt as the basis of determining contrastive elements in such an encoding scheme a broader conception of a phoneme in which by a language is meant a specified range of dialects, and in which semantic content includes paralinguistic content.

Prolonged vowels were not considered to be separate phonemes in the strict sense because paralinguistic content was not regarded as semantically contrastive. However, such a segregation of semantic content is somewhat arbitrary. The mere fact that representation of Northwest European languages in the Roman alphabet did not include signs to signify paralinguistic content such as return salutation, calling from afar, response to questions and ritual context is not sufficient justification for a principled exclusion of such paralingistic content from semantics in the definition of a phoneme.

In Sanskrit linguistics and critical theory semantic content includes paralingustic con-

(13)

tent, and linguistic texts include explicit discussion of the sounds used to designate it Scharf (2009).

Since the semantic content of Sanskrit includes paralinguistic content, trimoraic duration, which conveys some distinction in paralinguistic content, is contrastive and so phonemic. Since the language in question ranges over various dialects of Vedic and classical Sanskrit, these dialects merge within that range. Hence the retroflexL, l.and\h,l.h, and certain spirants, which are in complementary distribution only inso- far as such dialects are distinguished, are now in contrastive distribution. Likewise, surface accentuation, which is non-contrastive within a particular recitational tradition, is contrastive when set side by side with differently accented text from another recitational tradition within a single language that encompasses both traditions. And since the palatal nasalñin technical terms in linguistic treatises (e. g.añ,ñit) occurs in contrastive distribution withnandm, it is phonemic in the broader sense.

6. Conclusions and applications

Although computers today manipulate linguistic and textual data in sophisticated ways, most current encoding systems reflect orthographic design factors to the exclusion of more relevant information-processing principles. Even the most recent standardized encoding systems reproduce deficiencies inherent in the traditional orthographies themselves. Yet display is only one of numerous functions that computers now perform. Computers engage in data transmission and perform linguistic processing, such as spell-checking, machine translation, content analysis, indexing, and morphological and syntactic analysis. Therefore display for a human reader should no longer be considered the primary determinant of an encoding scheme. Rather, language should be encoded in such a way as to facilitate automatic processing.

Writing systems are encodings of language. Yet few of the world’s writing systems were designed for the languages that they represent in extant texts. Most are inadequate adaptations and none were designed to meet the sophisticated linguistic processing demands of today. Therefore, in general, it is fruitful to design encoding schemes based on the phonology of the language rather than its writing system. By using the concept of a phoneme in the broad sense, we designed encoding schemes capable of showing all the contrastive information in the corpus of Sanskrit texts. Un- like Devan¯agar¯ı script and the encoding schemes based upon it but like Romanis.ations, the Sanskrit Library basic encoding scheme (SLP1) represents vowels, including the vowel /a/, by unique characters, regardless of whether they follow a consonant or not. Unlike most Romanis.ations, SLP1 represents diphthongs and aspirated stops by a single character as well. Unlike both Devan¯agar¯ı and Romanis.ations, it represents the aspirated retroflex lateral flap by a single character too. Yet because SLP1 is limited to ASCII characters, accents, nasalization and other modifications of basic sound segments are represented by modifier characters that represent features rather than segments. SLP2 is a strictly segmental encoding that represents each segment by a single codepoint. SLP3 is a strictly featural encoding that represents only phonetic

(14)

features; segments are represented by bundles of features rather than by single codepoints. These schemes and a more detailed exposition of fundamental issues in the coding of natural language texts can be found in Scharf and Hyman (2011). Table 2 shows the first line of theBhagavadg¯ıt¯ain Devan¯agari, standard Romanization, and SLP1.

Table 2. Bhagavadg¯ıt¯a 1.1 in Devan¯agari, standard Romanization, and SLP1 Devan¯agar¯ı ;Da;mRa;[ea:ea ku+.+:[ea:ea .sa;ma;vea;ta;a yua;yua;tsa;vaH Á

Roman dharmaks.etre kuruks.etre samavet¯a yuyutsavah. . SLP1 Darmakzetre kurukzetre samavetA yuyutsavaH .

The Sanskrit Library Phonetic encoding schemes simplify linguistic processing.

Hence the Sanskrit Library utilizes these encodings for the storage of a corpus of San- skrit texts and for linguistic processing in our digital Sanskrit library (sanskritlibrary .org). We segregate these functions from the functions of data-entry and display and employ transcoding software to interface between the functions. Besides simplifying linguistic processing, the segregation of functions allows for a high degree of flexibil- ity in data-entry and display options thereby permitting users to select the display and input method of their choice. Thus we transcode to a variety of Indic scripts and Ro- manization in Unicode for display purposes and employ various meta-transliterations, Indic Unicode, as well as clickable input keyboards for data input.

After we conducted a collaborative project to add 68 additional characters to the Unicode Standard for the display of Vedic texts, the most ancient literary heritage of India, and developed our encoding schemes, the Sanskrit Library developed inflec- tional morphology software to decline nominals and conjugate verbs. By applying that software to the 170,000 stems in Monier Williams’A Sanskrit-English Dictionary, we generated a full-form lexicon of eleven million words which serves as the foundation of a complementary morphological analyzer. We then linked each word in the digital editions of texts in the Sanskrit Library (numbering more than 100 substantial Vedic and Classical texts) that had inter-word prosodic changes (sandhi) analyzed to the dictionary. We are now in the process of building a multi-dictionary interface that in- tegrates all of the major bi-lingual and mono-lingual Sanskrit dictionaries, specialized dictionaries, and traditional thesauri so that one can look up a word in all of them by entering the stem in a search interface once. We link unanalyzed Sanskrit sentences to the Sanskrit Heritage site’s parser (sanskrit.inria.fr/DICO/reader.html). Through close collaboration with the Sanskrit Heritage site, we are developing a syntactically tagged database of Sanskrit texts and beginning to engage in digitized statistical linguistic analysis of Sanskrit. While this work just begins to approximate the sort of work that has been conducted on modern Western European languages for decades, we hope that it will inspire the application of similar principles to work on rare and historical languages.

(15)

References

Allen W. S. (1953).Phonetics in ancient India. London, Oxford University Press.

Bühler G. (1896). Indische palaeographie: von circa 350 a. chr.–circa 1300 p. chr.(Vols. 1.

Bd., 11. Heft). Strassburg, K. J. Trübner.

Cardona G. (2003). Sanskrit. In G. Cardona, D. Jain (Eds.),The Indo-Aryan languages, pp.

104–160. London, Routledge.

Cardona G., Jain D. (Eds.). (2003).The Indo-Aryan languagesNo. 2. London, Routledge.

Dani A. H. (1963).Indian palaeography. Oxford, Clarendon Press.

Renou L. (1952).Grammaire de la langue védique. Lyon, IAC.

Salomon R. (1995). On the origin of the early Indian scripts.Journal of the American Oriental Society, Vol. 115, No. 2, pp. 271–279.

Salomon R. (1998).Indian epigraphy: A guide to the study of inscriptions in Sanskrit, Prakrit, and the other Indo-Aryan languages. New York, Oxford University Press.

Salomon R. (2003). Writing systems of the Indo-Aryan languages. In G. Cardona, D. Jain (Eds.),The Indo-Aryan languages, pp. 67–103. London, Routledge.

Scharf P. M. (2009). Modeling P¯an.inian grammar. In G. Huet, A. Kulkarni, P. M. Scharf (Eds.),Sanskrit computational linguistics: First and second international symposia: Roc- quencourt, France, October 2007; Providence, RI, USA, May 2008, Vol. 5402, pp. 95–126.

Berlin, Springer.

Scharf P. M. (2012). Vedic accent: underlying versus surface. In F. Voegeli, V. Eltschinger, D. Feller, M. P. Candotti, B. Diaconescu, M. Kulkarni (Eds.),Devadatt¯ıyam: Johannes bronkhorst felicitation volume, pp. 405–426. Bern, Peter Lang.

Scharf P. M., Hyman M. D. (2011). Linguistic issues in encoding sanskrit. Delhi, Motilal Banarsidass.

Scharfe H. (2002). Kharos.t.h¯ı and Br¯ahm¯ı.Journal of the American Oriental Society, Vol. 122, No. 2, pp. 391–393.

Sharma R. (2002).Br¯ahm¯ı script: Development in north-western India and central Asia. Delhi, B. R. Publishing. (2 vols.)

Singh A. K. (1991). Development of Nagari script. Delhi, Parimal Publications.

Voigt R. (2005). Die Entwicklung der aramäischen zur Kharos.t.h¯ı- und Br¯ahm¯ı-Schrift.

Zeitschrift der Deutschen Morgenländischen Gesellschaft, Vol. 155, pp. 25–50.

Whitney W. D. (1889).Sanskrit grammar: Including both the classical language, and the older dialects, of veda and brahmana(2d ed.). Cambridge MA, Harvard University Press.

(16)