UTF-8 Info For Ethiopic

Introduction

This page is provided to promote UTF-8 support for Ethiopic (Ge'ez/Fidel) electronic text. ``UTF-8'' is the ``UCS Text Format'' 8 bit encoding system of 16 bit text (``UCS'' is then ``Universal Character Set''). But why is 16 bit text important? Previously, PC operating systems would recognize only 7 or 8 bit character code systems (which limits the number of letters the computer can work with to 127 or 256). In these limited environments developers had to make do with what was available from the operating system. This meant breaking the Fidel syllabary into a group of 2-9 font sets or even breaking the fidels themselves into pieces by creating diactric marks. While extremely ingenuitive by the developer and allowing for most every kind of electronic publishing, the computer was still not yet conquered. The computer had merely been tricked into displaying Roman text with a Ge'ez typeface. The text was not yet Ge'ez.

Conquest:- 1996

Shortly after the 100th anniversary of the Adowa Victory Fidel gained a universally recognized character code system and the computer had at last finally been conquered. Or was the conquest just beginning?

The computer operating systems of 1997 (and languages such as Java, Limbo, and Alef) no longer have the ``ASCII'' or ``ANSI'' limitations. They can support Fidel as Fidel. Now that we are no longer working with limited systems, it is is upon us to start using them. How to use the new operating system resources is what this page is about. Contributions from people willing to spend a little time experimenting are most welcomed!

Look here for a glossary of terms used in this document.

Linux Consoles & NLS

The Linux operating system supports UTF-8 text streams. This means it will recognize Ethiopic file and directory names in addition to document text. You may create files, directories and do any thing you would normally do on a UNIX system with 65,536 (2^16) letters available to work with. However, only 512 character may be used at a single time (this may be a limit of the CGA/VGA system in x86 architecture, or of terminals) in Linux consoles.

512 characters (supported after Linux v1.3.28) is fortunately enough for both English and Ethiopic scripts. GohaTibeb, Dashen Engineering, and the Ethiopian Science and Technology Commission have so far contributed fonts for Linux use; they can be found at the ftp archive:

ftp://ftp.ethiopic.org/pub/fonts/linux/

where the fonts are found individually. You may like to download the complete package which details more on Ethiopic under Linux.

Linux also uses National Language Support (NLS) whereby software and the operating system can adjust their dialog language for individual users. The Linux UTF-8 support and ease of use makes it an ideal test bed for NLS with the Ge'ez script languages. The latest NLS package being developed with Addis Abeba University's Computer Science Department may be downloaded from:

ftp://ftp.ethiopic.org/pub/linux/nls/


Other UNIXes and GNU

The UNIX world is moving towards Unicode and UTF formats. IBM's AIX is known to use UTF-8 streams natively (an Ethiopic tester is needed!). While Solaris and SGI have made strides in multilingualism it is not known if the latest OSs support UTF yet (help needed!).

The evolving HURD also supports UTF-8 streams but to date has not been evaluated for Ethiopic. The never quite complete Plan 9 and its follow-up Brazil are UTF-8 native systems. UTF-8 place in the UNIX future is clearly evident.

GNU who is crafting The HURD is responsible for a large library of UNIX resource that is often preferred over the vendor supplied equivalents. GNU is rapidly internationalizing is software on the portable object approach to NLS. UTF-8 streams will be essential to support for Eritrean and Ethiopian languages. The Addis Abeba University Computer Science department will shortly be working with GNU in this effort.


UXterm

The Unicode X-Term is an X11 resource that will interpret UTF-8 streams on a variety of UNIX operating systems (even those that do not use UTF-8 natively). UXterm comes with its own Unicode font, but Ethiopic is not included. However, you may download two varieties specially tailored for UXterm from here

ftp://ftp.ethiopic.org/pub/fonts/uxterm/
Once both UXterm and the fonts are installed you may invoke UXterm as per:
uxterm -fn -admas-uxterm12-bold-r-normal--12-120-100-100-m-104-ethiopic-unicode -e tcsh &
in example.

UXterm is generally satisfactory for the viewing Ethiopic UTF-8 text, however line breaks may appeared where the are not expected (unusally not a problem in HTML). 9term is the Plan-9 terminal also supporting UTF-8 that has been ported to X11. Unfortunately 9term is seen to die on Solaris and Linux systems when it tries to read the Ethiopic portion of UTF-8 text. These two examples hightlight the need to evaluate UTF-8 applications in the Ethiopic (3-byte) text range.


Web Browsing

A number of web browsers are beginning to speak UTF-8. Netscape run under Windows NT is able to display UTF-8 text. The Accent web browser which also comes as a Netscape plugin needs further study. The Tango browser can read Ethiopic in UTF-8 now -see sample files and setup information at the Ethiopian News Headlines

When Lynx 2.7 is released it will support UTF-8 text for web browsing. You may add on the extensions to version 2.6 by downloading them from

http://www.tezcat.com/~kweide/lynx-chartrans/
You will need to recompile Lynx with the ``Slang'' options and be sure that the Slang is compiled with -DSLANG_MBCS_HACK . If you have Ethiopic fonts installed in either Linux or for UXterms you may proceed:
  1. start Lynx at the command line:
    lynx http://www.cs.abyssiniacybergateway.net/fidel/let/yoHens-utf8.html

  2. set your "display (C)haracter Set" option to "UNICODE UTF 8".
  3. set the "Raw 8-bit or CJK m(O)de" option to "ON".
Alternately, if you have compiled Lynx as described above but can not install the fonts; Lynx is now capable of converting UTF-8 text into SERA transliteration. You would start Lynx by:
  1. start lynx as above and with the option -assume_charset=unicode-1-1-utf-8
  2. display (C)haracter Set : ISO Latin 1
  3. Raw 8-bit or CJK m(O)de : OFF


Windows '97, NT, OS/2 Warp 4, & Mac OS8


Working With Java

Ge'ez output from Java interpretters can be viewed in Linux consoles and UXterms.

SelamAlem.java : Note that in JDK-1.02 the System.out.println was unable to print Ethiopic UTF text to stdout. Perhaps this will be corrected in JDK-1.1. After numerous permutations writeUTF to a file name was found to work as shown below:


import java.io.* ;
class SelamAlem {
  public static void main(String args[]) throws IOException {
     DataOutput out = new DataOutputStream( new FileOutputStream("alem.out") ) ;
     out.writeUTF("\u1220\u120B\u121D\u1361\u12D3\u1208\u121D!") ;
  }
}

The Unicode escape string above is not overly practical to type by hand. Were the above code written with a 1997 release of Multilingual Emacs and saved with a .java or .JAVA extension, the conversion to the Unicode escapes is automatic. The ``sera2any'' resource also offers Java output options.