A Scalable Prioritized Encoding System For Fidel

[[ still a work in progress... ]]

The problem of limited address space has been introduced previously along with the concept of a ``Business Fidel''. This follow up article will discuss one principal for selecting the members of a business fidel.

To introduce the method of this paper, lets change our last analogy from the egg cartons to buckets. We may review our problem again by considering Fidel to be a 2 gallon bucket of water. ASCII would be an empty ť gallon bucket and Extended ASCII an empty 1 gallon bucket. Our basic problem is then summed up with the basic truth ``you can not pour 2 gallons of water into a 1 gallon bucket''. No matter what you do you are going to loose something, there is simply no way around it.

This is the age old problem of the Fidel programmer in the ASCII world. The trick of the trade is to try to use 2 ASCII buckets to hold all of the water (Fidel). Depending on the text environment (DOS, Windows, Mac, etc) and the software, this works with varying degrees of success but has never been a very satisfying solution for the programmer. There will always be some instance when only one ASCII bucket of Fidel can be used.

Now we are back to the other aspect of the age old problem -what goes in that first bucket? Is it an ASCII or Extended ASCII bucket? How to decide? How can you be prepared for both or for another system that gives you 2 buckets or a 1-1/2 gallon bucket? Will your competitors choose the same? It is a nightmare!

But it is not just the nightmare of Fidel. Even with DOS file names can only be in UPPERCASE -so half of Alphabet is missing but it would have been a comparatively easier decision to choose to loose the lowercase letters. The fact that lowercase letters have higher addresses than UPPERCASE will be interesting very shortly. A theory and method has been available to give us a way to prioritize Fidel, or any other script, so that the letters are chosen ahead of time for any size of bucket we are given.

Prioritizing Fidel we are accepting that not every drop of water is the same. Because we use some letters more than others their value to us for writing can be considered greater. Perhaps then we should pretend it is fruit or gems of different quality in the bucket and we want to make sure we will always pour out the best ones first. Or whatever, lets just say a bucket of Fidel.

If ``usage'' is to be our system of measure than Huffman encoding is an algorithm that insures characters used most will have the first addresses. The Huffman method was devised for encoding variable width binary addressing. Since we can not use variable width addressing we will implement the Huffman method with left-padded-zeros to fill out a 9 bit width sequence.

So we have a theory and a system, now we need statistics. Another age old problem -where are they? Probably we are going to have to come up with our own. To do so we enter another field of study that would tell us how to select a representative sample. Unfortunately the present writer never took this class and will have to make something up. The field of ``making something up'' is called ``engineering'', and fortunately the present writer is an engineer!

Ok, obviously books, magazines, newspapers and emails are written with words; which transport the letters. The words used depend heavily on the language of the writer, the subject material (consider an article on propulsion vs gardening) and also on the writing style and education of the writer. To serve the widest audience possible no one language, book, author, or subject material should make the difference for any single character address.

For now, practicality governs otherwise and we will narrow the scope of our statistical base. First, though we are coding for Fidel, and there are many languages that use Fidel, lets reduce the scope to just Amharic. Immediately the consequence can be foreseen. Series not used in Amharic -Qe, De, Ge, Xe, or infrequently - Ke, Ve. Are going to end up at the end up in the bottom of the Fidel bucket (end of the address table). A person wanting to write in Oromiffa, Bilin, or Tigrigna, etc will have to make compromises when only 1 ASCII bucket is available.

When we narrow the scope further by not considering some subject material, say political or religious; a similar effect happens. This can be good or bad, it all depends on your purposes. If you know priori that you will be writing only in Bilin, then the address system can be optimized for Bilin. Likewise again for subject and author(1).

Today, for this exercise I will use only the SERA files in my personal account. Fortunately most of the SERA software will print statistics so the job is simplified even further. Details of the Huffman algorithm we need not discuss, the results are what are of value to us.

Samples

The samples available for this experiment come from Ethiopian Review and from Ullendorf's ''Chrestomathy'', specifically a section on William Shakespear. The sizes of the articles are similar, note however that the word count of the Shakespear sample is nearly double that of the Ethiopian Review article. This highlights right away the impact that subject content and vocabulary can have in writing. It should not surprise us then if the character counts are found to be very different.

Statistics(2) for the samples allow us to explore the issues to be contended with. Up to the first 80 - 90 characters we find nearly the same set of letters. As we look closer and closer to the beginning of the list we find letters occuring at closer to the same relative (as normalized by the address system) frequencies. This agreement is a property of the language. An exception comes from a difference in writing style as we see the Ethiopic wordspace assigned to address 0 in the Shakespear document and have no use in the magazine article. Ethiopic period however occurs with only a moderately different frequency, comma however is used in the Ethiopian Review article but then not in the Shakespear document. Fidel documents we see then may also have a sensitivity to the time period in which it was written.

Lets say we need to device a single font table of 128 characters that we will use to print or view both of these documents. What does each loose? The loss for the ER article will be at least 22 characters since it uses only 151 out of 358. The Shakespear article looses at least 31 of the 160 letters required to reproduce it faithfully. In a font for both however, if we selected only the top 128 letters for the font, some letters shared or not shared by both documents will be lost when one of the unshared letters has a higher occurrance rate (such as the wordspace) than some other letter in the union set.

Upon careful examination of the statistics we can find 19 additional characters lost within the 128 range of the Shakespear document as they are displaced by higher frequency characters uniquely in the ER set or from the union set. The ER article looses only 11 such characters within its 128 range for the same reason. An interesting region is at the border of the 128 limit. At the border (125->136) there are 12 characters with the same frequency, only 3 make it into the final font set. The decision is made here to elect the 3 characters into the final set that are shared by both documents.

Failings of Blind Prioritization

The merit of prioritizing Fidel would be that documents can me reproduced with the most satisfaction possible when fewer and fewer letters are available.

Does this really work?

One important issue that arises is what do you do when a letter is not available? Can economization be extended further? If what we really wish to serve is communication and not faithful republishing, then yes, Fidel allows for further prioritization. Consider that Fidel has redundant phonic representatives; 2 he series (3 in Amharic), 2 vowels series, and 2 se and Se series. If for example occurrences of se and `se were both high enough to make it into the top 200 -one can be placed at the end of the list and thus make room for another letter. This type of replacement only aesthetically hampers communication that (to most people hopefully) does not outweigh the gains.

Lets assume this scaling were taken to the extremes and only 26 letter would be available -still very common in the present day (consider user names, pagers, DOS file names, cash registers, ATMs, airline CRSs, etc). The first 26 letters in our prioritized table will no longer be the 26 most valuable to us. The best way to then serve communication would be to use Fidel in a Latin type system. In such a system we might choose one set of 7 vowels, and one each of the consonant series (still dropping a redundant representations). This makes reading highly unnatural and unsatisfactorily if at least possible. A better choice vs using all 7 vowels may be to use 1 vowel and then 6 diacritical marks in place of the others.

Conclusions

The thrust of this experiment was to demonstrate the sensitivity of text content to character occurrances. Our samples being of nearly equivalent length, but of only 1 page in length, truly permit us to conclude only that there is a strong sensitivity of character occurrances in samples of limited length.

So how much is enough? Before comparing character occurrances in different samples, the length of the content in a single sample should be sufficient such that the extension of the sample by 1,000 characters no longer significantly effects occurrence outcome.

The limited length of these samples serves then to exaggerate the point to be made. Which is simply that character occurrances are sensitive to text content, and writing styles, and that one sample should not govern character selection for limited sets. It would be expected that for samples that are of the ``sufficient length'' described and are of different content, composed by different authors, that for the same language relative character occurences become closer to the same. Sensitivity to content would only be significant for low frequency letters. This supposed result may only be meaning then when choices for fonts need be made at and near the border regions of ASCII (128 characters) and Extended ASCII (256 characters).

Other conclusions that may come from the experiment would be that it is not in a practical sense for the end user to created a limited font set that contains all members of a given series. That is if one of a particular consonant series is selected for the economized set, it is not necessary or helpful, that all of its syllabic companions follow.

A universalized limited set for all users of the writing system may not be practical or in the best interest of the user. That is a limited character set will be of greater value to a user if the limited set is optimized for his or her language. A break is made now from an implicit premise that the limited set be optimized for the writing system. Were such ever attempted, the weighted sample selection from different languages would need to be considered.

Footnotes

(1) Effectively this is type of optimization is what has been done to devise
    Haddis Encoding used by EthiO-Systems.

(2) The sorting of the character occurrances was executed by the C language's
    qsort routine.  The ordering of characters with like occurrances is
    arbitrary; each character in the same frequency set is equally eligible for
    the address of any other character within the set.  The ultimate ordering
    within a frequency set is simply a consequence of the onset ordering.  This
    is also the result in the Huffman method.