Mozilla Language Enabling Feature- Arabic/Hebrew (Bi-Di) language Enabling

Editor: Franck Portaneri <franck@langbox.com> Last Update: Jan 18th, 1999


Developers:

[Not every item are filled. The important thing is we need to know who cover what and understand what is still uncovered. I have added the name of persons find in the Mozilla.org schedule. If I miss someone or I am wrong, please just let me know... ]


 

Official Schedule from Mozilla.Org

Feature Owner:
Alexander Khalil  <iskandar@ee.tamu.edu>
Franck Portaneri <franck@langbox.com>
WinFE:
Barak Ori <barak@comfy.co.il>
XFE:
Franck Portaneri <franck@langbox.com>
Mark Leisher <mleisher@crl.nmsu.edu>
MacFE:
Adil Allawi <adil@diwan.com> starts an in-house project and might show a beta at the Gitex show. He is open to a collaboration with the Mozilla team.
XP:
Arabic : Franck Portaneri <franck@langbox.com> and Mark Leisher <mleisher@crl.nmsu.edu>
Hebrew : Dotan Dimet <dotan@usa.net>
QA:
Alexander Khalil  <iskandar@ee.tamu.edu>
Anoosh Hosseini <anoosh@gpg.com>

JKL <jklnet@usa.net>
Doron Shikmoni  <doron@erez.cc.biu.ac.il>
Jonathan Rosenne  <rosenne@netvision.net.il>
Dov Grobgeld <dov@orbotech.co.il>
Ariel Magnum <amagnum@bigfoot.com>
Shay Elkin <antil_za@mailandnews.com>
 
Document:
Alexander Khalil  <iskandar@ee.tamu.edu>  (Alex, agree ???)

You want to participate :

  1. Visit on the mozilla.org site and specially http://www.mozilla.org/community.html
  2. Subscribe to the netscape.public.mozilla.i18n newsgroup ( mailto:mozilla-i18n-request@mozilla.org?subject=subscribe)
  3. Have a look on the http://www.mozilla.org/docs/refList/i18n/scripts.html and http://www.mozilla.org/docs/refList/i18n/schedule.html
  4. Download the source tree and build it on your system
  5. Contact the project owner by e-mail, cc mozilla-i18n@mozilla.org to introduce yourself.

Specifications: The main support is common for Arabic and Hebrew because of the Bi-Di (Bi-Directionality) specificity of both languages.
Of course, the charset is not the same, as well as the latest rendering process which is more complex for Arabic due to the "glyph shaping determination". So, this part of the document is splited in two sections - Arabic and Hebrew :
 


Arabic specific :

Document Charset:

There are several charset commonly used on the web for Arabic/Hebrew languages. We decide to support the following:
 

Unicode : See http://www.unicode.org
It is next generation charset standard : The new layout engine, NGLayout, uses UCS-2 internally (in contrast to the current layout engine which internally dealt with multiple encodings).
Mark Leisher <mleisher@crl.nmsu.edu> is working on this specific issue.
ISO-8859-6 : See http://www.langbox.com/iso8859-6.html
It is international standard adopted by the Arab Community as well under the Unix X11 and Mac environment. It is common used in many web site, such as :
ASMO 449+ : See http://www.langbox.com/asmo449.html
It is national standard and fully compatible with ISO 8859-6.  All sites using ISO 8859-6 are directly readable under this format. However, some additional characters (Arabic digits, punctuation signs... are added in this ASMO codeset)
cp1256 : See http://www.itsnet.com/~qamus/codepages/codepage_win95.htm
It is the code page Window used for font, and supported by many web site, such as most of site developed or hosted under Arabic Windows machine...
Arabic-mac Code Page  (Is there a specific name?) See http://www.itsnet.com/~qamus/codepages/codepage_mac.htm
It is the script code Macintosh used, it is compatible with ISO 8859-6 and ASMO 449+.

ISIRI 3342 : (Anoosh, any URL in mind? ) It is a Farsi codeset, not yet adopted by ISO, but by the Iranian Group of Normalization. It is also used on the Web with the PMosaic browser. It is the actual 8 bit standard for Farsi. The Farsi language cannot be managed by the ISO 8859-6 alone.

Mail Charset:

We decide to use ISO 8859-6 as Mail Charset since it is de-facto standard common to all platforms.

Front-End Font Encoding

For Arabic, there not really a Font Encoding definition, just because even if the codeset have been defined and fixed, the font itself must include much more glyphs than can appear in the codeset. This is due to the "glyph shaping" characteristic of the Arabic language. So according the different Software implementation, we can find different font set definition.  At LangBox, we used to have 2 levels of font encoding, according to the device font capabilities and the requested quality :
As for example, to read text on the web, the second set is quite enough. Now for publishing or printing purpose, it is preferable to use the first one.  Some ISO-8859-6-8 fonts are given with the AraMosaic browser on Unix, and can be used with Mozilla.
So, we propose the following :
XFE
ISO-8859-6-8
WinFE
Arabic Windows fonts (used under Arabic Windows license) - Or any Free TTF fonts (any pointer here???)....
MacFE
Arabic Mac fonts
Printing
ISO-8859-6-16

Host Operating Systems Consideration:

There is two types of host operating systems :

The advantage to use an Arabic OS is that all GUI widgets and keyboard input will also work properly in Arabic. The System Arabic fonts could be used, or new font can be add, but according the same fontset that the system's one.

Detail Design : Existing Netscape's DaVinci Win32 Bi-Di implementation

Posted by Catalin Rotaru on Sep 28, 1998

It is strongly recommended to look at these diff files (diff beetween Comunicator 4.04 and a Win-32 only Bi-Di implementation) It is great for understanding the effort in implementing a Bi-Di Mozilla.

Detail Design : Introduce the Arabic new Charset :

 See the Frank Tang doc : How To Add Additional Charset : http://www.mozilla.org/docs/refList/i18n/addcharset.html

Detail Design : Define the Bi-Di API to implement in order to cover both Hebrew and Arabic need.

At this stage, I think that the best solution should be to define a Mozilla specific API, that could be later implemented using Specific system libraries (UNIX CTL, Arabic Windows, Arabic MacOS...) .
Here is a draft for an API definition proposal.

Detail Design : Find public source code or write new code from scratch for the Bi-di API

In the case we cannot use an existing Arabic System library (on pure Latin operating system for example), then the API must be implemented from scratch or from an existing public code (if it exists and is usable)

New: 15-Jan-1999 : Dov Grobgeld <dov@imagic.weizmann.ac.il> announces the first alpha version of FriBidi, a Free BiDi library that adhers closely to the Unicode BiDi algorithm. See http://imagic.weizmann.ac.il/~dov/freesw/FriBidi for more info.

However, under such systems, the GUI side (dialog boxes, text input forms...) will behave only in Latin (no dual keyboard management)

Detail Design :  Use an HTML Explicit or Implicit description of the Right to Left management:

This part should determine if Mozilla Arabic support expects that all the RTL/LTR management is done as :

But this point should be in accordance with the HTML 4.0 definition. Please send you feedback here, this is really an open subject that need more input and discussions...

Detail Design :  Extend the Mozilla libi18n and layout source code with the Bi-Di API

The API function calls must be embedded within the Mozilla source tree to get the Bi-Di and Arabic support build-in. This is a complex part where the following issues must be taken in account:
 

 


Hebrew specific :

This part has been directly created from the Dotan Dimet document : "A Proposal For Preliminary Hebrew Support In Mozilla" (URL??) where I made some light modification (Please Dotan, send me your comments)

Document Charset:

There are several charset commonly used on the web for Arabic/Hebrew languages. We decide to support the following:

ISO-8859-8 :
This is an VISUAL standard (according RFC1555) : Apparently, 98% (??? to be verified) of hebrew language documents on the internet use the webfont or visual encoding to display hebrew. This codeset is the same as ISO 8859-8-i, but the Bi-Di rendering process has already be done on the stored data within the HTML document. Thus, the Bi-Di process must NOT be done a second time, and we just have to display the data as is, using an ISO 8859-8 font set. This support should be very easy to implement and if there is really so much site that use it, it must be done first. However, the data cannot be used for editing purpose since the input sequence is lost.
It is common used in many web site, such as :
  • (any URL here...)
ISO-8859-8-i :
It is international standard adopted under the Unix X11, Windows and Mac environment. It is used in web site, such as :
  • (any URL here, guys...)
This codeset is an IMPLICIT codeset, meaning that the rendering process has to follow the Bi-Di algorithm to re-organize both Latin and hebrew letters.
ISO-8859-8-e:
EXPLICIT encoding: apparently not used
CP-1255
Default under Hebrew Windows -

Mail Charset:

We decide to use ISO 8859-8 as Mail Charset since it is the standard to all platforms for data exchange (RFC 1555).

Front-End Font Encoding

XFE
ISO-8859-8
WinFE
ISO 8859-8
CP-1255
MacFE
ISO 8859-8

Detail Design

 By Dotan Dimet (Email: dotan@usa.net )  (Modified by Franck Portaneri <franck@langbox.com> - Dotan, any comments???):

1 - Support of Hebrew Visual : This means adding support for "visual" display of the iso-8859-8 charset.

Currently, most of hebrew language documents on the internet use the webfont or visual encoding to display hebrew. The Visual encoding method does not rely on the OS or windowing environment for hebrew support. In fact, it actively ignores such support by requiring the user to install special fonts and the page creator to write his hebrew text in reverse (if he's using an application with hebrew support) and use HTML tags such as PRE and NOBR to handle line-breaking. Despite the hassle, this lowest common denominator de-facto standard is in such wide use that it has been ratified officially, and Israeli standard bodies have determined that the following META tag should be used to label such pages:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-8">

Mozilla doesn't recognize this tag. Or rather, when it sees it, it sets the encoding to "Western (iso-8859-1)", and treats the hebrew text as a standard (Western) 8-bit character set, without applying any Bi-Di algorithm. However, if the special "web fonts" are chosen for this encoding, the pages will be readable.

Problems with this method include line-breaking (must be controlled by HTML tags, must not be done automatically by the display), printing (on systems with hebrew support the bidi algorithm kicks in, reversing text), and font choice (the limited selection of special web fonts is rather ugly).

The two big advantages of this method is that it should work on systems without any built-in hebrew support, and that is the de-facto standard.

The suggestion is to add support for this charset to the user interface. Instead of overriding the "Western" encoding, the user should have a seperate entry for "iso-8859-8 (visual)" where he can install his web fonts. A good improvement to this would be to bypass font/language association, and let the user use any installed hebrew fonts to view pages. This in fact is what the Hebrew version of Internet Explorer allows you to do. You'll still need to install fonts if your system has no hebrew support (and you'll still probably see the page title and any form elements as messed up), but if you have a Hebrew-aware system, you'll get more choice.

The second level of this "Visual" support should be to make it available on Hebrew Operating systems by either disable the System Bi-Di rendering in the TextOut (or equivalent) function, or by performing a reverse-transformation on the Visual line to get back the logical (Implicit) one and let the OS render it correctly (but a little bit tricky and resource consuming).

2. - Support Hebrew Implicitly: This means adding support for the logical or "implicit" interpretation of iso-8859-8. Documents written in this method will not be reversed when viewed with applications that DON'T have an hebrew support, it will be shown in the inputing order. The charset tag used should be"iso-8859-8-i", and the Bi-Di algorithm should be used to present this text. It consists in the support for codes that implicitaly set the text's direction (e.g. Latin, digit or punctuation mark characters are considered as LTR ("Left-To-Right") direction characters, while Hebrew characters are considered as RTL ("Right-To-Left") In fact, the Implicit coding represents and store the exact entry sequence of keys pressed by the user when he/she wrote the text. The support of this encoding is necessary for text editing.

On operating systems with Hebrew support, this implicit support is already there, and the Hebrew text will be displayed correctly, but without Bi-Di support within Mozilla, the text selection for cut/paste operation, mouse pointing will not work properly. But here, we should take care that the Bi-Di process is not performed twice on the same line (in Mozilla and in the OS TextOut (or equivalent) functions).

On standard (English) Operating systems, If you use a font that the system knows is hebrew to look at some text in the browser, it will be displayed the way it was written (and then cannot be read correctly)

3 - The Fiddly Bits: These include support for tricky directionality codes, HTML 4 stuff, CSS(?), Forms, and Javascript.

4- The support of Hebrew Explicit: This is really an optional case. Apparently, it is not really used for Web document, unless someone can explain or gives some input here : It consists in the support for codes that explicitly set the text's direction (codes that exist in iso-8859-8 and Unicode, as well as those in HTML 4) and that should be included to force specific nested LTR ot RTL sub-string within a line. The Bi-Di algorithm's should attempts to interpret these codes and by-pass the implicit ordering of characters to render its output text. The charset tag used could be "iso-8859-8-e".

 


Reference and Related Specification:

W3C Documents:

RFC:

Character Sets:

MIME Charset Name

Related Engineering Information:

Related Information and Resources:

 
 


Open Issues

Free Hebrew fonts for XFE
Any URL, pointer...??

Free Resources:

 XFE fonts:  http://www.langbox.com/AraMosaic/mozilla/fontXFE (See README file)
 

Schedule:

To be determined ...


linux4arab.com