[windev] Converting from greek....

Serge Wautier serge at wautier.net
Thu Mar 26 14:23:18 GMT 2009


Interesting. A few ideas that pop up (although you probably already sorted
out these issues and btw you didn't ask my opinion. But I rarely wait for
people to ask :-):

- accents and friends are usually called 'diacritics'.

- universita' is indeed different from università from a codepoints list
standpoint. However, even if you type universita' (e.g. because your
keyboard doesn't have the letter à), Windows automatically converts it to à.
Which means you'll very rarely find universita' in text. Of course, it's not
impossible since your text may come from very various sources.

- In Unicode parlance, universita' is called the decomposed form while
università is called the precomposed form. From sorting (aka collation)
point of view, Windows considers these two strings equivalent. But not
necessarily equivalent to universita: It depends if you ask it to ignore
diacritics (Vista only IIRC). See CompareString() API for details. BTW,
CompareString() and friends also take care of thingies such as german ß vs
ss.

- FoldString() && GetStringTypeEx() can help you get rid of diacritics: Turn
your string into decomposed form (università -> universita'). Then iterate
on characters and skip the ones with C3_DIACRITIC class. You may even use
more restrictive filtering such as the C1_ALPHA.

- Transliteration Greek/Latin could use the same mechanism to reduce your
Greek string to a bunch of non-accented Greek letters, which are then very
easy to transliterate to Latin.

- Actually, LCMapString(LCMAP_SORTKEY) is probably your best friend: It
generates a 'weight' string that uniquely describes your original string in
terms of sorting. You can decide to strip out part of that weight (such as
the diacritics) and use it as your index key.

- As far as sorting/collation is concerned, Windows always requires a
language parameter. It picks the collation urles that apply for this
language.

HTH,

Serge.
http://www.apptranslator.com



> -----Original Message-----
> From: Roberto Tirabassi [mailto:rtirabassi at 3di.it]
> Sent: jeudi 26 mars 2009 9:36
> To: Serge Wautier
> Cc: Windows Developers (Mailing List)
> Subject: Re: [windev] Converting from greek....
> 
> Hi Serge...
>     Well, we develop a native XML data Base with an Full Text retrieval
> in it. I know that I can use Utf-8 to encode greek language (we
> currently use utf-8 in other scenarios) but... It makes harder to make
> term extension that means considering two terms equal or similar if few
> chars changes or if written with or without modificators (accents and
> so
> on).
>     An example is better than 1000 words. Let's talk about my
> language.... in Italian the term "university" is written as
> "università"
> but can be written even as "universita'" using the quot char as the
> accent. This is just an example but this make us face the problem that
> the same word can be written uppercase, lowercase and so on...
> expecially if data is coming from ancient data bases and so on...
>     More... the deutsch letter 'ß' is frequently written as 'ss' (a
> "modern" way to express the same char). More and more... eastern europe
> languages (iso-8859-2 and so on) or northern ones have many chars that
> are latin1 letter modified by a symbol. Users of different languages
> often don't know how to write that char...
>     That means that...
>     ...when our users wants to look at the full index term list (that
> we
> call vocabulary) we want to show the term in it's exact "face" but when
> I make my searches, search extensions and so on... I have to
> "normalize"
> all terms. We believe that... latin-1/ascii-7-bit should be the most
> significant normalization level.
>     That's the target.
>     That's why we tried to follow the greeklish way but we also find
> out
> that greeklish isn't standard and there are at least 3 different ways
> to
> transliterate greek to greeklish.
>     We eared about ELOT standard for greek transliteration but I still
> haven't found anything about it...
> 
>     Here's my trouble...
> 
>                            Roberto Tirabassi.
> --
> "Nei periodi di grandi cambiamenti, gli apprendisti ereditano la Terra
> mentre gli specialisti si ritrovano preparatissimi ad affrontare
> un mondo che non c'è più". [Eric Hoffer]
> 
> 
> Skype: roberto.tirabassi
> 
> 3D Informatica, Via Speranza 35,
> 40068, S.Lazzaro di Savena - Bologna, Italy
> Voice: +39051450844, Fax: +39051451942
> WWW: http://www.3di.it
> Documentation: http://www.3di.it/manuali/ - http://wiki.3di.it
> FTP: ftp://ftp@ftp.3di.it, Download:/3di, Upload:/incoming
> --
> 
> 
> 




More information about the Windev mailing list