Converting text to basic latin (aka removing accents) with JavaScriptConverting text to basic latin (aka removing accents) with JavaScript

I was recently working on a decision engine for a quiz site (KwizMi.com) . The quizzes allow you to define a table of questions and answers, and users playing the quizzes have to try and guess all the answers in a given period of time.

The problem I stumbled upon is best illustrated by the following example:

  • Question: Which "GP" plays at centre-back for Barcelona?
  • Answer: Gerard [Piqué]

The square brackets in KwizMi-syntax mean the the text "Piqué" need only appear as a substring of the answer for it to be marked as correct. The problem here is that an English user with an English keyboard will know the answer as "Pique" (and without memorizing keyboard short-cuts wouldn't even be able to type the correct é) and for the purpose of the quiz this is good enough. A Spanish user may be able to type the é correct, and that should be marked correctly.

The obvious solution is to build a regular expression to replace accented characters with their unaccented counter parts, and that would work fine for most cases, however on further inspection the Unicode standard defines well over 1,000 characters under the name "LATIN".

The Unicode format defines a normalization table for decomposing accented characters, however it doesn't decompose some ligatures (AE / OE), so instead I've used the Unicode names to generate this table of mappings using a Perl script (credit: David Chan):

var latin_map = {
  'Á': 'A', // LATIN CAPITAL LETTER A WITH ACUTE
  'Ă': 'A', // LATIN CAPITAL LETTER A WITH BREVE
...
'ᵥ': 'v', // LATIN SUBSCRIPT SMALL LETTER V
'ₓ': 'x', // LATIN SUBSCRIPT SMALL LETTER X
};

Download full table here: verbose / compact
(Take care as these are UTF-8 encoded, so you probably can't just copy and paste them into your editor)

And the following extensions to the String object:

String.prototype.latinise = function() {
return this.replace(/[^A-Za-z0-9]/g, function(x) { return latin_map[x] || x; })
};

// American English spelling :)
String.prototype.latinize = String.prototype.latinise;

String.prototype.isLatin = function() {
return this == this.latinise();
};

Here are some examples:
> "Piqué".latinise();
"Pique"
> "Piqué".isLatin();
false
> "Pique".isLatin();
true
> "Piqué".latinise().isLatin();
true

Comments

Gracias! muy buena solución!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
  • The signwriter filter 'Page titles' is enabled.
  • The signwriter filter 'Page sub titles' is enabled.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
By submitting this form, you accept the Mollom privacy policy.