bash: convert html entities to UTF-8, but keep existing UTF-8

Question

Just like this question, I need to convert html entities (e.g. &) to UTF-8 (&) while ignoring other UTF-8 characters. The difference is that in my case, I need to do this via the bash command line.

I can use a tool like recode and run echo '&' | recode html..utf-8 which converts over to & just fine, however with UTF-8 characters in the string, like in

echo 'Arabic &amp; ٱلْعَرَبِيَّة' | recode html..utf-8

I get:

Arabic & Ù±ÙÙØ¹ÙØ±ÙØ¨ÙÙÙÙØ©

which, naturally, is not what I need. It should look like this at the end:

Arabic & ٱلْعَرَبِيَّة

Is there a way to do this without a bunch of messy and seemingly endless regex? Thanks

Shawn · Accepted Answer · 2019-11-29T02:36:52.210

3

perl one-liner:

$ echo 'Arabic &amp; ٱلْعَرَبِيَّة' | perl -CS -MHTML::Entities -ne 'print decode_entities($_)' 
Arabic & ٱلْعَرَبِيَّة

Requires the HTML::Entities module, which is part of the larger HTML::Parser bundle. Install through your OS package manager or favorite CPAN client.

edited Nov 29 '19 at 02:36

answered Nov 29 '19 at 02:31

Shawn

47,241
3
26
60

Nice! Works out of the box on macOS. – zord Feb 03 '23 at 11:08

score 0 · Answer 2 · answered Apr 04 '23 at 18:22

I had a similar problem when trying to recode a Portuguese text using recode. This problem occurs because recode assumes that the input text is encoded with ISO-8859-1 (Latin Alphabet Number 1).

To solve the problem I used recode 2 times in a sequence.

See this example in Portuguese:

echo 'Isto é uma simulação.' | recode --diacritics UTF-8..HTML | recode HTML..UTF-8;
Isto é uma simulação.

Note that I use --diacritics to ignore characters like &, <, >, '. It is very important to prevent the & character from being converted to &. The documentation isn't clear, but you can see it in the source code.

In the first recode command, the letters with diacritics are converted to their correspondent HTML entities:

echo 'Isto é uma simulação.' | recode --diacritics UTF-8..HTML;
Isto &eacute; uma simula&ccedil;&atilde;o.

Note that é was replaced with é ('e' with acute accent).

The second recode command converts the HTML entities to UTF-8:

echo 'Isto &eacute; uma simula&ccedil;&atilde;o.' | recode HTML..UTF-8;
Isto é uma simulação.

Note that é was replaced with é.

Your example would look like this:

echo 'Arabic &amp; ٱلْعَرَبِيَّة' | recode --diacritics UTF-8..HTML | recode HTML..UTF-8 
Arabic & ٱلْعَرَبِيَّة

bash: convert html entities to UTF-8, but keep existing UTF-8

2 Answers2