# What is the best way to remove accents (normalize) in a Python unicode string?

## The Question :

538 people think this question is useful

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
2. remove all the characters whose Unicode type is “diacritic”.

Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

509 people think this answer is useful

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'



300 people think this answer is useful

import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')



This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>



The character category “Mn” stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark’s answer (I didn’t think of unicodedata.combining, but it is probably the better solution, because it’s more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not “decoration”.

162 people think this answer is useful

I just found this answer on the Web:

import unicodedata

def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii



It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedata

def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join()



unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it’s a diacritic.

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)



49 people think this answer is useful

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.

Thanks to you, I have created this function that works wonders.

import re
import unicodedata

def strip_accents(text):
"""
Strip accents from input String.

:param text: The input string.
:type text: String.

:returns: The processed String.
:rtype: String.
"""
try:
text = unicode(text, 'utf-8')
except (TypeError, NameError): # unicode is a default on python 3
pass
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode("utf-8")
return str(text)

def text_to_id(text):
"""
Convert input text to id.

:param text: The input string.
:type text: String.

:returns: The processed String.
:rtype: String.
"""
text = strip_accents(text.lower())
text = re.sub('[ ]+', '_', text)
text = re.sub('[^0-9a-zA-Z_-]', '', text)
return text



result:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'



27 people think this answer is useful

This handles not only accents, but also “strokes” (as in ø etc.):

import unicodedata as ud

def rmdiacritics(char):
'''
Return the base character of char, by "removing" any
diacritics like accents or curls and strokes and the like.
'''
desc = ud.name(char)
cutoff = desc.find(' WITH ')
if cutoff != -1:
desc = desc[:cutoff]
try:
char = ud.lookup(desc)
except KeyError:
pass  # removing "WITH ..." produced an invalid name
return char



This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don’t think it is very elegant indeed. In fact, it’s more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.

There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain ‘WITH’. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.

### EDIT NOTE:

Incorporated suggestions from the comments (handling lookup errors, Python-3 code).

15 people think this answer is useful

I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats. As a test, I created a test.txt file that looked like this:

Montréal, über, 12.89, Mère, Françoise, noël, 889

I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba’s comment:

import sys
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join()

with open('test.txt') as f:
for element in row:
print remove_accents(element)



The result:

Montreal
uber
12.89
Mere
Francoise
noel
889



(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)

14 people think this answer is useful
'Sef chomutovskych komunistu dostal postou bily prasek'



Another solution is unidecode.

Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):