|
|
|
And the award for the platform using a sane default encoding goes to...
Linux
>>> h = open('test.txt')
>>> h.encoding
'UTF-8'

Robin |
08/12/08 - 3:58 pm | #
|
|
Well - the platform encoding is a user setting dependent on locale on Linux (I believe), so Linux is just as likely to have a default encoding of JIS...
Michael Foord |
Homepage |
08/12/08 - 4:26 pm | #
|
|
Don't get me started on the Linux distributions' encoding policies and their "everything is now magically UTF-8, didn't the pixies visit you?" attitude. My filesystems are still ISO-8859-15 and will remain so for some time to come.
It's nice to see an article which doesn't pretend that Python 3.0 magically cures the world's Unicode ills with a wave of the wand - something you'd start to believe after reading various other, more verbose articles on the subject.
Paul Boddie |
08/12/08 - 4:59 pm | #
|
|
Filesystem encoding is *another* whole area of wonder and joy in Python 3. 
Michael Foord |
Homepage |
08/12/08 - 5:08 pm | #
|
|
I agree with PB; Great to see an informative article about py3k written without the aid of rose-tinted specs like many others (nor an air of misery and gloom like the rest of them).
So, what options are there for "guessing" the encoding of a file given an unknown origin? Did I hear that BeautifulSoup implements the heuristics used by Firefox?
Rob C |
08/12/08 - 7:24 pm | #
|
|
Yes indeed I was under the impression that unicode encodings are supposed to be detected. How can we detect the encoding of a file and why isn't that the default behavior?
rgz |
Homepage |
08/12/09 - 12:25 am | #
|
|
rgz: You *can't* detect the encoding of a file (I assume that's what you mean by Unicode settings). You *can* just assume that it conforms to the default of whatever computer you have in front of you, which is in fact what open('test.txt') does. You have to guess; there are many solutions to this. *It is easy to guess wrong.* And it's dangerous. For example, if you receive a file from someone that contains only 7-bit ascii (latin letters, digits, and English punctuation), you might guess that the file is ascii, utf-8, or latin-1 (aka iso-8859-1) or a few other things (shift-jis maybe?). And you'd be half right - if you guessed any of these things, and then tried to read the file, you'd be fine. The danger is, then you edit the file, paste in some curly quotes or an accented re'sume', and send it back - and now your guess *matters* because what you've sent back has to match what your recipient is expecting. If you guessed wrong, they will see either corrupt crap, or get a decoding error.
The only solution here is to know, or be told, ahead of time what encoding text is in. This will be the case for any foreign byte source, whether it's a network protocol or a disk file. This is always done either explicitly (example: HTTP, when it's done right, tells you in the headers) or implicitly (some protocols insist on a particular encoding for all text; some applications insist on a particular encoding for all the files they save).
And Python 3 will fix none of this. You must program your application to be encoding-aware.
Cory |
08/12/10 - 1:15 am | #
|
|
I agree that you *can't* determine the encoding of a file with 100% certainty, as I also agree that it is easy to guess wrong. Unfortunately, you sometimes have *no choice* but to guess.
Some time ago I worked on a wiki-like application that allowed users to edit content source files any which way they liked. On windows this often ended up meaning buffalo-in-an-encoding-boutique type editors such as Notepad. The edited files still needed to be processed on other systems, typically a linux server. Without some sort of anit-notepadding, the application would bomb regularly e.g. every time a windows user edited content in his favorite editor. And this was the original motivation to develop the decodeh encoding guessing algorithm... it tries to guess the encoding of a string or a file, by the contents within. Given that a given string may be decoded seemingly correctly in any of a number of encodings, decodeh is necessarily weighted and *opinionated*, meaning it has an order preference -- this though is configurable.
To bypass the inevitable py3 encoding "guess" when reading a file to a (uicode) str, decodeh always reads files in binary mode, and then applies its guessing algorithm on its content to pick a best encoding to use. It of course can be supplied what "preferred" encoding to use first -- if one is known.
The module code and desciption is at: http://gizmojo.org/code/decodeh/
The same code runs as is on python 2.4, 2.5, 2.6 *and* 3.0. In particular for its applicability for py3 context, I would appreciate any comments you may have about the algorithm's logic.
Mario Ruggier |
Homepage |
09/01/12 - 12:37 am | #
|
|
I think it's either incredibly stupid or an oversight that different default encodings are used on different Python platforms.
As far as I'm, concerned, this is a bug. UTF-8 should be the default encoding on all platforms.
Python 3 is supposed to be clean; that's why we're going through all this trouble right? So let's fix this.
Seun |
09/06/06 - 8:23 pm | #
|
|
|
Commenting by HaloScan
|