

The information is encoded correctly, but due to a long-standing bug it is displayed incorrectly. Which looks terrible but does not actually indicate a problem. Running the above example on Windows produces this: On Windows there is a bug in that causes ame‘s with UTF-8 encoded columns to be displayed incorrectly in non UTF-8 locales. We can read these files as follows on any platform (Windows, Linux, Mac): You can ensure that the information is re-encoded to UTF-8 by using the readr package.įor example, I have two versions of a file containing numbers and Japanese characters: japanese_utf8.csv is encoded in UTF-8, and japanese_shiftjis.csv is encoded in SHIFT-JIS. If not, you can guess the encoding using the stri_read_raw and stri_enc_detect functions in the stringi package. Hopefully the encoding is specified in the documentation that accompanied your data. To read a text file with non ASCII encoding into R you should a) determine the encoding and b) read it in such a way that the information is re-encoded into UTF-8, and c) ignore the bug in the ame print method on Windows. Additional background and discussion is presented in later sections. If you are on a deadline and just need to get the job done this section should be all you need. The issues described here by and large to not apply on Mac or Linux they are specific to running R on Windows. This section gives the basic facts and recommendations for importing files with arbitrary encoding on Windows. Note: the title of this post was inspired by this question on stackoverflow.
