WRDS will be converting all data from Latin1 to UTF-8 on Monday, November 16
The English alphabet, numbers, and basic symbols are viewable with a system called Latin1 encoding. Languages such as Spanish, German, and others add additional characters and symbols that are not part of Latin1. The Chinese alphabet and others do not use Latin1 at all. These characters and symbols are part of a much larger encoding system called UTF8, which also includes Latin1.
Since WRDS' inception, all of our data has been stored in Latin1 encoding. As WRDS becomes much more global in scope and much more text-heavy, the need to move to UTF-8 encoding is apparent. By changing our underlying encoding to UTF-8, we will be able to retain all non-Latin characters -- which will mean you will no longer see random, strange looking characters in places where a legitimate non-Latin character should be present.
What will I need to do?
In the vast majority of cases, you don't have to do anything. The change to UTF-8 encoding should be seamless and unnoticeable. All web queries will be unaffected and will continue to run as expected. In most cases you will not need to change any of your existing code, however there are some edge cases that may temporarily require an adjustment to your
libname statements as detailed further below.
If you use SAS/Connect, you must connect with the unicode / UTF-8 version of the SAS client, Latin1 is the default. On unix-like systems this is the
sas_u8 command. On Windows you must start
SAS 9.4 (Unicode Support) application. If you do not use the unicode version you will see a warning like this:
Warning: the client session code latin1 is not compatible with the server session encoding UTF-8. Data may not be transmitted correctly.
All data access methods (web query, SAS, Python, API access, running code on WRDS Cloud, etc.) will use UTF-8 encoded data after the migration.
Potential Performance Impacts / Encoding Errors
As mentioned, in most cases, existing code or saved datasets should not break due to the migration. However, there are some edge cases where you may notice a performance impact -- or even an error -- specifically:
- If you have a previously saved dataset in your home directory and you use custom code to join that data to WRDS data, there may be a mismatch between the dataset encodings after the migration to UTF-8, and SAS will not use any indexes to query the data. While the code should still work, you may notice slower performance and a message like this in your log:
NOTE: Data file MYLIB.MYFILE.DATA is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce performance.
- Declaring custom libnames in your code that reference WRDS data could potentially be problematic. For example, when declaring libraries like
libname complib '/wrds/comp/sasdata/d_na/';
There are a few simple workarounds you can use to allow SAS to read the index and not use CEDA (Cross Environment Data Access), which degrades performance:
- Option 1: Add "inencoding=asciiany" to your custom libname statements. In your SAS code where you specify the library for your saved data, or if you have custom libname statements referencing WRDS data, simply add the option "inencoding=asciiany" to the end of the statements, for example:
libname mylib '~/' inencoding=asciiany;
libname complib '/wrds/comp/sasdata/d_na/' inencoding=asciiany;
- Option 2: Change the encoding of your saved dataset to UTF-8 so it matches WRDS data encoding. You can change the encoding of your dataset to UTF-8 format with the following code:
proc datasets nolist library=mylib;
modify mydata / correctencoding=utf8;
All new datasets created after the migration will be in UTF-8 encoding by default.