Stata jobs on the WRDS Cloud can be run either interactively or in batch mode:
Interactive jobs are sessions where you input commands in serial within an interactive shell, receiving a response to each command after you enter it. Interactive sessions are the best choice for quick tests, for accessing and working with straightforward data segments, or when working on a program and trying out commands as you write them.
Batch jobs are longer programs that might need to perform several steps to generate a result. Batch jobs are the best choice when running multi-step jobs that may run over the course of several hours or days.
Jobs on the WRDS Cloud are managed by a process called Grid Engine, which accepts job submissions and distributes them to the least-loaded compute system available. Grid Engine is very powerful and offers many features for fine-tuning your jobs and thier performance, and also allows for robust job management. This document will include just enough to get you started submitting your Stata jobs to the WRDS Cloud Grid Engine, but a greater degree of detail is available for the curious in the WRDS Cloud Manual.
Both interactive jobs and batch jobs are discussed in detail below.
Three Versions of Stata
Stata is provided in three versions at WRDS:
stata - the default version of Stata
stata-mp - for multiprocessor computers, ideal for running on the WRDS Cloud
stata-se - for large databases, such as the many large databases housed on the WRDS platform
For general practice, WRDS recommends the stata-mp version for use on the WRDS Cloud, however there are occassions where the stata or stata-se version may serve you better. For more information on the different versions, please consult Stata's Which Stata is Right for Me document.
This document will use stata-mp in all of its examples.
WRDS Cloud: Preparing your Environment
The first step in establishing a connection to WRDS data on the WRDS Cloud is setting up two ODBC access files in your WRDS Cloud home directory. This step only needs to be done once.
These files contain the ODBC connection parameters -- such as server, username, and password -- so that you do not need to enter them each time you wish to connect to WRDS within Stata.
To create your ODBC access files, first SSH to the WRDS Cloud (if you're new to the WRDS Cloud or need a refresher, consult the WRDS Cloud Manual on how to do this).
Once connected, we first need to create the odbc directory in your home directory to house these ODBC access files:
Next, in this new directory, we create the first ODBC access file: odbc.ini.
You can use any text editor -- such as nano or vi -- to create this file. The contents of this file should match the following:
[ODBC Data Sources]
wrds-postgres = PostgreSQL
Driver = PostgreSQL
Description = Connect to WRDS on the WRDS Cloud
Database = wrds
Username = my_wrds_username
Password = my_wrds_password
Servername = wrds-pgdata-h.wharton.private
Port = 9737
SSLmode = require
Where my_wrds_username is your WRDS username, and my_wrds_password is your WRDS password. Be sure to place this file within the odbc directory you just created. The file's location should therefore be: ~/odbc/odbc.ini.
As this file contains your WRDS username and password in cleartext, it is very important that you restrict permissions on this file using the following command:
chmod 600 ~/odbc/odbc.ini
This command sets the permissions on this file to be read/write for your user account only.
The second file we need to create will be named odbc.csh and should also reside this same directory. The file's location should therefore be: ~/odbc/odbc.csh.
You can use any text editor -- such as nano or vi -- to create this file. The contents of this file should match the following:
NOTE: The default shell on the WRDS Cloud is tcsh, which uses the above syntax. If, instead, you use the bash shell, you should instead create an odbc.sh file, and use the export command to set the above values using proper bash syntax.
With these ODBC access files in place, your environment is fully configured to be able to connect to WRDS data from within Stata, which is covered in the next section.
WRDS Cloud: Interactive Jobs
Interactive Stata sessions allow you to execute Stata code in serial within a shell, receiving a response to each command as you enter it. Many users chose to work in an interactive session to formulate the various data steps that accomplish a given task, before then building a batch job from these data steps and executing the job.
Interactive Stata sessions are also a good way to ensure that the ODBC files we just created are correct, so let's start with a Stata interactive session on the WRDS Cloud.
In order to run an interactive Stata session, you will need to schedule an interactive job with the WRDS Cloud Grid Engine software. The following example shows such a connection, submission, and interactive Stata session. Following the example, you'll find an explanation of the steps.
[my_user@my_laptop ~]$ ssh firstname.lastname@example.org
[wrds_user@wrds-cloud-login1-h ~]$ qrsh
[wrds_user@wrds-sas5-h ~]$ source ~/odbc/odbc.csh
[wrds_user@wrds-sas5-h ~]$ stata-mp
. set odbcmgr unixodbc
. odbc list
. odbc load, exec("select distinct frname from wrds_lib_internal.friendly_schema_mapping;") dsn("wrds-postgres")
. odbc load, exec("select * from djones.djdaily") dsn("wrds-postgres")
[wrds_user@wrds-sas5-h ~]$ logout
A description of the above WRDS Cloud SSH connection and Stata interactive session line-by-line:
From my_laptop I make an SSH connection to wrds-cloud.wharton.upenn.edu with my WRDS username wrds_user.
I enter my WRDS password when prompted.
Once logged into the WRDS Cloud, I submit an interactive job to the WRDS Cloud Grid Engine with qrsh.
I am assigned to the interactive compute node wrds-sas5-h and given a prompt. I source the odbc.csh file I created above, which sets which ODBC files to use within Stata.
I start an interactive Stata session with stata-mp.
Once Stata starts up, I configure Stata to use the UnixODBC driver, which references the ODBC access files we've created.
I use the odbc list command to show that our ODBC DSN is found as expected. You should see an entry for wrds-postgres.
I query for all available WRDS data products by vendor with the odbc load command. Note that this will show you the friendly names for all WRDS datasets; you will still need a valid, active subscription to the data in order to access it.
I use the list command to show me the results of the above.
I use the clear command to reset the current active table in memory.
I issue another odbc load query, this time selecting everything in the djones.djdaily dataset.
I use the list command to show me the results of the above.
I use the clear command to reset the current active table in memory.
I use the exit command to terminate my Stata session.
I am returned to my prompt at the interactive compute node wrds-sas5-h. I issue the logout command.
I am returned to the WRDS Cloud head node wrds-cloud-login1-h, and am ready to issue another job. I could also have remained on wrds-sas5-h if I wanted to submit additional interactive jobs.
Instructions for accessing WRDS data once connected are given in the Accessing WRDS Data From Stata section of the document.
WRDS Cloud: Batch Jobs
Batch jobs on the WRDS Cloud allow for long-running, multi-step program execution, and are ideal for more in-depth computational analysis of data. Batch jobs are submitted on the WRDS Cloud to a central Grid Engine, which handles the scheduling, resource management, and execution of the submitted job. For more information on the WRDS Cloud and its Grid Engine, please consult the WRDS Cloud Manual.
A Stata batch job is comprised of two files:
The Stata program itself, with file extension .do, that contains the Stata code to be executed
A wrapper shell script, with file extension .sh, that submits the Stata program to the Grid Engine
A Stata batch job relies on the presence of two ODBC access files to connect to WRDS, just like an interactive job does. If you haven't already, be sure you've created these files as described in the WRDS Cloud: Preparing Your Environment section above.
Because we have already defined the connection to WRDS in our ODBC access files, we can simply refer to the connection by its established DSN, wrds-postgres, anywhere in our Stata program. To demonstrate, let's create a Stata program for this batch submission that returns the Dow Jones Index by date. In your WRDS Cloud home directory, create a program named myProgram.do with the following code:
set odbcmgr unixodbc
odbc load, exec("select * from djones.djdaily") dsn("wrds-postgres")
save my_output.dta, replace
Configures Stata to use the correct ODBC driver
Makes an ODBC connection to WRDS and issues a SQL query against djones.djdaily.
Saves the results as my_output.dta, replacing the file if it already exists.
Stata jobs cannot be submitted directly to Grid Engine on the WRDS Cloud, but must instead be submitted by a wrapper script: a simple shell script that tells Grid Engine how to call the Stata program, and allows the opportunity to provide additional job-control parameters if desired. Recall that we named our program above myProgram.do. The following example is the simplest possible wrapper script to call this program:
Tells Grid Engine to look in the current working directory (cwd) for any referenced files, and to store output files in that same directory as well.
Instructs Grid Engine to send a notification Email at stages: before, after, and at error (if applicable).
Specifies the address to send the above Emails to.
Uses the command echo to send a bit of text and the current timestamp to an output file just before starting my Stata program.
Sets the ODBCINST environment variable for the job session, needed for Stata's odbcmgr.
Sets the ODBCINI environment variable for the job session, needed for Stata's odbcmgr.
Invokes stata-mp with the -b flag (run in background) to run the myProgram.do program.
Uses the command echo to send a bit of text and the current timestamp to an output file just after finishing my Stata program.
WRDS Cloud: Submitting a Batch Job
Once you have both the Stata program and its wrapper script created, the job can be submitted via qsub. Let's say I named my wrapper script myProgram.sh. My qsub command would simply be:
The WRDS Cloud Grid Engine will then verify your job submission and schedule it to run on the currently least-utilized compute node in the cluster. When your job completes, all output files will be written to your current working directory (as described in the next section).
To view the status of a currently-running Stata job on the WRDS Cloud, use the qstat command. If no results are given, you have no jobs running. Please consult the WRDS Cloud Manual for more information on working with the Grid Engine on the WRDS Cloud, including job monitoring, queue selection, and much more.
WRDS Cloud: Batch Job Output
When your batch job completes, several output files will be created containing the results of your job. Since we used the #$ -cwd parameter in our wrapper script, these files will be created in the same directory as the wrapper script itself. Assuming the job you ran was called 'myProgram.sh' which calls the Stata program 'myProgram.do', these output files are:
my_output.dta - The output file of your Stata program, containing the collection of observations that your code generated. These are your results.
myProgram.log - The Stata log generated from the run of your Stata program.
myProgram.sh.o####### - The Grid Engine file that contains all output from your myProgram.sh wrapper (such as the echo commands).
myProgram.sh.e####### - The Grid Engine file that contains all errors from your myProgram.sh wrapper. This will be empty if your program's execution did not experience any errors.
Whether you are accessing data on the WRDS Cloud via an interactive or batch job, the method for accessing WRDS data is the same.
Since we already created the requisite ODBC access files to configure our connection parameters (and set the two ODBC environment variables either via odbc.csh interactively or directly in our wrapper script via batch mode), we merely need to issue the set odbcmgr command once at the beginning of our program:
set odbcmgr unixodbc
Having done this, we can then access WRDS data within our Stata program at anytime with the odbc load command, using the following syntax:
odbc load, exec("select X from Y;") dsn("wrds-postgres")
Where select X from Y could be any manner of SQL query you need to run for your research.
Let's use this syntax to get an overview of the data available to us in the next section.
Querying Metadata - Examining the Structure of Data
When working with WRDS data, it is often useful to examine the structure of the dataset, before focusing on the data itself. WRDS data is structured according to vendor (e.g. crsp) and referred to as a library. Each library contains a number of component tables or datasets (e.g. dsf) which contain the actual data in tabular format, with column headers called variables.
We can analyze the structure of the data available to us through its metadata by querying the special information_schema table, as outlined in the following steps:
List all available libraries at WRDS
Select a library to work with, and list all available datasets within that library
Select a dataset, and list all available variables (column headers) within that dataset.
NOTE: When working with the PostgreSQL information_schema table, libraries and datasets that you provide are case-sensitive, and must be lowercase. This applies to both information_schema.tables and information_schema.columns.
Let's run through each of these three steps to examine the structure of our data.
1. To determine the libraries avaialble at WRDS:
odbc load, exec("select distinct frname from wrds_lib_internal.friendly_schema_mapping;") dsn("wrds-postgres")
This will list all libraries available WRDS. Though all libraries will be shown, your institution must still have a valid, current subscription for a library in order to access it via Stata, just as with SAS or any other suported programming language at WRDS. If you recieve the error: Error while executing the query SQLSTATE=42501 or similar, it is because you do not have access to the underlying data you are attempting to query.
2. To determine the datasets within a given library:
odbc load, exec("select distinct table_name from information_schema.columns where table_schema='library' order by table_name;") dsn("wrds-postgres")
Where 'library' is a dataset such as crsp or comp as returned from #1 above.
3. To determine the column headers (variables) within a given dataset:
odbc load, exec("select column_name from information_schema.columns where table_schema='library' and table_name='dataset' order by column_name;") dsn("wrds-postgres")
Where 'library' is a dataset such as crsp as returned from #1 above and 'dataset' is a component database within that library, such as dsf, as returned from query #2 above. Remember that both the library and the dataset are case-sensitive!
The above queries help to establish the structure of the data we're trying to access; ascertaining the exact dataset and variable names we'll need to use when we go to access the data itself in the next section. Alternatively, a comprehensive list of all WRDS libraries is available via the WRDS Dataset List. This online resource provides a listing of each library, their component datasets and variables, as well as a tabular database preview feature, and is helpful in establishing the structure of the data you're looking for in an easy, web-friendly manner.
NOTE: If you are familiar with the previous SAS-based method of connection, you used the SAS dictionary table for this. That table required exclusively uppercase libnames and memnames, while this PostgreSQL method requires exclusively lowercase libraries and datasets. Be sure to update your code!
Now that we know how to query the metadata and understand the structure of the data, we can construct a data query to start gathering the actual data itself. Unlike the above metadata queries, where table_schema and table_name are separately used to seach libraries and datasets respectively, data queries instead use the two together to identify the data location. So, for example, a data query for the datasetdsf within the librarycrsp would be referred to as crsp.dsf.
The following demonstrates this example:
odbc load, exec("select * from crsp.dsf limit 10") dsn("wrds-postgres")
IMPORTANT: Setting limit 10 artificially limits the results to 10 observations. The table crsp.dsf is CRSP's Daily Stock File and tracks thousands of stocks over almost a hundred years: a query that returns all observations would take a very long time. In reality, most queries are far more specific, as shown in some of the examples below. Generally, until you are sure that you're getting exactly the data you're looking for, you should limit the number of observations returned to a sensibe maximum such as 10 or 100 (by setting limit to such a value). Remember, much of the data provided at WRDS is huge! It is highly recommended to develop your code using such an observation limit, then simply remove that limit (by removing the limit value) when you are ready to run your final code.
NOTE: Unlike querying the PostgreSQL information_schema table, querying tables themselves (such as crsp.dsf) does not require that you adhere to any specific case. However, to keep things the same across the board, WRDS recommends always using lowercase for referencing libraries and their datasets.
Querying Data: Searching by Variable
Datasets often contain a large number of variables (column headers) such as date, ticker, cusip, price, or a host of other values depending on dataset. Limiting the number of variables returned in queries makes the resulting data neater and easier to read, but also speeds up the query execution time and makes the resulting size of the returned data smaller. Once we have queried the metadata and understand the available variables, we probably want to specify only those we are interested in. We can do this by specifying each variable to query explicitly in our SELECT statment, instead of selecting all (via the asterisk * which matches all variables). Here is an example where we specify the cusip, permno, date, and two price variables from the CRSP Daily Stock File:
odbc load, exec("select cusip,permno,date,bidlo,askhi from crsp.dsf limit 10") dsn("wrds-postgres")
We can also further refine our query by selecting data that meets a certain criteria for one or more of these variables. Here we limit the returned results to data which has an askhi value above 2500 and a bidlo value under 2000:
odbc load, exec("select cusip,permno,date,bidlo,askhi from crsp.dsf where askhi > 2500 and bidlo < 2000 limit 10") dsn("wrds-postgres")
It turns out that (as of 2017) there are only 27 records that match the above criteria. Try running the above query again, this time setting the number of returned observations to unlimited ( by removing the limit restriction). Despite searching through millions of stock trades over almost one hundred years, the above query only takes about fifteen seconds because we have limited both the variables queried and the specific range of those variables.
Querying Data: Searching by Date
One of the more common methods of querying data is by date. WRDS PostgreSQL data uses the date notation convention of yyyy-mm-dd, so January 4th, 2013, the first trading day of 2013, would be formatted as 2013-01-04. Dates in your SQL queries must be surrounded by single quotes. Here's an example:
odbc load, exec("select cusip,permno,date,bidlo,askhi from crsp.dsf where date = '2013-01-04'") dsn("wrds-postgres")
Searching by date ranges is much the same, as in the following example:
odbc load, exec("select cusip,permno,date,bidlo,askhi from crsp.dsf where date between '2013-01-07' and '2013-01-08'") dsn("wrds-postgres")
Note how we can leave the limit statement off of the above queries: due to the specificity of our query -- limiting the search to both select variables and specific dates -- we can execute this query quite quickly despite the relative size of the data as a whole. Of course it is advisable to do this only after first showing -- with the limit statement -- that such a query is indeed manageable.
NOTE: If you are familiar with the previous SAS-based method of connection, you may have had to use SAS datetime conventions in the past (where the above date would look like this: '04jan2013'd ). This is not required with the PostgreSQL connection method. Date formats in this approach are always yyyy-mm-dd.
Querying Data: Searching by Multiple
Any number of qualifiers can be strung together in your SQL statement. Say we wanted to combine the two above sections, and create a data query that restricted results based on date range as well as askhi and bidlo values. The following accomplishes this.
NOTE: this query takes some time to run.
odbc load, exec("select cusip,permno,date,bidlo,askhi from crsp.dsf where date between '1960-01-01' and '1980-01-01' and askhi > 2500 and bidlo < 2000") dsn("wrds-postgres")
More examples are given in the Example Workflows and Programs section.
WRDS supports installing your own Stata modules, should you wish to expand your Stata working environment on the WRDS Cloud. This section details how to install and use your own selection of Stata modules on the WRDS Cloud.
First, you'll need to set up your WRDS Cloud home directory to support the installation of your own Stata modules. This begins with the creation of a Stata profile file named profile.do located in your home directory. Create this file with your favorite text editor with the following line:
sysdir set PERSONAL "~/stata/"
This configures Stata, each time it launches, to look in the ~/stata directory for any additional modules you might have added.
Start an interactive session on a compute node with qrsh, and run stata-mp. Use the command personal to check and see if your profile.do was properly read:
your personal ado-directory is ~/stata/
The above is what we expect to see.
NOTE: Stata only looks for profile.do files in the directory it was started from. That means you need to issue the stata-mp command from directly in your home directory (~), not any other location. If you do not see the above output, but created the profile.do file as instructed above, try changing directory to your homedir ( cd ~ ) to be sure you are where you're supposed to be.
Of course, having configured this, we'll also need to create the ~/stata directory, which we can do with the following command.
Now that our environment is set up to support additional Stata modules, we are ready to install our first. Many widely-used Stata modules are available at REPEC. We'll use the popular unique module, available at this location, in our example.
IMPORTANT: The WRDS Cloud compute nodes (wrds-sas1 through wrds-sas36, etc) are not internet accessible. You must use the two WRDS Cloud head nodes (wrds-cloud-login1 & 2) to download your modules initailly. Once downloaded, however, you can easily use them on the compute nodes as normal.
1. Browse REPEC for Stata modules. We'll use unique.
2. On one of the WRDS Cloud head nodes, navigate to your ~/stata directory, and use wget to download the package (and its help file) that you have selected:
That's it. As long as you remember to download on one of the head nodes, and to start Stata from the same directory that contains your profile.do file, you should have no problems using custom Stata modules at WRDS.
There are many ways to perform analytical research against data, but most involve the two-step procedure explored above:
Examine the metadata to determine what data is available and by what names, then
Contruct a data query to gather precisely the data needed for the research task
The following set of queries represents a sample Stata workflow using WRDS data, mimicking the workflow of a researcher who might be new to a given data library. The commands in this workflow could be run interactively or submitted via a batch job (or jobs).
NOTE: This workflow section is a series of stand-alone queries, issued one after another. In real-world Stata programming, you would send the results of such queries to some additional statistical analysis in your program, but for these examples, with queries issued in serial, you will need to issue the clear command betwen odbc load commands, otherwise you will recieve the error: you must start with an empty dataset.
1. What data libraries can I work with?
odbc load, exec("select distinct frname from wrds_lib_internal.friendly_schema_mapping") dsn("wrds-postgres")
2. OK, I want to work with crsp. What datasets are offered in crsp?
odbc load, exec("select distinct table_name from information_schema.columns where table_schema='crsp' order by table_name") dsn("wrds-postgres")
3. OK great, the Daily Stock File (dsf) is what I want to research. What does the data mean? What are the variables that describe each column of the data?
odbc load, exec("select column_name from information_schema.columns where table_schema='crsp' and table_name='dsf' order by column_name") dsn("wrds-postgres")
4. Perfect, I only care about cusip, permno, date, bidlo, and askhi. Let's see what the data looks like if I grab the first hundred rows filtering for only these variables (but limiting to 100 results):
odbc load, exec("select cusip,permno,date,bidlo,askhi from crsp.dsf limit 100") dsn("wrds-postgres")
5. Alright, this gives me a good idea of what my data looks like, but will probably return an awful lot of data if I were to remove the artificial limit. Let's filter it by a few permno values I'm interested in and limit to one day to get an idea of what such a snapshot of the market might look like:
odbc load, exec("select cusip,permno,date,bidlo,askhi from crsp.dsf where permno in (14593, 90319, 12490, 17778) and date='2013-01-04' limit 100") dsn("wrds-postgres")
6. That's more like it. I'm interested in high askhi values; let's get a list of dates where these permnos posted an Ask Price over $2,000 between 2010 and 2013 (now no longer limiting the number of returned rows, as we're getting pretty specific):
odbc load, exec("select cusip,permno,date,bidlo,askhi from crsp.dsf where permno in (14593, 90319, 12490, 17778) and date between '2010-01-01' and '2013-12-31' and askhi > 2000") dsn("wrds-postgres")
7. Looks like only one of my permnos posted such a high Ask price during this window. I wonder how many other companies have similar Ask histories. Let's open the search to all permnos that have ever posted an Ask Price over $2,000 in any date range (I use distinct here to only return one entry per matching permno). Since I'm querying against all permnos and the entire date range of the dataset, I'm prepared for this query to take a little longer:
odbc load, exec("select distinct permno from crsp.dsf where askhi > 2000") dsn("wrds-postgres")
8. Let's retrieve all dates for which an Ask Price over $2000 was posted, along with the permnos that posted them. This will give me a list of dates that match, with an additional entry for that date if additional permnos match as well. By 2011 you can see that we're down to two contenders:
odbc load, exec("select distinct date,permno from crsp.dsf where askhi > 2000 order by date") dsn("wrds-postgres")
9. Just for fun, what was the highest Ask ever posted (searching only through Asks over $2000), on what date was it posted, and which permno posted it? Note how I use limit 1 to speed up my search, since I only want the top value:
odbc load, exec("select permno,askhi,date from crsp.dsf where askhi > 2000 order by askhi desc limit 1") dsn("wrds-postgres")
I think I'm starting to get the picture.
This is just an example of how one might approach an analytics task in Stata. It begins by gathering metadata information from the wrds_lib_internal.friendly_schema_mapping and information_schema tables to learn more about the data available -- the libraries, datasets, and variables -- and then using that information to select meaningful observations from the data itself.
A common next step after the above exploration is to then write a batch program that uses the above one-off queries together. An example might be a program that uses a loop to iterate over each permno that has ever posted an ask price over $2000 and to calculate how long a date range it was able to maintain that height. Or perhaps certain dates were more prolific than others -- tallying the number of high asks per date might be informative.
Joining Data from Separate Datasets
Data from separate datasets can be joined and analyzed together. In this example, we will join Compustat Fundamentals (i.e. comp.funda) with pricing data (i.e. comp.secm), querying for total assets and liabilities mixed with monthly close price and shares outstanding.
odbc load, exec("select a.gvkey, a.datadate, a.tic, a.conm, a.at, a.lt, b.prccm, b.cshoq from comp.funda a join comp.secm b on a.gvkey = b.gvkey and a.datadate = b.datadate where a.tic = 'IBM' and a.datafmt = 'STD' and a.consol = 'C' and a.indfmt = 'INDL'") dsn("wrds-postgres")
The resulting data frame data contains data from both datasets joined by a common gvkey identifier and date. Note that running joins between large sets of data can require vast amounts of memory and execution time. Be careful to limit the scope of your queries to reasonable sizes when performing joins. This query reports on a single company (IBM) and with a data point frequency of one year, resulting in a result of 55 observations (as of 2017) with a very fast query execution time.
NOTE: Your institution would need to subscribe to both datasets for the above to work.
Stata at WRDS was designed to be straightforward yet powerful, and was implemented and documented with feedback from users in mind. If you have ideas on improving the Stata implementation at WRDS, or recommendations regarding this documentation, please contact WRDS Support. We hope you find using Stata at WRDS an enjoyable experience.