How to access WRDS data directly in your Python program, either on the WRDS Cloud or remotely from your workstation.
Introduction to Python at WRDS
Python is a widely-used high-level programming language that is both powerful and easy to use, and is proving to a major player in large-scale data analytics applications.
WRDS provides a direct interface for Python access, alowing native querying of WRDS data right within your Python program. All WRDS data is stored in a PostgreSQL database, and is available through Python through our custom python module, wrds.
Here is an example of a simple query against the Dow Jones Averages & Total Return Indexes using the wrds python module:
db = wrds.Connection()
db.raw_sql('SELECT date,dji FROM djones.djdaily')
Full usage and many examples are given in this document.
There are two ways to access WRDS data with Python:
Python on the WRDS Cloud - You SSH to the WRDS Cloud and run batch or interactive Python jobs on our high-performance computing cluster. This method benefits from our strong computing resources with direct access to the data, high available memory and CPU count, and powerful job management. This method is appropriate for high data-throughput jobs, long-running jobs (longer than a few hours), or working with large datasets such as TAQ.
Python from your Workstation - You connect remotely to WRDS from a Python IDE or Python directly from your workstation and download the data you need inline. This method is appropriate for shorter jobs or smaller data samples. Ideally you would have a fairly powerful computer and a fast (and reliable!) internet connection to use this method.
Each is described below in its own section. All of the example workflows and programs provided in this document may be used for either access method.
PostgreSQL vs SAS
In the past, WRDS data was only available as SAS flat files. In this format, these files were accessible to non-SAS clients (such as Python) only via a JDBC interface. While this generally worked well, it proved problematic for memory-intensive programs, and proved impossible for extremely large datasets such as TAQ.
WRDS has now made all of our data additionally available as a series of PostgreSQL databases, which has significantly opened up research options for other, non-SAS programming languages such as Python, R, and MATLAB. Java and JDBC drivers are no longer required for Python connectivity, as a native Python wrds module exists to facilitate the connection. Compared to the previous JDBC connection method, the native method is faster, more robust, and capable of handling far larger queries.
Please let us know what you think of the new connection method, and how you're using it in your research!
Python 3 vs Python 2
The WRDS Cloud supports both Python 3 and Python 2; you may use your preference when working with WRDS data. This applies both to working with interactive and batch jobs on the WRDS Cloud, as well as remotely connecting from your workstation using Rodeo or similar.
However, while both are supported, WRDS always recommends the latest version. If you have a choice in your programming, you should use Python 3. If you are just starting out in Python, you should absolutely use Python 3.
All examples in the document are given in Python 3 code.
Across the board at WRDS, Python 3 may be accessed as python3 and Python 2 may be accessed as python2. More about this will be discussed later in this document.
The following resources may be helpful to you as you begin using Python at WRDS.
WRDS Cloud Manual - If you intend to run batch or interactive Python jobs on the WRDS Cloud, this document contains everything you need: from SSH access to job management to quota reporting.
Rodeo Docs - If you intend to run R via Rodeo on your workstation and connect remotely to WRDS, it might be helpful to be familiar with Rodeo itself, though this document will provide enough information to get you started.
Python for Beginners - If you are new to Python or could use a refresher, python.org provides a getting started guide that should point you in the right direction.
Additionally, if you are planning on running interactive or batch jobs on the WRDS Cloud, you should have an understanding of basic UNIX commands and how to use a UNIX-like text editor, such as: nano, emacs, and vi, which are all available on the WRDS Cloud.
Python jobs on the WRDS Cloud can be run either interactively or in batch mode:
Interactive jobs are sessions where you input commands in serial within an interactive shell, receiving a response to each command after you enter it. Interactive sessions are the best choice for quick tests, for accessing and working with straightforward data segments, or when working on a program and trying out commands as you write them.
Batch jobs are longer programs that might need to perform several steps to generate a result. Batch jobs are the best choice when running multi-step jobs that may run over the course of several hours or days.
Jobs on the WRDS Cloud are managed by a process called Grid Engine, which accepts job submissions and distributes them to the least-loaded compute system available. Grid Engine is very powerful and offers many features for fine-tuning your jobs and thier performance, and also allows for robust job management. This document will include just enough to get you started submitting your Python jobs to the WRDS Cloud Grid Engine, but a far greater degree of detail is available for the curious in the WRDS Cloud Manual.
Both interactive jobs and batch jobs are discussed in detail below.
WRDS Cloud: Preparing your Environment
The first step to connectiing to WRDS data from within Python on the WRDS Cloud is setting up your .pgpass file in your WRDS Cloud home directory. This step only needs to be done once.
The .pgpass file includes your WRDS username and password so that you do not need to enter them each time you wish to connect to WRDS within Python.
To create your .pgpass file, SSH to the WRDS Cloud (if you're new to the WRDS Cloud or need a refresher, consult the WRDS Cloud Manual on how to do this), create a new file named .pgpass directly in your WRDS Cloud home directory (you can use any editor to do this, such as nano or vi), and place the following configuration in that file:
Where 'your_username' is your WRDS Username and 'your_password' is your WRDS Password.
Next, secure your .pgpass file with the following command:
chmod 600 ~/.pgpass
Failure to secure the .pgpass file in the above manner will cause connection attempts to fail with a warning message indicating the need to secure this file.
WRDS Cloud: Interactive Jobs
Interactive Python sessions allow you to execute Python code in serial within a shell, receiving a response to each command as you enter it. Many users chose to work in an interactive session to formulate the various data steps that accomplish a given task, before then building a batch job from these data steps and executing the job.
Interactive Python sessions are also a good way to ensure that the .pgpass file we just created is working as expected, so let's start with a Python interactive session on the WRDS Cloud.
In order to run an interactive Python session, you will need to schedule an interactive job with the WRDS Cloud Grid Engine software. The following example shows such a connection, submission, and interactive Python session. Following the example, you'll find an explanation of the steps.
[my_user@my_laptop ~]$ ssh email@example.com
[wrds_user@wrds-cloud-login2-h ~]$ qrsh
[wrds_user@wrds-sas6-h ~]$ ipython3
In : import wrds
In : db = wrds.Connection()
In : db.raw_sql("SELECT date,dji FROM djones.djdaily")
In : quit
[wrds_user@wrds-sas6-h ~]$ logout
A description of the above WRDS Cloud SSH connection and Python interactive session line-by-line:
From my_laptop I make an SSH connection to wrds-cloud.wharton.upenn.edu with my WRDS username wrds_user.
I enter my WRDS password when prompted.
Once logged into the WRDS Cloud, I submit an interactive job to the WRDS Cloud Grid Engine with qrsh.
I am assigned to the interactive compute node wrds-sas6-h and given a prompt. I start an interactive Python session using ipython3. If I wanted to work in Python 2 instead of Python 3 I could instead enter ipython2.
Once Python starts up, I import the wrds Python module, which provides the necessary Python classes and methods to connect to WRDS and access data.
I use wrds.Connection() to initiate a connection to WRDS, which uses the username and password defined in my .pgpass file, and saves it as the db variable.
After I am connected ("Done"), I use the db.raw_sql() function to run a simple SQL query against the Dow Jones dataset and recieve my answer. Other methods of the the wrds module will be discussed later in this document.
When I am done with my interactive Python session, I quit.
I am returned to my prompt at the interactive compute node wrds-sas6-h. I issue the logout command.
I am returned to the WRDS Cloud head node wrds-cloud-login1-h, and am ready to issue another job.
Instructions for accessing WRDS data once connected are given in the Accessing Data From Python section of the document.
WRDS Cloud: Batch Jobs
Batch jobs on the WRDS Cloud allow for long-running, multi-step program execution, and are ideal for more in-depth computational analysis of data. Batch jobs are submitted on the WRDS Cloud to a central Grid Engine, which handles the scheduling, resource management, and execution of the submitted job. For more information on the WRDS Cloud and its Grid Engine, please consult the WRDS Cloud Manual.
A Python batch job is comprised of two files:
The Python program itself, with file extension .py, that contans the Python code to be executed.
A wrapper shell script, with file extension .sh, that submits the Python program to the Grid Engine
A Python batch job relies on the presence of a .pgpass file to connect to WRDS, just like an interactive job does. If you haven't already, be sure you've created this file as described in the WRDS Cloud: Preparing Your Environment section above.
The wrds module does most of the heavy lifting for us, so simply importing it and establishing an initial connection with wrds.Connection() is sufficient to get us programmatic access to WRDS data which we can use throughout our Python program. To demonstrate, let's create a Python program for this batch submission that returns the Dow Jones Index by date. In your WRDS Cloud home directory, create a program named myProgram.py with the following code:
db = wrds.Connection()
data = db.raw_sql("SELECT date,dji FROM djones.djdaily")
We use the dataframe's to_csv() function to give us a quick and easy way to output the results to a file, to prove that the program is working as expected. In actual research programming, we would probably further manipulate the data result extensively before outputting it.
Python jobs cannot be submitted directly to Grid Engine on the WRDS Cloud, but must instead be submitted by a wrapper script: a simple shell script that tells Grid Engine how to call the Python program, and allows the opportunity to provide additional job-control parameters if desired. Recall that we named our program above myProgram.py. The following example is the simplest possible wrapper script to call this program:
Sets the shell of the wrapper script to bash.
Tells Grid Engine to look in the current working directory (cwd) for any referenced files, and to store output files in that same directory as well.
Runs the myProgram.py program in Python.
The following additional lines may also be helpful in your wrapper script:
Tells Grid Engine to look in the current working directory (cwd) for any referenced files, and to store output files in that same directory as well.
Instructs Grid Engine to send a notification Email at stages: before, after, and at error (if applicable).
Specifies the address to send the above Emails to.
Uses the command echo to send a bit of text and the current timestamp to an output file just before starting my Python program.
Runs the myProgram.py program in Python.
Uses the command echo to send a bit of text and the current timestamp to an output file just after finishing my Python program.
WRDS Cloud: Submitting a Batch Job
Once you have both the Python program and its wrapper script created, the job can be submitted via qsub. Let's say I named my wrapper script myProgram.sh. My qsub command would simply be:
The WRDS Cloud Grid Engine will then verify your job submission and schedule it to run on the currently least-utilized compute node in the cluster. When your job completes, all output files will be written to your current working directory (as described in the next section).
To view the status of a currently-running Python job on the WRDS Cloud, use the qstat command. If no results are given, you have no jobs running. Please consult the WRDS Cloud Manual for more information on working with the Grid Engine on the WRDS Cloud, including job monitoring, queue selection, and much more.
WRDS Cloud: Batch Job Output
When your batch job completes, several output files will be created containing the results of your job. Since we used the #$ -cwd parameter in our wrapper script, these files will be created in the same directory as the wrapper script itself. Assuming the job you ran was called 'myProgram.sh' which calls the Python program 'myProgram.py', these output files are:
myProgram.out - The output file of your Python program, containing the collection of observations that your code generated. These are your results, and were created via the to_csv() function. In reality, you can call this whatever you like, or use one of many other ways to output your results.
myProgram.sh.o###### - The Grid Engine file that contains all output from your myProgram.sh wrapper (such as the echo commands).
myProgram.sh.e###### - The Grid Engine file that contains all errors from your myProgram.sh wrapper. This will be empty if your program's execution did not experience any errors.
Where ###### is the Grid Engine job number.
Note that if you run your program again, the second run will overwrite your previous output file myProgram.out with the new results. You may wish to rename the outputfile to something else before you run your Python program again, should you wish to keep it. The easiest way to rename a file is the mv command, like so:
mv myProgram.out previous_results1
The above command renames your myProgram.out file as previous_results1.
You do not need to delete or rename the Grid Engine output or error files, since they contain the job number in their filename, and are therefore unique to all other jobs run.
The WRDS Cloud also supports connecting via a remote Python interface to access WRDS data. This allows you to program from the comfort of your own computer, using your own tools and resources. This can be done with one of many Python interfaces, both graphical and command line, including Jupyter Notebooks, Spyder, Rodeo, Python's own shell, and many others. Jupyter Notebooks are probably the easiest of these to get started with, and are available for Mac, PC, and Linux, so this document will focus on Jupyter -- but instructions should be similar for other Python interfaces as well.
Jupyter Notebooks: Installing Python
The first step to getting Jupyter up and running is to install Python on your workstation. This section applies to all platforms (Windows, Mac OS X, Linux). For installing Python, WRDS recommends using Anaconda, as it bundles up most of the relevant Python packages you'd want to use into one easy installation (including jupyter), and is packaged for installation on Windows, OS X, and Linux. To install Python using Anaconda:
Download the Anaconda installer specific to your platform from Anaconda Downloads. You may choose either the graphical installer or the command-line installer depending on your preference: they both accomplish the same thing. WRDS highly recommends Python 3.5 or later for use with WRDS data, so be sure your download the latest version shown.
When installing, pick a folder to install Anaconda to. You will need to find this folder later for other parts of this document. If you do not otherwise have a preference, WRDS recommends that you install to the Documents folder in your user home directory; the remainder of these instructions will assume you have done so.
NOTE: Depedning on your OS, and your familiarity with Python, you may already have a version of Python installed on your workstation. For the purposes of this document, we will install a new version of Python using Anaconda. If you are a power user, you may use a pre-existing installation of Python if you wish, though the installation of additional Python packages to support the wrds module will be up to you.
Jupyter Notebooks: Configuring Your Environment
The next step is to create a .pgpass file directly in your home directory. This section applies to all platforms (Windows, Mac OS X, Linux). This file allows the wrds module to connect to WRDS on demand, without requiring that you enter your credentials each time you connect. Ensure that the filename is all-lowercase, begins with a period, and does not have an extension. In Mac OS X you can do this easily in Terminal or TextEdit; in Windows it is advised to use a plain-text editor such as Notepad++. Do not use WordPad or Microsoft Word.
The .pgpass file must be located directly in your home directory, as follows:
Mac OS X: /Users/my_user/.pgpass
Note for Windows: %APPDATA% refers to the Application Data subdirectory in your user profile, which you can find by typing %appdata% in your Windows Search function)
Once you have created this file, write the following line to it:
Where 'your_username' is your WRDS Username and 'your_password' is your WRDS Password.
On Mac OS X and Linux systems, you must additionally secure the permissions on this file with the following command:
chmod 0600 ~/.pgpass
NOTE: On Microsoft Windows, it is assumed that the file is stored in a directory that is secure, so no special permissions check is made, and the above command is not required.
Jupyter Notebooks: Installing the wrds Module
The final step is to install the Python wrds itself. This section applies to all platforms (Windows, Mac OS X, Linux).
Open your system's command line interface (CMD for Windows, Terminal for Mac OS X)
Install the wrds module from PyPI: pip install wrds
Fire up a new Jupyter Notebook server: jupyter notebook
This will attempt to launch a new Jupyter Notebook in your web browser of choice. If it does not automatically open, simply copy and paste the URL presented in Terminal into your web browser. You should be presented with a list of files in your home directory.
Next, create a new Jupyter Notebook via the New > Python3 drop-down menu in the upper-right of your Jupyter home page.
Once you are sitting at the In  prompt, you are ready to continue with the Accessing WRDS Data from Python section.
Whether you are accessing data on the WRDS Cloud via an interactive or batch job, or are connecting remotely via Python or Jupyter Notebooks, the method for accessing WRDS data once connected is the same. Since we already created a .pgpass file to store our connection parameters, we merely need to execute the following to import the wrds module and then connect to WRDS:
db = wrds.Connection()
Having done this once at the beginning of our program, we can now use the functionality of the wrds module throughout our Python program.
The wrds module supports many different methods of accessing WRDS data. The best way to learn about them is from the inline documentation available for each of them within a Python console. Having saved our WRDS connection as the variable db, I can see what functions I may use against that connection by typing db. and then hitting the <Tab> key to expand a list of supported functions. Here is the list that appears:
NOTE: The engine(), insp(), and schema_perm() methods are internal methods used by the other methods, and do not provide any functionality on their own. The remainder of the methods listed are described in detail below.
To read the inline documentation for any of these, use the help() function once you've imported the wrds module and made your initial connection, like so:
Accessing Data: pandas and numpy
No matter which wrds method you use, your result is always returned as a pandas dataframe; pandas is a popular Python module that, together with numpy, forms the basis for most numeric data manipulation in Python. The pandas module offers powerful data manipulation capabilities for your returned results, and allows for a granular degree of control over your data output.
In all examples provided in this document, the returned results variable data is a pandas dataframe.
If you are unfamiliar with pandas or numpy, WRDS recommends the following resources:
When working with large data sources, it is important to begin your research with small subsets of the data you eventually want to query. Limiting the number of returned observations is essential while developing your code, as queries that involve large date ranges or query a large number of variables could take a long time and generate large output files.
Generally, until you are sure that you're getting exactly the data you're looking for, you should limit the number of observations returned to a sensibe maximum such as 10 or 100. Remember, much of the data provided at WRDS is huge! It is highly recommended to develop your code using such an observation limit, then simply remove that limit when you are ready to run your final code.
Examples of limiting the number of returned observations are given below in the Querying Data section, for the two methods that supports doing so: get_table() and raw_sql().
Querying Metadata - Examining the Structure of Data
When working with WRDS data, it is often useful to examine the structure of the dataset, before focusing on the data itself. WRDS data is structured according to vendor (e.g. crsp) and referred to as a library. Each library contains a number of component tables or datasets (e.g. dsf) which contain the actual data in tabular format, with column headers called variables.
We can analyze the structure of the data available to us through its metadata via the included methods of the wrds module, as outlined in the following steps:
List all avaialble libraries at WRDS, via list_libraries().
Select a library to work with, and list all available datasets within that library, via list_tables().
Select a dataset, and list all available variables (column headers) within that dataset, via describe_table().
NOTE: When working with the wrds module, libraries and datasets that you provide are case-sensitive, and must be lowercase.
Let's run through each of these three steps to examine the structure of our data. We'll assume that you have already imported the wrds module, and made your initial connection as db.
1. To determine the libraries avaialble at WRDS:
This will list all libraries available WRDS. Though all libraries will be shown, your institution must still have a valid, current subscription for a library in order to access it via Python, just as with SAS or any other suported programming language at WRDS. You will receive a helpful error message indicating this if you attempt to query a table to which your institution does not have access.
2. To determine the datasets within a given library:
Where 'my_library' is a dataset such as crsp or comp as returned from #1 above.
3. To determine the column headers (variables) within a given dataset:
Where 'my_library' is a dataset such as crsp as returned from #1 above and 'my_dataset' is a component database within that library, such as dsf, as returned from query #2 above. Remember that both the library and the dataset are case-sensitive!
The above queries help to establish the structure of the data we're trying to access; ascertaining the exact dataset and variable names we'll need to use when we go to access the data itself in the next section. Alternatively, a comprehensive list of all WRDS libraries is available via the WRDS Dataset List. This online resource provides a listing of each library, their component datasets and variables, as well as a tabular database preview feature, and is helpful in establishing the structure of the data you're looking for in an easy, web-friendly manner.
NOTE: If you are familiar with the previous SAS-based method of connection, you used the SAS dictionary table for this. That table required exclusively uppercase libnames and memnames, while this PostgreSQL method requires exclusively lowercase libraries and datasets. Be sure to update your code!
Now that we know how to query the metadata and understand the structure of the data, we can construct a data query to start gathering the actual data itself. The wrds module provides us several methods that are useful in gathering the data:
get_table() - fetches data by matching library and dataset, with the ability to filter by several parameters. This is the easiest method of accessing data.
raw_sql() - executes a SQL query against the specified library and dataset, allowing for highly-granular data filtration.
get_row_count() - returns the number of rows in a given dataset.
Each of these three is explored below.
Querying Data: get_table()
The method get_table() accepts the following parameters:
library - the library to query
table - the dataset to query
columns - the columns (variables) to include in the query (optional)
obs - the number of observations (rows) to return (optional)
offset - the starting point of for the query (optional)
For additional parameters, and further explanation of each, use the built-in help: help(db.get_table).
Here's an example that returns the first 10 rows of only the date and dji columns from the Dow Jones Index:
data = db.get_table(library='djones', table='djdaily', columns=['date', 'dji'], obs=10)
NOTE: The library and table parameters are required, and are also positional. Therefore, should you chose, you may omit the label for these two parameters as long as you supply them in order.
The following example illustrates this:
data = db.get_table('djones', 'djdaily', columns=['date', 'dji'], obs=10)
Querying Data: raw_sql()
The method raw_sql() accepts the following parameters:
sql - the SQL string to query
date_cols - a list or dict of column names to parse as date (optional)
index_col - a string or list of column(s) to set as index(es) (optional)
For additional parameters, and further explanation of each, use the built-in help: help(db.raw_sql).
Here's an example that does the same thing as get_table() example above, but uses SQL to select the data instead:
data = db.raw_sql('select date,dji from djones.djdaily LIMIT 10;', date_cols=['date'])
Querying Data: get_row_count()
The method get_row_count() only accepts the following two parameters:
library - the library to query
table - the dataset to query
For additional information, use the built-in help: help(db.get_row_count).
Here is an example that returns the row count for the Dow Jones Index:
Python is a popular language for data visualization these days, and the WRDS Cloud supports all the popular python graphing and plotting packages, such as matplotlib. Graphics access on the WRDS Cloud is available through XWindows, which your local machine must support. XWindows allows the data processing to happen on the server side, while the graphics display occurs on the client side (your workstation).
XWindows on Windows 8/10
Users conencting from a Windows workstation will need PuTTY and XMing - both are available for free online. Once both are installed, perform the following steps to be connected to the WRDS Cloud with graphics capabilities:
Lanch PuTTY and check the following option: Connection > SSH > X11 > Enable X11 Forwarding
Specify the Host Name in the Session section as: wrds-cloud.wharton.upenn.edu
Click the Open button to initiate the connection, then authenticate as normal
XWindows on Mac OS X
Users connecting from a Mac OS X workstation will need Terminal and XQuartz - Terminal is included with OS X and can be found in /Applications/Utilities/Terminal and XQuartz is available for free online. Once you've installed XQuartz (you may need to restart your computer after installing it), perform the following steps to be connected to the WRDS Cloud with graphics capabilities:
Launch Terminal from /Applications/Utilities/Terminal
SSH to the Wrds Cloud with the following command: ssh -X wrds-cloud.wharton.upenn.edu
XQuartz will automatically launch
Authenticate in Terminal as normal
NOTE: The -X flag in step #2 above is uppercase.
Generating a Graph
Once connected with graphical capabilities, graphical Python packages can be used within an interactive Python session as normal (as if you were running your Python session locally on your workstation). The following workflow will generate an example graph of the performance of the Dow Joes Industrial Index over time:
Start an interactive job on the WRDS Cloud: qrsh
After being assigned to a compute node, start an interactive Python 3 session: ipython3
Within your python session, import the wrds module: import wrds
Import the pyplot module, which we'll use for plotting: import matplotlib.pyplot as plt
Make your initial connection: db = wrds.Connection()
Submit your query: data = db.raw_sql("select date,dji from djones.djdaily", date_cols=['date'], index_col=['date'])
Alternatively, you could use get_table(): data = db.get_table("djones", "djdaily", columns=['date','dji'], date_cols=['date'], index_col=['date'])
Plot the results: plt.plot(data)
Label the X-Axis: plt.xlabel('date')
Label the Y-Axis: plt.xlabel('dji')
Draw the graph using XWindows: plt.show()
Python will draw your graph through the X Windows subsystem, plotted by date against the Dow Jones Index, and label both axes.
Should I wish to save the graph to a file, I can use: plt.savefig('my_graph.png'). This will save the graph to the filename specified in my current working directory on the WRDS Cloud. I could then transfer the file to my workstation using SCP or SFTP as described in the WRDS Cloud Manual, in the section Connecting to the WRDS Cloud.
WRDS has made many popular statistical Python packages available on the WRDS Cloud. To see the list of packages currently installed on the WRDS Cloud, enter pip3 list on the command line (i.e. not in an iPython session). If you find that you would like to use additional Python packages not presently supplied by WRDS, or a different version of one that is installed, you may install your own packages in your home directory and use them in your Python programming. This section details how to do this.
We'll use virtualenv to create a virtual Python environment that references both the central, WRDS-supplied package repositiory and your own custom packages.
NOTE: These procedures detail installing packages on the WRDS Cloud. If instead you are using Jupyter Notebooks from your workstation, you can install packages from Anaconda.
Installing Python Packages
To install Python packages on the WRDS Cloud:
From a WRDS Cloud head node, create a virtualenv directory in your home directory: virtualenv3 --system-site-packages ~/virtualenv
Activate your new virtualenv: source ~/virtualenv/bin/activate.csh
Install your desired package ('numexpr' for example): pip3 install numexpr
Now that you've installed an additional package, you'll need to use the virtualenv virtual environment every time you wish to use it. The following steps show you how to do so:
From a WRDS Cloud head node, start an interactive session: qrsh
Activate your virtualenv: source ~/virtualenv/bin/activate.csh
Launch iPython: ipython3
Import your package: import numexpr
You can now use the functionality of your custom Python package.
The same can be done for batch jobs as well. Simply activate your virtualenv in your wrapper script file to make the wrapped Python job aware of any modules you've installed.
There are many ways to perform analytical research against data, but most involve the two-step procedure explored above:
Examine the metadata to determine what data is available and by what names, then
Contruct a data query to gather precisely the data needed for the research task
The following set of Python queries represents a sample Python workflow using WRDS data, mimicking the workflow of a researcher who might be new to a given data library. The commands in this workflow could be run interactively, submitted via a batch job (or jobs), or run in a Jupyter Notebook.
The following assumes that you have already imported the wrds module and made your initial connection to WRDS as db.
1. What data libraries can I work with?
2. OK, I want to work with crsp. What datasets are offered in crsp?
3. OK great, the Daily Stock File (dsf) is what I want to research. What does the data mean? What are the variables that describe each column of the data?
4. Perfect, I only care about cusip, permno, date, bidlo, and askhi. Let's see what the data looks like if I grab the first hundred rows filtering for only these variables.
Both get_table() and raw_sql() can be used for this:
db.raw_sql('select cusip,permno,date,bidlo,askhi from crsp.dsf LIMIT 100')
5. Alright, this gives me a good idea of what my data looks like, but will probably return an awful lot of data if I were to remove the artificial limit. Let's filter it by a few permno values I'm interested in and limit to one day to get an idea of what such a snapshot of the market might look like.
Since we are now filtering by specific values for variables, we need to use raw_sql() exclusively. All future examples will be shown using raw_sql().
NOTE: Quotation marks are important. Because the date values in our query require quotes around them, we move to using double-quotes to surround the larger SQL string itself.
db.raw_sql("select cusip,permno,date,bidlo,askhi from crsp.dsf where permno in (14593, 90319, 12490, 17778) and date='2013-01-04'")
6. That's more like it. I'm interested in high askhi values; let's get a list of dates where these permnos posted an Ask Price over $2,000 between 2010 and 2013 (now no longer limiting the number of returned rows, as we're getting pretty specific):
db.raw_sql("select cusip,permno,date,bidlo,askhi from crsp.dsf where permno in (14593, 90319, 12490, 17778) and date between '2010-01-01' and '2013-12-31' and askhi > 2000")
7. Looks like only one of my permnos posted such a high Ask price during this window. I wonder how many other companies have similar Ask histories. Let's open the search to all permnos that have ever posted an Ask Price over $2,000 in any date range (I use distinct here to only return one entry per matching permno). Since I'm querying against all permnos and the entire date range of the dataset, I'm prepared for this query to take a little longer:
db.raw_sql('select distinct permno from crsp.dsf where askhi > 2000')
8. Let's retrieve all dates for which an Ask Price over $2000 was posted, along with the permnos that posted them. This will give me a list of dates that match, with an additional entry for that date if additional permnos match as well. By 2011 you can see that we're down to two contenders:
db.raw_sql("select distinct date,permno from crsp.dsf where askhi > 2000 order by date")
9. Just for fun, what was the highest Ask ever posted (searching only through Asks over $2000), on what date was it posted, and which PERMNO posted it? Note how I use LIMIT to only return the top value:
db.raw_sql('select permno,askhi,date from crsp.dsf where askhi > 2000 order by askhi desc LIMIT 1')
I think I'm starting to get the picture.
This is just an example of how one might approach an analytics task in Python. It begins by gathering metadata information using list_libraries() and list_tables() to learn more about the data available -- the libraries, datasets, and variables -- and then using that information to select meaningful observations from the data itself using get_table() and raw_sql().
A common next step after the above exploration is to then write a batch program that uses the above one-off queries together. An example might be a program that uses a loop to iterate over each permno that has ever posted an Ask price over $2000 and to calculate how long a date range it was able to maintain that height. Or perhaps certain dates were more prolific than others -- tallying the number of high asks per date might be informative.
Joining Data from Separate Datasets
Data from separate datasets can be joined and analyzed together. In this example, we will join Compustat Fundamentals (i.e. comp.funda) with pricing data (i.e. comp.secm), querying for total assets and liabilities mixed with monthly close price and shares outstanding.
db.raw_sql("select a.gvkey, a.datadate, a.tic, a.conm, a.at, a.lt, b.prccm, b.cshoq from comp.funda a join comp.secm b on a.gvkey = b.gvkey and a.datadate = b.datadate where a.tic = 'IBM' and a.datafmt = 'STD' and a.consol = 'C' and a.indfmt = 'INDL'")
The resulting data frame data contains data from both datasets joined by a common gvkey identifier and date. Note that running joins between large sets of data can require vast amounts of memory and execution time. Be careful to limit the scope of your queries to reasonable sizes when performing joins. This query reports on a single company (IBM) and with a data point frequency of one year, resulting in a result of 55 observations (as of 2017) with a very fast query execution time.
NOTE: Your institution would need to subscribe to both datasets for the above to work.
Python at WRDS was designed to be straightforward yet powerful, and was implemented and documented with feedback from users in mind. Specifically, we wrote the wrds Python module to make programming your WRDS research in Python a relative breeze! If you have ideas on improving the Python implementation at WRDS, or recommendations regarding this documentation, please contact WRDS Support. We hope you find using Python at WRDS an enjoyable experience.