This next installment begins with my personal intellectual curiosity about big data, offers resources for using the Contingent Faculty Index, and concludes with a practical how-to about extracting data from pdf tables into csv format (for use in Excel, SPSS, Google Fusion Tables, etc.)
How to Get a Humanist-Sociologist Excited about Big Data
Wendy Hsu, my partner in life and thought, has used big data as grounds for ethnographic inquiry by using maps to generate new sites of musical and cultural analysis [see On Digital Ethnography: Mapping as a Mode of Discovery]. I’m a cultural sociologist and as such, I’ve generally been able to avoid having to deal with the anxieties I have over statistics or numbers of any kind. But Wendy’s post provided me an entry point to a new methodological approach.
In my field, qualitative analysis (interviews, field observation, focus-groups etc) is the little sibling of quantitative analysis (e.g., statistical analysis of survey data). The most judicious methods texts teach students that qualitative analysis is creative, engaging, and useful as a tool of exploration of social phenomena, but qualitative is always a precursor to more rigorous statistical analysis, which can give explanation of social phenomena.
We are at a moment when researchers can invert the normative relationship between qualitative and quantitative inquiry. The analysis of big data can help researchers explore new questions and find relevant cases for in-depth qualitative inquiry. Crowd-sourced data, open data sharing, and web-based visualization tools also enhance our ability to give voice to outliers.
I begin my exploration of faculty contingency with this perspective at the front of my mind. Until the big data movement, and particularly the development web-based data visualization tools that help make meaning out of big data, I’ve generally been quant-phobic. Now I’m discovering that quantitative data can actually help me construct new questions and find appropriate samples (i.e., cases) for qualitative analysis. I can’t avoid numbers anymore.
I teased in the previous post of this series that the AAUP’s Contingent Faculty Index can help me — a cultural sociologist and qualitative researcher — generate new questions and identify appropriate cases for local-level analysis. In order to make progress on the social problem of faculty contingency, I believe we need to answer small, cultural, local-level questions about the adjunct crisis. Big data and data visualizations help me identify where to begin.
Opening Up AAUP’s CFI Data
IPEDS data was difficult to access and parse in 2006, despite its existence in a freely accessible repository. The AAUP Contingent Faculty Index visualizes this data (N=2,617) for its readers in the form of tables in a pdf. As an index, it is a useful look up tool for single institutions but the reader cannot, of course, interact with the tables to resort the data by institution type, location, size, or % contingent or to calculate descriptive statistics.
Since 2006, readers have become more accustomed to interactive data and the use of big data to help answer questions large and small. For the CFI to be read in this way, the data need to be “opened” by changing their format. Luckily, the CFI is already well on it way to containing open data. The index is already freely available and machine readable thanks to the hard work of the CFI authors John W. Curtis and Monica Jacobe and the support of AAUP.
Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike.” OpenDefinition.org – Find an overview of the concept of Open Data at Open Knowledge Foundation
To get started, I first made the index more easily readable by tweaking the pdf formatting. I then extracted the data into a spreadsheet format and lightly cleaned it for use in excel (.csv).
Reading and Searching the CFI’s Machine Readable Data
Use the Index to search for specific schools
The appendices offer a school-by-school analysis of contingent counts and percentages across full-time instructional, full-time research, part-time and graduate faculty along with totals for each institution.
The original document is a pdf and is machine-readable. This allows you to find your school using
control + f on a PC and
command + F on Mac.
Note that the tables are split in two pages, making it difficult to read electronically.
I wanted to fix this inconvenience so I modified the pdf [download modified pdf here] to include an extra blank page at p. 19 so that the charts display across from each other (as likely intended by the authors). I also rotated the charts on p. 17 and 18 to landscape view so no one strains their neck trying to read sideways.
Open the modified document in Adobe Acrobat Reader…[freely download Adobe]
Go to “view” → “page display” and choose “Two Up” or “Two Up Continuous.”
Now you can easily read across to find all data for a particular school.
The New Table of Contents
→ Doctoral programs begin on p. 20
→ Masters programs begin on p. 38
→ Baccalaureate programs begin on p. 74
→ Associates programs begin on p. 110
Extracting and Sharing the Data
I wanted to extract the data from the pdf to:
- Examine some patterns that were beyond the scope of the original study
- See what else could be gleaned from the AAUP’s existing, well-formatted, data
- Prepare data for comparison with more recent data from IPEDS and other sources
I extracted the Baccalaureate data from the pdfs into a spreadsheet [Baccal_Pages_AAUPContingentFacultyIndex2006].
The spreadsheet is also now publicly available in Google Fusion Tables.
I document my extraction process below in case these steps prove useful to other extraction and sharing efforts.
Things that didn’t work
A. Copy paste from pdf to Excel resulted in all data in one column 😦
B. Export to XML resulted in a muck of code 😦
Things that did work
Using Adobe Acrobat Pro (I’m still on version 9.0)…
1. Extract the desired page(s) that contain(s) the table data you want
2. From your new pdf of extracted pages you can either…
…Export the pdf to HTML, Open Excel, import the HTML file
…Save the pdf as “tables in Excel.” This produces an XML file that Excel can read [recommended]
3. Manually move the data into a single sheet using copy/paste because each page will appear on a separate sheet in the same workbook.
4. To delete formatting, save the excel file as .csv and then reopen the .csv file in Excel or another analysis program. This will allow you to use filtering and sorting functions without running into errors due to merged table cells.