BBL on "terabytes of data"
"Incremental Internet data collection for the systematic study of the diffusion of policy innovations through international NGO networks"
On Tuesday, November 27th, Thomas Hannan, UCLA PhD candidate who's been working with Global Integrity this fall, spoke to the OpenGov Hub about a project he's been working on to collect data during a Brown Bag Lunch (BBL) session.
1. An outline of the method that Thomas and his co-authors have developed for systematically studying the diffusion of policy innovations through international NGO networks of HIV/AIDS NGOs and their donors.
2. Impressions of a major stumbling block that has stumped he (not particularly tech-savvy) and his colleagues to quickly, efficiently--and, ideally, inexpensively!--index the enormous amount of data they've collected so far (comprising terabytes-worth of plain text files).
3. An open discussion and suggestions on how to troubleshoot this indexing problem together.
For some background to the BBL, feel free to read about the team's research method and the problem they are facing below.
We wrote piece of computer code which is designed to act as an automated Internet (hereafter, “web”) ‘crawler’: that is, once a week, the web crawler automatically visits, in turn, each website from a list--comprising the websites of over 600 HIV/AIDS NGOs, INGOs, advocacy groups, and funders/donors--and ‘scrapes’ the entire text of each website, to a "recursion" level of three, meaning that it saves a copy of the full text of the website’s pages (to our pre-determined “recursion” level) as a plain text file on a local hard drive. Each week’s saved plain text file for each website is time- and date-stamped. This web crawler will be left to do this every week for a long time—long enough that when, years later, we hear of the latest HIV/AIDS policy innovation that has gained widespread prominence and legitimacy, we can turn to our saved and “indexed” (that is, quickly and efficiently cataloged in order to be almost instantaneously searchable—similar to a library's cataloging/indexing of its collection of books, or Google's cataloging/indexing of websites for its search engine) web crawler data and search for the name of the new policy innovation.
The result of this search will be a list of all the HIV/AIDS NGO websites on which the searched-for phrase—that is, the name of the policy innovation—appears, sorted by the exact week that the phrase appeared on each website. This list, then, should allow us to trace: (i) exactly where and when the policy innovation originated; (ii) exactly how the policy innovation diffused throughout our network of HIV/AIDS NGOs, INGOs, advocacy groups, and funders/donors (as indicated by when the policy innovation was first mentioned on each organizations’ website); (iii) the relative speed with which the innovation diffused; and (iv) its precise “diffusion curve.” In short, this method should allow us to trace the origins of policy innovations and, for each policy innovation, the exact process, week by week, through which it rose from obscurity to ubiquity, and the route through which it rose.
Interested in running your own BBL session at the OpenGov Hub? Send an email to email@example.com.