Virginia Tech

The 4/16 Digital Library

A digital library for recovery, research, resources, analysis and community relating to 4/16/2007, and after, at Virginia Tech


Library bridge

Research Plan

Below is an updated summary of our project plans as proposed to the US National Science Foundation. See also the actual proposal, less the budget, here.

Introduction

This project is to support a wide range of research studies, as well as inquiries from the general public, related to the tragedy that occurred during the morning of April 16, 2007 on the Virginia Tech campus in Blacksburg, VA. The target audience includes those interested in how technology aids detection, prevention, and responding to disasters in highly connected settings. Concern with many other issues, such as social support and psychological health and coping, leads behavioral and social scientists to request support for data curation as well as special services involving data mining and information visualization. A key question is how digital libraries can work in rapid-response settings, as well as for studying the aftermath of tragedies: testing hypotheses, collecting and analyzing related data, visualizing findings, discovering trends and patterns, modeling and simulation, and building and validating improved theories and models. This project will lead to further development of the theory and software support for large scale digital libraries that also allow researchers to apply closely-coupled data mining and visualization services, e.g., so that archived content can be efficiently and conveniently analyzed, and so that trends and outliers can be spotted. Computer and information scientists, following legal, policy, and human-subject guidelines, can study portions, or the whole complex, of the resulting testbed - of content, services, usage logs, etc.

Project Description

Our multidisciplinary team will research how digital libraries (DLs) [1, 2] can provide immediate and ongoing support during crises and their aftermath, especially on university campuses. We will develop a testbed supporting a wide range of research studies including those aided by data mining, visualization, and social network analysis. We will validate our approach with data and multimedia information related to the events on 4/16/2007 at Virginia Tech (VT-416), when 33 members of the university community were killed by a student turned gunman. As can be seen from Figure 1, our DL will be at the heart of our research activities and should have significant broader impacts when used by large numbers of scholars, as well as the general public. The specification of the DL will draw on suggestions we receive, including from focus groups that will convene during the 2007/2008 academic year.

Project Overview Diagram

It is extremely important that our digital library be put into operation as soon as possible, so data now being captured by various parties, to assist in understanding VT-416, can be brought together and used to support the research of sociologists, psychologists, and others interested in crises, tragedies, stress, grief, coping, and many related topics. This must be done in a flexible way, so we can adapt the DL to ever changing needs.

DL Generation

DL-VT-416 will be built using our semi-automatic approach to rapid development of digital libraries [3, 4] that has been under development by PI Fox and his colleagues for over five years, in connection with the 5S framework [5]. In addition to a number of doctoral dissertations related [6-8], several Master's theses [9, 10] have facilitated this approach; the most recent is by Gorton [11], who summarized the overall situation. We will generate a DL, and revise it as needed, building initially upon the popular DSpace system [12, 13]. We will connect it with data mining, social network analysis, and visualization capabilities [14], leading to a flexible support infrastructure in which rapid testing of hypotheses will be made possible, as follows.

Supporting Systems-Level Science

Our goal is to support "systems-level" science on the social dynamics associated with crisis events. Inspired by current research trends in biology and the life sciences, systems-level science seeks to understand the functioning of very large and complex systems and all the interactions therein, in a holistic fashion. This is in contrast to more traditional reductionist science, which narrows the problem down to focus on specific individual variables. Systems-level science is more exploratory in nature, and encourages the development of new hypotheses. In the life sciences, the "system" of systems-level science refers to the functioning of complex biological organisms. In Virginia Tech's case, the "system" under study is the complex communities of people and how they respond to crises. Systems-level science is more challenging to implement than other approaches, but can offer deeper insights into the underlying phenomenon. Systems-level science requires:

- Rapid and continuous collection of massive data:
The recent growth of systems-level science in biology was supported by the invention of micro-array and similar instrumentation that enables simultaneous collection of data about thousands of genes and proteins at the cellular level, thereby offering a complete picture that is both detailed and broad. Similarly our digital library will be organized to ingest large collections, such as email logs, that are collected and curated over time.
- Integration of diverse, heterogeneous data:
Our digital library must be able to bridge diverse data sources, including vast electronic logs such Google searches and email logs, as well as rich personal sources such Facebook pages, surveys, and interviews. We will investigate new kinds of data that could be captured such as through volunteer tracking and deployment of our two Microsoft SenseCam/Memex units.
- Realtime exploratory analysis:
To gain deep insight, our digital library must link with analysis tools to support rich exploratory analysis of complex patterns and interrelationships for theory development. Realtime analysis must maintain synchronization with incoming information to enable awareness of breaking hypotheses and quick response. Access, analysis, dissemination, and utilization will be supported by visualizations and data mining tools.

Visualization

For access, analysis, and dissemination of our library contents, we will integrate a set of visualization tools, led by Dr. North. To support heterogeneous and dynamic data collections, flexible visualization tools capabilities are required. We will integrate our Visualization Schemas framework with our 5S digital libraries framework to offer visualization capabilities that users can model and curate in much the same way that they do for the digital library content. We will liaise with other institutions that offer diverse library visualization tools that could plug into this environment, such as Pacific Northwest National Laboratories' InSpire system, and Penn State's Improvise system. To enable new social science research through realtime analysis of our continuously dynamic library content, we will link the digital library with Virginia Tech's GigaPixel Display. The GigaPixel Display project ( http://infovis.cs.vt.edu/gigapixel/ ) offers nearly 200 million pixels of display space for massive data visualization and situational awareness. Recent research results indicate that such large displays significantly expand and enhance human abilities for visualizing large data and maintaining awareness of dynamic data. This can enable a new form of social research that occurs in realtime. Scientists can examine trends as they occur, such as ulterior changes in later crisis reactions by certain population groups, and potentially work to affect outcomes. The GigaPixel Display will give the library a living presence.

Data Mining

We will investigate a multi-pronged approach to mining and harnessing the collection of information brought together in DL-VT-416. First, led by Dr. Ramakrishnan, we will mine the time-stamped series of documents to uncover the key trends that characterized the tragedy and the ensuing response. Next, we will mine the network of relationships induced by communications as recorded on various social networking sites and characterize this network temporally in light of the trends characterized before. This will aid in understanding if particular forms of communication were especially prevalent during different stages of the unfolding sequence of events. Finally, we can explore different projections of the multi-dimensional data space and determine if trends manifesting globally also reflect in the local views. The results of data mining will ideally be parameters of information diffusion that can then be used to drive a system-wide model of human-human communication, which in turn can be used for simulating synthetic scenarios. The algorithms we will explore include Kleinberg's burst detection algorithm, storyline extraction from collections of documents, graph characterizations of networks such as connectedness, average shortest path length, clustering coefficient, and multi-dimensional aggregates and views.

Social Network Analysis

Dr. Fan has extensive expertise on focused crawling, text mining, and social network analysis [15-23]. He will help with crawling data from different sources and with text mining and social network analysis to analyze emerging patterns from the testbed. Social network analysis is a proven technique widely used in social science to understand properties related to a social system. It will help us understand not only the global properties of a network such as the average betweenness of two nodes, or the average in-degree and out-degree among all nodes in a network, but also help us understand individual node's properties such as its centrality in the network, and in-degree and out-degree. Many data sets from our data collection will have networks of relationships. For example, each email exchange (which we can study with IRB approval if released for research by all parties involved) will set up a link between a sender and a receiver. Similarly, in the popular Facebook discussion forum, every message will include a message originator and a respondent. Analyzing these kinds of social exchange data using social network analysis combined with text mining techniques will help us answer interesting research questions related to the 04/16/2007 VT tragedy such as:

Sustainability and Broader Impacts

Virginia Tech University Libraries, and the various branches concerned with Special Collections and Archives, has been in touch with the Library of Congress and other groups. It will maintain into the future an archive related to VT-416. We are coordinating closely with them, as well as the Center for Digital Discourse and Culture and will make sure that sustainability of the DL results. Our focus will be on handling digital information, collecting it as quickly as possible, and supporting the broader impacts of such information through a variety of services aimed at the needs of researchers and the general public (see the figure above). Our approach, partially described above, will be refined as we collaborate with those forming a research agenda for this field, those providing data, and those engaged in research and education activities. We will have a web site and widely disseminate our findings through online, conference, and journal venues. We also will build upon work previously supported by NSF, included studies described below.

Results from Prior NSF Support

Drs. Fan and Fox are completing four years of work with NSF (ITR) funding, through grant IIS-0325579, entitled Information Technology Research: Managing complex information applications: An archaeology digital library. This was launched by an archaeologist, Project PI James Flanagan (CWRU), with the IT aspects led by VT PI Fox and co-PI Fan. VT's subcontract was for $189,500, covering 9/1/03-12/31/05, but a no-cost extension allowed continuation of research into the summer of 2007. The ETANA-DL (digital library - see http://etana.dlib.vt.edu for publications, presentations, and a link to the system) provides an integration framework and broad set of services operating on data from sites in Jordan and Israel. Two dissertations, two theses, and a number of papers have been published [4, 24-38]. Tools have been developed for schema mapping and integration, and the system supports search, multi-dimensional browsing, visualization, comparison, data export, and extensibility to more sites and other domains. Dr. North, serving as PI working with colleagues Doug Bowman, Roger Ehrich, Steve Harrison, is completing work on Towards Boundless Display: Developing a Reconfigurable Research Testbed for Large-scale, High-resolution Visual Displays. NSF #CNS-04-23611(08/16/04 - 08/15/07) supported the construction of the GigaPixel Display Laboratory ( http://infovis.cs.vt.edu/gigapixel/ ), hosted by Virginia Tech's Department of Computer Science and the Center for Human-Computer Interaction (CHCI), and directed by Dr. Chris North. This NSF-funded facility contains reconfigurable ultra-high resolution displays, totaling approximately 200 million pixels, one of the highest resolutions in the world. In addition to resolution, a unique aspect of this facility is its diversity of technologies and reconfigurability. Display technologies include rear-projection blocks and LCD panels. Interactive devices include touch panels, 6 DoF trackers, laser trackers, RFID, and various handhelds. Reconfigurability enables the display blocks and panels to be rearranged into arbitrary form factors, with plug-n-play flexibility of input devices. The facility is supported by computational clusters, and software to support rapid reconfiguration. The facility is co-located with the CHCI's AwareLab, providing VICON vision-based tracking for interactive input, and the 3Di Laboratory, providing immersive 3D displays. This facility provides an ideal research testbed for exploring fundamental questions of the design of future human-computer interfaces. It also provides a resource for advanced visualization and analysis of very large data. The massive number of pixels enables analysts to efficiently visualize much larger quantities of data than on traditional desktop displays. A current project resulting from this facility is designing and evaluating visualizations for intelligence data analysis for the National Geospatial-intelligence Agency. Research results have shown significant user performance advantages of such large-scale visualizations over their small-scale counterparts. Initial results are published in [39-45].

References

Documents