Virginia Tech

The 4/16 Digital Library

A digital library for recovery, research, resources, analysis and community relating to 4/16/2007, and after, at Virginia Tech


Library bridge

Analysis

In the future, we hope to provide a number of services from this site including queries, visualizations, data mining, semi-automatic classification, grouping, and tagging. Our goal is to support collaborative analysis and visualization in a similar vein to the services provides by the ManyEyes project.

We are just beginning to analyze some of the data available to us, and just starting to explore hypotheses and models that have been proposed with regard to April 16. Below we discuss some preliminary observations, to illustrate the potential of the approach.

April16Archive.org maintained by Virginia Tech's Center for Digital Discourse and Culture (CDDC) contains a collection of digital media artifacts including pictures, videos, and text articles such as blogs and stories. The metadata contains information both about contributors (e.g., age, gender, IP address, and sometimes occupation) and the objects they contributed (e.g., description, title, type, tags, contributor, and date added). At the time of this analysis, there were 737 objects contributed by 98 contributors with 904 tags in the archive. Below is a collection of the visualizations of the archive metadata. The visualizations are generated by Spotfire and IN-SPIRE.

Demographics

This graph shows the birth year (when reported) with respect to the date that each artifact was added. The gender of the contributor of the artifact is shown by the color (red: female, blue: male, grey: not provided) and the general type of submission is indicated by the shape (square: story, circle: binary file, triangle: image). There was a wide spectrum of submissions up until about May 12th at which point there is a week-long gap. After the 21st, there are two main contributors, who are both members of the CDDC. They produced the lower collection of mostly squares and the upper collection of triangles. There were only sporadic contributions from other submitters. The graph is jittered to show more of the overlapping objects.

Age x Submissions

Looking at the power-law behavior of contributors, most contributors made only one or two submissions with the exception of the members of the CDDC staff:

Submissions per contributor

The majority of the contributions are from Virginia, but there are a variety of submissions from outside of the state. The labels from left to right are: Arizona, California, Colorado, Connecticut, District of Columbia, Dominican Republic, Georgia, Illinois, Maryland, Massachusetts, Michigan, Missouri, New Jersey, New York, North Carolina, Ontario, Pennsylvania, Sweden, Texas, Ukraine, Virginia, and Western Australia:

Submitters by state

Tags

One line of research has to do with topical categories of collected content. The site allows people to tag uploaded files. This table shows pairs of tags that co-occurred often in the collection of artifacts. Thus, for 55 of the artifacts in the collection, "Blog" and "Commentary" both appeared, showing the popularity of blogs for the purpose of making comments. As a second example, we note that the co-occurrence of Professor Librescu's name and the name of a leading Romanian magazine can help guide us to the news coverage in the country of origin regarding one of the heroes killed on April 16.

Tag correlation table

The following graph shows the full co-occurrence matrix, with larger dots representing more frequent co-occurrences (click on the graph for a high resolution version).

Tag correlation matrix

Below shows the frequency of individual tags. Tags are sorted alphabetically on the Y axis, and frequency of use on the X axis. The most popular tag is "memorial". We see that people were involved not only in events like the campus vigil, but also in electronic communication using blogs and other mechanisms to prepare commentaries

Tag count graph

The next two graphs show the lifetime of each tag. In the first graph, each pair of points connected by a line is a tag. The two points show the first and last usage of the tag. The second graph shows the first usage (X axis) with respect to the last usage (Y axis) of each tag. The color and size indicate the number of times the tag was used. This is an interesting way to see how focus changed over time. (Click on the graphs for for resolution versions)

Tag lifetime Tag usage graph

Clusters of Text Words

These visualizations illustrate how texts in a digital library can be analyzed and presented (using the IN-SPIRE software), using all the text from all of the contributions to the April16Archive. The contributions are clustered according to a textual analysis of words and phrases in the text, and the clusters are labeled with frequently used words within each cluster. The first is a ThemeScapes view, where clusters of contributions are shown as mountains that form a landscape of themes. The second is a GalaxyView, where each dot represents one contribution. Other than the differences in visual representation, these two views present essentially the same analysis and clustering layout. In particular, the placement of contributions and their groupings reflect a multi-dimensional scaling analysis, whereby closely related works are placed nearby, and sections of the visualization reflect aspects or issues popular in the collection.

Thus, for example, on the right side is a large cluster about the student community, with some sub-clusters for the vigil and memorials. In the top-right is a cluster about societal issues. The middle areas are mainly about various events and memorials, like the convocation, drillfield, etc. The left side is external issues and foreign languages. Interest in Professor Librescu (an April 16 hero) led to a sizable cluster here. Issues related to Korea are concentrated toward the top and left. (Click on the graphs for high resolution versions)

Themescape Galaxyview