Teraflow Data Services for the Sloan Digital Sky Survey (SDSS)
Astronomical data is growing at an exponential rate, doubling
approximately every year The main reason for this trend is Moore's
Law, since the power of the underlying hardware used by data
collection and processing grows via Moore's Law. As one example, the
Sloan Digital Sky Survey (SDSS) is mapping in detail one-quarter of
the entire sky, determining the positions and brightnesses of more
than 300 million celestial objects. It will also measure the distances
to more than a million galaxies and quasars.
A research group led by Alex Szalay from John Hopkins University in
collaboration with Jim Gray from Microsoft is building the science
archive for the project. The first data from this project was
released in 2001 and was about 80 GB in size. The second release
(DR1) of data took place in 2003 and conisted of about 1 TB of data.
The third release of data (DR2) took place in 2004 and consists of
about 1.7 TB.
At a technical demonstration at the SC 04 meeting in 2004, this data
was distributed using the UDT high performance data transport protocol
(part of the teraflow data services). This is the first time this
data was distributed via the network instead of by shipping disks of
data. With UDT and high performance networks, the data could be
transported over 1000x faster than with the TCP protocol as standardly
deployed over today's networks.
The goals of this project are 1) to use teraflow data services to
distribute SDSS data; and 2) to use teraflow data services to process
the data continuously so the releases of data becomes a continuous
process instead of an episodic one.
Up-to-date information on the SDSS Project can be found at SDSS Wiki
Pantheon Gateway Testbed
Today, research in data integration and data assimulation is
hindered by the lack of availability to researchers of large
collections of heterogeneous data that can be used for developing and
testing new technologies. In this project, we are archiving highway
sensor data, overhead imagery, text based data about special events
that may affect traffic, and weather related data. These resources
will be archived each day and made available to the community for
testing novel data integration and assimulation strategies.
Today, this data is collected, but not archived, by the Gateway
System that coves the three state, fifteen county
Gary-Chicago-Milwaukee (GCM) corridor. The Gateway System uses fixed
traffic sensors in addition to other data sources to compute real-time
traffic congestion data and displays this data to the public at two
websites
http://www.gcmtravel.com and http://www.travelinfo.org.
The Pantheon Gateway Testbed archives this data, overlays additional
data, and makes this available to the community as a resource.
Chicago Biomedical Consortium
Bioinformatics Data Integration Testbed
The Chicago Community Trust and Searle Family Foundation have
recently awarded funds to UIC, UC and NW to create the Chicago
Biomedical Consortium (CBC). The focus of the CBC will be on
proteomics. In the first year of the project, the Chicago Biomedical
Consortium will purchase advanced mass spectrometers, such as
Time-of-Flight or Fourier-Transform Mass Spectrometers.
In this project, we are developing a data integration
infrastructure for mass spectrometer data, which will archive mass
spectrometer data and make it available as an open community resource
in a format that facilitates its integration and leverages its ability
to contribute to new discoveries. In particular, we are developing
open source repositories for mass spectrometer data, and developing
teraflow-based services for the real time discovery of proteins, and
the integration of third party protein, text, and pathway
databases.
|