The OSEDA/Missouri Census Data Center Archive is a collection of data files created over a period of more than 30 years. Most of the work has been done by progammers working at OSEDA (Office of Social and Economic Data Analysis, part of the University of Missouri Columbia campus) under contract with the Missouri Census Data Center (part of the Missouri State Library within the office of the Missouri Secretary of State). This informal document is to provide some insight for those who are wanting to know what the archive contains and how they might be able to access it.
When you have this much data you have to be at least a little bit organized or you'll never be able to find anything. So we have tried to use directories and file naming conventions to make it easy (or at least easier) for us (both the programmers creating it and the users using it, which also includes those programmers) to find things.
People who want to access "census data" know (or need to know) that this almost always means accessing these summary files, or something based upon them. Summary files always have numbers (1 through 4 per decade, typically). Summary File 1 is the first of the SF products to be released, and contains detailed tables based on the data collected on the short form questionnaire in the census. So it turns out that this "sf12010x" is a collection of data that is a "standard eXtract" based on this Summary File 1 collection of detailed tables based on the short form in the 2010 decennial census. Once you figure out what that means, then you'll feel comfortable when you see a filetype of sf32000x which you will not be surprised to learn is a standard extract based on the "Summary File 3" data product from the 2000 decennial census.
There are some important exceptions to this general rule. For example, the various Public Use MicroSample ("PUMS") filetypes (acsums, pums2000, etc.) contain data sets that describe individual persons or housing units.
State Equals 06 .The summary level variable appears under the name slvl in many of our 1980 and 1990 decennial census datasets.
With the latter (i.e. the Datasts.html page) you get
Rank filetype dset # Times Accessed 1 georef zcta_master 1,603 2 sf12010x moselectedinv 872 3 acs2011 usstcnty5yr 716 4 georef zipcodes 631 5 pl942010 uscounties 415 6 acs2011 ustracts5yr 409 7 acs2012 uszctas5yr 392 8 acs2011 uszctas5yr 356 9 acs2012 usmcdcprofiles3yr 348 10 acs2012 usstcnty5yr 344 11 acs2012 usmcdcprofiles 322 12 sf12010x usstcnty 315 13 acs2011 usbgs5yrtemp 284 14 corrlst zip07_cbsa06 235 15 sf12010 uszips 233 16 sf32000x ustracts 229 17 sf12010x moblocks 205 18 acs2012 usbgs5yr 201 19 sf12010 uscounties 198 20 sf12010x uszips871 178 21 sf32000 usgeos 172 22 sf12010 moinventory 162 23 acs2012 uscdslds5yr 158 24 corrlst us_stzcta5_county 157 25 corrlst uscdslds2012 155
The entry for the stf903 filetype on the Uexplore/Dexter home (directory) page reads as follows:
stf903/ 1990 Summary Tape File 3 Each dataset here contains over 3300 cells of pre-tabulated data based on the 1990 census long-form questionnaires. Each observation contains data for a single geographic area. We have complete "A" files for Missouri, Illinois and Kansas plus a few other states; we also have the complete "C" file (national) with summaries for the country, states, counties and larger cities. And, we have the "B" file - ZIP level summaries. This filetype has been made accessible at the table level from Dexter. As with any of the census summary file filetypes, you really need to have access to the technical documentation -- available in the stf903/Docs subdirectory of this archive -- before attempting to use these data. The stf903x and stf903x2 filetypes are derived from these files and are appropriate for quick overviews or access to frequently-used variables.
That's a bit more information than we typically provide for a filetype but there was a time when this was one of our most frequently accessed filetypes and we wanted to provide users with some guidance. The important thing to understand here is that the datasets in this collection (which can be thought of as data tables, with each row a geographic area and each column containing either some kind of geographic identifier info or a count of persons or households or a mean or median measure of some sort. Each of the latter are actually cells comprising the Summary File tables. So we have tables within tables. Instead of variables with mnemonic names such as TotPop, Age0_4, or Hispanic you have variable names such as P6I1, P6I2 P6I3 and P6I4. These 4 variables correspond to the SF3 table called P4. The letter "I" in the variable names stands for "Item"; so the variable P8I4 would be the 4th cell in table P8. So how do you know what the tables are? That's where the Docs subdirectory becomes important. This subdirectory contains the complete offical technical documentation of the STF3 data product as distributed by the Census Bureau. We have created a series of files representing the chapters and appendices of the original 464-page pdf document distributed by the Bureau. These "tech docs" are long and complex, but they are consistent in terms of structure and content for all the Bureau's summary data files. So once you figure out what a "Summary Level Sequence Chart" is about (telling you what geographic entities are summarized on various "files") and where to look for the table matrix outline information, you should be able to find these key sections and use them for reference. In this case you can access an index.html file that will make it quite easy to follow links to the various components - such as Chapter 6 - Summary Level Sequence Charts and Chapter 5 - Table Outlines. We have also provided a set of "ascii" (plain text) files within the Docs subdirectory. The tbl_mtx.asc file is probably the most valuable single file in this collection of documents. It lets you see what tables are available and helps you see what the variable names are going to be corresponding to the cells of those tables. For example:
P27. SEX(2) BY MARITAL STATUS(6)  Universe: Persons 15 years and over Male: Never married P0270001 9 N 1,1 Now married: Married, spouse present P0270002 9 N 1,2 Married, spouse absent: Separated P0270003 9 N 1,3 Other P0270004 9 N 1,4 Widowed P0270005 9 N 1,5 Divorced P0270006 9 N 1,6 Female: (Repeat MARITAL STATUS) P0270007 54 N 2,1is part of this file and defines Table P27. The column containing the database names (as used by the Bureau on their CD-ROM database files) can be rather easily translated into the variable names used on our datasets. If you wanted to get a count of divorced persons in an area you would want to access the 6th cell of this table and that variable would be P0270006 (Census name) or p27i6 (our name). To get the number of divorced females you would need to access cell (variable) p27i12 (the female counts are in cells 7 to 12 so you need to add 6 to the corresponding Males cell). If you go back up one directory level to the stf903 data directory you files Varlabs.sas and Varlabs.txt that look like this:
P27I1 /* MALE:NEVER MARRIED */ P27I2 /* :NOW MARRIED:MARRIED, SPOUSE PRESENT */ P27I3 /* ::MARRIED, SPOUSE ABSENT:SEPARATED */ P27I4 /* :::OTHER */ P27I5 /* :WIDOWED */ P27I6 /* :DIVORCED */ P27I7 /* FEMALE:NEVER MARRIED */ P27I8 /* :NOW MARRIED:MARRIED, SPOUSE PRESENT */ P27I9 /* ::MARRIED, SPOUSE ABSENT:SEPARATED */ P27I10 /* :::OTHER */ P27I11 /* :WIDOWED */ P27I12 /* :DIVORCED */. This is pretty terse but you might see where it could help locate names for cells once you had viewed the table outline matrix. (This reflects a not fully mature to presenting table outline metadata; we do better in our later files for the 2000, 2010 censuses and for the ACS summary tables.)
So let's say you actually needed to access these data. You wanted to look at the distribution of divorced females by county for the state of Missouri. Let's say you know about Table P27 (you went to the Docs directory and did some searching to find this on your own, let's say). So now you need to find a dataset that has county level summaries for Missouri. There are two possibilities: moi contains "Inventory" for the value of Units in the Datasets.html metadata. The standard list of inventory summary levels include counties. You could also use the uscntys dataset that has only county level summaries but has it for the entire country. (You would use that data set to do this same extract for any other state.) So I go ahead and click on the moi data set to invoke Dexter and access it. I can then access the link to "detgailed metadata" for the set and then follow the link to get the "key values" for the slvl variable (remember, we said that we used this alternate name for the SumLev variable on earlier datasets). I see from the key values report that 050 is the code for county so I can now go back and code my filter in Sec. II of the form:
County Equals 050Now comes the tricky part, Section III where I get to choose my variables. The Identifiers choice is pretty simple . Just go with FIPCO and AREANAME. And on the right I get to choose my Numeric variables. But - surprise! - the select list on the right does not have the usual "Numerics" label at the top -- instead it says "Tables". And the entries are not variables, they are table descriptions. This means that the system has flagged this as a table-based summary file and has made access available at the table rather than the variable level. That makes it a little easier since all I have to do is click on the P27 entry. Of course all I really need is the last cell in this table (the count of divorced females) but I can always take the whole table and throw away what I don't need once I get it into Excel.
So now what about the companion filetype, stf903x? In the interest of keeping it brief (too late?) we'll just say that the data in these datasets are not tables - they are variables derived from those tables. These are the data we use for our Profile applications, the data that can be used to answer 80% of the questions asked with only a fraction of the number of data items to slog through. Most users, for most applications, will be using an extract collection rather than a complete table collection. Look back at the list of most-frequently-accessed datasets, above, and you'll see only 3 table summary data sets (all in the sf12010 filetype collection): uscounties, uszips and moinventory.
You'll note that the metadata provided for the extract collections is a lot shorter and is not written by anyone at the Census Bureau. These extracts are certainly based on census data files and you really need to understand about the "parent" complete-table STF collection in order to understand the extract, but we keep the two levels of documentation separate.
The margin of error measures in the ACS files requires that we have these measures (columns, variables) as well as the usual table entries. Our naming convention for these is similar to our table-cell naming convention; we just use the letter "m" instead of "i". So on the 5yr ACS base tables data sets we have the variables b01001i2 and b00101m2 where we store the second cell of table B00101 ("Males") and the corresponding margin-of-error value.
Because of the large number of tables (and table cells) on the ACS base tables, we partitioned them based on topics. This gives us dataset names such as usstcnty17_20, which contains, for every state and county in the U.S. all the tables associated with topics 17, 18, 19 and 20. The first 2 digits of a table name constitute the topic code and we have numerous TableTopicCodes.txt files in our basetbls/btabs5yr directories. If it is your first time accessing our base tables it is strongly suggested that you read the Readme.html file in the acs2012/btabs5yr directory. (We may or may not create a new version for each new acs vintage.)
See the various training modules related to Uexplore/Dexter at mcdc.missouri.edu/tutorials/uexploreDexter (which is linked to from the Uexplore/Dexter home page). Note that there is a PowerPoint module specifically on the topic of the MCDC Data Archive. Most of the other modules focus on how to use the software to access the data.