| Data
Submission Requirements Introduction The primary purpose of this document is to describe the format that the UPCI Biostatistics Facility requires for data sets that are submitted to us for analysis. Our intent is twofold: 1) To minimize delays. We do not have the personnel to provide data management, and will request that you re-format data submitted in an intractable format. 2) To encourage good data management practice. We suggest that you consult with us on the data format before you begin data collection; we will be happy to assist in design of an appropriate data structure. We urge that data be submitted on Excel worksheets, since 1) use of a spreadsheet program facilitates data entry, editing and display, 2) our statistical packages can read Excel data directly if our format specifications are followed, 3) you probably have access to Excel. If you cannot supply data on Excel worksheets, or if your data format does not conform to the requirements below, see the Additional Information section at the end of this document. If your data are for a clinical protocol, please see the Additional Information section for discussion of additional data that should be provided. In addition to the data, we require a Data Dictionary that defines the variables. Please keep in mind that, although we do consistency checks on data, we expect data submitted to us for analysis to be accurate and ready for analysis. Data Format Columns: Each column must contain the values of a single variable; one column must contain the values of a subject identifier. The dataset may be augmented with any number of columns containing descriptive text, but only if the text is not required for analysis. Row #1: The first row (row #1) must contain variable names. Rows: Each row except row #1 must contain values of variables for a single subject, and an entry must be made for each variable listed in row #1. A variable may be repeatedly measured for each subject. If the number of measurements varies from subject to subject, it is best to use example 2 as a template; if not, you can use either example 1 or 2. Cells: A cell must contain the value of a single variable or a variable name; cells should not be left empty. Formats for specific kinds of data follow: Variable names: The
first row of cells must contain variable names. We prefer a single word
or string of characters without spaces, but can accommodate almost anything.
There can be no duplicate names the variables and
the units in which they are measured. We suggest that, if you use Excel, the data dictionary be included with the data on a separate Excel worksheet in the same workbook. Any additional information about the variables that you believe might be useful to us can be included in the data dictionary. Examples of data dictionaries are provided below; however, we do not require that you use the specific format.
Examples Alternative Formats: Contact us if your data are not in Excel worksheets. We can analyze data from most spreadsheet and database programs, and from most statistical packages; however data should be formatted according to the rules described above. We also accept properly formatted text files. Data embedded in Word or WordPerfect documents would generally need to be re-formatted. For large datasets with repeated measurements, it is often more efficient to record baseline data in one file, and the repeatedly measured data in another file. However, both files must have identical patient identifiers. See example 4. Use of Text for Values of Categorical Variables: We discourage use of text for the values of a categorical variable; our experience is that text is much more subject to various data entry errors than numeric or character codes. However, if your dataset already contains text entries, we can analyze it under the following conditions: a) there is no evidence of data entry errors, and b) each value of a categorical variable is identified by a single unique text string. Alternative Date Formats: We can handle most date formats, but a single format should be used for an entire dataset. We prefer mm/dd/yyyy since it is less prone to data entry errors than most other formats. The year must be indicated with 4 digits. Numeric Formats: Scientific notation (1.3E+10, 5.6E-5, 2.3E+1, .) is an acceptable format, but may be more prone to data entry errors than decimal or integer formats. If you use Excel, you may mix the three numeric formats in a dataset, even for a single variable. Missing Data Codes: Missing data codes should generally be used to identify the reason why data are missing, unless that information is of no interest or would be redundant. For example, a womans age at first pregnancy could be missing either because the age is unknown, or because the woman was never pregnant. (In this case, one should use two numbers that could not possibly be a womans age at first pregnancy to code the missing values, e.g., 98 = age unknown and 99 = never pregnant.) If the number of children were also provided for all women, a missing data code to distinguish between these two cases would be redundant. However, if the age is unknown, it may still be useful to distinguish between the following two situations: 1) the age is expected to be determined in future follow-up, and 2) the age is not expected to be determined. Additional Data that You Should Provide for a Clinical Protocol: 1. There should be at least one data file that includes (subject, patient, etc.) entered into the study, regardless of whether complete data was obtained for that experimental unit. (For example, even patients who drop out of the study just after randomization should be included in at least one of the data files.) 2. When the protocol calls for a measurement to be taken in a certain week or month of the study (say, at baseline or 1 month after surgery), include this information (e.g., coded as: 0 = baseline, 1 = 1 month) as well as the actual calendar date on which the visit occurred. 3. When information on patient survival, duration of response, or time to disease progression is to be analyzed, the date of last contact (last follow-up) must be provided for all patients. (This should be in a separate column.) Example 4: Same data as in example 2, but separated into 2 files. |
Home
• Services
• Personnel • Publications
• Design Resources •
Data Submission • Links |