The article by Kelly McGuire summarizes how to go about making useful data and finding the right kind of data. She says that the first step in evaluating data is to determine what value you can gain from it. The three questions you should ask yourself are: What is the data? How is it collected? and How often is it updated? The last step is to determine if you will need additional resources to manage the data. When managing the data, one must think about Occam’s Razor, which cautions that more data might not generate a good answer. It states that simplicity is sometimes better than complex ideas/data. McGuire states that the things to watch out for when creating a data set are overfitting and multicolinearity. Overfitting happens when there are too many parameters relative to the number of observations. Multicolinearity happens when 2 or more predictor values are are highly correlated and do not provide independent information.
The data set that I looked at was My Brother’s Keeper Key Statistical Indicators on Boys and Men of Color. I chose to research this dataset because I want to bring light to the stereotypes that are commonly associated with African Americans. I chose specifically men of color because I feel as though they are more misunderstood and stereotyped than women of color. The dataset compares boys and girls of different ethnicities to the year that they were born and the rate of birth to women ages 18-19 and 20-24. It shows the distribution of boys and girls to a specific race on a certain year.
- Source: Data.gov
- Who made/published it: The data set was published by the National Center for Education Statistics
- How was it collected? The information was collected by the My Brother’s Keeper task force. They collaborated with the Interagency Forum on Child and Family Statistics and federal statistics agencies to form new statistics from key indicators.
- How old is it? > Why is it old? The data set was created on May 29, 2014 and updated on March 31, 2016
- What’s the format? (file format, columns, and rows) The file format is an excel spreadsheet. The columns show the ethnicity of the people being studied. The rows show the year, rate of birth to women 18-19, rate of birth to women 20-24, and distribution of male and female children to women ages 18-19 or 20-24.
- What’s the type? (time series or something else like geographical?) I think that the type is a time series because it shows the rate of distribution over the years.
- How would you need to transform the data to make it useful? I could use the graph to show how many African American boys are born to women between the ages of 18-19 and 20-24 and how the age of the mother affects the environment that they grow up in and the factors that play in their upbringing. Thus leading to how stereotypes come into play.