Data Exploration & Storytelling MOOC
Week 1: Finding and understanding data
What I learned:
What surprised me:
What I wrote in the discussion section:
Week 2: Character development for your data story
What I learned:
What surprised me:
What I wrote in the discussion section:
Week 3: Basic plot elements of your story
What I learned:
What surprised me:
What I wrote in the discussion section:
Week 4: Advancing the plot of your story
What I learned:
What surprised me:
What I wrote in the discussion section:
Week 5: The plot thickens in your data story
What I learned:
What I wrote in the discussion section:
Week 6: Putting the data story together
What I learned:
What surprised me:
What I wrote in the discussion section:
I am looking forward to learning how to implement and apply these lessons and takeaways to our project for the remainder of the semester!
What I learned:
- There are different classifications of data including metadata aggregate data, microdata
- Module 1 was a good introduction of what goes into and what is necessary in finding and preparing data, but I don’t feel adequately equipped to do it myself yet.
What surprised me:
- How little I knew about the process of cleaning data
- What makes data usable, and ways to find already available data
- How daunting it is to clean and prepare data
What I wrote in the discussion section:
- Would you have reported using the Ashley Madison data?
- “The case of the leaked Ashley Madison data brings me back to some questions pertaining to media ethics and privacy. The concepts of want to know vs. need to know come to mind. The Ashley Madison data does not seem to qualify as "need to know" information in my opinion as it does not directly threaten safety or public well-being. It is personal information that affects individual people and families. Though I may ethically disagree with the nature of the site and the behavior that ensues because of it, I do not think that is reason enough to invade members' privacy by leaking their information. Some exceptions to this line of thought would be if there are people on the site who are elected into positions by the public or held to some other public standard of accountability. Then it would be relevant information for the public to know on a bigger scale affecting the greater good, rather than personal, individual drama that does not warrant publishing. Do you think the Ashley Madison data qualifies as "need to know" for the public? Why or why not?”
Week 2: Character development for your data story
What I learned:
- How to get started with Excel and Tableau public
- What Tableau is
- What the first steps to cleaning and auditing data are
- How to categorize data
- The ethics around cleaning data
- The four principles behind data visualization
- 1. Need to make sure you have good data (reliable, etc)
- 2. Data visualizations need to attract attention
- 3. It needs to be clear and understandable
- 4. Need to show the right amount of data, data visualization is about clarification not simplification, you need to contextualize numbers/data or it will be misleading, how to input data into excel or Tableau, make sure variables match-up,
- 1. Need to make sure you have good data (reliable, etc)
- How to clean data
- Reformat/reshape data
- Familiarize yourself with every row
- Know what is numeric/non-numeric
- Change your numbers to have two decimal places
- Look at non-numeric variables
- Look at your dates
- Figure out how missing data is labeled
- Make sure your variables are in consistent units
- Check for outliers
- Make sure your process is clear and replicable
- Reformat/reshape data
What surprised me:
- How much a simple slip or mistake can completely skew a data set and ruin the credibility and truth of a story
- Similar to whose story you’re telling and being careful about the ethics of perspectives and voices, you also have to be just as conscious when working with data and what your raw data actually represents or stands for in analysis
What I wrote in the discussion section:
- You’re doing a story on school performance. The data shows that one of the schools has standardized scores that have increased much faster than all other schools in that state. How do you proceed?
- John Osborne: Put on my reporter’s hat (a classmate's response to the thread)
- This sounds like a classic instance of an outlier that cloaks a potentially interesting story. Examining the data set using the tools suggested in the module and determining the data was valid and consistent with data collected for other schools, I would begin looking for an explanation. I might start with a data expert, perhaps someone in the state school department who could tell me what might be going on. I would want to be sure the trend didn't mask widespread cheating within the district, something a statistician might be able to investigate.
- But perhaps the demographics of the school district are changing, with more affluent families moving into town, a trend known to raise school performance. Such a suspicion would call for looking at census data for the district and similar towns to spot changes in race, income, and other indicators (though the American Community Survey has margins of error at the school district level and the 10 year gap between the full census might not be fine-grained enough). Should this story be about more than student performance, I want to step outside this newsroom and talk to administrators and teachers in the district, as well a parents. In the end, I suspect, there would be no way to escape "shoe-leather" reporting, no matter how compelling the data.
- An opportunity to explore positive deviation (my response):
- "I agree in the sense that my first instinct after reading this question was that the outlier school’s performance has the potential to be a story. Similar to John’s response, I think that it would be an important and necessary first step to contextualize the high test scores. I would verify to make sure that the data is legitimate, accurate and credible. Then, if verified, I would look into the potential reasons for why the test scores are higher at that school. What are the influencing factors? Is the school taking a different approach to standardized test prep or to the way they are educating their students or running their school? This is an example of a positive deviance and could be the start of a solutions-focused story, highlighting the practices and approach of this school in a way to show how it could potentially benefit other schools and students hoping to achieve higher standardized test scores."
Week 3: Basic plot elements of your story
What I learned:
- This week, we learned about different types of data and how to focus on certain variables, looking at one variable at a time, to tell a story more effectively.
- We began to learn how to analyze the data in different ways including noticing trends and seeing how the data is distributed.
- We also continued to explore the ethics involved in data journalism.
What surprised me:
- Data visualization was created for analysis not communication
- Different patterns appear based on looking at each of the different variables in the same data set
- There are so many features of Tableau and how the program can arrange your data for you
What I wrote in the discussion section:
- As a data storyteller, is it your job to use the data in a way that will be understood by the average audience? Or is it your job to use the data in a way that is technically correct? How do you balance these two?
- “I think it is not just the goal of a good journalist to balance these two qualities but rather it is the duty of a good journalist execute them both well all the time. Your data should never be technically incorrect, period. Similarly, your audience’s ability to comprehend the data that you are presenting them with should always be a priority informing and directing the way that you present said data. If your audience cannot understand your data, or if your data is factually incorrect in any way, you have not done your job well as a journalist.”
Week 4: Advancing the plot of your story
What I learned:
- Whereas last week we looked at how to analyze one variable at a time to find patterns, this week we talked about using two variables at a time to pull trends from the data
- We addressed the importance of a change being significant enough to report
- P-value as a pretty good estimation of whether or not the trends and results that you're seeing are happening in the real world or whether they're caused by chance.
- The smaller the P-value is the more likely it is that the results you're seeing are actually happening in the real world rather than just being caused by the chance that's present in your data.
- Classic correlation does not equal causation
- Visualizations naturally allude to causation, even if that is not necessarily the case
- Looking at the data over time is one of the best ways to talk about causality
What surprised me:
- Just because something is statistically significant does not mean that it is practically significant
- It is best to avoid using the word significant unless you are working directly with a statistician. There's a lot of other words that convey a more accurate meaning whether a difference is substantial, whether it difference is meaningful, whether a difference is consequential, may be more important than whether the difference is significant.
- If you have a randomized control trial or a strict scientific experiment, there are ways to talk about causality based on what the data can tell you but if you have data from anything other than a scientific randomly controlled trial, which is ninety nine percent of the time what you we're working with you can't tell from the data whether causality is real or not.
- Exploring causality requires reporting, not just strong data
What I wrote in the discussion section:
- So we know that correlation is not causation. In a story that included data, is it the writer’s job to make sure the audience doesn’t make causal assumptions just because the story connects two variables – or is it enough for the journalist to state that the data is related but not necessarily causal and hope the audience doesn’t make the incorrect leap? How does the context of the data story influence this?
- “As has been presented consistently throughout this MOOC in several modules and lessons within each of the modules, it is a journalist’s job to contextualize their data to the best of their ability in a way that will best serve their readers’ understanding and the good of the public overall. Allowing causation to be falsely assumed due to a lack of thorough reporting is not editorial. It is lazy and unethical. By not adequately contextualizing the nature of the relationships between your variables, you could severely mislead your reader, which would make your data journalism extremely ineffective. As Heather discussed in the model, reporting offers the opportunity to supplement the data by talking to the people who the numbers represent. This is just one way for the journalist to explore causality. How correlation vs. causality is being conveyed to the readers is something that should stay at the forefront of the mind of the journalist as they create their data journalism.”
Week 5: The plot thickens in your data story
What I learned:
- Moderators, mediators, and confounders: how data can demonstrate different kinds of relationships and relationship statuses between variables
- Confounding
- Relationship being caused by a third factor
- Sunscreen, cancer, cancer risk factors
- Mediating
- Sunscreen has chemical, chemical causes cancer, sunscreens without chemical don’t cause cancer
- Sunscreen itself doesn’t cause the cancer, but the chemical in it
- Moderating
- Sunscreen acts differently for different kinds of people
- Changes the relationship status depending on the person
- Confounding
- Questions to ask:
- What other things can be causing this relationship? (confounding)
- What’s the mechanism that might be causing the behind the scenes relationship? (mediating)
- Is the relationship different in different situations? (moderating)
What I wrote in the discussion section:
- Many times you cannot tell from looking at the data which variables are moderators, mediators or mediators. What practical steps can you take to help figure this out? When does a relationship between two variables indicate that one causes the other?
- Response to post “Research, research, research” by Sheiva Rezvani (classmate's post)
- “As an interdisciplinary practitioner myself, I agree with what's been said on this forum about using tools available to you as a generalist in the field. One of the many strengths that generalists have is that while our knowledge may only be an inch deep it is also a mile wide. As a result we may see possible relationships outside the scope of an expert who has built-in biases in their field. That said, I also agree that it's imperative to consult with experts in the corresponding subject matter of your piece to fully understand the material and the details of the data at hand (i.e. nature of the collection, the population(s), the political and social environments, etc). Once the overall landscape of the data is established, it's up to us to determine if there may be other areas of that "mile" of knowledge that may require deeper digging and expert consultation.”
- “I agree completely that research should be the first step. I think just as journalists think through the various angles and lines of questioning we can take with our sources, I think we need to seriously consider the lines of questioning that we should take with our data sets. Heather offered some good starting points in the video as far as questions that we can aks ourselves such as what are some other factors that could be causing this relationship? And does the relationship vary among different situations? These types of questions are the starting point to exploring the kinds of relational factors discussed in this model, and if reported on thoroughly can lead the journalist to a better idea of the nature of the data. Before publishing a data story, I believe it is essential to check it from different angles for these types of other relationships that could greatly affect the direction of your story.”
- Response to post “Research, research, research” by Sheiva Rezvani (classmate's post)
Week 6: Putting the data story together
What I learned:
- It is important to include discussion of uncertainty in your data stories
- How to begin to narrow in and focus your data-driven story
- How to incorporate data into a narrative story
- "Treat your data like a character in your story."
- Make it approachable and interesting.
- What is the back story?
- What are the strengths and weaknesses?
- Make it approachable and interesting.
- "Treat your data like a character in your story."
- You should always have more questions to follow up on to continue reporting if you wanted to.
What surprised me:
- How similar working with data is to working with other elements that make up a good story.
What I wrote in the discussion section:
- What are you going to do if you work on a story for a long time and the data results turn out to be either the opposite of what you thought or show no real trends at all?
- “This question reminded me of two different things. One, it reminded me of just a few weeks ago when Greg Borowski of the Milwaukee Journal Sentinel visited our class. One of his first pieces of advice for undertaking a data-driven investigative project was to identify a minimum and maximum story early on. By this, he meant that before a journalist invests a great deal of time, energy and resources into a story, they should have an idea of at least a minimum story that they will be able to tell that will be powerful and effective regardless of the strength of the data that surfaces. The maximum story would be enhanced by strong data that shows strong patterns, etc. I experienced this kind of situation in real time last year when I was working on a special report about human trafficking for the Milwaukee Neighborhood News Service and anxiously awaiting records that I had requested from the Milwaukee Police Department. Throughout that waiting period, I worked with my editor to develop plans and methods for how to proceed and continue working with the story depending on what the data revealed. If and when I am to face this kind of situation in the future, I want to be preemptive in considering different possibilities, and after receiving data that is different that what I expected, I hope that I would still be able to find a powerful story somewhere within the results that I did gather.”
I am looking forward to learning how to implement and apply these lessons and takeaways to our project for the remainder of the semester!