Microsoft DAT208x: Introduction to Python for Data Science, a review

In my quest to complete the Microsoft Professional Program for Data Science, I took their course Introduction to Python for Data Science earlier this month to disappointing results.

It could be that I had very different expectations, or that I already have too much background in Python for another introductory course, but I certainly wasn’t impressed and I’m loath to pay for the verified certificate.

In a nutshell: This felt more like an overview than a proper introduction. If this was in a university setting, this would have been the first day when the instructor gives out the syllabus and walks through the course expectations. If (s)he’s a smart alec they’d force an awkward icebreaker.

Would I discourage you from taking the course? Yes actually.

(To follow my progress on the program, check out the Microsoft Professional Program tag)

 

The Structure

DAT208x claims to “cover Python basics and prepare you to undertake data analysis using Python”. Similar to the Microsoft courses that come before it, it is a self-paced course composed of modules that comprise of video lectures and lab exercises.

The modules are as follows:

  1. Python Basics
  2. Lists
  3. Functions and Packages
  4. Numpy
  5. Plotting with Matplotlib
  6. Control Flow and Pandas

This course is brought to you by a partnership between Microsoft and Data Camp, an online Data Science school similar to DataQuest. In an old post I mentioned my apprehension with Data Camp as I’ve heard they favor R over Python, but I decided to give them the benefit of the doubt and give their Python course a try.

Its due to this partnership that most of the lab activities are outside of edX. i.e., we’re redirected to DataCamp’s interface for the lab exercises.

These exercises are the meat of the course. If you’ve tried DataQuest before then the DataCamp interface should be familiar:

Instructions to the left, interactive Python shell to the right.

Unlike other Microsoft courses I’ve tried, this one has a final exam. You are given 4 hours to answer 50 questions: a mixture of knowledge checks, pseudo coding, and actual coding.

Considering the knowledge checks, exercises, and final exam, you need to score at least 70% to pass the course. An easy feat considering 40% is just course surveys.

 

Continue reading “Microsoft DAT208x: Introduction to Python for Data Science, a review”

Storytelling with Data: a book review and my takeaways

As a child, I loved telling stories. I’d take my favorite book and TV characters and create a world where they would oh-so-conveniently meet. Say, a magical anime girl wanders Narnia until she encounters the now-villainous Power Rangers.

As an adult in the corporate world, I still want to tell stories. But now I find that people are more critical of which stories I tell them.

It must be in the form of numbers, they said.

It’s a data-driven world, they said.

In Cole Nussbaumer Knaflic’s book Storytelling with Data, she argues we can do just that: tell stories with numbers.

language + math = data storytelling

She takes traditional storytelling concepts then re-interprets them for “adult-appropriate” tables and charts. She teaches us to edit our charts, the same way authors do their stories, by borrowing principles of visual design.

My key takeaways from the book can be found below (click for larger size), but they can be summarized as follows:

  1. Context is king. The form your data will take depends on your audience and what you want them to do with the data.
  2. Choose the right graph to best express the key message (I’ve made a flowchart in my notes to help with that).
  3. Following on #1, design around this message.
  4. Present your data as you would a story, with a beginning, middle, and end.
storytellingwithdata1
“Storytelling with Data” notes, by dannaisadork

P.S. Sorry about the terrible handwriting. My normal penmanship’s already pretty bad, but writing on a tablet made it worse!

 

Continue reading “Storytelling with Data: a book review and my takeaways”

A data journalism peg: NY Times on Uber’s psychological mind games.

The New York Times is right up there with the Guardian’s Datablog in my data journalism aspirations.

One of my favorite posts of theirs is Snow Fall: a coverage of the 2012 Tunnel Creek avalanche. Its a wonderful mixture of storytelling, visualizations, and traditional journalistic interviews.

Go check it out first, I promise you won’t regret it. Just don’t forget to come back.

Unlike the Datablog however, the Times doesn’t collate their data viz content into a single page (IKR? Not even a tag!), so I often miss out on great content unless it hits viral.

(Before you suggest I subscribe to the Times, did you know they publish about 230 pieces of content daily? I’m not willing to sift through that!)

So I’m glad I didn’t miss out on this latest one: their coverage on How Uber Uses Psychological Tricks to Push Its Drivers’ Buttons.

nyt_uber
This is a serious journalism piece. Not a game. I think.

What’s to like:

  • Interactive simulations!
  • The feature viz is a throwback to the 8-bit games of the 80s–which is kind of meta, given the post talks about how Uber experimented with video game techniques to maximize profit.
  • Charts. Charts. Charts. And interactive ones at that.
  • A union of social science with data science. How exciting! I like how they incorporated psychological vocabulary into the piece (e.g. loss aversion, ludic loop, binge-watching, etc).
  • “Uber exists in a kind of legal and ethical purgatory.” Please excuse me while I writer-geek out over this analogy.

Its a pretty length piece which will take about half an hour to get through, but I argue its worth it.

.xlsx files are secretly compressed!

There. I spilled the not-so-big secret. Excel files from Excel 2007 and above (.xlsx) are automatically compressed. A feature which, in all my years of using Excel, I never knew about.

I once received a large excel file from finance for analysis. Normally I would convert said file to CSV (comma separated values) as the latter:

  1. …is just data, no formatting. Exactly what I need for a data extract and nothing more.
  2. …tends to be more malleable across multiple applications.
  3. …and because of #s 1 and 2, tends to have a smaller file size.

So imagine my surprise when, upon converting to CSV, my 29 MB file ballooned to 115 MB.

Whuuuutttt???

Usually it’s the other way around. With all the formatting and formulas removed, the file size usually shrinks.

But apparently this is no longer the case when you have a lot of data. Once you go over a certain point, the amount of data you use matters more than the formatting.

Fortunately .xlsx is compatible with Power BI, which is where I was going to plug the data into anyway. I let the file type stay as is.

Makes for a convincing argument for the utilizing the Microsoft suite, eh?

(And in case your answer is no, let me argue that even technology research group Gartner agrees with me by crowning Microsoft king in business intelligence and analytics platforms.)

The best path to data science starts with the problem.

In the third grade, my science teacher sent shockwaves when she failed the final projects of more than half the class (thankfully I was in the minority).

This is it??? This is all you have?!

You can do better than this. These are too easy.

Give me something that’s actually worth… something!

Let me remind you: WE WERE THIRD-GRADERS. We were little brats who had never been told we sucked, much less failed.

Stricken by this failure, one classmate approached me after class to ask for advice. He had always been in the top 10 of the class. This must have devastated him.

Too bad I was never good at consoling, even as a kid. So instead I told him a story.

Of how I was playing outdoors the day before and was bothered by mosquitoes. Of how, try as I might, I couldn’t find where my mom hid the insect spray.

So I just used the first thing I found in the kitchen: Maggi savor.

(For those outside the Philippines, maggi savor is a blend of liquid seasoning, something like soy sauce but with garlic and lime.)

And to my surprise it worked. Not as effectively as insect spray, but the mosquitoes no longer buzzed as actively as before.

You can guess what happened next: Classmate wins title of “Best Project” for his study on The feasibility of soy sauce as a mosquito repellent alternative. I was… well, I passed so all was well.

 

Why am I sharing this story?

Because to me, my experiment had been nothing more but a curious solution to play outdoors.

But to my friend, and to my science teacher, it was a problem worth solving.

And as it turns out, that’s how to become a data scientist.

 

 

One of the most popular posts I’ve written on this blog is Getting started with Data Science, for the complete beginner. Its also one of my first posts.

Since then, many articles on the same topic have come up. But of note is this one published in Forbes  (originally from Quora). It answers the question, “What’s the best path to becoming a data scientist?”

  1. Pick a topic you’re passionate or curious about.
  2. Write the tweet first.
  3. Do the work.
  4. Communicate.

 

Where I said have a personal project, the writer took it to the next level by recommending to have a public portfolio:

I recommend building up a public portfolio of simple but interesting projects. You will learn everything you need in the process, perhaps even using all the resources above.

Makes sense right? More and more we’re judged by what we can do, no longer by the credentials we have. Artists, architects, and now programmers and developers… more and more jobs require having a portfolio.

 

What I haven’t considered is to write the tweet first.

Is the project even worth pursuing?

It sounds obvious, but people are eager to jump into a random tutorial or class to feel productive and soon sink months into a project that is going nowhere.

Ouch. I think she’s talking about me.

She’s got a good point though.

 

So. I now know I have to revisit my projects and write their tweets… but how do I talk about that portfolio?

If you’re like me and data science isn’t your day job, how do you talk about what are, essentially, your side projects?

It’s unfortunate that side projects are often overlooked by the people who aren’t actively working on them. Side projects can be immensely rewarding to talk about. They demonstrate a lot about how you work.

 

Thankfully LinkedIn has the ability to showcase projects. Its the perfect avenue to showcase your portfolio.

In person though, you may want to try this approach:

  1. Start with the problem
  2. Define your approach
  3. Share the challenges you faced
  4. End with the results
  5. Follow-up with what you would do differently

Again, it starts with the problem.

 

Like most things, the start is the most difficult step.

Finding the right problem is hard. But it might not need to be. It might already be there, right in front of you, just under your nose… and you just haven’t recognized it as a problem yet. Just like maggi savor.

In order to re-course my path to data science, the first thing I’m doing is to take a second look. But this time with a fresh set of eyes.