the equivalator rises!

equivalator

for the past few weeks, i have been working on a data visualization project that i’ve called the “equivalator“.  it came out of an idea that i had to build a context for a single measurement or a pair of measurements out of a bunch of random comparisons that use the same dimension of measurement.  you know… i REALLY wish that i had a more enticing way of describing this!

how it works

equivalator screenshotso the basic idea is that users will enter three pieces of information – the name of an object, the number of objects in the comparison, and then a quantification for that number of objects.  for example, i could enter “my cat”, “1”, and “8 pounds”.  the application then takes what you’ve told it (that one of “my cat” weighs “8 pounds”) and goes to the database of other user-entered items and returns a sampling of totally unrelated items that also have information about their mass stored there, and then tells the user what other things weigh 8 pounds.  really – the simplest way to see what it does is to try it out for yourself.  the fun part of it is that you get a little information graphic with it to help you visualize the comparison in a dimension that you would not normally think of.

ok… soooooo… what’s the point to this?

this is a particularly excellent question.  i’m not entirely sure.  i’ve thought that it would be fun and whimsical to build something like this.  but as i started building my original design… and discovering that it was a pretty cranky interface that no one would ever like or use… i found that the project was gaining a “personality”.  maybe not in the kind of killer-talking-artificial-intelligence-construct way, but it was definitely struggling to find its own identity.  it didn’t like this kind of data entry or that.  it didn’t know what to do with the data i was sending to it.  it was trying to tell me that there was more stuff that it could be doing if i just made it simpler. and then once i got it actually working, it started showing that surprise was a great part of its allure.  it showed me comparisons in orders of magnitude that i had never considered interesting but suddenly found… well… entertaining.  and when i showed it to friends, they immediately found it amusing (albeit, with its fairly limited dataset, the amusement is fairly short-lived).

where i hope that this goes is that this personality continues to grow as i add new dimensions to the equivalator and better intelligence for drawing comparison across multiple dimensions.  for example, if i know the mass of an object, and correlate that with some distance between two points, then i can calculate the work that is required to move that object between those two points.  and THEN i can compare THAT WORK to some other totally unrelated and unexpected quantity of force…  *MINDBLOWN*

near future

this application is still in shakedown mode currently.  my hope is to create a community around the tool to help validating user input and moderating the vast oceans of input that i am unrealistically assuming will follow its final release.  i would love to develop an API that would allow developers and bloggers to simply add equivalations to their own website and offer visualizations of live data outside of the equivalator’s domain.  maybe a mobile app?  this started out as a simple thought experiment, but the equivalator’s personality is quickly winning me over.  i hope to have more news about it as the project matures.  stay tuned!

Why Dataviz Is Important To Me…

There is a part of me that sometimes feels that dataviz might be an informational extravagance – a pomping up of statistics with unnecesary decoration, or at worst an intellectual troupe d’oeil for conning people into a distorted view of facts.  But there is a much larger and more influential part of me that thinks that dataviz is an important part of communication that very sincerely aspires to illuminate conclusions that would be too esoteric or incomprehensible in the form of raw data.  I strongly suspect that my opinion and my overall affinity for dataviz are the results of a lifelong affection for comics and graphic design, my appreciation of visual thinkers, and a sincere belief that effective communications can affect lives.

The one that got it all startedLike many geeks, the reality of my childhood was enhanced by the super-reality of comic books.  The CMYK-rendered world of comics grabbed my interest and imagination in a way that humdrum normal reality never seemed able.  After all, from the moment we are born, we are surrounded by mundane, everyday reality, as miraculous and astonishing as it is.  So it seems perfectly normal that  as soon as we are cognitively capable, we would seek to escape it through representations of alternative realities. Continuously shattering box office records with  releases like Harry Potter, the Hunger Games, and the Avengers, Hollywood films demonstrate that the general public shares this interest in escaping from reality. For me, the convergence of talents that is specifically required to create comics, from the inception of writing a fantastic story, to the imagination and envisioning of compelling visuals, to the polarizing and stylizing of pencil renderings through inking, and finally the enriching and emotion-establishing coloring of panels, creates a unique form of magic to engage and entertain us.  The abstraction and distillation of reality into some new kind of visual representation of imaginary worlds provides a very intense and engaging interaction with the media and I think you either feel that way, or you don’t.  In the way that you either get Justin Bieber or you don’t.  Or maybe Wittgenstein, rather. Same idea.

Important visual communications obviously predate the day of enormously successful pop-culture franchises.  From cave paintings to illuminated manuscripts, engaging visuals have been critical to communication since the technologies and techniques emerged to give them expression.  Consider Leonardo DaVinci.  DaVinci represents unparalleled genius immediately perceivable through his illustrated works.  He wrote prolifically and had the talent to express his genius with illustrations that remain unrivalled examples of creative-thinking and innovation.  The visual nature of his creative expression immediately fills the viewer with the understanding of DaVinci’s creativity, and that is the essence of brilliant communication.

So thinking about the ever growing amount of RAW DATA that our electronic world generates on a daily basis, I think that there’s a case to be made that in order to make that data relevant, explorations in dataviz allow us to make fantastic what would otherwise be the impossibly boring task of evaluating context-less statistical data. For example, the Japanese have been using manga for corporate communications and education vehicles for decades because an illustrative context evolves the expression of mere facts and statistics into an engaging and relatable narrative.  The imagination engages with the visuals while one’s rational mind follows the written dialogue – it’s a communicative experience that is unique in that it can be reproduced in printed form for offline consumption at liberty as opposed to a DVD or YouTube video. It is consumed at exactly the right pace for each consumer because he or she can read or re-read the content as often as is needed to achieve comprehension.

This leads me to why I think dataviz is so important.  If you have a message that you need to communicate – some serious reality that you need to share with the public – you need to think seriously about how to deliver it effectively.  Consider the following article in which Dr. Kumar makes the case that 39% of ICU patients with sepsis die compared to 18% with heart attacks. The message formulated in 2006 seems like a pretty straightforward one – that the mortality rate for sepsis is over twice that of heart attacks when one enters a hospital. Yet it has taken years and thousands of deaths to move thinking about early administration of antibiotics to patients.

Early antibiotics for sepsis called a lifesaver

 “Nobody really realized how important the antimicrobials are” ends the last paragraph of the article. Therein lies the failure of communication to be persuasive and the consequence of that failure is a lack of mobilization by medical boards to consider changing early treatment patterns, allowing for possibly preventable deaths. I am reminded of that line from the Talmud, “Whoever destroys a soul, it is considered as if he destroyed an entire world. And whoever saves a life, it is considered as if he saved an entire world.” That is why dataviz is important to me and why it is probably important to you too.  If you have a message that you need to communicate – some serious reality that you need to share with the public – you need to think seriously about how to deliver it effectively.

Processing and XML – A Beginner’s Tale

As I spend more time with Processing, I become more and more impressed with its ease of use and its adaptability.  I set a new homework assignment for myself which was to parse some XML that represents project data for a group over the course of a year into a meaningful visualization.  Since I had a year’s worth of data, I thought it would be valuable to represent how much effort each project consumed over the course of the year.  

I could have made a simple pie chart that represented each project’s contribution to the total effort of the year, but that’s pretty boring.  It would give a sense of the percentage of effort for the year, but it would not have had any real informational density.  I’m not sure whether “informational density” is a real thing or not, but I intuitively feel that the difference between a data visualization and a simple chart is that a visualization works consumes more than one or two dimensions of information.  A good visualization should incorporate as many axes of interpretation as possible and should represent them with as little cognitive effort as possible.  A really good visualization should illuminate traits that would not be apparent (or at least be harder to observe) otherwise and give insights and foster conclusions that would not be available otherwise.

So I set out on my project.  As I am still relatively new to the tool, I wanted to get a feel for it by using as few plug-ins and helpers as possible, preferring to build whatever routines I needed to get my visual across from scratch.  As one might expect, I made a lot of initial mistakes!  I started by creating a series of arrays and loaded each column into each array so that I could use each array individually as needed.  While this is one possible approach, it quickly became unwieldy and confusing, trying to keep straight which array represented what data and of course, the design was brittle because a change to the schema would mean massive changes to the program.

I got smart after a few sessions and created a class object to represent the units of data and loaded a single array of typed data.  From here, I was able to pull out the data I needed when I needed it.  It also meant that instead of using many large arrays of data, I could have one source array and create sub-sets of the main array that could be iterated more quickly and efficiently!  This is definitely a good lesson learned!  I added comparators to my class constructor so that I could sort the arrays by date or index and that too was a huge win for manageability. Rendering Attempt #1 Once I had the data in a form in which I could use, I started messing around with the rendering methods.  My first forays were pretty dismal.  I tried relating effort and time sorted by date and sequence and … well, as you can see from the rendering… it was a mess.  There’s a suggestion of association and extension, but no information value at all.  It was just a mash of lines.  So no points for this first effort.

Rendering Attempt 2

Second attempt – I tried a different approach to grouping the data and then calculating graph points.  Things worked a little better in that now, data points were no longer floating around meaninglessly, drifting off senseless directions.  In fact, it started to look a little like a good old western blot like I used to do in the old biochemistry lab.  But once again, I felt like this was pretty meaningless, and I was pretty sure that the marks were not actually representative of what I was intending.  So I tried again.

Rendering Attempt 3This is around the time that I really started to grasp the value of my typed data array and I set up a tricks that helped me to better sort my data when I was ready to render it.  I thought about how I wanted the data to expose itself.  First, I wanted to show the total effort week by week, which would look a little like a simple line chart indicating the total effort per week.  So sorting my data by week was my first sorting variable.  I then wanted to slice out each project, so that was my second sorting variable.  Rendering that, I was starting to get something that started to look vaguely representative of the data… progress was being made!

Rendering Attempt 4

I was getting where I wanted to go, but what I really wanted to do was to show the persistence of each project across each weekly period – to give a volume to each project that represented the continuity of effort involved. This meant that I would have to draw areas across the X-axis, rather than plotting individual points.  This is where the value of the sub-sets of arrays really paid dividends.  Each of my sorting variables allowed me to treat each variable in the proper sequence… first the big array of projects per week, then cutting that set up into individual projects… doing everything in order meant that project A in week 1 could be related easily to project A in weeks 2 or 3 and so on, and I could render areas instead of lines. As you can see, it was working… sort of.

Rendering Attempt 5Out of frustration, I put the project aside for a week and went out to enjoy running, playing with my cats, and visiting with friends.  Too much time in front of a computer is not often the correct solution to breaking through mental problems!  When I returned, I quickly identified why my objects were not closing properly and looked like they were being drawn by a meth-head from the TRON universe.

It’s funny sometimes how very minor tweaks can make the difference between total catastrophe and successful joy! Rendering Attempt 6For the sake of saving processing cycles, I had been working with a subset of the total data available – about a quarter.  Now I was feeling confident that I was close to where I wanted to be, and I loaded the whole set.  I tweaked some colours and changed the scales and held onto my butt…  Success.  I had got the effect that I wanted… a visualization that showed me the contribution of each project to the total workforce capacity, week by week over the whole year.
Final RenderingWhat the visualization gives me (over a simple pie chart) is a sense of just how busy the entire group was, week over week, on project work.  It shows when times were busy and when times were slow  It shows which projects had priority week by week, and it shows when they started and ended… I had achieved informational density! After some data cleansing and labelling, I had achieved my goal.

Like I said at the beginning, I think that the value of Processing is that it’s such a great tool for experimenting with visualizing data.  It’s quick and easy to pick up if you have any experience with Java of Javascript (object oriented programming techniques clearly are a benefit) and it allows you to very quickly sculpt out the data that you process in different ways.  It’s obvious to me that time and experience allow you to develop methods of approaching a problem and code to achieve it that can be reused to decrease development time.  But even for a newcomer, the tool is fantastic for converting data into compelling visual stories.  Next on the list of things to do is to learn how to make these renderings interactive!

Processing Tweets

There’s a great open source tool available for data visualization that has been around for years called Processing (available at http://www.processing.org).  It’s dead-simple to set-up and start working and with some fairly basic programming skills in java or javascript, the results can be immediate and fairly gratifying.

I have an application that I built last year to import and normalize RSS feeds from the like of twitter and last.fm, but I had some trouble rolling my own data visualization framework to render the data efficiently.  Enter processing… within a day, I had imported my data and was able to parse, display and animate it with some really simple programming.  Once the basics were in place, I added some code to compare date-times, and some rudimentary logic to highlight tweets about “cat”s and “superman” (fairly common tweeting subject matter for me, and voilà!  My first animated data visualization!My Tweets from 2008 to the Present

Here we see the tweets I’ve made from my personal twitter account from 2008 to the present, with the tweets spaced apart in proportion to the time that elapsed between them.  Tweets about cats are in orange, and the tweets about Superman are in red.  As you can clearly see – I tweet a lot more about cats than Superman.  And the display is so hideous as to be completely unpresentable.  However, the main point for me is that with a great tool and some fairly basic programming, we can get a visualization that is far easier to comprehend and is far more descriptive than a mere list of 900 tweets.

For bonus fun-good-times, Processing also outputs to MOV so that you can see the rendering in real-time!  Like I’ve said before, I’m going to love working with this tool!

Timeline of Tweets (.mov)

 

Convergent Evolution

As I was in my car this morning, I heard a story on the radio about an interesting population genetics study based on the Melanesian people of the Solomon Islands in the South Pacific. It turns out that some 10-15% of the children of this predominantly dark-skinned, dark-haired group of people have bright blonde hair traditionally associated with people of northern European descent. Through careful analysis, it has been concluded that the blonde gene that expresses itself in the Melanesians is unique and distinct from any of the blonde genes in Europe.  This proves that while the two traits appear very nearly the same, they are genetically unrelated and arose independently of one another. The term for that in genetics is “convergent evolution” and while it is uncommon and often unexpected, it is a real thing.

Later on this same afternoon, I was confronted with another case of convergent evolution but on a lot more personal level. As I explore the dataviz field (as the embodiment of the term “neophyte”), I realize that so many of the great epiphanies about data that I’ve experienced in my professional and academic contexts have been expressed incredibly well already by many of the giants of the field.  On the one hand, as I learn of expert after expert who already shines so brightly in this  field and listen to them expound the same concepts that I (in my very hand-wavey, inexperienced, undisciplined fashion) try to communicate to my friends, I feel like a charlatan.  But on the other hand, convergent evolution explains the potential for two equivalent and equally useful occurrences to arise independently, and I suddenly feel slightly less utterly fraudulent.

Jer ThorpI spent a good deal of time this afternoon immersing myself in the works of one Jer Thorp, Data Artist in Residence at the New York Times (and fellow Canadian), one of the visionaries who has already established himself as a groundbreaker in all almost all of the exact same fields of data visualization that I have been aching to carve out.  A key tenet of his is the need for better tools to make the ever-inflating cloud of data growing all around us sensible and useful and he has done much to serve that need.  He expresses the urgency of this exploration so well at his TED Talk in Vancouver in November 2011 (click here to watch it on YouTube).

My OpenPaths Map View

Jer has been involved in some pretty exciting projects. For example, OpenPaths is a project that allows users to exploit the historical locational data stored on their iPhones or Android devices to visualize where they (or at least their phones) were and when they were there. Said like that, it sounds fairly “Big Brother-ish.” However, if you think about how valuable that information could be to YOU, and not just to an application developer or Google or Facebook or Apple, the outcome seems less sinister and more personally useful. If YOU could better visualize that data then you could relate it back to the personal narrative of your own life, thereby enriching your ability to recall and express those moments in the future. I can’t put it any better than Jer does in his Talk:

“What we didn’t expect was how moving this experience would be. When I uploaded my data, I thought ‘big deal, I know where I live, I know where I work… what am I gonna see here?’

Well, it turns out, what I saw was that moment when I got off the plane to start my new life in New York… that restaurant where I had Thai food that first night thinking about this new experience of being in New York. The day that I met my girlfriend… right?” (@13:50 – 14:12)

The convergent aspect for me is that this is the exact same reason why I still use Foursquare and Google Latitude and Facebook Places to check in to places… so that I can bookmark moments of my life for MY OWN personal consumption and use.  In 2011, I moved from one side of the country to the other… and then back again, so I have a very sincere interest in being better able to use any of the available data regarding that journey to help me chronicle that part of my life’s history.  I understand that this opens me up to a couple very undesirable potential scenarios – I like to think of these as the “Enemy of the State/Eagle Eye” scenarios.  But the likelihood of those scenarios coming to pass seems too unlikely compared to the value of this rich source of passively-collected data.  Jer’s prototype proves to me that other people feel the same way about their personal data, which is very encouraging.

Avengers AssembledWhat else has Jer done that is awesome? Well, shamefully, the whole reason that I found Jer in the first place was that I am supposed to go to see The Avengers tonight, and FlowingData had a striking visualization of the first appearances of each Avenger (apparently, there have been over 120 Avengers introduced in the 570-some issues). This one post blew my mind – not just for the clever and beautiful visualizations, but also because it exposed me to a great open data source for comics that seems pretty sophisticated and complete! As if all of this goodness weren’t enough, Jer has shared some of his visualization tools and prototypes with the community. Incredible guy!

Theories and ideas about DataViz are much easier to explore and have a far shorter life span that Melanesians from the South Pacific, so I don’t feel as fraudulent as I did earlier this afternoon about convergently evolving so many of Jer’s conclusions.  On the contrary,  I’m actually relieved and a little self-contented to have landed so squarely on the right path and facing the right direction to pursue my interests, and to have even been pushed forward quite a number of leagues ahead of where I could get on my own.  It’s almost like I’ve made a quantum leap in my dataviz evolution… but then… that would be more like mutation and the X-Men, and tonight, it’s all about the Avengers! 😉

Edward Tufte – Father of DataViz

It’s funny that literally within hours of relaunching this site, I had two or three friends ping me to ask if I was familiar with the works of Edward Tufte. As my mind does, it went immediately to a favourite film quote, “People are always asking me if I know Tyler Durden.” Edward Tufte is an economist and statistician who pioneered the study of data visualization and is perhaps most singularly responsible for elevating the study above merely giving a pretty face to a boring chart.

Visual ExplanagtionsHis books are the definitive texts on effective presentation of statistical data. As he first explored his field in the early 80s, he surveyed hundreds of examples of data presentment evaluating different methods of selecting and modelling data, and how those different methods work together to create effective or failed communication.

The thing that really makes his work fascinating is the breadth of scope that he encompasses in his exploration. Tufte evaluates examples of data visualization from all across history, and across such diverse fields as meterology, photography, modern art, kinesthetics, cartography, epidemiology, and of course really obvious fields like marketing and propaganda.

A very dear friend lent me a copy of Visual Explanations a year or so ago and I immediately became a disciple, so I owe as huge a debt of gratitude to her for turning me on to this brilliant source of inspiration as I do to him. It will truly be a challenge for me to share my own thoughts and views without seeming derivative of Tufte’s life work.

 

Data Visualization and Open Data

COUNT ALL THE THINGS!Last November, I had the great idea to start a blog devoted to excellence in data visualization.  “DataViz” is an increasingly important field as the amount of raw data to which we are exposed and that we generate daily grows and grows.   From searches on Google to trends on Twitter to likes on Facebook, more and more of us become increasingly involved in the generation and collateral consumption of large sets of data.  We are consciously aware of only superficial manifestations of those activities, such as the accuracy of our search results, or the popularity of certain topics of interest, but there is a gaping black-hole of awareness between what we as social networkers contribute to these systems and what we harvest from them.  DataViz is one of the tools available to us to try and illuminate that mystery for ourselves.  It allows us to wrap information in (what is typically hoped to be) an aesthetically appealing presentation of data to derive meaning or to present an position on the basis of empirical data.  To put it simply, it provides a picture of reality based on objective sampling to tell some kind of a story.  Hence, the name of this blog – numbers made to be pretty, or “prettynumbers“.

One of the reasons that it has taken me so long to get this blog going is that I have become professionally involved in a pretty significant data-sharing initiative.  The project deals with Open Data or the presentation of publicly available sets of data for public consumption.  In principle, it is the release into the public domain, to do with what it will, the statistics and measurements that organizations track and use to make informed decisions about how to proceed in policy.  This data is managed or owned by governments, corporations, not-for-profit groups, schools, individuals – anyone at all with a collection of numbers.  The reason that Open Data is so appealing is that if we can agree on a standard format for expressing sets of data, then we can also develop really useful tools to visualize any of those sets of data creatively and usefully, so as to make them comprehensible, appealing and perhaps most importantly relevant for consumption.

In my mind, it’s the difference between taking all of your clothes from your dressers and closets and throwing them on the floor in a huge messy pile and saying that is your collective “wardrobe and personal sense of style”, versus choosing the demonstrative outfits that best describe your wardrobe and style.  Or better yet, allowing your friends or complete strangers to come in to your room and rifle through your clothes themselves and allowing them to draw their own conclusions!  Undeniably , there is an opportunity for bias to be applied in the process in either case, but it makes the overall task of assessing the value of the data far more manageable (and in this example, far more entertaining).  But we’ll deal with the bias issue more in subsequent posts.

Hands down, one of the most exciting partners in the movement to expand Open Data is a company called Socrata.  They have created an unbelievable set of easy-to-use tools designed to simplify the conversion of raw data into useful web applications that make sense to humans, rather than spreadsheet programs, and it has had tremendous uptake.  One of my favourite implementations of the Socrata toolkit belongs to my home town’s government website, the City of Edmonton.  data.edmonton.ca offers over a hundred sundry data sets all coupled with useful (to varying degrees) visualizations that encourage its citizens to explore the data that has been captured, rather than relying on the account of news agencies or even the government itself.  Giving people the purest, most raw form of data available as well as the tools to explore and interact with that data is the best way of removing bias from understanding reality that I can imagine, short of going out to a field and observing all of the phenomena for oneself.  It is at once empowering, democratizing, manifesting real operational transparency, and maximizing opportunities for discourse in a way with which sitting in a crowded bar or pub and exchanging misinformed opinions can’t even begin to compete.

Open Data as a concept has been around forever – since the first sentient being looked under a rock.  However the technology to make all of a government’s spending patterns available to every citizen is incredibly new.  My smartphone has thousands of times the computer processing power that yesteryear’s supercomputers had, meaning that as easily as I can update my Facebook status, I can explore the population movements in my country over the past four decades.  So long as that data is available.  Open Data solves that last problem.

I can’t get excited enough about the possibilities of this technology! With all of the conceivable opportunities to misinform and misdirect public opinion in today’s mass media channels, Open Data stands as a force for unequivocal good in the search for truth in an increasingly complicated and confusing age.  I hope to share more of my experiences, insights and examples with you over time.  In the meantime, check out the thousands and thousands of examples available on Socrata sites like https://opendata.socrata.com/ or https://nycopendata.socrata.com/ to get a sense for the breadth of this cool new approach to sharing information.