Processing and XML – A Beginner’s Tale

As I spend more time with Processing, I become more and more impressed with its ease of use and its adaptability.  I set a new homework assignment for myself which was to parse some XML that represents project data for a group over the course of a year into a meaningful visualization.  Since I had a year’s worth of data, I thought it would be valuable to represent how much effort each project consumed over the course of the year.  

I could have made a simple pie chart that represented each project’s contribution to the total effort of the year, but that’s pretty boring.  It would give a sense of the percentage of effort for the year, but it would not have had any real informational density.  I’m not sure whether “informational density” is a real thing or not, but I intuitively feel that the difference between a data visualization and a simple chart is that a visualization works consumes more than one or two dimensions of information.  A good visualization should incorporate as many axes of interpretation as possible and should represent them with as little cognitive effort as possible.  A really good visualization should illuminate traits that would not be apparent (or at least be harder to observe) otherwise and give insights and foster conclusions that would not be available otherwise.

So I set out on my project.  As I am still relatively new to the tool, I wanted to get a feel for it by using as few plug-ins and helpers as possible, preferring to build whatever routines I needed to get my visual across from scratch.  As one might expect, I made a lot of initial mistakes!  I started by creating a series of arrays and loaded each column into each array so that I could use each array individually as needed.  While this is one possible approach, it quickly became unwieldy and confusing, trying to keep straight which array represented what data and of course, the design was brittle because a change to the schema would mean massive changes to the program.

I got smart after a few sessions and created a class object to represent the units of data and loaded a single array of typed data.  From here, I was able to pull out the data I needed when I needed it.  It also meant that instead of using many large arrays of data, I could have one source array and create sub-sets of the main array that could be iterated more quickly and efficiently!  This is definitely a good lesson learned!  I added comparators to my class constructor so that I could sort the arrays by date or index and that too was a huge win for manageability. Rendering Attempt #1 Once I had the data in a form in which I could use, I started messing around with the rendering methods.  My first forays were pretty dismal.  I tried relating effort and time sorted by date and sequence and … well, as you can see from the rendering… it was a mess.  There’s a suggestion of association and extension, but no information value at all.  It was just a mash of lines.  So no points for this first effort.

Rendering Attempt 2

Second attempt – I tried a different approach to grouping the data and then calculating graph points.  Things worked a little better in that now, data points were no longer floating around meaninglessly, drifting off senseless directions.  In fact, it started to look a little like a good old western blot like I used to do in the old biochemistry lab.  But once again, I felt like this was pretty meaningless, and I was pretty sure that the marks were not actually representative of what I was intending.  So I tried again.

Rendering Attempt 3This is around the time that I really started to grasp the value of my typed data array and I set up a tricks that helped me to better sort my data when I was ready to render it.  I thought about how I wanted the data to expose itself.  First, I wanted to show the total effort week by week, which would look a little like a simple line chart indicating the total effort per week.  So sorting my data by week was my first sorting variable.  I then wanted to slice out each project, so that was my second sorting variable.  Rendering that, I was starting to get something that started to look vaguely representative of the data… progress was being made!

Rendering Attempt 4

I was getting where I wanted to go, but what I really wanted to do was to show the persistence of each project across each weekly period – to give a volume to each project that represented the continuity of effort involved. This meant that I would have to draw areas across the X-axis, rather than plotting individual points.  This is where the value of the sub-sets of arrays really paid dividends.  Each of my sorting variables allowed me to treat each variable in the proper sequence… first the big array of projects per week, then cutting that set up into individual projects… doing everything in order meant that project A in week 1 could be related easily to project A in weeks 2 or 3 and so on, and I could render areas instead of lines. As you can see, it was working… sort of.

Rendering Attempt 5Out of frustration, I put the project aside for a week and went out to enjoy running, playing with my cats, and visiting with friends.  Too much time in front of a computer is not often the correct solution to breaking through mental problems!  When I returned, I quickly identified why my objects were not closing properly and looked like they were being drawn by a meth-head from the TRON universe.

It’s funny sometimes how very minor tweaks can make the difference between total catastrophe and successful joy! Rendering Attempt 6For the sake of saving processing cycles, I had been working with a subset of the total data available – about a quarter.  Now I was feeling confident that I was close to where I wanted to be, and I loaded the whole set.  I tweaked some colours and changed the scales and held onto my butt…  Success.  I had got the effect that I wanted… a visualization that showed me the contribution of each project to the total workforce capacity, week by week over the whole year.
Final RenderingWhat the visualization gives me (over a simple pie chart) is a sense of just how busy the entire group was, week over week, on project work.  It shows when times were busy and when times were slow  It shows which projects had priority week by week, and it shows when they started and ended… I had achieved informational density! After some data cleansing and labelling, I had achieved my goal.

Like I said at the beginning, I think that the value of Processing is that it’s such a great tool for experimenting with visualizing data.  It’s quick and easy to pick up if you have any experience with Java of Javascript (object oriented programming techniques clearly are a benefit) and it allows you to very quickly sculpt out the data that you process in different ways.  It’s obvious to me that time and experience allow you to develop methods of approaching a problem and code to achieve it that can be reused to decrease development time.  But even for a newcomer, the tool is fantastic for converting data into compelling visual stories.  Next on the list of things to do is to learn how to make these renderings interactive!

Processing Tweets

There’s a great open source tool available for data visualization that has been around for years called Processing (available at  It’s dead-simple to set-up and start working and with some fairly basic programming skills in java or javascript, the results can be immediate and fairly gratifying.

I have an application that I built last year to import and normalize RSS feeds from the like of twitter and, but I had some trouble rolling my own data visualization framework to render the data efficiently.  Enter processing… within a day, I had imported my data and was able to parse, display and animate it with some really simple programming.  Once the basics were in place, I added some code to compare date-times, and some rudimentary logic to highlight tweets about “cat”s and “superman” (fairly common tweeting subject matter for me, and voilà!  My first animated data visualization!My Tweets from 2008 to the Present

Here we see the tweets I’ve made from my personal twitter account from 2008 to the present, with the tweets spaced apart in proportion to the time that elapsed between them.  Tweets about cats are in orange, and the tweets about Superman are in red.  As you can clearly see – I tweet a lot more about cats than Superman.  And the display is so hideous as to be completely unpresentable.  However, the main point for me is that with a great tool and some fairly basic programming, we can get a visualization that is far easier to comprehend and is far more descriptive than a mere list of 900 tweets.

For bonus fun-good-times, Processing also outputs to MOV so that you can see the rendering in real-time!  Like I’ve said before, I’m going to love working with this tool!

Timeline of Tweets (.mov)


Convergent Evolution

As I was in my car this morning, I heard a story on the radio about an interesting population genetics study based on the Melanesian people of the Solomon Islands in the South Pacific. It turns out that some 10-15% of the children of this predominantly dark-skinned, dark-haired group of people have bright blonde hair traditionally associated with people of northern European descent. Through careful analysis, it has been concluded that the blonde gene that expresses itself in the Melanesians is unique and distinct from any of the blonde genes in Europe.  This proves that while the two traits appear very nearly the same, they are genetically unrelated and arose independently of one another. The term for that in genetics is “convergent evolution” and while it is uncommon and often unexpected, it is a real thing.

Later on this same afternoon, I was confronted with another case of convergent evolution but on a lot more personal level. As I explore the dataviz field (as the embodiment of the term “neophyte”), I realize that so many of the great epiphanies about data that I’ve experienced in my professional and academic contexts have been expressed incredibly well already by many of the giants of the field.  On the one hand, as I learn of expert after expert who already shines so brightly in this  field and listen to them expound the same concepts that I (in my very hand-wavey, inexperienced, undisciplined fashion) try to communicate to my friends, I feel like a charlatan.  But on the other hand, convergent evolution explains the potential for two equivalent and equally useful occurrences to arise independently, and I suddenly feel slightly less utterly fraudulent.

Jer ThorpI spent a good deal of time this afternoon immersing myself in the works of one Jer Thorp, Data Artist in Residence at the New York Times (and fellow Canadian), one of the visionaries who has already established himself as a groundbreaker in all almost all of the exact same fields of data visualization that I have been aching to carve out.  A key tenet of his is the need for better tools to make the ever-inflating cloud of data growing all around us sensible and useful and he has done much to serve that need.  He expresses the urgency of this exploration so well at his TED Talk in Vancouver in November 2011 (click here to watch it on YouTube).

My OpenPaths Map View

Jer has been involved in some pretty exciting projects. For example, OpenPaths is a project that allows users to exploit the historical locational data stored on their iPhones or Android devices to visualize where they (or at least their phones) were and when they were there. Said like that, it sounds fairly “Big Brother-ish.” However, if you think about how valuable that information could be to YOU, and not just to an application developer or Google or Facebook or Apple, the outcome seems less sinister and more personally useful. If YOU could better visualize that data then you could relate it back to the personal narrative of your own life, thereby enriching your ability to recall and express those moments in the future. I can’t put it any better than Jer does in his Talk:

“What we didn’t expect was how moving this experience would be. When I uploaded my data, I thought ‘big deal, I know where I live, I know where I work… what am I gonna see here?’

Well, it turns out, what I saw was that moment when I got off the plane to start my new life in New York… that restaurant where I had Thai food that first night thinking about this new experience of being in New York. The day that I met my girlfriend… right?” (@13:50 – 14:12)

The convergent aspect for me is that this is the exact same reason why I still use Foursquare and Google Latitude and Facebook Places to check in to places… so that I can bookmark moments of my life for MY OWN personal consumption and use.  In 2011, I moved from one side of the country to the other… and then back again, so I have a very sincere interest in being better able to use any of the available data regarding that journey to help me chronicle that part of my life’s history.  I understand that this opens me up to a couple very undesirable potential scenarios – I like to think of these as the “Enemy of the State/Eagle Eye” scenarios.  But the likelihood of those scenarios coming to pass seems too unlikely compared to the value of this rich source of passively-collected data.  Jer’s prototype proves to me that other people feel the same way about their personal data, which is very encouraging.

Avengers AssembledWhat else has Jer done that is awesome? Well, shamefully, the whole reason that I found Jer in the first place was that I am supposed to go to see The Avengers tonight, and FlowingData had a striking visualization of the first appearances of each Avenger (apparently, there have been over 120 Avengers introduced in the 570-some issues). This one post blew my mind – not just for the clever and beautiful visualizations, but also because it exposed me to a great open data source for comics that seems pretty sophisticated and complete! As if all of this goodness weren’t enough, Jer has shared some of his visualization tools and prototypes with the community. Incredible guy!

Theories and ideas about DataViz are much easier to explore and have a far shorter life span that Melanesians from the South Pacific, so I don’t feel as fraudulent as I did earlier this afternoon about convergently evolving so many of Jer’s conclusions.  On the contrary,  I’m actually relieved and a little self-contented to have landed so squarely on the right path and facing the right direction to pursue my interests, and to have even been pushed forward quite a number of leagues ahead of where I could get on my own.  It’s almost like I’ve made a quantum leap in my dataviz evolution… but then… that would be more like mutation and the X-Men, and tonight, it’s all about the Avengers! 😉