Processing and XML – A Beginner’s Tale

As I spend more time with Processing, I become more and more impressed with its ease of use and its adaptability.  I set a new homework assignment for myself which was to parse some XML that represents project data for a group over the course of a year into a meaningful visualization.  Since I had a year’s worth of data, I thought it would be valuable to represent how much effort each project consumed over the course of the year.  

I could have made a simple pie chart that represented each project’s contribution to the total effort of the year, but that’s pretty boring.  It would give a sense of the percentage of effort for the year, but it would not have had any real informational density.  I’m not sure whether “informational density” is a real thing or not, but I intuitively feel that the difference between a data visualization and a simple chart is that a visualization works consumes more than one or two dimensions of information.  A good visualization should incorporate as many axes of interpretation as possible and should represent them with as little cognitive effort as possible.  A really good visualization should illuminate traits that would not be apparent (or at least be harder to observe) otherwise and give insights and foster conclusions that would not be available otherwise.

So I set out on my project.  As I am still relatively new to the tool, I wanted to get a feel for it by using as few plug-ins and helpers as possible, preferring to build whatever routines I needed to get my visual across from scratch.  As one might expect, I made a lot of initial mistakes!  I started by creating a series of arrays and loaded each column into each array so that I could use each array individually as needed.  While this is one possible approach, it quickly became unwieldy and confusing, trying to keep straight which array represented what data and of course, the design was brittle because a change to the schema would mean massive changes to the program.

I got smart after a few sessions and created a class object to represent the units of data and loaded a single array of typed data.  From here, I was able to pull out the data I needed when I needed it.  It also meant that instead of using many large arrays of data, I could have one source array and create sub-sets of the main array that could be iterated more quickly and efficiently!  This is definitely a good lesson learned!  I added comparators to my class constructor so that I could sort the arrays by date or index and that too was a huge win for manageability. Rendering Attempt #1 Once I had the data in a form in which I could use, I started messing around with the rendering methods.  My first forays were pretty dismal.  I tried relating effort and time sorted by date and sequence and … well, as you can see from the rendering… it was a mess.  There’s a suggestion of association and extension, but no information value at all.  It was just a mash of lines.  So no points for this first effort.

Rendering Attempt 2

Second attempt – I tried a different approach to grouping the data and then calculating graph points.  Things worked a little better in that now, data points were no longer floating around meaninglessly, drifting off senseless directions.  In fact, it started to look a little like a good old western blot like I used to do in the old biochemistry lab.  But once again, I felt like this was pretty meaningless, and I was pretty sure that the marks were not actually representative of what I was intending.  So I tried again.

Rendering Attempt 3This is around the time that I really started to grasp the value of my typed data array and I set up a tricks that helped me to better sort my data when I was ready to render it.  I thought about how I wanted the data to expose itself.  First, I wanted to show the total effort week by week, which would look a little like a simple line chart indicating the total effort per week.  So sorting my data by week was my first sorting variable.  I then wanted to slice out each project, so that was my second sorting variable.  Rendering that, I was starting to get something that started to look vaguely representative of the data… progress was being made!

Rendering Attempt 4

I was getting where I wanted to go, but what I really wanted to do was to show the persistence of each project across each weekly period – to give a volume to each project that represented the continuity of effort involved. This meant that I would have to draw areas across the X-axis, rather than plotting individual points.  This is where the value of the sub-sets of arrays really paid dividends.  Each of my sorting variables allowed me to treat each variable in the proper sequence… first the big array of projects per week, then cutting that set up into individual projects… doing everything in order meant that project A in week 1 could be related easily to project A in weeks 2 or 3 and so on, and I could render areas instead of lines. As you can see, it was working… sort of.

Rendering Attempt 5Out of frustration, I put the project aside for a week and went out to enjoy running, playing with my cats, and visiting with friends.  Too much time in front of a computer is not often the correct solution to breaking through mental problems!  When I returned, I quickly identified why my objects were not closing properly and looked like they were being drawn by a meth-head from the TRON universe.

It’s funny sometimes how very minor tweaks can make the difference between total catastrophe and successful joy! Rendering Attempt 6For the sake of saving processing cycles, I had been working with a subset of the total data available – about a quarter.  Now I was feeling confident that I was close to where I wanted to be, and I loaded the whole set.  I tweaked some colours and changed the scales and held onto my butt…  Success.  I had got the effect that I wanted… a visualization that showed me the contribution of each project to the total workforce capacity, week by week over the whole year.
Final RenderingWhat the visualization gives me (over a simple pie chart) is a sense of just how busy the entire group was, week over week, on project work.  It shows when times were busy and when times were slow  It shows which projects had priority week by week, and it shows when they started and ended… I had achieved informational density! After some data cleansing and labelling, I had achieved my goal.

Like I said at the beginning, I think that the value of Processing is that it’s such a great tool for experimenting with visualizing data.  It’s quick and easy to pick up if you have any experience with Java of Javascript (object oriented programming techniques clearly are a benefit) and it allows you to very quickly sculpt out the data that you process in different ways.  It’s obvious to me that time and experience allow you to develop methods of approaching a problem and code to achieve it that can be reused to decrease development time.  But even for a newcomer, the tool is fantastic for converting data into compelling visual stories.  Next on the list of things to do is to learn how to make these renderings interactive!

Processing Tweets

There’s a great open source tool available for data visualization that has been around for years called Processing (available at http://www.processing.org).  It’s dead-simple to set-up and start working and with some fairly basic programming skills in java or javascript, the results can be immediate and fairly gratifying.

I have an application that I built last year to import and normalize RSS feeds from the like of twitter and last.fm, but I had some trouble rolling my own data visualization framework to render the data efficiently.  Enter processing… within a day, I had imported my data and was able to parse, display and animate it with some really simple programming.  Once the basics were in place, I added some code to compare date-times, and some rudimentary logic to highlight tweets about “cat”s and “superman” (fairly common tweeting subject matter for me, and voilà!  My first animated data visualization!My Tweets from 2008 to the Present

Here we see the tweets I’ve made from my personal twitter account from 2008 to the present, with the tweets spaced apart in proportion to the time that elapsed between them.  Tweets about cats are in orange, and the tweets about Superman are in red.  As you can clearly see – I tweet a lot more about cats than Superman.  And the display is so hideous as to be completely unpresentable.  However, the main point for me is that with a great tool and some fairly basic programming, we can get a visualization that is far easier to comprehend and is far more descriptive than a mere list of 900 tweets.

For bonus fun-good-times, Processing also outputs to MOV so that you can see the rendering in real-time!  Like I’ve said before, I’m going to love working with this tool!

Timeline of Tweets (.mov)