Search This Blog

Thursday, 22 September 2016

Really proud of my students -- machine learning on corporate websites

Amanda wanted to know what factors influence certain web sites that people visit. But her exploratory analysis found out the secrets of big software companies.

She used a sitemap crawler to look at the documentation of the usual names: Microsoft, Symantec, Dropbox, Intuit, Atlassian, CA, Trello, Github, Adobe, Autodesk, Oracle and some others. She looked at the number of words on each page, the number of links and various other measures and then clustered the results.

Like the best of all data science projects, the results are obvious, but only in retrospect.

Microsoft, Symantec and Dropbox are all companies whose primary focus is on serving non-technical end-users who aren’t particularly interested in IT or computers. They clustered into a group with similar kinds of documentation.

CA, Trello and Github primarily focus on technical end-users: programmers, sysadmins, software project managers. Their documentation cluster together in similarity. Intuit and Atlassian were similar; Adobe and Oracle clustered together.

Really interestingly, it’s possible to derive measures of the structural complexity of the company. Microsoft is a large organisation with silos inside silos. It can take an enormous number of clicks to get from the default documentation landing page until you get to a “typical” target page. Atlassian prides itself on its tight teamwork and its ability to bring people together from all parts of the organisation. They had the shortest path of any documentation site.

But this wasn’t what Amanda was really after: she wanted to know whether she could predict engagement: whether people would read a page on a documentation site or just skim over it. She took each website that she could get data for and deduced how long it would take to read the page (based on the number of words), how long people were actually spending on the page, the number of in-links and a few other useful categories (e.g. what sort of document it was). She created a decision tree model and was able to explain 86% of the variance in engagement.

Interesting result: there was little relationship between the number of hyperlinks that linked to a site and how much traffic it received. Since the number of links strongly influence a site’s pagerank in Google’s search algorithms, this is deeply surprising.

There was more to her project (some of which can’t be shared because it is company confidential), but just taking what I’ve described above, there are numerous useful applications:
  • Do you need to analyse your competition’s internal organisational structure? Or see how much has changed in your organisation in the months after an internal reorg?
  • Is your company’s website odd compared to other websites in your industry?
  • We can use Google Analytics and see what pages people spend time on, and which links they click on, but do you want to know why they are doing that? You know the search terms your visitors used, but what is it that they are interested in finding?