Search This Blog


Thursday, 22 September 2016

Really proud of my students -- machine learning on corporate websites

Amanda wanted to know what factors influence certain web sites that people visit. But her exploratory analysis found out the secrets of big software companies.

She used a sitemap crawler to look at the documentation of the usual names: Microsoft, Symantec, Dropbox, Intuit, Atlassian, CA, Trello, Github, Adobe, Autodesk, Oracle and some others. She looked at the number of words on each page, the number of links and various other measures and then clustered the results.

Like the best of all data science projects, the results are obvious, but only in retrospect.

Microsoft, Symantec and Dropbox are all companies whose primary focus is on serving non-technical end-users who aren’t particularly interested in IT or computers. They clustered into a group with similar kinds of documentation.

CA, Trello and Github primarily focus on technical end-users: programmers, sysadmins, software project managers. Their documentation cluster together in similarity. Intuit and Atlassian were similar; Adobe and Oracle clustered together.

Really interestingly, it’s possible to derive measures of the structural complexity of the company. Microsoft is a large organisation with silos inside silos. It can take an enormous number of clicks to get from the default documentation landing page until you get to a “typical” target page. Atlassian prides itself on its tight teamwork and its ability to bring people together from all parts of the organisation. They had the shortest path of any documentation site.

But this wasn’t what Amanda was really after: she wanted to know whether she could predict engagement: whether people would read a page on a documentation site or just skim over it. She took each website that she could get data for and deduced how long it would take to read the page (based on the number of words), how long people were actually spending on the page, the number of in-links and a few other useful categories (e.g. what sort of document it was). She created a decision tree model and was able to explain 86% of the variance in engagement.

Interesting result: there was little relationship between the number of hyperlinks that linked to a site and how much traffic it received. Since the number of links strongly influence a site’s pagerank in Google’s search algorithms, this is deeply surprising.

There was more to her project (some of which can’t be shared because it is company confidential), but just taking what I’ve described above, there are numerous useful applications:
  • Do you need to analyse your competition’s internal organisational structure? Or see how much has changed in your organisation in the months after an internal reorg?
  • Is your company’s website odd compared to other websites in your industry?
  • We can use Google Analytics and see what pages people spend time on, and which links they click on, but do you want to know why they are doing that? You know the search terms your visitors used, but what is it that they are interested in finding?

Wednesday, 14 September 2016

Really proud of my students - AI analysis of reviews

Sam wants to know what movies are worth watching, so he analysed 25,000 movie reviews. This is a tough natural language processing problem, because each movie only has a small number of reviews (less than 30). It’s nowhere near enough for a deep learning approach to work, so he had to identify and synthesise features himself.

He used BeautifulSoup to pull out some of the HTML structure from the reviews, and then made extensive use of the Python NTLK library.

The bag-of-words model (ignoring grammar, structure or position) worked reasonably well. A naive Bayesian model performed quite well -- as would be expected -- as did a decision tree model, but there was enough noise that a logistic regression won out, getting the review sentiment right 85% of the time. He evaluated all of his models with F1, AUC and precision-recall. He used this to tweak the model a little and just nudge it a little higher.

A logistic regression over a bag-of-words essentially means that there we are assigning a score to each word in the English language (which might be a positive number, a negative number or even zero), and then adding up the scores for each word when it appears. If overall it adds up to a positive number, we count the review positive; if negative the reviewer didn’t like the movie.

He used the Python Scikit learn library (as do most of my students) to calculate the optimal score to assign to each English language word. Since the vocabulary he was working with was around 75,000 words (he didn’t do any stemming or synonym-based simplication) this ran for around 2 days on his laptop before coming up with an answer.

Interestingly, the word “good” is useless as a predictor of whether a movie was good or not! It probably needs more investigation, but perhaps a smarter word grouping that picked up “not good” would help. Or maybe it fails to predict much because of reviews that say things like “while the acting was good, the plot was terrible”.

Sam found plenty of other words that weren’t very good predictors: movie, film, like, just and really. So he turned these into stopwords.

There are other natural language processing techniques that often produce good results, like simply measuring the length of the review, or measuring the lexical dispersion (the richness of vocabulary used). However, these were also ineffective.

What Sam found was a selection of words that, if they are present in a review, indicate that the movie was good. These are “excellent”, “perfect”, “superb”, “funniest” and interestingly: “refreshing”. And conversely, give a movie a miss if people are talking about “worst”, “waste”, “disappointment” and “disappointing”.

What else could this kind of analysis be applied to?
  • Do you want to know whether customers will return happier, or go elsewhere looking for something better? This is the kind of analysis that you can apply to your communications from customers (email, phone conversations, twitter comments) if you have sales information in your database.
  • Do you want to know what aspects of your products your customers value? If you can get them to write reviews of your products, you can do this kind of natural language processing on them and you will see what your customers talk about when they like your products.

Friday, 9 September 2016

Really proud of my students -- data science for cats

Ngaire had the opportunity to put a cat picture into a data science project legitimately, which could be a worthy blog post in itself. She did an analysis on how well animal shelters (e.g. the council pound) are able to place animals.

She had two data sources. The first was Holroyd council’s annual report (showing that the pound there euthanised 59% of animals they received), but of course any animal that had been microchipped would have been otherwise handled, so the overall percentage is much lower than this in reality. Still, Australia is far behind the USA (based on her second data source, which was a Kaggle-supplied dataset of ~25,000 outcomes).

She put together a decision tree regressor which correctly predicted what would happen to an animal in a shelter around 80% of the time. She also had a logistic regression model with a similar success rate.

The key factors which determined the fate of a lost kitten or puppy were a little surprising. Desexed animals do better in the USA. The very young (a few weeks) do have very good outcomes -- they are very likely to be adopted or transferred. Black fur was a bad sign, although the messiness of the data meant that she couldn’t explore colours all that much further: at a guess, she suggested that being hard to photograph is a problem, so perhaps there is a cut-off level of darkness where there will be a sudden drop in survival rates.

Where else could this kind of analysis be applied? She could take her model and apply it when you want to know an outcome in advance. Questions might include:
How will a trial subscription work out for a potential customer?
What are going to happen to our stock? Will the goods be sold, used, sent to another store…?

Ngaire is open to job opportunities, particularly if you are looking for a data scientist with a very broad range of other career experience in the arts; I can put you in touch if you are interested.

Sunday, 28 August 2016

Really proud of my students -- final projects

Over the next few weeks I'm going to do some short blog posts about each of the final projects my students did in their data science course.

One of the reasons this blog has been a bit quieter than usual these last few months is that I was teaching a Data Science class at General Assembly, which was rewarding but rather exhausting.

Some observations:
  • GA is busy and dynamic. I remember back in the late 1990s at HP when every company was deploying SAP on HP-UX to avoid Y2K problems: there were classes constantly; you might discover that the class you were teaching was going to be held in the boardroom using some workstations borrowed from another city. GA was like that: every room packed from early morning until late at night.
  • No-one in the class had a job as a data scientist at the beginning of the course, but there was a lot of movement within 10 weeks: job changes, promotions, new career directions. The only time in my teaching career where I saw the same wow-this-person-is-trained-now-let's-poach-them was in the early days of the Peregrine -> Service Manager transition.
  • The course is mainly about machine learning but there is flexibility for the instructor to add in a few other relevant topics based on what the students want. Right now, Natural Language Processing is white-hot. Several students did some serious NLP / NLU projects. The opportunities for people who have skills in this area are very, very good.
  • Computer vision is an area where there is a lot of interest as well.
I'll be teaching the first part of the Data Science immersive (a full-time course instead of a night-time part-time one) starting in September; please sign up with GA if you are interested.

I suspect by the time I've finished blogging about my past students' projects that there will be a new round of student projects to cover, so this might become a bit more of a feature on my blog.

Tuesday, 23 August 2016

Automate the installation of a Windows DataProtector client

A client today wanted to push the DataProtector agent from SCCM / System Center 2012 instead of from Data Protector. It's not that difficult, but I couldn't find the command-line setup documented anywhere.

You will need to run (as an administrator):

  net use r: \\\Omniback
  cd \x8664
  msiexec /i "Data Protector A.09.00.msi" /passive INSTALLATIONTYPE=Client ADDLOCAL=core,da,autodr
  net use /delete r:

Obviously, substitute with your install server, and if R: is already allocated, use something else instead.

Then, trigger the following command on your cell manager:

 omnicc -import_host clientname

Replace clientname with the name of the client.

Script this as appropriate (e.g. after the operating system has booted) in order to have an unattended installation.

Friday, 5 August 2016

Data Protector CRS operation cannot be performed in full-screen mode

Today's head-scratcher: after upgrading to 9.07 on a Windows cell manager, the CRS service won't start.

Eventvwr says something even weirder:

The Data Protector CRS service terminated with service-specific error The requested operation cannot be performed in full-screen mode.. 

I was in full screen mode at the time, but it still wouldn't start even when I minimised my RDP session. For my own sanity, I was glad of this.

Trawling through Daniel Braun's blog, I saw some comments there that it could be related to anti-virus software. Nope, not that either.

The debug.log said something a little bit more believable:

[SmCreateTable] MapViewOfFile(size:17505216) failed, error=[5] Access is denied.

I discovered that I could reliably get that message added that message every time I tried to start the CRS. But what is actually being denied?

So I ran omnisv start -debug 1-500 crm-vexatious.txt

I then had a 160KB file created in C:\programdata\omniback\tmp that began with OB2DBG, ended with crm-vexatious.txt and had CRM in the filename. Good: at least it gets far enough that it can create debug messages.

Scrollling right to the bottom of it, there it was:

Code is:1007  SystemErr: [5] Access is denied
************************   DEFAULT ERROR REPORT   ***************
[Critical] From [email protected] "" Time: 5/8/2016 1:00:33PM
Unable to allocate shared memory: Unknown internal error.

Internally, the function to return a shared memory segement presumably encodes something as 1007; CRS then exits with that code (which is the standard Windows error code for "can't be performed in full-screen mode").

There aren't many reasons for a shared memory allocation to fail. In fact, the only one I can think of that could be relevant here is if the segment already exists. I thought about figuring out what the equivalent to ipcrm is on Windows, gave up and rebooted the box.

And it came up perfectly. Funnily enough, if I had had no idea what I was doing, I would have just bounced the box to see if it would have fixed it, and saved myself a headache and some stress wondering what was going on. Ignorance would have been bliss.

Saturday, 9 July 2016

[Politics] The Rise of the Technologist Parties

What’s the most important resource? What is it, that if you control it, gives you power?
Here are four of the most common answers you will hear:
  • The most important resource is land. Without land we have no food (or anything else for that matter).
  • The most important resource is the labour of massed workers. Without anyone to do the work, nothing will get done and we will have nothing.
  • The most important resource is the environment. Without air to breathe or water to drink, there is no economy.
  • The most important resource is the capital that dictates what gets done. Money is power: we should try to remove aberrations that send capital into unnecessary and pointless directions.
Most people can align with one of these viewpoints. In fact, in Australia these viewpoints are so strongly held that we even have political parties to represent those who hold those views (in order: the Nationals, Labor, the Greens, the Liberals).
As far as I can tell, in the USA the middle two align with the Democrats and the outer two with the Republican party. In some states in Australia, a similar merge has happened with the Nationals and Liberal party merging.
We look at these answers as if they have been around forever and that there can be no other significant factor, ignoring the fact that “the labour of workers” as a significant asset was a rarely-expressed thought prior to 1850, nor was there much coherency to the green movement before “Silent Spring” in the 1960s.
But something has just changed. We’re seeing it first in Australia because of our preferential voting and large numbers of micro-parties. In this week’s elections, the vote for “other” parties grew. About 1 in every 4 Australians did not vote for any of the major parties, but instead voted for one of about 50 “other” parties.
There’s a good chance that “other” parties will end up holding the balance of power in the lower house and even with desperate changes to the voting rules enacted by the previous parliament, there are likely to be numerous “other” parties in the upper house.
And I think it’s an acknowledgement that there are other answers to my first question.
Let me add three answers, which statistically (according to the election results) must be viewpoints held by at least 250,000 adults in Australia:
  • The most important resource in a society is the quality and depth of the religious faith of the members of that society.
  • There is no important resource that is worth getting worked up about. Let’s all have sex and smoke dope.
  • The most important input in the 21st century is the accessible and useable corpus of science, technology and engineering.
The religious faith answer is interesting in itself and maybe one day I’ll write an article on it. There are lots of different threads to that one.
The Sex Party and HEMP party alliance together polled nearly as well as the Christian parties. If we add in the Drug Reform party, they were well ahead. I’m not sure what to make of this. Does this show that we are a very mature country, well up on Maslow’s hierarchy of needs, or does it show the opposite?
But for now, let’s look at what happened with science, technology and engineering.
Have a look through the candidates for the Science party (which surely is the party closest to this last answer), and you will find a list of bright folks: a PhD in biochemistry here; a technology startup founder there; a professional scientist.
It’s a funny co-incidence, but those are all job titles of future equity lords. If you are a wealthy founder of a high-tech company, you probably at one point had a job title like one of those.
Let’s rewind. If you wanted to get wealthy twenty years ago, you went into finance, did some deals, took a cut and everyone came out smiling because there was always margin to be made. Play things right and you could make a few million just getting the right people and the right money together. You really only needed some capital.
If you wanted to get wealthy fifty years ago, you would have started a factory, employed lots of workers to churn out goods and made a profitable living on the marginal value that each worker could produce. You needed a supply of trainable workers and a bit of capital to get going.
If you wanted to get wealthy one hundred years ago, you needed to own land. The more of it you had, the more you could grow on that land. Come harvest time you would employ as much temporary labour as you could acquire and sold the goods produced. You needed land, a supply of semi-trained workers and a lot of capital.
Today, if you want to become wealthy, you need a skillset that lets youautomate somethingso that you can leverage your own brainpower to do the same work as a hundred people without that skill set. Here are some examples:
  • A biotech startup that works out how to get bacteria to synthesise some useful industrial chemical.
  • The machine learning / artificial intelligence startup that works out how to automate a white collar (or blue collar) job.
  • The medtech startup that has some new process for treating or identifying a disease.
  • The software company that creates a viral product that everybody wants.
You need very modest amounts of capital (the most expensive of these would probably be the medtech startup which would probably need to raise $5m). Since any of these occupation titles (computer scientist, biotech developer, medical device engineer) can generate very valuable intellectual property in a very short time, there’s a good chance that you would maintain a significant equity stake in your business after all the capital raising has been done and the company that you form goes on to become worth tens or hundreds of millions of dollars. There’s a name for the people whose lives have these trajectories: “equity lords”.
Emphatically, to get there, equity lords don’t need:
  • A large workforce of unskilled or semi-skilled labour. Unlike manufacturing, doubling revenue does not require doubling the workforce. So right-wing parties trying to rail on the side of big business against labour unions are unlikely to be saying anything of importance. Let’s just put this into perspective: I overheard a salary negotiation for a potential new employee the other day. The employer had offered $140-$150k plus equity. The candidate replied that this was way too low, doubled the equity component and asked for $20k extra. The employer happily agreed saying “well if it’s only an extra $2,000 per month…” This company has less than 5 employees, but would have wages approaching $1m / year.
  • Hundreds of millions of dollars of capital. We have moved into a world where the financiers are desperately trying to find returns and the only place they can find them is in the leftovers of startups. Financiers are trying to figure out how they can make cost-effective smaller investments because there just isn’t the call for big rounds of capital raising any more. So right-wing parties in the pockets of Wall Street (and its international equivalents) aren’t particularly relevant here either.
  • Special government programs. (In fat, there is clear evidence from the QUT CAUSEE study that some of these can be actively damaging.) So leftist parties aren’t going to be terribly interesting to the equity class.
  • Natural resources, land area, or access to particular places. I’ve seen (and worked with) hyper-growth firms that have operated out of heritage listed buildings, garages, beach fronts, dedicated incubators and top-floor city offices. I’ve had meetings with people running significant startup companies where we brought our children to the park and they worked sitting on a picnic rug. High population densities (to bring together the skills and resources required) do seem to be important, so both the parties supporting farmers and parties trying to keep the natural environment preserved are irrelevant and in some cases actively antagonistic. (Try suggesting “no genetically modified organisms” to a biotechnologist and see what happens.)
The story of the Australian (and the world economy) over the next 20–50 years is going to be the rise of this equity lord class. In the same way that the landed gentry gained wealth and then used that to leverage political power in the past, the equity lords will grow in wealth and in numbers and in their desire to be represented politically.
Who is going to represent them? Based on the dot points above it doesn’t look like any existing major party is well positioned for it. But there is at least one minor party that looks very well aligned. Looking at the parties at the last Federal election, it’s obvious who it will be representing the experts who will be running the artificial intelligences and nanotech factories that will be pervasive in our lives mid-century.
So, while it would be easy to dismiss the Science Party / Cyclists coalition as just another silly minor party (the one who polled lowest outside of their alliance), I’m predicting a steady growth over the next decades in both its size and its support. Paul Graham has written about the possible political implications of startups, and in Australia we’re seeing that play out starting right now. Don’t dismiss the possibility of Prime Minister Meow-Meow Ludo Meow defending the seat of Grayndler in the 2036 election.