Search This Blog

Thursday 29 September 2016

DataProtector technical workshop in New Zealand


Email brandon.voight@hpe.com if you are interested. It's on next week.

Date: Wednesday, 5 October 2016
Time: 8.30am to 1.00pm

Location:  Hewlett Packard Enterprise Office
Level 4, 22 Viaduct Harbour Avenue, Auckland 1011

Brandon and Paul Carapetis will be talking about:
  • New features from the latest versions of Data Protector
  • Integration with VMware, Hyper-V, and HPE hardware: 3PAR and StoreOnce
  • Road map for the year ahead
  • Introduction to New Related Products: Backup Navigator, Connected, Storage Optimizer, and VM Explorer

Thursday 22 September 2016

Really proud of my students -- machine learning on corporate websites

Amanda wanted to know what factors influence certain web sites that people visit. But her exploratory analysis found out the secrets of big software companies.

She used a sitemap crawler to look at the documentation of the usual names: Microsoft, Symantec, Dropbox, Intuit, Atlassian, CA, Trello, Github, Adobe, Autodesk, Oracle and some others. She looked at the number of words on each page, the number of links and various other measures and then clustered the results.

Like the best of all data science projects, the results are obvious, but only in retrospect.

Microsoft, Symantec and Dropbox are all companies whose primary focus is on serving non-technical end-users who aren’t particularly interested in IT or computers. They clustered into a group with similar kinds of documentation.

CA, Trello and Github primarily focus on technical end-users: programmers, sysadmins, software project managers. Their documentation cluster together in similarity. Intuit and Atlassian were similar; Adobe and Oracle clustered together.

Really interestingly, it’s possible to derive measures of the structural complexity of the company. Microsoft is a large organisation with silos inside silos. It can take an enormous number of clicks to get from the default documentation landing page until you get to a “typical” target page. Atlassian prides itself on its tight teamwork and its ability to bring people together from all parts of the organisation. They had the shortest path of any documentation site.

But this wasn’t what Amanda was really after: she wanted to know whether she could predict engagement: whether people would read a page on a documentation site or just skim over it. She took each website that she could get data for and deduced how long it would take to read the page (based on the number of words), how long people were actually spending on the page, the number of in-links and a few other useful categories (e.g. what sort of document it was). She created a decision tree model and was able to explain 86% of the variance in engagement.

Interesting result: there was little relationship between the number of hyperlinks that linked to a site and how much traffic it received. Since the number of links strongly influence a site’s pagerank in Google’s search algorithms, this is deeply surprising.

There was more to her project (some of which can’t be shared because it is company confidential), but just taking what I’ve described above, there are numerous useful applications:
  • Do you need to analyse your competition’s internal organisational structure? Or see how much has changed in your organisation in the months after an internal reorg?
  • Is your company’s website odd compared to other websites in your industry?
  • We can use Google Analytics and see what pages people spend time on, and which links they click on, but do you want to know why they are doing that? You know the search terms your visitors used, but what is it that they are interested in finding?

Wednesday 14 September 2016

Really proud of my students - AI analysis of reviews

Sam wants to know what movies are worth watching, so he analysed 25,000 movie reviews. This is a tough natural language processing problem, because each movie only has a small number of reviews (less than 30). It’s nowhere near enough for a deep learning approach to work, so he had to identify and synthesise features himself.

He used BeautifulSoup to pull out some of the HTML structure from the reviews, and then made extensive use of the Python NTLK library.

The bag-of-words model (ignoring grammar, structure or position) worked reasonably well. A naive Bayesian model performed quite well -- as would be expected -- as did a decision tree model, but there was enough noise that a logistic regression won out, getting the review sentiment right 85% of the time. He evaluated all of his models with F1, AUC and precision-recall. He used this to tweak the model a little and just nudge it a little higher.

A logistic regression over a bag-of-words essentially means that there we are assigning a score to each word in the English language (which might be a positive number, a negative number or even zero), and then adding up the scores for each word when it appears. If overall it adds up to a positive number, we count the review positive; if negative the reviewer didn’t like the movie.

He used the Python Scikit learn library (as do most of my students) to calculate the optimal score to assign to each English language word. Since the vocabulary he was working with was around 75,000 words (he didn’t do any stemming or synonym-based simplication) this ran for around 2 days on his laptop before coming up with an answer.

Interestingly, the word “good” is useless as a predictor of whether a movie was good or not! It probably needs more investigation, but perhaps a smarter word grouping that picked up “not good” would help. Or maybe it fails to predict much because of reviews that say things like “while the acting was good, the plot was terrible”.

Sam found plenty of other words that weren’t very good predictors: movie, film, like, just and really. So he turned these into stopwords.

There are other natural language processing techniques that often produce good results, like simply measuring the length of the review, or measuring the lexical dispersion (the richness of vocabulary used). However, these were also ineffective.

What Sam found was a selection of words that, if they are present in a review, indicate that the movie was good. These are “excellent”, “perfect”, “superb”, “funniest” and interestingly: “refreshing”. And conversely, give a movie a miss if people are talking about “worst”, “waste”, “disappointment” and “disappointing”.

What else could this kind of analysis be applied to?
  • Do you want to know whether customers will return happier, or go elsewhere looking for something better? This is the kind of analysis that you can apply to your communications from customers (email, phone conversations, twitter comments) if you have sales information in your database.
  • Do you want to know what aspects of your products your customers value? If you can get them to write reviews of your products, you can do this kind of natural language processing on them and you will see what your customers talk about when they like your products.

Friday 9 September 2016

Really proud of my students -- data science for cats

Ngaire had the opportunity to put a cat picture into a data science project legitimately, which could be a worthy blog post in itself. She did an analysis on how well animal shelters (e.g. the council pound) are able to place animals.

She had two data sources. The first was Holroyd council’s annual report (showing that the pound there euthanised 59% of animals they received), but of course any animal that had been microchipped would have been otherwise handled, so the overall percentage is much lower than this in reality. Still, Australia is far behind the USA (based on her second data source, which was a Kaggle-supplied dataset of ~25,000 outcomes).

She put together a decision tree regressor which correctly predicted what would happen to an animal in a shelter around 80% of the time. She also had a logistic regression model with a similar success rate.

The key factors which determined the fate of a lost kitten or puppy were a little surprising. Desexed animals do better in the USA. The very young (a few weeks) do have very good outcomes -- they are very likely to be adopted or transferred. Black fur was a bad sign, although the messiness of the data meant that she couldn’t explore colours all that much further: at a guess, she suggested that being hard to photograph is a problem, so perhaps there is a cut-off level of darkness where there will be a sudden drop in survival rates.

Where else could this kind of analysis be applied? She could take her model and apply it when you want to know an outcome in advance. Questions might include:
How will a trial subscription work out for a potential customer?
What are going to happen to our stock? Will the goods be sold, used, sent to another store…?

Ngaire is open to job opportunities, particularly if you are looking for a data scientist with a very broad range of other career experience in the arts; I can put you in touch if you are interested.