Search This Blog

Loading...

Thursday, 29 September 2016

DataProtector technical workshop in New Zealand


Email [email protected] if you are interested. It's on next week.

Date: Wednesday, 5 October 2016
Time: 8.30am to 1.00pm

Location:  Hewlett Packard Enterprise Office
Level 4, 22 Viaduct Harbour Avenue, Auckland 1011

Brandon and Paul Carapetis will be talking about:
  • New features from the latest versions of Data Protector
  • Integration with VMware, Hyper-V, and HPE hardware: 3PAR and StoreOnce
  • Road map for the year ahead
  • Introduction to New Related Products: Backup Navigator, Connected, Storage Optimizer, and VM Explorer

Thursday, 22 September 2016

Really proud of my students -- machine learning on corporate websites

Amanda wanted to know what factors influence certain web sites that people visit. But her exploratory analysis found out the secrets of big software companies.

She used a sitemap crawler to look at the documentation of the usual names: Microsoft, Symantec, Dropbox, Intuit, Atlassian, CA, Trello, Github, Adobe, Autodesk, Oracle and some others. She looked at the number of words on each page, the number of links and various other measures and then clustered the results.

Like the best of all data science projects, the results are obvious, but only in retrospect.

Microsoft, Symantec and Dropbox are all companies whose primary focus is on serving non-technical end-users who aren’t particularly interested in IT or computers. They clustered into a group with similar kinds of documentation.

CA, Trello and Github primarily focus on technical end-users: programmers, sysadmins, software project managers. Their documentation cluster together in similarity. Intuit and Atlassian were similar; Adobe and Oracle clustered together.

Really interestingly, it’s possible to derive measures of the structural complexity of the company. Microsoft is a large organisation with silos inside silos. It can take an enormous number of clicks to get from the default documentation landing page until you get to a “typical” target page. Atlassian prides itself on its tight teamwork and its ability to bring people together from all parts of the organisation. They had the shortest path of any documentation site.

But this wasn’t what Amanda was really after: she wanted to know whether she could predict engagement: whether people would read a page on a documentation site or just skim over it. She took each website that she could get data for and deduced how long it would take to read the page (based on the number of words), how long people were actually spending on the page, the number of in-links and a few other useful categories (e.g. what sort of document it was). She created a decision tree model and was able to explain 86% of the variance in engagement.

Interesting result: there was little relationship between the number of hyperlinks that linked to a site and how much traffic it received. Since the number of links strongly influence a site’s pagerank in Google’s search algorithms, this is deeply surprising.

There was more to her project (some of which can’t be shared because it is company confidential), but just taking what I’ve described above, there are numerous useful applications:
  • Do you need to analyse your competition’s internal organisational structure? Or see how much has changed in your organisation in the months after an internal reorg?
  • Is your company’s website odd compared to other websites in your industry?
  • We can use Google Analytics and see what pages people spend time on, and which links they click on, but do you want to know why they are doing that? You know the search terms your visitors used, but what is it that they are interested in finding?

Wednesday, 14 September 2016

Really proud of my students - AI analysis of reviews

Sam wants to know what movies are worth watching, so he analysed 25,000 movie reviews. This is a tough natural language processing problem, because each movie only has a small number of reviews (less than 30). It’s nowhere near enough for a deep learning approach to work, so he had to identify and synthesise features himself.

He used BeautifulSoup to pull out some of the HTML structure from the reviews, and then made extensive use of the Python NTLK library.

The bag-of-words model (ignoring grammar, structure or position) worked reasonably well. A naive Bayesian model performed quite well -- as would be expected -- as did a decision tree model, but there was enough noise that a logistic regression won out, getting the review sentiment right 85% of the time. He evaluated all of his models with F1, AUC and precision-recall. He used this to tweak the model a little and just nudge it a little higher.

A logistic regression over a bag-of-words essentially means that there we are assigning a score to each word in the English language (which might be a positive number, a negative number or even zero), and then adding up the scores for each word when it appears. If overall it adds up to a positive number, we count the review positive; if negative the reviewer didn’t like the movie.

He used the Python Scikit learn library (as do most of my students) to calculate the optimal score to assign to each English language word. Since the vocabulary he was working with was around 75,000 words (he didn’t do any stemming or synonym-based simplication) this ran for around 2 days on his laptop before coming up with an answer.

Interestingly, the word “good” is useless as a predictor of whether a movie was good or not! It probably needs more investigation, but perhaps a smarter word grouping that picked up “not good” would help. Or maybe it fails to predict much because of reviews that say things like “while the acting was good, the plot was terrible”.

Sam found plenty of other words that weren’t very good predictors: movie, film, like, just and really. So he turned these into stopwords.

There are other natural language processing techniques that often produce good results, like simply measuring the length of the review, or measuring the lexical dispersion (the richness of vocabulary used). However, these were also ineffective.

What Sam found was a selection of words that, if they are present in a review, indicate that the movie was good. These are “excellent”, “perfect”, “superb”, “funniest” and interestingly: “refreshing”. And conversely, give a movie a miss if people are talking about “worst”, “waste”, “disappointment” and “disappointing”.

What else could this kind of analysis be applied to?
  • Do you want to know whether customers will return happier, or go elsewhere looking for something better? This is the kind of analysis that you can apply to your communications from customers (email, phone conversations, twitter comments) if you have sales information in your database.
  • Do you want to know what aspects of your products your customers value? If you can get them to write reviews of your products, you can do this kind of natural language processing on them and you will see what your customers talk about when they like your products.

Friday, 9 September 2016

Really proud of my students -- data science for cats

Ngaire had the opportunity to put a cat picture into a data science project legitimately, which could be a worthy blog post in itself. She did an analysis on how well animal shelters (e.g. the council pound) are able to place animals.

She had two data sources. The first was Holroyd council’s annual report (showing that the pound there euthanised 59% of animals they received), but of course any animal that had been microchipped would have been otherwise handled, so the overall percentage is much lower than this in reality. Still, Australia is far behind the USA (based on her second data source, which was a Kaggle-supplied dataset of ~25,000 outcomes).

She put together a decision tree regressor which correctly predicted what would happen to an animal in a shelter around 80% of the time. She also had a logistic regression model with a similar success rate.

The key factors which determined the fate of a lost kitten or puppy were a little surprising. Desexed animals do better in the USA. The very young (a few weeks) do have very good outcomes -- they are very likely to be adopted or transferred. Black fur was a bad sign, although the messiness of the data meant that she couldn’t explore colours all that much further: at a guess, she suggested that being hard to photograph is a problem, so perhaps there is a cut-off level of darkness where there will be a sudden drop in survival rates.

Where else could this kind of analysis be applied? She could take her model and apply it when you want to know an outcome in advance. Questions might include:
How will a trial subscription work out for a potential customer?
What are going to happen to our stock? Will the goods be sold, used, sent to another store…?

Ngaire is open to job opportunities, particularly if you are looking for a data scientist with a very broad range of other career experience in the arts; I can put you in touch if you are interested.


Sunday, 28 August 2016

Really proud of my students -- final projects

Over the next few weeks I'm going to do some short blog posts about each of the final projects my students did in their data science course.

One of the reasons this blog has been a bit quieter than usual these last few months is that I was teaching a Data Science class at General Assembly, which was rewarding but rather exhausting.

Some observations:
  • GA is busy and dynamic. I remember back in the late 1990s at HP when every company was deploying SAP on HP-UX to avoid Y2K problems: there were classes constantly; you might discover that the class you were teaching was going to be held in the boardroom using some workstations borrowed from another city. GA was like that: every room packed from early morning until late at night.
  • No-one in the class had a job as a data scientist at the beginning of the course, but there was a lot of movement within 10 weeks: job changes, promotions, new career directions. The only time in my teaching career where I saw the same wow-this-person-is-trained-now-let's-poach-them was in the early days of the Peregrine -> Service Manager transition.
  • The course is mainly about machine learning but there is flexibility for the instructor to add in a few other relevant topics based on what the students want. Right now, Natural Language Processing is white-hot. Several students did some serious NLP / NLU projects. The opportunities for people who have skills in this area are very, very good.
  • Computer vision is an area where there is a lot of interest as well.
I'll be teaching the first part of the Data Science immersive (a full-time course instead of a night-time part-time one) starting in September; please sign up with GA if you are interested.

I suspect by the time I've finished blogging about my past students' projects that there will be a new round of student projects to cover, so this might become a bit more of a feature on my blog.

Tuesday, 23 August 2016

Automate the installation of a Windows DataProtector client

A client today wanted to push the DataProtector agent from SCCM / System Center 2012 instead of from Data Protector. It's not that difficult, but I couldn't find the command-line setup documented anywhere.

You will need to run (as an administrator):

  net use r: \\installserver.ifost.org.au\Omniback
  r:
  cd \x8664
  msiexec /i "Data Protector A.09.00.msi" /passive INSTALLATIONTYPE=Client ADDLOCAL=core,da,autodr
  net use /delete r:

Obviously, substitute installserver.ifost.org.au with your install server, and if R: is already allocated, use something else instead.

Then, trigger the following command on your cell manager:

 omnicc -import_host clientname

Replace clientname with the name of the client.

Script this as appropriate (e.g. after the operating system has booted) in order to have an unattended installation.

Friday, 5 August 2016

Data Protector CRS operation cannot be performed in full-screen mode

Today's head-scratcher: after upgrading to 9.07 on a Windows cell manager, the CRS service won't start.

Eventvwr says something even weirder:

The Data Protector CRS service terminated with service-specific error The requested operation cannot be performed in full-screen mode.. 

I was in full screen mode at the time, but it still wouldn't start even when I minimised my RDP session. For my own sanity, I was glad of this.

Trawling through Daniel Braun's http://www.data-protector.org blog, I saw some comments there that it could be related to anti-virus software. Nope, not that either.

The debug.log said something a little bit more believable:

[SmCreateTable] MapViewOfFile(size:17505216) failed, error=[5] Access is denied.

I discovered that I could reliably get that message added that message every time I tried to start the CRS. But what is actually being denied?

So I ran omnisv start -debug 1-500 crm-vexatious.txt

I then had a 160KB file created in C:\programdata\omniback\tmp that began with OB2DBG, ended with crm-vexatious.txt and had CRM in the filename. Good: at least it gets far enough that it can create debug messages.

Scrollling right to the bottom of it, there it was:

Code is:1007  SystemErr: [5] Access is denied
************************   DEFAULT ERROR REPORT   ***************
[Critical] From [email protected] "" Time: 5/8/2016 1:00:33PM
Unable to allocate shared memory: Unknown internal error.

Internally, the function to return a shared memory segement presumably encodes something as 1007; CRS then exits with that code (which is the standard Windows error code for "can't be performed in full-screen mode").

There aren't many reasons for a shared memory allocation to fail. In fact, the only one I can think of that could be relevant here is if the segment already exists. I thought about figuring out what the equivalent to ipcrm is on Windows, gave up and rebooted the box.

And it came up perfectly. Funnily enough, if I had had no idea what I was doing, I would have just bounced the box to see if it would have fixed it, and saved myself a headache and some stress wondering what was going on. Ignorance would have been bliss.