Data illusions

Yesterday was an interesting day for a few reasons; one of the primary reasons was an opinion piece in the Guardian by Jay Watts (@Shrink_at_Large). Like many article I considered to be in opposition, yet when I reread it, this piece has all kinds of hidden gems and I had to ponder a few items for an hour or so. I love that! Any piece, article or opinion that makes me rethink my position is a piece well worth reading. So this piece called ‘Supermarkets spy on them now‘ (at https://www.theguardian.com/commentisfree/2018/may/31/benefits-claimants-fear-supermarkets-spy-poor-disabled) has several sides that require us to think and rethink issues. As we see a quote like “some are happy to brush this off as no big deal” we identify with too many parts; to me and to many it is just that, no big deal, but behind the issues are secondary issues that are ignored by the masses (en mass as we might giggle), yet the truth is far from nice.

So what do we see in the first as primary and what is behind it as secondary? In the first we see the premise “if a patient with a diagnosis of paranoid schizophrenia told you that they were being watched by the Department for Work and Pensions (DWP), most mental health practitioners would presume this to be a sign of illness. This is not the case today.” It is not whether this is true or not, it is not a case of watching, being a watcher or even watching the watcher. It is what happens behind it all. So, when we recollect that dead dropped donkey called Cambridge Analytics, which was all based on interacting and engaging on fear. Consider what IBM and Google are able to do now through machine learning. This we see in an addition to a book from O’Reilly called ‘The Evolution of Analytics‘ by Patrick Hall, Wen Phan, and Katie Whitson. Here we see the direct impact of programs like SAS (Statistical Analysis System) in the application of machine learning, we see this on page 3 of Machine Learning in the Analytic Landscape (not a page 3 of the Sun by the way). Here we see for the government “Pattern recognition in images and videos enhance security and threat detection while the examination of transactions can spot healthcare fraud“, you might think it is no big deal. Yet you are forgetting that it is more than the so called implied ‘healthcare fraud‘. It is the abused setting of fraud in general and the eagerly awaited setting for ‘miscommunication’ whilst the people en mass are now set in a wrongly categorised world, a world where assumption takes control and scores of people are now pushed into the defence of their actions, an optional change towards ‘guilty until proven innocent’ whilst those making assumptions are clueless on many occasions, now are in an additional setting where they believe that they know exactly what they are doing. We have seen these kinds of bungles that impacted thousands of people in the UK and Australia. It seems that Canada has a better system where every letter with the content: ‘I am sorry to inform you, but it seems that your system made an error‘ tends to overthrow such assumptions (Yay for Canada today). So when we are confronted with: “The level of scrutiny all benefits claimants feel under is so brutal that it is no surprise that supermarket giant Sainsbury’s has a policy to share CCTV “where we are asked to do so by a public or regulatory authority such as the police or the Department for Work and Pensions”“, it is not merely the policy of Sainsbury, it is what places like the Department for Work and Pensions are going to do with machine learning and their version of classifications, whilst the foundation of true fraud is often not clear to them, so you want to set a system without clarity and hope that the machine will constitute learning through machine learning? It can never work, that evidence is seen as the initial classification of any person in a fluidic setting is altering on the best of conditions. Such systems are not able to deal with the chaotic life of any person not in a clear lifestyle cycle and people on pensions (trying to merely get by) as well as those who are physically or mentally unhealthy. These are merely three categories where all kind of cycles of chaos tend to intervene with their daily life. Those are now shown to be optionally targeted with not just a flawed system, but with a system where the transient workforce using those methods are unclear on what needs to be done as the need changes with every political administration. A system under such levels of basic change is too dangerous to get linked to any kind of machine learning. I believe that Jay Watts is not misinforming us; I feel that even the writer here has not yet touched on many unspoken dangers. There is no fault here by the one who gave us the opinion piece, I personally believe that the quote “they become imprisoned in their homes or in a mental state wherein they feel they are constantly being accused of being fraudulent or worthless” is incomplete, yet the setting I refer to is mentioned at the very end. You see, I believe that such systems will push suicide rates to an all-time high. I do not agree with “be too kind a phrase to describe what the Tories have done and are doing to claimants. It is worse than that: it is the post-apocalyptic bleakness of poverty combined with the persecution and terror of constantly feeling watched and accused“. I believe it to be wrong because this is a flaw on both sides of the political aisle. Their state of inaction for decades forced the issue out and as the NHS is out of money and is not getting any money the current administration is trying to find cash in any way that they can, because the coffers are empty, which now gets us to a BBC article from last year.

At http://www.bbc.com/news/election-2017-39980793, we saw “A survey in 2013 by Ipsos Mori suggested people believed that £24 out of every £100 spent on benefits was fraudulently claimed. What do you think – too high, too low?
Want to know the real answer? It is £1.10 for every £100
“. That is the dangerous political setting as we should see it; the assumption and believe that 24% is set to fraud when it is more realistic that 1% might be the actual figure. Let’s not be coy about it, because out of £172.3bn a 1% amount still remains a serious amount of cash, yet when you set it against the percentage of the UK population the amount becomes a mere £25 per person, it merely takes one prescription to get to that amount, one missed on the government side and one wrongly entered on the patients side and we are there. Yet in all that, how many prescriptions did you the reader require in the last year alone? When we get to that nitty gritty level we are confronted with the task where machine learning will not offer anything but additional resources to double check every claimant and offense. Now, we should all agree that machine learning and analyses will help in many ways, yet when it comes to ‘Claimants often feel unable to go out, attempt voluntary work or enjoy time with family for fear this will be used against them‘ we are confronted with a new level of data and when we merely look at the fear of voluntary work or being with family we need to consider what we have become. So in all this we see a rightful investment into a system that in the long run will help automate all kinds of things and help us to see where governments failed their social systems, we see a system that costs hundreds of millions, to look into an optional 1% loss, which at 10% of the losses might make perfect sense. Yet these systems are flawed from the very moment they are implemented because the setting is not rational, not realistic and in the end will bring more costs than any have considered from day one. So in the setting of finding ways to justify a 2015 ‘The Tories’ £12bn of welfare cuts could come back to haunt them‘, will not merely fail, it will add a £1 billion in costs of hardware, software and resources, whilst not getting the £12 billion in workable cutbacks, where exactly was the logic in that?

So when we are looking at the George Orwell edition of edition of ‘Twenty Eighteen‘, we all laugh and think it is no great deal, but the danger is actually two fold. The first I used and taught to students which gets us the loss of choice.

The setting is that a supermarket needs to satisfy the need of the customers and the survey they have they will keep items in a category (lollies for example) that are rated ‘fantastic value for money‘ and ‘great value for money‘, or the top 25th percentile of the products, whatever is the largest. So in the setting with 5,000 responses, the issue was that the 25th percentile now also included ‘decent value for money‘. So we get a setting where an additional 35 articles were kept in stock for the lollies category. This was the setting where I showed the value of what is known as User Missing Values. There were 423 people who had no opinion on lollies, who for whatever reason never bought those articles, This led to removing them from consideration, a choice merely based on actual responses; now the same situation gave us the 4,577 people gave us that the top 25th percentile only had ‘fantastic value for money‘ and ‘great value for money‘ and within that setting 35 articles were removed from that supermarket. Here we see the danger! What about those people who really loved one of those 35 articles, yet were not interviewed? The average supermarket does not have 5,000 visitors, it has depending on the location up to a thousand a day, more important, when we add a few elements and it is no longer about supermarkets, but government institutions and in addition it is not about lollies but Fraud classification? When we are set in a category of ‘Most likely to commit Fraud‘ and ‘Very likely to commit Fraud‘, whilst those people with a job and bankers are not included into the equation? So we get a diminished setting of Fraud from the very beginning.

Hold Stop!

What did I just say? Well, there is method to my madness. Two sources, the first called Slashdot.org (no idea who they were), gave us a reference to a 2009 book called ‘Insidious: How Trusted Employees Steal Millions and Why It’s So Hard for Banks to Stop Them‘ by B. C. Krishna and Shirley Inscoe (ISBN-13: 978-0982527207). Here we see “The financial crisis appears to be exacerbating fraud by bank employees: a new survey found that 72 percent of financial institutions say that in the last 12 months they have experienced a case of data theft by one of their workers“. Now, it is important to realise that I have no idea how reliable these numbers are, yet the book was published, so there will be a political player using this at some stage. This already tumbles to academic reliability of Fraud in general, now for an actual reliable source we see KPMG, who gave us last year “KPMG survey reveals surge in fraud in Australia“, with “For the period April 2016 to September 2016, the total value of frauds rose by 16 percent to a total of $442m, from $381m in the previous six month period” we see number, yet it is based on a survey and how reliable were those giving their view? How much was assumption, unrecognised numbers and based on ‘forecasted increases‘ that were not met? That issue was clearly brought to light by the Sydney Morning Herald in 2011 (at https://www.smh.com.au/technology/piracy-are-we-being-conned-20110322-1c4cs.html), where we see: “the Australian Content Industry Group (ACIG), released new statistics to The Age, which claimed piracy was costing Australian content industries $900 million a year and 8000 jobs“, yet the issue is not merely the numbers given, the larger issue is “the report, which is just 12 pages long, is fundamentally flawed. It takes a model provided by an earlier European piracy study (which itself has been thoroughly debunked) and attempts to shoe-horn in extrapolated Australian figures that are at best highly questionable and at worst just made up“, so the claim “4.7 million Australian internet users engaged in illegal downloading and this was set to increase to 8 million by 2016. By that time, the claimed losses to piracy would jump to $5.2 billion a year and 40,000 jobs” was a joke to say the least. There we see the issue of Fraud in another light, based on a different setting, the same model was used, and that is whilst I am more and more convinced that the European model was likely to be flawed as well (a small reference to the Dutch Buma/Stemra setting of 2007-2010). So not only are the models wrong, the entire exercise gives us something that was never going to be reliable in any way shape or form (personal speculation), so in this we now have the entire Machine learning, the political setting of Fraud as well as the speculated numbers involved, and what is ‘disregarded’ as Fraud. We will end up with a scenario where we get 70% false positives (a pure rough assumption on my side) in a collective where checking those numbers will never be realistic, and the moment the parameters are ‘leaked’ the actual fraudulent people will change their settings making detection of Fraud less and less likely.

How will this fix anything other than the revenue need of those selling machine learning? So when we look back at the chapter of Modern Applications of Machine Learning we see “Deploying machine learning models in real-time opens up opportunities to tackle safety issues, security threats, and financial risk immediately. Making these decisions usually involves embedding trained machine learning models into a streaming engine“, that is actually true, yet when we also consider “review some of the key organizational, data, infrastructure, modelling, and operational and production challenges that organizations must address to successfully incorporate machine learning into their analytic strategy“, the element of data and data quality is overlooked on several levels, making the entire setting, especially in light of the piece by Jay Watts a very dangerous one. So the full title, which is intentionally did not use in the beginning ‘No wonder people on benefits live in fear. Supermarkets spy on them now‘, is set wholly on the known and almost guaranteed premise that data quality and knowing that the players in this field are slightly too happy to generalise and trivialise the issue of data quality. The moment that comes to light and the implementers are held accountable for data quality is when all those now hyping machine learning, will change their tune instantly and give us all kinds of ‘party line‘ issues that they are not responsible for. Issues that I personally expect they did not really highlight when they were all about selling that system.

Until data cleaning and data vetting gets a much higher position in the analyses ladder, we are confronted with aggregated, weighted and ‘expected likelihood‘ generalisations and those who are ‘flagged’ via such systems will live in constant fear that their shallow way of life stops because a too high paid analyst stuffed up a weighting factor, condemning a few thousand people set to be tagged for all kind of reasons, not merely because they could be optionally part of a 1% that the government is trying to clamp down on, or was that 24%? We can believe the BBC, but can we believe their sources?

And if there is even a partial doubt on the BBC data, how unreliable are the aggregated government numbers?

Did I oversimplify the issue a little?

 

 

Advertisements

Leave a comment

Filed under Finance, IT, Media, Politics, Science

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.