Data Thing Part 2: Lakes and Warehouses and Lakehouses...
Now we've started to work out what we don't know, we can get to work with knowing things!
Spoiler alert: I end up even more confused.
Bringing our datasets together
I start learning about Data Lakes and Warehouses and Lakehouses. It all sounds very big and technical and expensive.
A warehouse sounds like a MEGADATABASE where you copy your other structured database data to for analysis. A lake is the same thing but you can also include unstructured data! I am picturing a huge heap of flytipped CSV files and SQL databases and PDFs. This sounds right up our street.
Then we have on-premise vs cloud. It seems counterproductive to bring everything into an on premise MEGADATABASE when we are meant to be going cloud first. But how do I turn this into something I can try out without running up a fortune in cloud computing costs?
Using what we already have
We have servers with SQL and Postgres databases in our data centre. I start looking into open source things like Apache Airflow and DBT Core which are snazzy. I soon realise I could spend many months going down a rabbit hole with this stuff because I find it interesting, but do we have the knowledge and resources to support this setup as a team?
I need a thing where we can spin up a working example easily...
Looking at this problem from the other end - we have users writing reports in software bundled with applications, and a couple of users starting to get interested in Power BI. We've just upgraded our Microsoft Licences to E5, so perhaps it is time to talk to Microsoft about what they offer? I start looking into Fabric and Azure storage and realise I need someone with a PHD in Microsoft, so through our account manager I book an introductory meeting with an Azure specialist.
We start thinking about the balance between cost and ease of use, maintenance and familiarity.
We need a clearer goal
It needs to be tangible and ideally not include the terms "data driven", "unlocking our data" or any vague references to insights because by this point I am overhearing them! Don't get me started on "chat to your data"... 😣
It needs to be specific and focused enough that we can actually deliver it and it doesn't become another service plan mega rollover.
Let's come up with some outcomes:
- Deliver a working example of data from 2 or more separate back office sources consolidated and processed into one source that an end user can create reports from. It needs to be a solution that can scale to draw in more of our hodge podge of datasets. It needs to stay up to date with little or no ICT intervention.
- We need to be able to estimate the long term costs from our prototype, so we can get approval for funding.
- Work with users to develop the first reports or use cases. We'll use this experience to develop any end user training we need. We also need to evidence a tangible benefit to whatever we create, to help offset the cost. It's a waste of time to tech-ify something that people already do for little time or effort, just to say its been transformed.
- Work with users on some general training or education on how to identify use cases for data projects, how to ensure they deliver real benefits and how to evidence the benefits.
Using a demo project to build momentum
At the start of the year I was fiddling with the OpenAI API and wrote a script to automate the summarisation of comments on planning applications. It's still going through rounds of changes and tweaking. It currently has to be run manually by me from some CSV extracts.
Sounds like I've got a good area to focus on:
- Colleagues who are interested and happy to help with testing and prototyping
- Data in a mishmash of systems that is bought together manually
- Processing that needs to be run manually by ICT
- An end result that will save a huge amount of time by cutting out manual processing. Right now officers read through the comments and pick out the key topics manually. There can be hundreds on a big application 🤯
Other posts with tag "data thing":