Sharing Public Data using Onelake, Delta and Cloudflare R2 

The use case is very simple, sharing public data continuously and as cheaply as possible, especially if the consumers are in a different geographic region. 

Note: This is not an officially supported solution, and the data can be inconsistent when copying to R2 , but it is good enough for public data

How to 

1- The Data is prepared and cleaned using Fabric and saved in Onelake 

2-  Copy the data to cloudflare R2 using code, as of today Shortcuts to S3 does not support  write operation, although I did not test it, Dataflow Gen2 (data pipeline) support S3 as a destination, I used code as I had it already from a previous project, you pay egress fees for this operation and storage in R2 with transaction cost

3- Provide access token to users or make it public, you don’t pay for egress fees from Cloudflare to end users, but the throughput is not guaranteed. 

Today, Fabric shortcuts requires list buckets permission, please vote for this idea to remove this requirement,  

For example, I am writing public data in Fabric US-West and consuming it in Fabric Melbourne  

make sure you turn on cache for Onelake, it helps a lot of performance.

You can try it  

You can try it yourself using this credential, they are temporary, and I may delete them anytime. 

Access Key ID 

3a3d5b5ce8c296e41a6de910d30e7fb6 

Secret 

9a080220941f3ff0f22ac93c7d2f5ec1d73a77cd3a141416b30c1239efc50777 

Endpoint 

https://c261c23c6a526f1de4652183768d7019.r2.cloudflarestorage.com

Quickly view Delta Table stats

With the recent release of Deltalake  Python,we can write to Fabric Onelake using a local Path, using this new functionality, I updated a notebook I had built previously to show quick stats for all the tables in a particular lakehouse, it is using pure Python, so not only it works in Fabric but offline too in your  local machine.

You can download the notebook here

All you have to do is to import the notebook and attach the lakehouse you want to analyze.

You can use append to keep the history.

It is using two packages

Delta Lake Python to get the delta stats

DuckDB to get the Parquet stats ( number of row groups)

And a SQL Query to combine the results from the two previous packages 🙂 

The notebook is very simple and show only the major metrics for a Table, total rows, number of files, number of row groups and average row per row group, and if V-Order is applied

If you want more details, you can use the excellent delta analyser  

Why you should care about Table stats

Fabric Direct Lake mode has some guardrails as of today for example, the maximum number of row groups in a table for F SKU  less than F64 is 1000, which is reasonably a very big number but if you do frequent small insert without Table maintenance you may end up quickly generate a lot of files ( and row groups), so it is important to be aware of the table layout,  especially when using Lakehouse, DWH do support automatic Table maintenance though.

Parting Thoughts 

Hopefully in the near future, Lakehouse will expose the basic information about Tables in the UI, in the meantime, you can use code as a workaround. 

Building a cost effective solution using Fabric

I have being playing with Electricity Market Data recently and now I have a #Fabric solution that I think is easy enough to use but more importantly cost effective that it works just fine using F2 SKU, which cost $156 /Month

https://github.com/djouallah/aemo_fabric

a Quick Video

Howto

0- Create a Fabric Workspace

1- Create a lakehouse

2-Download the notebooks from Github and import it to Fabric Workspace

3-Open a notebook, attached it to the Lakehouse

4-Run the notebook in sequence just to have the initial load

5-Build your semantic Model either using Direct Lake for better “latency” , or use the attached template for import mode if you are happy with 8 refreshes per day( for people with a pro license), all you need is to input the Lakehouse SQL Endpoint, my initial plan was to read Delta Table directly from OneLake but currently filter pushdown works only at the partition level ( but apparently further stuff are coming)

6- Use a scheduler to run the jobs, 5 minutes and 24 Hours

Show me the Money

For developpement use Starter Pool, but for scheduling use Small Single Node, it is good enough, medium is faster but will consume more capacity Unit, still within F2 limits.

This is the actual usage of capacity unit for this scenario, yesterday I was messing around trying to further optimize which end up consuming more 🙂

Spark Dataframe API is amazing

When reading the csv files, I need to convert a lot of columns to double, using SQL you have to manually type all the fields names, in Spark , you can just do it in a loop, that was the exact moment, I *understood*  why people like Spark  API !!!

df_cols = list(set(df.columns) -{'SETTLEMENTDATE','DUID','file','UNIT','transactionId','PRIORITY'})
for col_name in df_cols:
df = df.withColumn(col_name, f.col(col_name).cast('double'))

Source Data around 1 Billion of dirty csv, data added every 5 minutes and backfilled every 24 hours, all data is saved in OneLake using Spark Delta Table.

not everything is csv though, there is some data in Json and Excel of course 🙂

Although the main code is using PySpark , I used Pandas and DuckDB too, no philosophical reasons, basically I use whatever Stackoverflow give me first 🙂 and then I just copy the result to a Spark dataframe and save it as Delta Tables, you don’t want to miss on VOrder

For example this is the code to generate a calendar Table using DuckDB SQL, the unnest syntax is very elegant !!!

df=duckdb.sql(""" SELECT cast(unnest(generate_series(cast ('2018-04-01' as date), cast('2024-12-31' as date), interval 1 day)) as date) as date,
EXTRACT(year from date) as year,
EXTRACT(month from date) as month
""").df()

x=spark.createDataFrame(df)
x.write.mode("overwrite").format("delta").saveAsTable("Calendar")

The Semantic Model

The Model is not too complex but not trivial, 2 facts tables, 4 dimensions, biggest table 240 M rows.

Stuff that needs improvements

1- have an option to download the semantic Model from the service using pbit, it has to be one file for ease of distribution.

2- More options for the scheduler, something like run this job only between 8 AM to 6 PM every 5 minutes.

Parting Thoughts

Notebook + OneLake + Direct Lake is emerging as a very Powerful Pattern , specially when the data is big and freshness is important, but somehow users needs to learn how to write Python code, I am afraid that’s not obvious at all for the average data analyst, maybe this AI thing can do something about ? that’s the big question.

Does it mean, Notebook is the best solution for all use cases ? I don’t know, and I don’t think there is such a thing as a universal best practice in ETL, as far as I am concerned, there is  only one hard rule.

Save raw data when the source system is volatile.

anything else is a matter of taste 🙂

Use Fabric Notebook code based orchestration tool to avoid concurrent write conflicts.

I had a simple data ingestion use case, Notebook A inserts data to a Delta Table every 5 minutes and Notebook B backfills the same table with new fields but only at 4 am.

Initially I just scheduled Notebook A to run every 5 minutes and Notebook B to run at 4 AM , did not work as I got a write conflict, basically Notebook B take longer time to process the data, when it is ready to update the table, it is a bit too late as it was already modified by Notebook A and you get this error

Workarounds

Solution 1 :  Schedule Notebook A to run every 5 minutes except from 4 AM to 4:15 AM, today it is not supported in Fabric scheduler ( although it works fine in Azure Data Factory).

Solution 2 : Partition by Date to avoid Spark writing to the same file at the same time, which is fine for my table as it is big enough around 230 Millions spread over 6 years, generating 2000 files is not the end of the world, but the same approach does not work for another table which is substantially smaller around 3 millions

Solution : Turn out, there is a code base orchestration tool in Fabric Notebook 

I knew about MSSparkUtils mainly because Sandeep Pawar can’t stop talking  about it 🙂 but I did not know that it does orchestration too,  in my case  the solution was trivial.

Add a check in notebook A if there is a new file to backfill ; if yes call Notebook B

if len(files_to_upload_full_Path) > 0 :

 mssparkutils.notebook.run("Transform_Backfill_Previous_Day")

And it did work beautifully ( I know the feeling, it is easy when you know it)

Notice that the second Notebook runs using the same Runtime, so it is faster and maybe even cheaper.

Ok there is More 

Conditionally running a notebook based on a new file arrival is a simple use case, but you can do more, for example you can run multiple notebooks in parallel or even define complex relationships between Notebooks using a DAG with just Python code !!!!

Take Away

 This is just a personal observation , because Fabric was released with all the Engines at the Same time, a lot of very powerful features and patterns did not get a chance to be fully exposed and appreciated, and based on some anecdotal evidence on twitter , it seems I am not the only one who never heard about Fabric Notebook code orchestration. 

For PowerBI people Starting with Fabric, Python is just too Powerful. Yes, we did fine without it all these years, but  if you have any complex data transformation scenarios, Python is just too important to ignore. 

Thanks Jene Zhang for answering my silly questions.