Import SharePoint data to Fabric
Before you continue reading the article, I would like to point out that it already assumes certain knowledge of APIs, Python, SharePoint, Microsoft Fabric
Let’s have a SharePoint list as a data source and a Fabric Lakehouse as a storage for analytics data. But how do we get the data from point A to point B? Microsoft Fabric offers the following options:
- Copy activity in pipelines
- Dataflows Gen v2
- Apache Spark Notebook / Job definition
In this article, I will explain the last option, which is to import using Apache Spark Notebook, specifically using Python. I won’t go into comparing the other options. However, I can recommend you this article from friend of mine 🙂
|SharePoint REST API||Microsoft Graph API|
|older, used by the SharePoint Online||more modern, unified interface, wrapper for SharePoint REST API|
|exposes many endpoints||exposes only a small part of the endpoints, still being developed|
|supports expanding complex columns (e.g. Person, Lookup)||does not support expanding complex columns|
|uses internal column names||uses display column names|
|complicated authentication settings||easier to set up authentication|
Before we start implementing data import, we need to decide which API to use. I decided to use the SharePoint REST API.
The first reason is that it works with internal column names, which ensures that the import mechanism doesn’t break the moment someone changes the display name of a column.
The second reason is the support for column expansion, this benefit is important because of Person columns. SharePoint stores the user ID in the Person column, but it is not an global ID from Azure Active Directory, it is an integer ID specific to SharePoint only. In order to relate user data from SharePoint with other systems, we need a better identifier, e.g. email, which we can get just by expanding the field. In the case of the Graph API it would be more complicated, to get the user’s email from their ID we would have to look at a special hidden list in SharePoint called the User Information List. Which makes the whole process more complicated and also slows it down! 😕
So it’s decided to import using Python and use SharePoint REST API for that. What next? 🤔
Now we need to setup authentication. Since the data import takes place without the user’s interaction, we want to use the Client Credentials Flow. If we use the Microsoft Graph API, we would only need to create a new application object in Azure AD and set the required scopes. For the SharePoint REST API, the process is different.
The application object is created directly in SharePoint using the following link:
A form is displayed, which you fill out as follows. The client id and the client secret are set after selecting the generate buttons and the title field is completely up to you.
❗It is important to securely save the client id and secret❗
In case you are unable to create the application object, make sure that you are logged in as SharePoint Admin and also that it is allowed to create application objects in your tenant. Alternatively, the tenant settings can be changed using following PowerShell script.
Once the application object registration is complete, you must grant it permissions at the link:
In the form, just fill in the app id (client id from previous step) and after pressing the lookup button, the details of the object will be filled in. In the last field, just add the following definition that will grant full control to the client app. Setting up restricted access is documented on this link.
After pressing the create button, you must confirm the permissions in the dialog using the trust it button.
Now we have created the application object and we can move to Fabric, where we need a workspace in which the Lakehouse and Notebook are created and linked.
In order to query the SharePoint API we will need an access token. We can get it using this custom
get_access_token function, which sends POST request to OAuth endpoint with client credentials in the body and returns access token.
Test the access token retrieval. If you receive the error message “Token type is not allowed”, you must enable Custom app authentication using PowerShell as described when creating the application object.
By the way, I strongly recommend saving the client id and secret in Azure Key Vault and retrieving them in the notebook directly from Key Vault as follows.
Now we can finally retrieve data from a SharePoint list using the SharePoint REST API. To do this, we will need information about the list being loaded, specifically list ID and the site name. Both values can be read from the URL of the SharePoint list’s settings page, which is as follows:
Once we have these values, we can implement
First of all, I would like to mention pagination. The endpoint for reading list items returns a maximum of 5000 items. If there are more records in the list, the returned response contains odata.nextLink with a link to another page. Thus, the
get_all_list_items function keeps calling the endpoint while more pages are available and then combines the results into a single list which it returns.
There is also a select and expand parameter. In select we specify which columns should be returned. Since we are querying for a nested value in a complex column (Author/Email ), it is necessary to expand the complex column. Lookup columns work the same way.
Result looks like this:
In order to save the data to Lakehouse, we need to create a dataframe with a schema. The schema is important so we can later flatten the data structure (get rid of the nested Author/EMail ).
Finally, now we just have to flatten the structure using the select method and then save the data into a table in Lakehouse.
And here goes the result! 🥳
To sum it up 😵💫 We were able to retrieve data from a SharePoint list without user interaction with respect to pagination, plus we were able to expand complex columns, flatten the data structure and save it as a table in Fabric Lakehouse. Now you just need to set up a schedule ⌚ or plug the notebook into the pipeline and you can start building your ELT/ETL process!
Finally, I would like to add that in the case of implementation for many data sources, a lot of redundant source code can be generated. I address this issue in my repository fabric-sharepoint-import.