Cloud Photo

Azure Data Architect | DBA

Create Data “Thumbnail” in Azure Data Lake Store

,

Recently, a colleague could not create a model in Power BI Desktop because the dataset files were too large. The smallest file was 45GB. He asked if we could rerun the Azure Data Lake Analytics job (U-SQL) to create a smaller dataset. We could, however one of the jobs that creates the datasets runs for about four hours. The solution I came up with was to create a dataset “thumbnail” which would provide a small subset of the data that would work for his environment – an i5 laptop with 4GB of RAM and a 250GB hard drive. I used PowerShell to create small files with the top 500 rows. The number of rows was arbitrary and could easily be 10,000 if needed. Once the report and dataset are published, the dataset source is changed to point to the full dataset.

 

Login-AzureRmAccount
Select-AzureRmSubscription -SubscriptionId "53l3Ct-your-OWn-5U85CrIPtiOn"
# setup some variables
$filePath = "/adls/path/to/large/data/file.txt"
$newPath = "/adls/path/to/small/data/file.txt"
$accountName = "adlsName"

# get the first 500 lines from the full file
$content = Get-AzureRmDataLakeStoreItemContent -AccountName $accountName -Path $filePath -Head 500

# create thumbnail file
New-AzureRmDataLakeStoreItem -AccountName $accountName -Path $newPath -Force

# add a cr/lf to each line
foreach($line in $content) {
 $thumbnail += "$line`n"
}

# populate the thumbnail with data
Add-AzureRmDataLakeStoreItemContent -Account $accountName -Path $newPath -Value "$thumbnail"

# preview the thumbnail
Get-AzureRmDataLakeStoreItemContent -AccountName $accountName -Path $newPath -Head 10

# clean up
Clear-Variable thumbnail
Clear-Variable content
Clear-Variable line

Leave a Reply