woman-laptop-code.jpg

Migrating a Document Library and Retaining Authorship Information

Lately, I’ve had the opportunity to work on a couple of on-premises SharePoint 2010 to SharePoint 2013 migrations which proved to be both fun and challenging.  Our team chose to leverage PowerShell and CSOM for handling all the heavy lifting when it came to provisioning the new site and migrating data.  This, in turn, lead to one of the more interesting things I got to do, which was write a PowerShell script that would migrate the contents of a document library from the old 2010 site to a document library in the new 2013 site, while at the same time retaining authorship information.

As I researched how to do this, I found that there wasn’t any one source that explained all the steps needed in a nice concise manner. I found dribs and drabs here and there, usually around one step or another, or found that the code provided was using server side code which wasn’t help in my case.

So, I decided it would be worthwhile to pull all the parts into one document, both for my own reference as well as for others. And yes, I admit it. I shamelessly pilfered from some of my co-workers’ fine scripts to piece this together. Thanks, team, for being so awesome!

Here is an outline of the high-level steps needed to move files between sites. You can find the full script at the bottom.

Basic steps

  • Get a context reference for each of the sites (source and target). This will allow you to work on both sites at the same time.
  • Load both the source and target libraries into their appropriate contexts. Load the RootFolder for the target library as well. This will be needed both when saving the document and when updating the metadata.
  • Have a valid user id available to use in place of any invalid users found in the metadata being copied. Ensure the ‘missing user’ user id.
  • Query the source library using a CAML query to get a list of all items to be processed. You can apply filters here to limit results as needed. (Attached code has parameters for specifying start and end item ids).
  • Loop through the items returned by the CAML query
    • Get a reference to the source item using the id
    • Get a reference to the actual file from the source item
    • Load the source file by calling ‘OpenBinaryDirect’ (uses the file ServerRelativeUrl value)
    • Write it back out to the target library by calling ‘SaveBinaryDirect’ on the just loaded file stream
    • Copy over the metadata:
      • Get a reference to the new item just added in the target library
      • Populate the target item metadata fields using values from the source item
      • Update the item
    • Loop

That’s it, in a nutshell. There are all sorts of other things you can do to pretty it up, but I thought I would keep this simple as a quick reference both for myself and others. Be sure to check the top of the script below for other notes about what it does and does not do.

<#
.SYNOPSIS
Copies documents from one library into another across sites.
This was tested using SharePoint 2010 as the source and both SharePoint 2010 and SharePoint 2013 as the target.

Notes:
* Parameters can either be passed in on the command line or pre-populated in the script
* Example for calling from command line:
./copy-documents.ps1 "http://testsite1/sites/testlib1/" "domain\user" "password" "source lib display name" "http://testsite2/sites/testlib2/" "domain\user" "password" "target lib display name" "domain\user" 1 100

Features:
* Can cross site collections and even SP versions (e.g. SP2010 to SP2013)
* Allows you to specify both the source and target document library to use
* Can retain created, created by, modified, modified by and other metadata of the original file
* Can specify a range of files to copy by providing a starting id and ending id
* When copying metadata such as created by, will populate any invalid users with the provided 'missing user' value
* Uses a cache for user data so it doesn't have to run EnsureUser over and over for the same person

Limitations:
* Does not currently traverse folders within a document library.
* This only copies.  It does not remove the file from the source library when done.

#>

[CmdletBinding()]
param(
	[Parameter(Mandatory=$false)]
	[string]$sourceSiteUrl = "",
    [Parameter(Mandatory=$false)]
	[string]$sourceUser = "",
	[Parameter(Mandatory=$false)]
	[string]$sourcePwd = "",
	[Parameter(Mandatory=$false)]
    [string]$sourceLibrary = "",
	[Parameter(Mandatory=$false)]
	[string]$targetSiteUrl = "",
    [Parameter(Mandatory=$false)]
	[string]$targetUser = "",
	[Parameter(Mandatory=$false)]
	[string]$targetPwd = "",
    [Parameter(Mandatory=$false)]
    [string]$targetLibrary = "",
    [Parameter(Mandatory=$false)]
    [string]$missingUser = "",
	[Parameter(Mandatory=$false)]
    [int]$startingId = -1,
    [Parameter(Mandatory=$false)]
    [int]$endingId = -1
)

## Load the libraries needed for CSOM
## Replace with the appropriate path to the libs in your environment
Add-Type -Path ("c:\dev\libs\Microsoft.SharePoint.Client.dll")
Add-Type -Path ("c:\dev\libs\Microsoft.SharePoint.Client.Runtime.dll")

function Main {
	[CmdletBinding()]
	param()
	
	Write-Host "[$(Get-Date -format G)] copy-documents.ps1: library $($sourceLibrary) from $($sourceSiteUrl) to $($targetSiteUrl)"
	
    # Get the context to the source and target sites
	$sourceCtx = GetContext $sourceSiteUrl $sourceUser $sourcePwd
	$targetCtx = GetContext $targetSiteUrl $targetUser $targetPwd

    # Ensure the "missing user" in the target environment
    $missingUserObject = $targetCtx.Web.EnsureUser($missingUser)
    $targetCtx.Load($missingUserObject)
	
	## Moved the try/catch for ExecuteQuery to a function so that we can exit gracefully if needed
	ExecuteQueryFailOnError $targetCtx "EnsureMissingUser"

	## Start the copy process
	CopyDocuments $sourceCtx $targetCtx $sourceLibrary $targetLibrary $startingId $endingId $missingUserObject
}

function CopyDocuments {
	[CmdletBinding()]
	param($sourceCtx, $targetCtx, $sourceLibrary, $targetLibrary, $startingId, $endingId, $missingUserObject)

    $copyStartDate = Get-Date

    # Get the source library
    $sourceLibrary = $sourceCtx.Web.Lists.GetByTitle($sourceLibrary)
    $sourceCtx.Load($sourceLibrary)
	ExecuteQueryFailOnError $sourceCtx "GetSourceLibrary"

    # Get the target library
    $targetLibrary = $targetCtx.Web.Lists.GetByTitle($targetLibrary)
    $targetCtx.Load($targetLibrary)
	## RootFolder is used later both when copying the file and updating the metadata.
    $targetCtx.Load($targetLibrary.RootFolder)
	ExecuteQueryFailOnError $targetCtx "GetTargetLibrary"

    # Query source list to retrieve the items to be copied
    Write-Host "Querying source library starting at ID $($startingId) [Ending ID: $($endingId)]"
    $sourceItems = @(QueryList $sourceCtx $sourceLibrary $startingId $endingId) # Making sure it returns an array
    Write-Host "Found $($sourceItems.Count) items"

    # Loop through the source items and copy
    $totalCopied = 0
    $userCache = @{}
    foreach ($sourceItemFromQuery in $sourceItems) {

        $totalCount = $($sourceItems.Count)

        if ($sourceItemFromQuery.FileSystemObjectType -eq "Folder") {
            Write-Host "skipping folder '$($sourceItemFromQuery['FileLeafRef'])'"
            continue
        }
		Write-Host "--------------------------------------------------------------------------------------"
        Write-Host "[$(Get-Date -format G)] Copying ID $($sourceItemFromQuery.ID) ($($totalCopied + 1) of $($totalCount)) - file '$($sourceItemFromQuery['FileLeafRef'])'"

        # Get the source item which returns all the metadata fields
        $sourceItem = $sourceLibrary.GetItemById($sourceItemFromQuery.ID)
        # Load the file itself into context
		$sourceFile = $sourceItem.File
        $sourceCtx.Load($sourceItem)
        $sourceCtx.Load($sourceFile)
		ExecuteQueryFailOnError $sourceCtx "GetSourceItemById"

		## Call the function used to run the copy
        $targetId = CopyDocument $sourceCtx $sourceItem $sourceFile $sourceItemFromQuery $targetCtx $targetLibrary $userCache $missingUserObject
        
		$totalCopied++
    }

    # Done - let's dump some stats
    $copyEndDate = Get-Date
    $duration = $copyEndDate - $copyStartDate
    $minutes = "{0:F2}" -f $duration.TotalMinutes
    $secondsPerItem = "{0:F2}" -f ($duration.TotalSeconds/$totalCopied)
    $itemsPerMinute = "{0:F2}" -f ($totalCopied/$duration.TotalMinutes)
	Write-Host "--------------------------------------------------------------------------------------"
    Write-Host "[$(Get-Date -format G)] DONE - Copied $($totalCopied) items. ($($minutes) minutes, $($secondsPerItem) seconds/item, $($itemsPerMinute) items/minute)"
}

### Function used to copy a file from one place to another, with metadata
function CopyDocument {
    [CmdletBinding()]
    param($sourceCtx, $sourceItem, $sourceFile, $sourceItemFromQuery, $targetCtx, $targetLibrary, $userCache, $missingUserObject)

    ## Validate the Created By and Modified By users on the source file
    $authorValueString = GetUserLookupString $userCache $sourceCtx $sourceItem["Author"] $targetCtx $missingUserObject
    $editorValueString = GetUserLookupString $userCache $sourceCtx $sourceItem["Editor"] $targetCtx $missingUserObject

    ## Grab some important bits of info
	$sourceFileRef = $sourceFile.ServerRelativeUrl
    $targetFilePath = "$($targetLibrary.RootFolder.ServerRelativeUrl)/$($sourceFile.Name)"

    ## Load the file from source
    $fileInfo = [Microsoft.SharePoint.Client.File]::OpenBinaryDirect($sourceCtx, $sourceFileRef)
    ## Write file to the destination
    [Microsoft.SharePoint.Client.File]::SaveBinaryDirect($targetCtx, $targetFilePath, $fileInfo.Stream, $true)

    ## Now get the newly added item so we can update the metadata
    $item = GetFileItem $targetCtx $targetLibrary $sourceFile.Name $targetLibrary.RootFolder.ServerRelativeUrl

    ## Replace the metadata with values from the source item
    $item["Author"] = $authorValueString
    $item["Created"] = $sourceItem["Created"]
    $item["Editor"] = $editorValueString
    $item["Modified"] = $sourceItem["Modified"]

	## Update the item
    $item.Update()
    ExecuteQueryFailOnError $targetCtx "UpdateItemMetadata"

    Write-Host "[$(Get-Date -format G)] Successfully copied file '$($sourceFile.Name)'"

}

## Get a reference to the list item for the file.
function GetFileItem {
	[CmdletBinding()]
	param($ctx, $list, $fileName, $folderServerRelativeUrl)

	$camlQuery = New-Object Microsoft.SharePoint.Client.CamlQuery
	if ($folderServerRelativeUrl -ne $null -and $folderServerRelativeUrl.Length -gt 0) {
		$camlQuery.FolderServerRelativeUrl = $folderServerRelativeUrl
	}
	$camlQuery.ViewXml = @"
<View>
	<Query>
   		<Where>
      		<Eq>
        		<FieldRef Name='FileLeafRef' />
        		<Value Type='File'>$($fileName)</Value>
      		</Eq>
		</Where>
	</Query>
</View>
"@

	$items = $list.GetItems($camlQuery)
	$ctx.Load($items)
	$ctx.ExecuteQuery()
	
	if ($items -ne $null -and $items.Count -gt 0){
		$item = $items[0]
	}
	else{
		$item = $null
	}
	
	return $item
}

## Validate and ensure the user
function GetUserLookupString{
	[CmdletBinding()]
	param($userCache, $sourceCtx, $sourceUserField, $targetCtx, $missingUserObject)

    $userLookupString = $null
    if ($sourceUserField -ne $null) {
        if ($userCache.ContainsKey($sourceUserField.LookupId)) {
            $userLookupString = $userCache[$sourceUserField.LookupId]
        }
        else {
            try {
                # First get the user login name from the source
                $sourceUser = $sourceCtx.Web.EnsureUser($sourceUserField.LookupValue)
                $sourceCtx.Load($sourceUser)
                $sourceCtx.ExecuteQuery()
            }
            catch {
                Write-Host "Unable to ensure source user '$($sourceUserField.LookupValue)'."  
            }

            try {
                # Now try to find that user in the target
                $targetUser = $targetCtx.Web.EnsureUser($sourceUser.LoginName)
                $targetCtx.Load($targetUser)
                $targetCtx.ExecuteQuery()
                
                # The "proper" way would seem to be to set the user field to the user value object
                # but that does not work, so we use the formatted user lookup string instead
                #$userValue = New-Object Microsoft.SharePoint.Client.FieldUserValue
                #$userValue.LookupId = $user.Id
                $userLookupString = "{0};#{1}" -f $targetUser.Id, $targetUser.LoginName
            }
            catch {
                Write-Host "Unable to ensure target user '$($sourceUser.LoginName)'."
            }
            if ($userLookupString -eq $null) {
                Write-Host "Using missing user '$($missingUserObject.LoginName)'."
                $userLookupString = "{0};#{1}" -f $missingUserObject.Id, $missingUserObject.LoginName
            }
            $userCache.Add($sourceUserField.LookupId, $userLookupString)
        }
    }

	return $userLookupString
}

## Pull ids for the source items to copy
function QueryList {
    [CmdletBinding()]
    param($ctx, $list, $startingId, $endingId)

    $camlQuery = New-Object Microsoft.SharePoint.Client.CamlQuery
    $camlText = @"
<View>
    <Query>
        <Where>
            {0}
        </Where>
        <OrderBy>
            <FieldRef Name='ID' Ascending='True' />
        </OrderBy>
    </Query>
    <ViewFields>
        <FieldRef Name='ID' />
        {1}
    </ViewFields>
    <QueryOptions />
</View>
"@

    if ($endingId -eq -1) {
        $camlQuery.ViewXml = [System.String]::Format($camlText, "<Geq><FieldRef Name='ID' /><Value Type='Counter'>$($startingId)</Value></Geq>", "")
    }
    else {
        $camlQuery.ViewXml = [System.String]::Format($camlText, "<And><Geq><FieldRef Name='ID' /><Value Type='Counter'>$($startingId)</Value></Geq><Leq><FieldRef Name='ID' /><Value Type='Counter'>$($endingId)</Value></Leq></And>", "")
    }

    $items = $list.GetItems($camlQuery)
    $ctx.Load($items)
	ExecuteQueryFailOnError $ctx "QueryList"

    return $items
}

function GetContext {
	[CmdletBinding()]
	param($siteUrl, $user, $pwd)
	
	# Get the client context to SharePoint
	$ctx = New-Object Microsoft.SharePoint.Client.ClientContext($siteUrl)
	$securePwd = ConvertTo-SecureString $pwd -AsPlainText -Force
	$cred = New-Object PSCredential($user, $securePwd)
	$ctx.Credentials = $cred
	
	return $ctx
}

function ExecuteQueryFailOnError {
	[CmdletBinding()]
	param($ctx, $action)
	
	try {
		$ctx.ExecuteQuery()
	}
	catch {
		Write-Error "$($action) failed with $($_.Exception.Message).  Exiting."
		exit 1
	}
}

### Start the process
Main
Caroline SosebeeMigrating a Document Library and Retaining Authorship Information

4 comments

Join the conversation
  • Kirk Liemohn - December 8, 2016 reply

    Great post!

  • Eric Bowden - December 9, 2016 reply

    Nice pattern with ExecuteQueryFailOnError as a central method for all calls to ExecuteQuery. I happened to have been reading about throttling in SharePoint online recently and how retries are the expected behavior for when user code is throttled. This is more easily accomplished when all of the calls to ExecuteQuery flow through one method, as you have.

  • Tommy Ryan - December 9, 2016 reply

    Caroline, did not know you were so technical ;).

    Caroline Sosebee - December 12, 2016

    Uh … thanks … I think. LOL

Join the conversation

This site uses Akismet to reduce spam. Learn how your comment data is processed.