A program to download PBase images

I’ve been using PBase.com to store my images for more than 10 years. For various reasons, I’m considering moving from PBase to my own server. The task of trying to recreate the 70+ galleries containing several thousand images from my original files that are now in the high tens of thousands of files is a daunting task.

I started to look for a program to download my PBase library of images and galleries, but couldn’t find any options or a feature in PBase to allow me to download all my images. So, I decided to write one in C# using WinForms.

The source code and a pre-built executable are all available on Github at https://github.com/MarioGiannini/MGPBaseDownloader

The user interface is pretty simple, and just contains a text box for the URL of the pbase user root gallery and a folder where to store the downloaded file. I’ll be honest and further state that there is basically zero error handling for things like selecting an invalid (or even blank) URL or destination folder.

The majority of the work is done by the PBaseDownloader class, and the ProcessUrlEx function specifically. The ProcessUrlEx is a recursive function that loads a web page using WebClient and then parses it using HtmlAgilityPack.HtmlDocument methods. When it identifies a thumnail it determines if the thumbnail leads to another gallery or to an image display page, and then handles it accordingly. When the thumbnail refers to another gallery, then ProcessUrlEx calls itself recursively to continue processing that sub-gallery.

The HtmlAgilityPack was added using the Nuget package manager.

Below is some pseudo code that demonstrates downloading the page and parsing it:

using (WebClient Client = new WebClient())
{
   String Page = Client.DownloadString(url);  // Download the web page
   HtmlAgilityPack.HtmlDocument Doc = new HtmlAgilityPack.HtmlDocument();
   Doc.LoadHtml(Page);
   ItemNodes = Doc.DocumentNode.SelectNodes("//td[@class='thumbnail']");
   if (ItemNodes == null)
      Results.AppendText( ">> No images or folders\r\n");
   else
   {
      foreach (HtmlNode Node in ItemNodes)
      {
         ItemLink = ExtractString( Node.InnerHtml, "href=");
      }
   }
}

The example above uses the SelectNodes method to iterate the td elements with a class of ‘thumbnail’ to help iterate the thumbnail elements that link to another gallery or to an image display.

Note: The ExtractString method in the sample code is a method in the PBaseDownloader class of the source code. It simply extracts the quoted portion of a string after an attribute like “href=”.

The pseudo code below demonstrates how to identify a JPG link (from PBase formatting) and download the file:

HtmlNode jpgnode = ImageDoc.DocumentNode.SelectSingleNode("//div[@id='imgdiv']");
if (jpgnode != null)
{
   String jpgLink = ExtractString(jpgnode.InnerHtml, "src=");
   Client.DownloadFile( jpgLink, SaveFile );
}

When iterating though the elements of a page, each element (an image page or a sub gallery) is assigned a simple sequential index, and that index becomes a part of the folder or image name. This is done to help maintain the order of the items as they appear. When the folder or filenames are created, a zero-padded prefix is used to insure identical ordering (despite in Explorer where folders and files are separated into groups). For example:

001-Cross Country
    001.jpg
    001.txt
    002.jpg
    002.txt
002-People
    001.jpg
    001.txt
    002.jpg
    002.txt
    003.jpg
    003.txt

The .txt file corresponding to the JPG files contain the title and EXIF data (if available) that were also parsed from the HTML page. In addition, each gallery folder contains the thumbnail of the folder (Thumbnail.jpg) and the description that the PBase user added for the gallery (Description.txt).

I have not tested it extensively, but it’s working today on the few PBase accounts I’ve tested it with. In time, PBase may change their formatting and this could potentially break the HTML parsing logic. This is a common downside to web-scraping programs.

In general, I frown on the process of web scraping, but sometimes when you want a convenient way to gather your own information from a site, it may be the only way of doing it. Obviously, such a program could be used to go and ‘steal’ data that’s already in the public, and I hope that nobody plans on that. I just thought this was interesting demonstration of some basic operations in web scraping with C#.