Have you ever wanted to create an automated way to load, manipulate, and then act upon a web page?
Using CEFSharp (and some strategic JavaScript), you can create headless (no GUI) interfaces of Chrome’s parent browser, Chromium, and then instruct them to do pretty much anything a web browser can do.
This is a tutorial about using CEFSharp to accomplish some basic web functions with simple examples. We’ll create three automated bots that can simulate user web interaction and programmatically react to browser events using CEF and the CEFSharp library. You can follow along by copying the code provided or by downloading the repo here.
Background
First off, what is CEFSharp?
CEFSharp is a .NET binding of the Chromium Embedded Framework, or CEF, an open-source framework developed by Marshall Greenblatt that allows you to embed an instance of the Chromium web browser into your own application. It’s so well suited creating desktop applications with web-technology based components that you’ve probably used it without even realizing it. Here is its website.
There are a lot of other ways to pull data off the internet. Web scrapers are pretty common, but they are not always well suited to the tasks of scraping Single-Page Apps or to pages with more complicated authentication schemes.
That’s where CEFSharp’s “Offscreen” library comes in. Using it, you can simulate user web interaction and programmatically control, track, and react to browser events that users don’t generally have access to.
Setup
Let’s keep things simple for our demo and start with a basic .NET Console Application.
Next we’ll grab the CefSharp.Offscreen Nuget Package.
Now, go to your project properties, create an x64 configuration and set your Platform Target to x64. (x86 also works if that’s your preferred flavor).
From here, we’ll be using a modified version of the CefSharp.MinimalExample.Offscreen demo so feel free to check it out or start your own project off of it (Note: you will have to include System.Drawing into your project references to get the bitmap functionality working).
Here’s the basic program.cs that we will be starting with:
using System; using CefSharp; using CefSharp.OffScreen; using System.IO; namespace CEFSharpDemo { class CEFSharpDemo { private static ChromiumWebBrowser browser; private const string testUrl = "https://www.google.com/"; public static void Main(string[] args) { CefSharpSettings.SubprocessExitIfParentProcessClosed = true; var settings = new CefSettings() { CachePath = Path.Combine(Environment.GetFolderPath( Environment.SpecialFolder.LocalApplicationData), "CefSharp\Cache") }; Cef.Initialize(settings, performDependencyCheck: true, browserProcessHandler: null); browser = new ChromiumWebBrowser(testUrl); browser.LoadingStateChanged += BrowserLoadingStateChanged; Console.ReadKey(); Cef.Shutdown(); } private static void BrowserLoadingStateChanged(object sender, LoadingStateChangedEventArgs e) { if (!e.IsLoading) { Console.WriteLine($"page {testUrl} loaded!"); } else { Console.WriteLine($"page {testUrl} is loading!"); } } } }
You may notice from this snippet that we’re reacting to the BrowserLoadingStateChanged
event.
Any work that we want to occur after the page has loaded will live here, and should be within the !e.IsLoading
block.
Alright, let’s create some web bots!
BOT 1
Bot 1 will log into a page, grab some information, and spit it out into a file.
We will be using a scraper testing webpage, as our test URL since we can treat it like a sandbox and nobody will mind.
First, let’s change the initialUrl
string to match the address we want to visit first:
const string initialUrl = "http://testing-ground.scraping.pro/login";
Now, let’s check out the webpage in the developer console (F12 in most browsers) and grab some html element ids for username and password, and then set up a button click.
document.querySelector('#usr').value = 'admin'; document.querySelector('#pwd').value = '12345'; document.querySelector('input[type=submit]').click();
Now that we’re able to log in, let’s come up with a JavaScript function to grab what we want from the page. I’ve decided I want to pull the text out of all of the “html” elements on the screen and save them in a file on my computer.
I recommend using self-executing functions (note how the function is wrapped in parentheses (function(){})()
) to encapsulate/simplify script injection.
(function(){ var lis = document.querySelectorAll('li'); var result = []; result.push(document.querySelector('h3.success').innerText); for(var i = 0; i < lis.length; i++) { result.push(lis[i].innerText) } return result;})()
Now that we have all of the JavaScript necessary to interact with the webpage, we can move on to coding the C# backend.
Almost all of the changes will be made to the BrowserLoadingStateChanged
function as below.
static bool isLoggedIn = false; private static void BrowserLoadingStateChanged(object sender, LoadingStateChangedEventArgs e) { if (!e.IsLoading) { if (!isLoggedIn) { // fill the username and password fields with their respective values, then click the submit button var loginScript = @"document.querySelector('#usr').value = 'admin'; document.querySelector('#pwd').value = '12345'; document.querySelector('input[type=submit]').click();"; browser.EvaluateScriptAsync(loginScript).ContinueWith(u => { isLoggedIn = true; Console.WriteLine("User Logged in.n"); }); } else { // push the "success" field and the text from all 'li' elements into an array var pageQueryScript = @" (function(){ var lis = document.querySelectorAll('li'); var result = []; result.push(document.querySelector('h3.success').innerText); for(var i=0; i < lis.length; i++) { result.push(lis[i].innerText) } return result; })()"; var scriptTask = browser.EvaluateScriptAsync(pageQueryScript); scriptTask.ContinueWith(u => { if (u.Result.Success && u.Result.Result != null) { Console.WriteLine("Bot output received!nn"); var filePath = "output.txt"; var response = (List<dynamic>)u.Result.Result; foreach (string v in response) { Console.WriteLine(v); } File.WriteAllLines(filePath, response.Select(v => (string)v).ToArray()); Console.WriteLine($"nnBot output saved to {filePath}"); Console.WriteLine("nnPress any key to close."); } }); } } }
The main functionality of Bot 1 is based around the EvaluateScriptAsync
function call. Since this function returns an async task type, we can execute more code immediately after it’s finished with the ContinueWith
call and an anonymous function.
Another important thing to point out is that the e.Isloading
check is being used along with the isLoggedIn variable to control when each script gets activated.
While executing, the bot pushes this info out into the console.
╔══════════════════════════════════════════════════╗ ║ CEFSHARP-BOT #1 ACTIVATED ║ ╚══════════════════════════════════════════════════╝ User Logged in. Bot output received! WELCOME Send user credentials via POST method Receive, Keep and Return a session cookie Process HTTP redirect (302) Enter admin and 12345 in the form below and press Login If you see WELCOME then the user credentials were sent, the cookie was passed and HTTP redirect was processed If you see ACCESS DENIED! then either you entered wrong credentials or they were not sent to the server properly If you see THE SESSION COOKIE IS MISSING OR HAS A WRONG VALUE! then the user credentials were properly sent but the session cookie was not properly stored or passed If you see REDIRECTING... then the user credentials were properly sent but HTTP redirection was not processed Click GO BACK to start again Bot output saved to output.txt Press any key to close.
Bot 1’s execution ends with the messages starting with “WELCOME” and ending with “Click GO BACK to start again” are stored in output.txt in the application’s working directory.
BOT 2
Our second bot will work like a classic web crawler with a simplified scope.
It will be configured to run a google search and click the first link returned, while also saving the source code and a screenshot of the page it visits.
First off, let’s set our initial URL to google.
const string initialUrl = "https://www.google.com/";
After that, we’re going to need to track some additional events so we can more easily track when the page is loaded enough to grab the source and a screenshot.
Let’s drop these two events right next to our browser object declaration.
browser = new ChromiumWebBrowser(initialUrl); browser.FrameLoadStart += Browser_FrameLoadStart; browser.FrameLoadEnd += Browser_FrameLoadEnd; browser.LoadingStateChanged += BrowserLoadingStateChanged;
Let’s go ahead and declare these event handler functions; I went ahead and let the autocomplete name them.
private static void Browser_FrameLoadStart(object sender, FrameLoadStartEventArgs e) { Console.WriteLine($"{e.Url} loading!"); }
FrameLoadStart
happens before anything on the page is loaded, so it’s not very useful for grabbing a screenshot but should be useful for debugging info.
static int pageCount = 0; private static void Browser_FrameLoadEnd(object sender, FrameLoadEndEventArgs e) { browser.GetSourceAsync().ContinueWith(v => { File.WriteAllText($"{pageCount}_page.html", v.Result); }); var screenshot = browser.ScreenshotOrNull(); screenshot.Save($"{pageCount}_screenShot.png"); screenshot.Dispose(); Console.WriteLine($"Saving {pageCount}_screenShot and {pageCount}_page for {e.Url}..."); pageCount++; }
Bot 2’s functionality is driven by the GetSourceAsync
and ScreenshotOrNull
function calls.
GetSourceAsync
very simply returns the page’s source, and we dump it to a file based on which number page we are visiting.
ScreenshotOrNull
returns a System.Drawing.Bitmap
object which is then saved. We call Dispose afterwards to clear it from memory (since it also implements IDisposible
, you could also automatically dispose of it with a using(){} statement
).
Bot 2 outputs something like the following into the console window while running
╔══════════════════════════════════════════════════╗ ║ CEFSHARP-BOT #2 ACTIVATED ║ ╚══════════════════════════════════════════════════╝ https://www.google.com/ loading! Saving 0_screenShot and 0_page for https://www.google.com/... https://www.google.com/search?source=hp&amp;ei=vAphXK2pDsOSjwT0ga_gAg&amp;q=CEFSharp loading! Saving 1_screenShot and 1_page for https://www.google.com/search?source=hp&amp;ei=vAphXK2pDsOSjwT0ga_gAg&amp;q=CEFSharp... https://github.com/cefsharp/CefSharp loading! Saving 2_screenShot and 2_page for https://github.com/cefsharp/CefSharp...
Bot 2’s execution ends with the specified screenshots and html pages (#_screenshot.png and #_page.html) being output into the application’s working directory.
This specific call to Google can generate between two and four screenshots depending on which execution path Google uses to return our search information or link click. All visited page source will be saved, but some screens (especially about:blank, if it’s loaded) may not register a screenshot.
BOT 3
Bot 3 will initially load Google, browse to another page, and then attach to a web page’s debug console output and show that information in the console window. I chose a very straightforward console game to demonstrate this concept.
Since we’re going to be loading a new page instead of staying on the initial page, we’ll need a way to easily track our current URL.
We can attach to the AddressChanged
event to facilitate this.
browser = new ChromiumWebBrowser(initialUrl); browser.AddressChanged += Browser_AddressChanged; browser.LoadingStateChanged += BrowserLoadingStateChanged;
Now the implementation.
static string currentUrl = ""; private static void Browser_AddressChanged(object sender, AddressChangedEventArgs e) { currentUrl = e.Address; }
Now that we can easily track what page we’re on, let’s get to the main event.
static string fallingUrl = "http://rikukissa.github.io/falling/"; static string clickScript = "document.querySelector('#start').click()"; private static void BrowserLoadingStateChanged(object sender, LoadingStateChangedEventArgs e) { if (!e.IsLoading) { if (currentUrl != fallingUrl) { browser.Load(fallingUrl); } else { browser.ConsoleMessage += Browser_ConsoleMessage; // give the page a moment to finish loading its javascript Thread.Sleep(500); browser.ExecuteScriptAsync(clickScript); } } }
The current page is checked, and if it’s not where we want to be, we load the desired URL string fallingUrl
.
After the fallingUrl
page is loaded, we link up the the ConsoleMessage
event handler function and then go ahead and activate our page’s click action.
The ConsoleMessage
function lives just down the street and looks like this.
private static void Browser_ConsoleMessage(object sender, ConsoleMessageEventArgs e) { Console.WriteLine(e.Message); if(e.Message.Contains("How long it took for you to crash")) { Console.WriteLine("nWould you like to play again? (y/n)"); var result = Console.ReadKey(); if(result.Key != ConsoleKey.Y) { playAgain = false; } else { browser.ExecuteScriptAsync(clickScript); } } }
The most important thing here is that this function call exposes the e.Message
variable which contains the content of the console message.
We shoot that out onto the screen and add a check that allows us to play the game again if we would like.
Bot 3’s console output starts like this.
╔══════════════════════════════════════════════════╗ ║ CEFSHARP-BOT #3 ACTIVATED ║ ╚══════════════════════════════════════════════════╝ Hello there! #-------------------------------------------------O---------------------------------------------------# ##------------------------------------------------O--------------------------------------------------## ###-----------------------------------------------O-------------------------------------------------### ####----------------------------------------------O------------------------------------------------#### ########------------------------------------------O-----------------------------------------------##### ########------------------------------------------O----------------------------------------------###### ########------------------------------------------O--------------------------------------------########
and ends something like this
##########################------------------------O------------------------------------------------#### ###########################-----------------------O-------------------------------------------------### ############################----------------------O--------------------------------------------------## #############################---------------------O---------------------------------------------------# ##############################--------------------O--------------------------------------------------## ###############################-------------------O-------------------------------------------------### ################################------------------O------------------------------------------------#### #################################-----------------O-----------------------------------------------##### ##################################----------------O------------------------------------------------#### ###################################---------------O-------------------------------------------------### ####################################--------------O--------------------------------------------------## #####################################-------------O---------------------------------------------------# ######################################------------O--------------------------------------------------## #######################################-----------O-------------------------------------------------### ########################################----------O------------------------------------------------#### #########################################---------O-----------------------------------------------##### ##########################################--------O----------------------------------------------###### ###########################################-------O---------------------------------------------####### ############################################------O--------------------------------------------######## #############################################-----O-------------------------------------------######### ##############################################----O------------------------------------------########## ###############################################---O-----------------------------------------########### ################################################--O----------------------------------------############ #################################################-O---------------------------------------############# ##################################################O--------------------------------------############## You crashed with 132 points! How long it took for you to crash: 11564.510986328125ms Would you like to play again? (y/n)
You can, of course, hit ‘y’ to play again.
Conclusion
There are a lot of really useful ways to use CEFSharp.OffScreen and especially CEF in general. Hopefully these simple web bots display the versatility of CEF and the CEFSharp library.
Once again, the code for this project can be found on my bitbucket.
Thank you for reading, and happy coding!