software solutions for a mobile world

Windows Phone LINQ to SQL and Data Virtualization

Andy Wigley

How to work with large tables in LINQ to SQL on Windows Phone

Introduction to Data Virtualization

Data Virtualization is a technique you use to expose a subset of the data from a large data collection so that only the actual data you need is fetched into memory. To implement data virtualization, you take your data collection which could contain any number of rows – even a million rows or more! – and you create a custom class for your virtualized data collection that implements IList<T> and acts as a facade to the underlying data collection.

IList<T> has a great many methods, but fortunately the only methods you have to implement are:

  • public int Count

  • T IList<T>.this[int index]


The count property just returns the number of items in the underlying data collection, while the second is just the indexer and allows you to retrieve the ‘nth’ item from the underlying collection. The rest of the methods of IList<T> must of course be implemented in your virtualized data collection class, but you can just throw the NotImplemented exception – in most usages, they will not be called. The only other method of IList<T> that you might choose to implement is public int IndexOf(Titem) which returns the position of an item in the underlying collection. This becomes useful when you want to do lookups in the collection.

In Windows Phone apps, this is beneficial when you want to display a long list of items in a ListBox. Just set your ListBox ItemsSource property to an instance of your virtualized data collection class, and the ListBox will only fetch the visible items and a few either side (to assist with scrolling up and down the list). You can read more about this technique in Shawn Oster’s blog: Improving ListBox Performance in Silverlight for Windows Phone 7: Data Virtualization or in Peter Torr’s blog: Virtualizing Data in Windows Phone 7 Silverlight Applications.

Data Virtualization with LINQ to SQL

So, on to the purpose of this post. I wondered how to implement data virtualization using LINQ to SQL in Windows Phone Mango. For my test database, I created a database with a single table in it which could be used in a dictionary application. It has 200000 entries in it, and the the Word table looks like this:

Table design

Ignore the Sequence column for now – we’ll come back to that at the end. The WordTxt column is up to 128 characters in length and is the Primary Key, while the WordDescription column is up to 1024 characters and holds the description of the word. I’ve filled the table with randomly generated data (using a contorted process – see footnote Smile ) and now we can try to bind the data in this table to a ListBox.

You can download the sample project here (it’s nearly 50MB since it contains a database that is 107MB when decompressed). Compile the project and deploy it onto your target emulator or device.

IMPORTANT: The database is included in the project, but its Build Action is set to ‘None’ and Copy to Output Directory set to ‘Do not copy’ – so you’ll have to manually copy the database onto the target. I’ve done it this way because if you include the database as Content it takes around 5 minutes to build and deploy the app which is tedious when you’re playing with the code. So to deploy the database, either change its Build Action to Embedded Resource and then go away and have a cup of tea while it’s deploying, or use the Isolated Storage Explorer Tool to copy it to the target, or better still the nice tools at http://wptools.codeplex.com .

In the project, in the App.xaml.cs I initialize the data context appropriately for a read only database:

public DictionaryDatabaseContext DbDC;

// Code to execute when the application is launching (eg, from Start)
// This code will not execute when the application is reactivated
private void Application_Launching(object sender, LaunchingEventArgs e)
{
    DbDC = new DictionaryDatabaseContext(
        "Data Source=isostore:/DictionaryDatabase.sdf;Max Database Size=180;File Mode=Read Only;");
    DbDC.ObjectTrackingEnabled = false;
}

// Code to execute when the application is activated (brought to foreground)
// This code will not execute when the application is first launched
private void Application_Activated(object sender, ActivatedEventArgs e)
{
    if (!e.IsApplicationInstancePreserved)
    {
        DbDC = new DictionaryDatabaseContext(
            "Data Source=isostore:/DictionaryDatabase.sdf;Max Database Size=180;File Mode=Read Only;");
        DbDC.ObjectTrackingEnabled = false;
    }
}

 

The main page just has four buttons on it which navigate to four different pages which use differing ways of binding to the data:

L2S Main page

Every page has a ListBox on it which uses the following DataTemplate to format the display of items in the list:

        <DataTemplate x:Key="WordTemplate">
            <StackPanel>
                <TextBlock Text="{Binding WordTxt}" 
                           Style="{StaticResource PhoneTextAccentStyle}"/>
                <TextBlock Text="{Binding WordDescription}" 
                           Style="{StaticResource PhoneTextSmallStyle}" 
                           Width="436" TextWrapping="Wrap" Margin="12,0,12,12" Height="75"/>
            </StackPanel>
        </DataTemplate>

Option 1: Bind directly to the database table

This is just to prove the point. The ‘Load All’ button navigates to a page that just binds the ListBox directly to the database table:

using Microsoft.Phone.Controls;

namespace LargeDBAppSample.Views
{
    public partial class LoadAllPage : PhoneApplicationPage
    {
        public LoadAllPage()
        {
            InitializeComponent();
            this.Loaded += ((s, e) =>
            {
                DictionaryDatabaseContext dc = ((App)App.Current).DbDC;
                listBox1.ItemsSource = dc.Words;
            }
           );
        }
    }
}

When you try this option, the result is predictable. I’m using Peter Torr’s very helpful MemoryDiagnosticsHelper which writes the memory usage alongside the Framerate counters. This throws a debug assert when memory exceeds 90MB, and in this case we hit 104MB:

L2S LoadAllMemoryExceeded

 

Option 2: Data Virtualization – Bad Implementation

The second button on the main page takes you to a page that uses data virtualization. In this case, we also have a textbox where the user can enter a search string – because let’s face it, a ListBox that contains 200000 items is not going to give a good user experience. Scroll down to the bottom to find the ‘Y’s?? – I don’t think so Smile.

So the code for the page here looks like this:

public partial class BadDataVirtualizationPage : PhoneApplicationPage
{
    public BadDataVirtualizationPage()
    {
        InitializeComponent();

        this.Loaded += ((s, e) =>
            {
                listBox1.ItemsSource = new BadVirtualizedDataModel();
            }
        );
    }

    private void txtSearch_TextChanged(object sender, TextChangedEventArgs e)
    {
        DictionaryDatabaseContext dc = ((App)App.Current).DbDC;

        var words = from w in dc.Words
                    where w.WordTxt.StartsWith(txtSearch.Text)
                    select w;

        if (words.Count() == 0)
            MessageBox.Show("No matches");
        else
            listBox1.ScrollIntoView(words.First<Word>());
    }
}


As you can see, the ListBox is bound to an instance of BadVirtualizedDataModel. Why is it bad? I’ll explain shortly….

The class implements IList<Word> and as you can see below, it returns the count (which is cached since for some reason the ListBox asks for it multiple times). The indexer property uses some LINQ to SQL to fetch the item in the table at a particular index, and I write the time taken to the Debug Output window. The whole class looks like this:

using System;
using System.Linq;
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;

namespace LargeDBAppSample.Model
{
    public class BadVirtualizedDataModel : IList<Word>
    {
        private DictionaryDatabaseContext dc = ((App)App.Current).DbDC;
        int? cachedCount = null;

        public int Count
        {
            get {
                if (!cachedCount.HasValue)
                {
                    var wordcount = dc.Words.Count();
                    cachedCount = wordcount;
                }
                return cachedCount.Value; 
            }
        }

        Word IList<Word>.this[int index]
        {
            get
            {
                Stopwatch sw = Stopwatch.StartNew();

                var word = (from w in dc.Words
                            orderby w.WordTxt
                            select w).Skip(index).Take(1).First();
                
                Debug.WriteLine("Fetched {0} in {1}ms", new object[] { index, sw.ElapsedMilliseconds });
                return word;
            }
            set
            {
                throw new NotImplementedException();
            }
        }

        public int IndexOf(Word item)
        {
            Stopwatch sw = Stopwatch.StartNew();

            var matches = (from w in dc.Words select w).AsEnumerable()
                .Select((matchItem, index) => new { Item = matchItem, Position = index })
                            .Where(x => x.Item.WordTxt == item.WordTxt);

            int position = matches.First().Position;

            Debug.WriteLine("Found position of {0} in {1}ms", new object[] { item.WordTxt, sw.ElapsedMilliseconds });

            return position;
        }

        #region NotRequired

        public int Add(object value)
        {
            throw new NotImplementedException();
        }

        public void Clear()
        {
            throw new NotImplementedException();
        }

        public bool Contains(object value)
        {
            throw new NotImplementedException();
        }

        public void Insert(int index, object value)
        {
            throw new NotImplementedException();
        }

        public bool IsFixedSize
        {
            get { throw new NotImplementedException(); }
        }

        public bool IsReadOnly
        {
            get { throw new NotImplementedException(); }
        }

        public void RemoveAt(int index)
        {
            throw new NotImplementedException();
        }

        public bool IsSynchronized
        {
            get { throw new NotImplementedException(); }
        }

        public object SyncRoot
        {
            get { throw new NotImplementedException(); }
        }

        public void Insert(int index, Word item)
        {
            throw new NotImplementedException();
        }

        public void Add(Word item)
        {
            throw new NotImplementedException();
        }

        public bool Contains(Word item)
        {
            throw new NotImplementedException();
        }

        public void CopyTo(Word[] array, int arrayIndex)
        {
            throw new NotImplementedException();
        }

        public bool Remove(Word item)
        {
            throw new NotImplementedException();
        }

        public IEnumerator GetEnumerator()
        {
            throw new NotImplementedException();
        }

        IEnumerator<Word> IEnumerable<Word>.GetEnumerator()
        {
            throw new NotImplementedException();
        }

        #endregion
    }
}

So when you run this, first impressions are very favorable. The list appears in no time at all, scrolling up and down the list is really fast, and memory usage is only about 24MB. Looks great!

L2S Bad

However, everything is not as good as it seems. This is illustrated if you type a letter from later in the alphabet, such as ‘t’ into the Search box. Look at what is reported in the output window now:

L2S Bad Search

There are two lines starting ‘Found Position…’ each taking over 4.3 seconds, and further down the ‘Fetched…’ lines report how long it takes to get an item from the collection at a particular index are now taking over 370ms each. From this point on, scroll performance of the ListBox is terrible. What is going on?

The code that is executing here is firstly the TextChanged event handler finds the first word in the table that begins with the letters entered into the Search textbox:

private void txtSearch_TextChanged(object sender, TextChangedEventArgs e)
{
    DictionaryDatabaseContext dc = ((App)App.Current).DbDC;

    var words = from w in dc.Words
                where w.WordTxt.StartsWith(txtSearch.Text)
                select w;

    if (words.Count() == 0)
        MessageBox.Show("No matches");
    else
        listBox1.ScrollIntoView(words.First<Word>());
}

The lookup is really fast because it is doing a lookup on the WordTxt field which is the primary key.

However, once we have found our word, we call the ListBox.ScrollIntoView(object) method to cause the ListBox to scroll to display the first matching word. The way the ListBox does this is to first call the IndexOf(Word item) method of the data collection to find the index in the collection of the desired item (which it does twice for some reason), and then to call the indexer as normal to pull the required items out of the collection. It is the IndexOf() method that is reporting the 4.3 second search times and the indexer that is reporting times of over 370ms.

The problem here is that an index in a SQL Server Compact Edition database (the underlying database for LINQ to SQL on Windows Phone) is not optimised for random access. There is no efficient way of getting the 109,456th item out of an index. The code used in the indexer is actually an adaptation of what is in the MSDN Windows Phone documentation, topic Local Database Best Practices for Windows Phone, which states:

Use of Data Virtualization and Skip/Take

With ListBox controls, you can use the Skip and Take methods to enable fetching of the appropriate batches of items as the user scrolls through the list. For more information about enabling data virtualization on data-bound ListBox controls, see Improving ListBox Performance in Silverlight for Windows Phone 7: Data Virtualization.

Using the Skip and Take methods ensures that data will not be loaded from the database into memory until it is needed for display in the ListBox control. For example, the following code shows how to retrieve records 501 to 550 from the database.

C#

return
(from f in App.FeedsDB.Feeds                    
select f).Skip(500).Take(50);

 

Because there is cost to performing this load operation as the user scrolls through the list, this technique should be limited to scenarios where the data set is large (greater than 150 items). For smaller data sets, loading the entire collection into memory and binding to it there is likely to yield better performance.

In actual fact, this best practice advice should be qualified by saying that it should not be used for collections of greater than perhaps 1000. What is actually happening here is that internally, the database engine is having to walk the record collection to implement the Skip(n) part of it. So while it is true to say that the skipped records are not loaded from the database into memory, the database engine is still having to do a great deal of work to identify the record(s) it does need to materialize into memory.

Our indexer looks like this:

var word = (from w in dc.Words orderby w.WordTxt select w).Skip(index).Take(1).First();

However the real damage is being done by the IndexOf() method implementation:

public int IndexOf(Word item)
{
    Stopwatch sw = Stopwatch.StartNew();

    var matches = (from w in dc.Words select w).AsEnumerable()
        .Select((matchItem, index) => new { Item = matchItem, Position = index })
                    .Where(x => x.Item.WordTxt == item.WordTxt);

    int position = matches.First().Position;

    Debug.WriteLine("Found position of {0} in {1}ms", new object[] { item.WordTxt, sw.ElapsedMilliseconds });

    return position;
}

 

This code was taken from a number of posts found in searches on the internet and while it works well enough, it also causes the database engine to work very hard to give us our result. The only good news with all of this is that memory utilisation remains low at 24MB but that’s no substitute for the terrible performance which makes this solution unusable.

Option 3: Data Virtualization – Better Implementation

The second and third buttons on the main page takes you to two different pages that use a workable implementation of data virtualization. Each shows a different solution to this problem of needing a random access index to a database table. The Good virtualization button takes you to a page that is the same as the one we have just used in the previous example, with the exception that when the page loads, the page calls the BeginInitialize() method on our GoodVirtualizedDataModel class which builds an in-memory lookup table to allow us to easily find the index of a word in the collection. In the form, the page initialization logic looks like this:

public partial class GoodDataVirtualizationPage : PhoneApplicationPage
{
    public GoodDataVirtualizationPage()
    {
        InitializeComponent();

        this.Loaded += ((s, e) =>
        {
            listBox1.ItemsSource = null;

            var vDM = new GoodVirtualizedDataModel();
            vDM.InitializeCompleted += (ss, ee) =>
                {
                    listBox1.ItemsSource = vDM;
                    SystemTray.ProgressIndicator.IsVisible = false;
                };

            vDM.BeginInitialize();
            var pi = new ProgressIndicator();
            pi.IsIndeterminate = true;
            pi.Text = "Initializing..";
            pi.IsVisible = true;
            SystemTray.ProgressIndicator = pi;
        }
        );
    }

The GoodVirtualizedDataModel class has a private List<string> field called inMemoryLookup and this is what is initialized by the BeginInitialize() method. That uses a BackgroundWorker to do the work on a background thread, but all it is doing is extracting just the WordTxt field (which is the primary key field) from every Word object in the database and inserting it into the List<string>:

inMemoryLookup = (from w in dc.Words
                   orderby w.WordTxt
                   select w.WordTxt).ToList<string>();

 

Then in the indexer, instead of the costly .Skip(n).Take(1) operation, the code simply takes the index of the item that is passed in, uses it to fetch the object at that index from the inMemoryLookup collection and that gives it the primary key (the WordTxt) at that index which it uses to do an efficient primary key lookup for the required word:

Word IList<Word>.this[int index]
{
    get
    {
        var word = from w in dc.Words
                   where w.WordTxt == inMemoryLookup[index]
                   select w;
        return word.First<Word>();
    }
    set
    {
        throw new NotImplementedException();
    }
}

 

In actual fact, as a further optimisation, instead of executing the same query every time the indexer Getter is called, the implementation in the sample uses a Compiled Query to avoid the overhead of the query processor doing the same query analysis on every call. This is the full class:

using System;
using System.Linq;
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;
using System.ComponentModel;

namespace LargeDBAppSample.Model
{
    public class GoodVirtualizedDataModel : IList<Word>
    {
        private DictionaryDatabaseContext dc = ((App)App.Current).DbDC;
        int? cachedCount = null;
        private List<string> inMemoryLookup;
        private Dictionary<string, int> cachedPositions = new Dictionary<string,int>();

        public event EventHandler InitializeCompleted;

        public void BeginInitialize()
        {
            BackgroundWorker worker = new BackgroundWorker();
            worker.DoWork += (s, e) =>
            {
                inMemoryLookup = (from w in dc.Words
                                  orderby w.WordTxt
                                  select w.WordTxt).ToList<string>();

            };
            worker.RunWorkerCompleted += (s, e) =>
            {
                if (InitializeCompleted != null)
                {
                    InitializeCompleted(this, new EventArgs());
                }
            };
            worker.RunWorkerAsync();
        }

        public int Count
        {
            get {
                if (!cachedCount.HasValue)
                {
                    var wordcount = dc.Words.Count();
                    cachedCount = wordcount;
                }
                return cachedCount.Value; 
            }
        }

        // Prepare a compiled query for optimum performance
        public static Func<DictionaryDatabaseContext, string, IQueryable<Word>> SelectedByIndexWordResult =
            System.Data.Linq.CompiledQuery.Compile(
            (DictionaryDatabaseContext dc, string theWord) =>
                 from w in dc.Words
                 where w.WordTxt == theWord
                 select w
                 );

        Word IList<Word>.this[int index]
        {
            get
            {
                //var word = from w in dc.Words
                //           where w.WordTxt == inMemoryLookup[index]
                //           select w;
                //return word.First<Word>();

                Stopwatch sw = Stopwatch.StartNew();
                var word = SelectedByIndexWordResult(dc, inMemoryLookup[index]).First<Word>();
                Debug.WriteLine("Fetched {0} in {1}ms", new object[] { index, sw.ElapsedMilliseconds });

                return word;
            }
            set
            {
                throw new NotImplementedException();
            }
        }

        public int IndexOf(Word item)
        {
            Stopwatch sw = Stopwatch.StartNew();
            int position;

            if (cachedPositions.ContainsKey(item.WordTxt))
            {
                position = cachedPositions[item.WordTxt];
            }
            else
            {
                var matches = inMemoryLookup
                    .Select((matchItem, index) => new { theWord = matchItem, Position = index })
                                .Where(x => x.theWord == item.WordTxt);

                position = matches.First().Position;

                cachedPositions[item.WordTxt] = position;
            }

            Debug.WriteLine("Found position of {0} in {1}ms", new object[] { item.WordTxt, sw.ElapsedMilliseconds });

            return position;
        }

        #region NotRequired

        ...
        #endregion
    }
}

 

Using this in-memory lookup table, the results are impressive:

L2S Good InMemory

Of course, everything comes with a price. Using this in memory lookup, the building of the in-memory lookup takes 10-15 seconds on a device so you’ll have to show some ‘Please wait’ UI (I use a progress bar) or do it on a background thread while your user is still in the opening screens of your app. The lookup table also takes quite a lot of memory – up to 54MB with this sample set, which is not so bad, but in some earlier tests I did with 500000 items, my in-memory lookup was causing me to exceed the 90MB memory threshold. On the plus side, because the in-memory lookup is well – in memory – it’s easily updated if your app allows items to be added, edited or deleted from the main table.

Option 3: Data Virtualization – Best Implementation

So what do you do if you don’t want to have your users wait while you build the in-memory lookup index? Or you can’t afford the memory that the lookup table requires?

Well, one solution is to let the database do the work for you, and this is what the final option demonstrates. Remember that third column, Sequence, that I said I would come back to? That is performing the same role as an in-memory lookup but instead it’s stored in the data base as well. I wrote a desktop utility program that prepared this database by fetching all the records using the primary key – in other words giving me the Words in alphabetical order – and then sets the Sequence field of each to its position in the collection. I then created a secondary key on the Sequence field (again before including the database in this project), so now I have a fast lookup of words both by WordTxt (the primary key) and by index (the secondary key).

Using this, the final version of my virtualized data model looks like this:

using System;
using System.Linq;
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;

namespace LargeDBAppSample.Model
{
    public class BestVirtualizedDataModel : IList<Word>
    {
        private DictionaryDatabaseContext dc = ((App)App.Current).DbDC;
        int? cachedCount = null;
        private Dictionary<string, int> cachedPositions = new Dictionary<string,int>();

        public int Count
        {
            get {
                if (!cachedCount.HasValue)
                {
                    var wordcount = dc.Words.Count();
                    cachedCount = wordcount;
                }
                return cachedCount.Value; 
            }
        }

        // Prepare a compiled query for optimum performance
        public static Func<DictionaryDatabaseContext, int, IQueryable<Word>> SelectedBySecondaryIndexWordResult =
            System.Data.Linq.CompiledQuery.Compile(
            (DictionaryDatabaseContext dc, int theIndex) =>
                 from w in dc.Words
                           where w.Sequence == theIndex
                           select w
                 );

        Word IList<Word>.this[int index]
        {
            get
            {
                Stopwatch sw = Stopwatch.StartNew();

                //var word = from w in dc.Words
                //           where w.Sequence == index
                //           select w;
                //return word.First<Word>();

                var word = SelectedBySecondaryIndexWordResult(dc, index).First<Word>();
                Debug.WriteLine("Fetched {0} in {1}ms", new object[] { index, sw.ElapsedMilliseconds });

                return word;
            }
            set
            {
                throw new NotImplementedException();
            }
        }

        public int IndexOf(Word item)
        {
            Stopwatch sw = Stopwatch.StartNew();
            int position;

            if (cachedPositions.ContainsKey(item.WordTxt))
            {
                position = cachedPositions[item.WordTxt];
            }
            else
            {
                position = (from w in dc.Words
                           where w.WordTxt == item.WordTxt
                           select w.Sequence).First();

                cachedPositions[item.WordTxt] = position;
            }

            Debug.WriteLine("Found position of {0} in {1}ms", new object[] { item.WordTxt, sw.ElapsedMilliseconds });

            return position;
        }

        #region NotRequired

        ...
        #endregion
    }
}

 

The results speak for themselves:

L2S Best

Memory usage here is around 23MB.

So when not to use this option? Well, this only really works with read-only collections where you can get the Sequence numbers correct in the table before transferring to the device. It won’t work if you start inserting or removing records or changing the ordering – not without having to do massive data updates that would negate the benefit of the approach.

Summary

These examples have shown how to bind a very large collection of records from a local database table to a ListBox. This is rarely a great idea from a user experience point of view, unless you implement some kind of search as I have done in these examples. [An aside: my original intention had been to show a sample using the LongListSelector from the Silverlight Toolkit, configured as a JumpList, but that control does not support data virtualization, despite claims in a few blogs from the team to the contrary Thinking smile.] Nonetheless, a virtualized data source can be used in other situations where you want to avoid materializing all the records from a large table into memory.

So if you are going to implement a virtualized data source, beware of the .Skip(n).Take(n1) syntax recommended by the MSDN docs – that does not work with large collections as it makes the database engine struggle as you get down towards the bottom of the list.

Instead, implement your own way of efficiently determining the position of an item in the list, either by an in-memory lookup collection, or if your data is read-only by embedding positional information into a column in the database table and configuring it as a secondary index.

Download the sample project (around 50MB)

 

Footnote:

I built this sample database using some excellent tools that deserve calling out. Firstly, I created a ‘big SQL’ database in SQL Express and designed my Word table. I then generated the random data using Red Gate’s excellent SQL Data Generator, part of their SQL Toolbelt product set. They have a 14 day free trial, which was just long enough while I was working on this post!

I then scripted all the data and schema suitable for loading into a SQL Server Compact Edition 3.5 database using the free SQL export and script utility called Export2SQLCE, available here on Codeplex. Thanks Erik for these amazing (and free!) tools. Then it was back to SQL Server Management Studio to create an empty SQL Server Compact Edition 3.5 database and to execute all the scripts that Export2SQLCE had created for me. After this I had a database I could use on Windows Phone. I ran a small console app to set the sequence number in the table.

Last part of the jigsaw was to generate the DataContext. I could have used SQLMetal utility for that, but I prefer to use another of Erik’s free tools, this time the SQL Server Compact Toolbox which allows you to create a data context (amongst many other things).

Thanks guys for these tools!


Posted Dec 14 2011, 12:05 PM by Andy Wigley
Copyright © 2014 APPA Mundi Limited. All Rights Reserved. Terms of Use and Privacy Policy. OrcsWeb's Windows Cloud Server Hosting