LINQ Stinks - code smells in your LINQ
Yesterday I had the great privilege of speaking at the first ever Techorama Netherlands event (which was brilliant by the way - I highly recommend you visit if you get the chance). There were loads of great speakers covering the latest patterns, practices and tools for .NET, Azure and more.
But my talk actually covered a relatively old technology - LINQ, which was introduced in 2007, and has only had minor updates since then. However, it remains very relevant to day to day development in .NET which is why I think its still worth spending time educating developers in how to get the best out of it.
In the talk I went through several recommendations and best practices, many of which are featured in my More Effective LINQ Pluralsight Course.
But I also introduced a new term - "LINQ stinks" (or should that be 'stinqs') - which are code smells or anti-patterns in LINQ code.
A code smell is when there is nothing wrong with the functionality of the code, but there is an issue with the maintainability. The way we've written the code makes it hard to understand or extend. There are also several anti-patterns relating to poor performance whether you're using LINQ to objects, or using an ORM like Entity Framework to query the database with LINQ.
LINQ Stink #1 - Complex Pipelines
LINQ has the power to make your code much more expressive and succinct, and can enable you to solve tricky problems (such as my Lunchtime LINQ challenges) with a single pipeline of chained LINQ extension methods.
However, that is not an excuse to write unreadable code. Once a pipeline reaches more than 3 or 4 chained methods it becomes hard for someone reading the code to fully comprehend what the pipeline does.
In my talk I discussed a number of ways of solving this.
First of all, just because lambda expressions are an easy way of passing method calls into LINQ methods, doesn't mean you always need to use them. If a lambda expression starts to become a bit complicated, there is no reason why you can't refactor it into a well-named method that more clearly expresses the intent.
For example, this LINQ pipeline is very straightforward to understand, and we can always dive into the details of how the customers are getting filtered and projected by navigating into the methods.
customers
.Where(IsEligibleForDiscount)
.Select(GetDiscountPercentage)
(Important note - you can't use this technique with Entity Framework as it won't know how to translate your custom methods into SQL).
Another way in which LINQ pipelines can get over-complicated is when LINQ is missing a method you need. For example, LINQ doesn't have a Batch
operator, but this rather hacky technique achieves batching with the standard operators. Although it works, it obfuscates our intent.
orders
.Select((item, index) => new { item, index })
.GroupBy(x => x.index / 10)
.Select(g => g.Select(x => x.item).ToArray())
Instead, use a library like MoreLINQ, or even create your own LINQ extension methods to enable you to write more declarative code that expresses your intent more clearly. With the MoreLINQ Batch method, we can simply write the following:
orders.Batch(10)
LINQ Stink #2 - Reading too much
The next problem we discussed was reading more than we need from an IEnumerable<T>
. This is especially important if that sequence is lazily evaluated, and may need to do a non-trivial amount of work to produce the items in the sequence.
Consider the following example.
var blobs = GetBlobs("MyContainer").ToList();
var blob = blobs.First(b => b.Name.EndsWith(".txt"));
Console.WriteLine(blob.Name);
IEnumerable<Blob> GetBlobs(string containerName)
{
// ???
}
We're calling the GetBlobs
method which goes off to an Azure blob storage container and downloads information about all the blobs in the container. There could be thousands of these, but notice that the code that uses them only requires the first blob with a .txt
extension.
However, because we've used ToList
on the return from GetBlobs
, we will always download information about all the blobs in the container, even if the very first one was a text file.
So in this case, we should remove ToList
as it may cause us to do more work than we need. However, there are times we do want to use ToList
, which brings us onto our next LINQ stink...
LINQ Stink #3 - Multiple Enumeration
A related problem when dealing with lazily evaluated IEnumerable<T>
sequences is that if we enumerate them more than once, we end up doing the work to produce all the items in that sequence more than once, which is wasted time.
Consider the following example where the GetBlobs
method is once again a lazily evaluated sequence:
var blobs = GetBlobs("MyContainer");
var biggest = blobs.MaxBy(b => b.Size);
var smallest = blobs.MinBy(b => b.Size);
We're using the MoreLINQ MaxBy and MinBy extension methods to determine which the largest and smallest blobs in the container are. This does require us to download information about all blobs in the container, but this implementation will cause us to download them twice.
In this case, it would be better to perform ToList
on the return sequence from GetBlobs
, to store them all in memory to remove the performance penalty of enumerating twice.
Note: Tools like ReSharper are very good at warning you when you enumerate an IEnumerable<T>
more than once, and will allow you to perform a "quick fix" by adding a ToList
. This may be the appropriate solution, but often, the code that produced the IEnumerable<T>
is your own code, and so if you know for sure that you are always going to pass an in-memory collection like a list or an array, then it might just be better to pass an ICollection<T>
instead, making it clear that it is safe to enumerate through again.
LINQ Stink #4 - Inefficient SQL
ORMs with LINQ Providers like Entity Framework generally do a great job of converting your LINQ expressions into SQL, often producing better SQL than you might write yourself.
For example, using the MVC Music Store database, I can perform the following query which includes a join, sort, take and projection to anonymous object...
Albums
.Where(a => a.Artist.Name.StartsWith("A"))
.OrderBy(a => a.Artist.Name)
.Select(a => new { Artist = a.Artist.Name, Album = a.Title })
.Take(5)
...and the SQL we see generated is perfectly reasonable:
DECLARE @p0 NVarChar(1000) = 'A%'
SELECT TOP (5) [t1].[Name] AS [Artist], [t0].[Title] AS [Album]
FROM [Album] AS [t0]
INNER JOIN [Artist] AS [t1] ON [t1].[ArtistId] = [t0].[ArtistId]
WHERE [t1].[Name] LIKE @p0
ORDER BY [t1].[Name]
However, occasionally a programmer will put something into a LINQ expression that the LINQ provider doesn't know how to turn into SQL. In this case, we get a run-time exception.
The right thing to do in this situation is to rework the query to do as much as possible of the work in SQL (maybe even by creating a stored procedure) taking care to be as efficient as possible and minimising the data downloaded, and only then doing anything that can only be performed client side.
Unfortunately, inexperienced developers often "fix" the exception by inserting a ToList
or AsEnumerable
, which appears at first glance to work, but produces horribly inefficient SQL.
This next code example will not only download the entire Albums table to memory, but for every album in the entire table will perform an additional query to get hold of the artist name, resulting in what's known as a "Select N+1" anti-pattern.
Albums
.ToList()
.Where(a => a.Artist.Name.Count(c => c == ' ') > 2)
.OrderBy(a => a.Artist.Name)
.Select(a => new { Artist = a.Artist.Name, Album = a.Title })
.Take(5)
What's the solution to this issue? Well, obviously a bit of experience helps - knowing what sort of SQL a given query is likely to turn into. But you don't have to guess.
Both Entity Framework and Entity Framework Core allow you to inject a logger that will emit the raw SQL statements that are being executed. And of course you can always use a profiler to view the SQL your application is generating.
So I think as part of any code review of LINQ to database code, you should be asking the questions "what SQL does this query get turned into?" and "is this the most efficient query to retrieve the necessary data?"
LINQ Stink #5 - Side Effects
The final LINQ stink I discussed at Techorama was introducing side effects into your pipelines. One of the great things about LINQ is that it brings many of the ideas of functional programming into C#.
It helps you write more declarative code, use pipelines, higher-order functions and even encourages "immutability" because the Select
method doesn't take an Action
so you are encouraged to return new objects as they flow through your pipeline rather than mutating existing ones.
But another key functional programming concept is to always prefer "pure" functions wherever possible rather than functions that produce "side-effects". A pure function depends only on its inputs and the same inputs always produce the same output. The function is not allowed to perform any IO, such as making a network call or talking to the database.
Pure functions have the advantage of being inherently testable, make your code much easier to reason about, and are thread safe.
Now I don't think there is no place for side-effects or non-pure methods in a LINQ pipeline, but you should certainly be aware of the pitfalls that await if you do use them.
In my talk I showed this particularly egregious example of unhelpful use of side-effects.
var client = new HttpClient();
var urls = new [] { "https://www.alvinashcraft.com/",
"http://blog.cwa.me.uk/",
"https://codeopinion.com/"}; // imagine this list has 100s of URLs
var regex = @"<a\s+href=(?:""([^""]+)""|'([^']+)').*?>(.*?)</a>";
urls.Select(u => client.GetStringAsync(u).Result)
.SelectMany(h => Regex.Matches(h, regex).Cast<Match>())
.Where(m => m.Value.Contains("markheath.net"))
.Select(m => m.Value)
.ToList()
.ForEach(l => Console.WriteLine(l));
This LINQ pipeline is attempting to see who is linking to me on their blog. It does so by downloading the HTML for several websites, then using regex to find the links, and then filtering them out to just the ones of interest to me.
At the end of the pipeline, because I want to print them out to the console, and LINQ doesn't have a built-in ForEach
that operates on an IEnumerable<T>
(although MoreLINQ does), ToList
has been called first. I see developers doing this a lot as it allows them to perform side effects using a fluent syntax rather than the foreach
keyword.
There are numerous problems with the above code sample, not least the ugly way were trying to call an asynchronous method in the middle of a pipeline, with the .Result
property which is a recipe for deadlocks (sadly there is no standard 'asynchronous' version of LINQ at the moment, although there are promising signs that we might see that in C# 8).
But another big problem here is that downloading web pages is something that is particularly susceptible to transient errors. Suppose my urls
array contained 100 websites, and they all successfully downloaded except the last one. With the pipeline shown above, no output whatsoever would be written to the console, as one failure will cause the ToList
call to fail before we even get to the ForEach
. Yet for the purposes of this code, we probably would like to see the results from the sites we could download, even if some had failed.
Another issue with this code is that downloading web pages is inherently parallelizable - it would make sense to download more than one at a time, but it is hard to force that behaviour into a LINQ pipeline like this.
So this is an example of a LINQ pipeline that would probably be better implemented without leaning so heavily on LINQ. LINQ is great, but it is not a one-size-fits-all solution.
Anyway, I hope you found this brief tour of "LINQ stinks" helpful. The slides from my Techorama talk are available here, and I'd be happy to bring this talk to your local user group (assuming I can get there) or conference if you'd like to hear more about this.
Comments
Thanks for a good article. I've also really enjoyed your MoreLINQ video series (especially as a fellow gooner!).
Dan CarterI have a comment on #3. You mentioned returning an ICollection<t> from a method to indicate that it's safe to enumerate a collection multiple times. I tend to use an IReadOnlyList<t> to ensure that the collection is not modified by the caller. Basically introducing immutability as much as possible, because most of the time when you fetch some sort of collection you don't need to modify it afterwards. In the unlikely situation where I need to combine multiple sequences, I'll usually create a List<t> and call .AddRange(), or use LINQ's .Concat() method.
that's a good point, and I wondered if someone would pick up on that. Yes, IReadOnlyList would be even better, although obviously it can't magically make the items in the list immutable
Mark HeathYes, that's true. In on project I worked on I created a ReadOnlyEnumerable, only to discover later that IReadOnlyList existed!
Dan CarterI've often wondered about making a .Execute() extension method on IQueryable<t> and a .Evaluate() extension method on IEnumerable<t>. Both would call .ToList() underneath but it expresses the extent more clearly. As in, when you call ToList() you mostly aren't interested in the fact it returns a list; you're more interested in executing the query or enumerating the items in the collection.
MoreLINQ has a "Consume" which does what I think you mean by Evaluate (returns void). But I'd be concerned that if you needed it, you were putting a side-effect in earlier in the pipeline that might be better brought down into a ForEach (the lazily evaluated one from MoreLINQ)
Mark HeathMy idea was actually that .Evaluate() returned an IReadOnlyList<t> as well. It would be useful in situations where you wanted to build up a complicated in-memory LINQ query, and so you end up with an IEnumerable<t>, but you want to enumerate through the collection multiple times, so you want to only evaluate the query once.
Dan CarterAwesome! I really enjoyed all the thorough explanations.
James HickeyYou did say that you can't pass a function delegate to anEF LINQ clause, but you can pass a named Expression delegate to get the same usage:
Expression<func<somemodel, bool="">> OnlyActive = model=> model.Active;
var result = efContext.Where(OnlyActive).ToList();
Thanks man!
Ah ok I see. I'd probably call it ToReadOnlyList, but yes that would be useful
Mark HeathYes that's true, I almost mentioned that, but I've found that the code can end up getting quite complex when you do this, so the jury's out on whether is want to make a habit of this
Mark HeathReally enjoyed your talk at Techorama! Now discovered your blog and Pluralsight course, which is great while commuting to work! Really seeing more Linq possibilities and now see why Resharper is making noise :-)
Simonethanks Simone, glad to hear you found it helpful
Mark HeathYou can also use an extension method to hide the functionality. Something like:
Νίκος Πολυδερόπουλοςpublic static IQueryable<account> OnlyActive(this IQueryable<account> query)
{
return query.Where(account=> account.IsActive);
}
and use like db.Account.OnlyActive();
What do you think of this result ?
hi Nikos, that is a technique I mention in my Pluralsight course. It's especially useful if you want to take a few related stages from a long pipeline and give them a more descriptive name. I probably wouldn't bother just for a single Where clause like you show in this example though.
Mark HeathTrue that. My example is more of showing what i mean not doing the real thing. There are many different ways of fixing those problems from readability point of view. Thing is that most of the times developers (including me sometimes) think that its clear enough by leaving it as is.
Νίκος Πολυδερόπουλοςdisqus_3Y2yeE71OM wrote
reaperellagingerweisstebeschI tend to agree and in most cases I prefer writing SQL, but sometimes it is just more convenient to use LINQ. I believe in the right tool, to the job.
Janus Knudsen