[MS] How various git diff viewers represent file encoding changes in pull requests - devamazonaws.blogspot.com

In addition to the git command line tool, there are other tools or services that let you view changes in git history. The most interesting cases are those which present changes as part of a pull request, since those are changes you are reviewing and approving. But a common problem is that what they show you might not be what actually changed.

I'll limit my discussion to services and tools I have experience with, which means that it's the git command line, Azure DevOps, GitHub, and Visual Studio. You are welcome to share details for other services that you use, particularly those used for code reviews.

First, let's consider a commit that changes the encoding of a file. For concreteness, let's say that the file is this:

I just checked.
It costs A31.

where A3 represents a single byte with hex value 0xA3. This is the representation of £ in the Windows 1252 code page.

Suppose you change the encoding of this file to UTF-8:

It costs C2A31.

If you view this in the command line with git show you get

  I just checked.
- It costs <A3>1.
+ It costs £1.

The command line version shows you that there used to be a byte 0xA3 but now there is a £ character.

Next up is GitHub. Its diff says

  I just checked.
- It costs �1.
- It costs £1.

GitHub assumes that all files are in UTF-8, so it interprets the A3 as an illegal UTF-8 code unit sequence and represents it with U+FFFD REPLACEMENT CHARACTER.

Next up is Team Foundation Services Visual Studio Online Visual Studio Team Services Azure DevOps. Azure DevOps. That's the name. Azure DevOps.

Here's what Azure DevOps shows:

⚠ The file differs only in whitespace.

And if you expand the file and enable "Show whitespace changes", it shows you no changes, not even whitespace changes!

I just checked.
It costs £1.

This is quite concerning, because it means that if you made a change to the text of a file and also changed the encoding, Azure DevOps highlights the text changes, but does not give any indication that the encoding changed!

For example, maybe somebody changed the first line of text and accidentally changed the encoding from 1252 to UTF-8. Azure DevOps shows this as

I just checked.
I just looked.
It costs £1.

It happily shows you the text change, but completely ignores the encoding change.

That encoding change might have caused you to inadvertently change a bunch of strings in a Resource Script, resulting in mojibake.

If you ask Visual Studio to view the diff, it indicates that the file has been modified (M), but when you ask to see the diff, it says "0 changes", and nothing is highlighted.

Now let's consider a commit that inserted a UTF-8 BOM at the start of a file.

From the command line with git, you get this:

- I just checked.
+  I just checked.

The BOM displays as a space. Not great, but at least there is a +/− to show you that something changed, and if the first line is not otherwise blank, the shifted contents tell you that something got inserted at the start of the file.

For GitHub, the diff shows up like this:

- I just checked.
- I just checked.

The highlights tell you that something changed on that line, but squint all you want, you don't see any change. The change must be invisible, but at least you're told that there's a change somewhere on that line; you just can't see it.

And finally, we have Azure DevOps:

⚠ The file differs only in whitespace.

As before, even if you expand the file and enable "Show whitespace changes", you get no changes.

I just checked.
It costs £1.

So Azure DevOps tells you that the file changed in whitespace, but when you ask to see it, you are shown no changes.

If you ask Visual Studio to view the diff, it once again indicates that the file has been modified (M), but when you ask to see the diff, it says "0 changes", and nothing is highlighted.

I suspect that in the cases where GitHub, Azure DevOps, or Visual Studio show no visible changes, most users will just conclude, "Must be a bug," and not realize that no really, there's a change in there that you can't see.

So let's summarize these results in a table.

  git command line GitHub Azure DevOps Visual Studio
Code page UTF-8 UTF-8 Guess Guess
Encoding changes Shown in diff Shown in diff No change shown No change shown
BOM change Show as space Invisible No change shown No change shown

My take-away from this table is that if you do your work with any of these systems, you need to pay close attention when dealing with files that contain characters outside the 7-bit ASCII set because changes to encoding or the presence of a BOM can be hard to spot, or even become outright invisible, even though it drastically changes what the contents of the file mean.


Post Updated on December 30, 2024 at 03:00PM
Thanks for reading
from devamazonaws.blogspot.com

Comments

Popular posts from this blog

Scenarios capability now generally available for Amazon Q in QuickSight - devamazonaws.blogspot.com

Research and Engineering Studio on AWS Version 2024.08 now available - devamazonaws.blogspot.com

Amazon EC2 C6id instances are now available in AWS Europe (Paris) region - devamazonaws.blogspot.com