Python Woes – Seeking

Python Logo

One thing you learn early on when programming is the idea of high-level languages. In theory, a high-level language abstracts some implementations details from you, so you can express in your code what you are trying to do, and not worry to much about the how. The same idea goes from libraries. And then, there is Python, which I increasingly feel is a poseur language, it pretends to be high-level and just then just… isn’t.

Today’s gripe is about a very simple task, see if a text file ends with a carriage return. So logically you would want to do the following:

  • Open the file for reading
  • Seek to the end – 1
  • Read the last character

So first we want to open the file, easy enough, you call open, the defaults are for reading, which is what we want, and text mode, which would make sense, we look at text files.

Now we want to seek to the end of the file. Here is the help page on seek:

seek(cookie, whence=0, /) method of _io.TextIOWrapper instance
    Change stream position.
    
    Change the stream position to the given byte offset. The offset is
    interpreted relative to the position indicated by whence.  Values
    for whence are:
    
    * 0 -- start of stream (the default); offset should be zero or positive
    * 1 -- current stream position; offset may be negative
    * 2 -- end of stream; offset is usually negative
    
    Return the new absolute position.

We want to seek related to the end of the file, so we choose mode 2. Now there are symbolic constants for these values defined, in our case os.SEEK_END, but of course such details are not documented, who wants to use symbolic constants in their code, magic values are much better.

handle.seek(-1, whence=os.SEEK_END)
→ TypeError: seek() takes no keyword arguments

Turns out, you can forbid the use of keyword arguments in python by adding the slash at the end of declaration, that’s a bit of python syntax is was unaware of. This clearly makes sense for a method like sum, but why one would do this in this particular case is beyond me, but OK, let’s try

handle.seek(-1, os.SEEK_END)
→ UnsupportedOperation: can't do nonzero end-relative seeks

I literally did what the documentation says is the usual case for mode 2! First, it would be helpful to say why this is unsupported in the error message, as there is no hint about this in the documentation. Is it because of the operating system? Or because the file is in text-mode? Turns out it is the second, which is not a good reason.

The file’s encoding is UTF-8, and you can back up in a UTF-8 stream, looking at a given byte, the higher bits tell you where you are in an UTF-8 sequence, if you are not in the head byte, the sequence tells you how much you need to back-up to get to the head byte, so seeking backwards is just a matter of decoding bytes on the way, not super efficient, but it works.

So now, I’m wondering, can Python actually seek forward in an UTF-8 file? Let’s try with the following text Les éléphants sont élégants, how does this work:

handle.seek(5) 
handle.read() 
→ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte

Now remember, we don’t actually know what 5 is, the documentation says cookie, but then explains things in terms byte offset. Turns out it’s bytes, and we get an exception. What is the point of pretending you have a seek method on text files? Basically forward seeking only works if your file is ASCII, and backward seek does not work. Why not be honest in your error messages?

The simple solution is to stop pretending that Python can handle non-ASCII text files in 2023 and work in binary mode or do it the Python way and load the whole thing into a string in memory and access the last character. This is of course quite wasteful if the file is big, but who cares?

The thing is, open does not really return you a file object, you get a text wrapper around a buffer. So you can call seek on the underlying buffer.
What happens if we do?

handle.buffer.seek(-1, os.SEEK_END)
handle.read()
→ '\n'

Which gives us the last character of the file. Why can’t the TextIOWrapper call the seek method on the underlying buffer? It would be broken, sure, but exactly as broken as forward seeking. In fact, you can do seeks relative to the end, you just need to do it manually.

handle.seek(handle.seek(0, os.SEEK_END) - 1) 
handle.read()
→ '\n'

Actually, does it really need to be broken? The read code could just realign itself with the underlying encoding (say UTF-8), so if you seek mid code-point, the read call returns you the next structurally valid code-point?

Now you could argue that this would yield complexity and unclear semantics, as different seek offset would yield the same result. It turns out that this is exactly what TextIOWrapper does, but only for carriage returns…

Consider the following file

first line␍␊
second line␍␊
third line␍␊

Saved with carriage returns + line-feeds, i.e. old school DOS ASCII.

handle.seek(0xa) 
a = handle.read()
→ '\nsecond line\nthird line\n'
…
handle.seek(0xb) 
b = handle.read()
→ '\nsecond line\nthird line\n'
a == b
→ True

So you have a file which, if you read it, contains 34 bytes, but you can seek to position 36.

handle.seek(0)
len(handle.read())  
→ 34
handle.tell() 
→ 37
handle.buffer.seek(-1, 2)
handle.tell()
→ 36

So there is black-magic going, but only for ASCII, because this is what text mode is about…

Edit: I filed a bug, which leds to some interesting follow-ups.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: