I'm using this code to get standard output from an external program:

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]

The communicate() method returns an array of bytes:

>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

However, I'd like to work with the output as a normal Python string. So that I could print it like this:

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

I thought that's what the binascii.b2a_qp() method is for, but when I tried it, I got the same byte array again:

>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

Does anybody know how to convert the bytes value back to string? I mean, using the "batteries" instead of doing it manually. And I'd like it to be ok with Python 3.

13 Answers 11

up vote 1612 down vote accepted

You need to decode the bytes object to produce a string:

>>> b"abcde"
b'abcde'

# utf-8 is used here because it is a very common encoding, but you
# need to use the encoding your data is actually in.
>>> b"abcde".decode("utf-8") 
'abcde'
77 upvote
  flag
This 'solution' was particularly hard to find (for me at least) considering it is such a simple problem ... I'd love to put a line somewhere the subprocess docs about this since I bet a good portion of newbies like me will hit this snag when using subprocess. Anybody know about contributing to the python docs? – mathtick
upvote
  flag
Yes, but given that this is the output from a windows command, shouldn't it instead be using ".decode('windows-1252')" ? – mcherm
38 upvote
  flag
Using "windows-1252" is not reliable either (e.g., for other language versions of Windows), wouldn't it be best to use sys.stdout.encoding? – nikow
11 upvote
  flag
This is the second time I forgot about this and it’s still nowhere to be found in the documentation, not even in the unicode section. What a shame. – Profpatsch
6 upvote
  flag
Maybe this will help somebody further: Sometimes you use byte array for e.x. TCP communication. If you want to convert byte array to string cutting off trailing '\x00' characters the following answer is not enough. Use b'example\x00\x00'.decode('utf-8').strip('\x00') then. – Wookie88
1 upvote
  flag
I've filled a bug about documenting it at bugs.python.org/issue17860 - feel free to propose a patch. If it is hard to contribute - comments how to improve that are welcome. – anatoly techtonik
upvote
  flag
what other decoding options does the binary object possess? – CMCDragonkai
20 upvote
  flag
In Python 2.7.6 doesn't handle b"\x80\x02\x03".decode("utf-8") -> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte. – martineau
3 upvote
  flag
If the content is random binary values, the utf-8 conversion is likely to fail. Instead see @techtonik answer (below) //allinonescript.com/a/27527728/198536 – wallyk
upvote
  flag
upvote
  flag
@AaronMaenpaa : This won’t work on an array like it worked in python2. – user2284570
upvote
  flag
@Profpatsch: it's kinda hidden. See answer below for a reference to documentation. It's also in the bytes-docstring (help(command_stdout)). – serv-inc
upvote
  flag
upvote
  flag
@nikow: small update on using sys.stdout.encoding - this is allowed to be None which will cause encode() to fail. – Kevin Shea
upvote
  flag
I have some code for networking program. and its [def dataReceived(self, data): print(f"Received quote: {data}")] its printing out "received quote: b'\x00&C:\\Users\\.pycharm2016.3\\config\x00&C:\\users\\pych‌​arm\\system\x00\x03-‌​-' how would i change my code to fix this. WHen i write print(f"receivedquote: {data}".decode('utf-8') that does not do the trick. – Jessica Warren

You need to decode the byte string and turn it in to a character (unicode) string.

b'hello'.decode(encoding)

or

str(b'hello', encoding)
20 upvote
  flag
Note that the str function in Python 2 (at least 2.7.5 I'm running) doesn't support the second encoding parameter, so it's better to go with the decode method if you want your code to work on Python 2 and 3. – metakermit
4 upvote
  flag
@dF. : This doesn’t work with python3. – user2284570
4 upvote
  flag
@user2284570 str(s, 'utf-8') worked for me in Python3 – Kat

I think what you actually want is this:

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron's answer was correct, except that you need to know WHICH encoding to use. And I believe that Windows uses 'windows-1252'. It will only matter if you have some unusual (non-ascii) characters in your content, but then it will make a difference.

By the way, the fact that it DOES matter is the reason that Python moved to using two different types for binary and text data: it can't convert magically between them because it doesn't know the encoding unless you tell it! The only way YOU would know is to read the Windows documentation (or read it here).

2 upvote
  flag
open() function for text streams or Popen() if you pass it universal_newlines=True do magically decide character encoding for you (locale.getpreferredencoding(False) in Python 3.3+). – jfs
upvote
  flag
'latin-1' is a verbatim encoding with all code points set, so you can use that to effectively read a byte string into whichever type of string your Python supports (so verbatim on Python 2, into Unicode for Python 3). – tripleee

I think this way is easy:

bytes = [112, 52, 52]
"".join(map(chr, bytes))
>> p44
5 upvote
  flag
Thank you, your method worked for me when none other did. I had a non-encoded byte array that I needed turned into a string. Was trying to find a way to re-encode it so I could decode it into a string. This method works perfectly! – leetNightshade
3 upvote
  flag
@leetNightshade: yet it is terribly inefficient. If you have a byte array you only need to decode. – Martijn Pieters
9 upvote
  flag
@Martijn Pieters I just did a simple benchmark with these other answers, running multiple 10,000 runs //allinonescript.com/a/3646405/353094 And the above solution was actually much faster every single time. For 10,000 runs in Python 2.7.7 it takes 8ms, versus the others at 12ms and 18ms. Granted there could be some variation depending on input, Python version, etc. Doesn't seem too slow to me. – leetNightshade
upvote
  flag
@leetNightshade: yet the OP here is using Python 3. – Martijn Pieters
upvote
  flag
@Martijn Pieters Fair enough. In Python 3.4.1 x86 this method takes 17.01ms, the others 24.02ms, and 11.51ms for the bytearray to string cast. So it's not the fastest in that case. – leetNightshade
upvote
  flag
@leetNightshade: you also appear to be talking about integers and bytearrays, not a bytes value (as returned by Popen.communicate()). – Martijn Pieters
4 upvote
  flag
@Martijn Pieters Yes. So with that point, this isn't the best answer for the body of the question that was asked. And the title is misleading, isn't it? He/she wants to convert a byte string to a regular string, not a byte array to a string. This answer works okay for the title of the question that was asked. – leetNightshade
upvote
  flag
@leetNightshade: the title can indeed be misleading, I'll edit. – Martijn Pieters
upvote
  flag
It can convert bytes read from a file with "rb" to string, and It's handy when you don't know the encoding – Sasszem
upvote
  flag
@Sasszem: this method is a perverted way to express: a.decode('latin-1') where a = bytearray([112, 52, 52]) ("There Ain't No Such Thing as Plain Text". If you've managed to convert bytes into a text string then you used some encoding—latin-1 in this case) – jfs
1 upvote
  flag
For python 3 this should be equivalent to bytes([112, 52, 52]) - btw bytes is a bad name for a local variable exactly because it's a p3 builtin – Mr_and_Mrs_D

From http://docs.python.org/3/library/sys.html,

To write or read binary data from/to the standard streams, use the underlying binary buffer. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc').

2 upvote
  flag
The pipe to the subprocess is already a binary buffer. Your answer fails to address how to get a string value from the resulting bytes value. – Martijn Pieters

Set universal_newlines to True, i.e.

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]
3 upvote
  flag
I've been using this method and it works. Although, it's just guessing at the encoding based on user preferences on your system, so it's not as robust as some other options. This is what it's doing, referencing docs.python.org/3.4/library/subprocess.html: "If universal_newlines is True, [stdin, stdout and stderr] will be opened as text streams in universal newlines mode using the encoding returned by locale.getpreferredencoding(False)." – twasbrillig

If you don't know the encoding, then to read binary input into string in Python 3 and Python 2 compatible way, use ancient MS-DOS cp437 encoding:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

Because encoding is unknown, expect non-English symbols to translate to characters of cp437 (English chars are not translated, because they match in most single byte encodings and UTF-8).

Decoding arbitrary binary input to UTF-8 is unsafe, because you may get this:

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

The same applies to latin-1, which was popular (default?) for Python 2. See the missing points in Codepage Layout - it is where Python chokes with infamous ordinal not in range.

UPDATE 20150604: There are rumors that Python 3 has surrogateescape error strategy for encoding stuff into binary data without data loss and crashes, but it needs conversion tests [binary] -> [str] -> [binary] to validate both performance and reliability.

UPDATE 20170116: Thanks to comment by Nearoo - there is also a possibility to slash escape all unknown bytes with backslashreplace error handler. That works only for Python 3, so even with this workaround you will still get inconsistent output from different Python versions:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

See https://docs.python.org/3/howto/unicode.html#python-s-unicode-support for details.

UPDATE 20170119: I decided to implement slash escaping decode that works for both Python 2 and Python 3. It should be slower that cp437 solution, but it should produce identical results on every Python version.

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))
4 upvote
  flag
I really feel like Python should provide a mechanism to replace missing symbols and continue. – anatoly techtonik
1 upvote
  flag
Brilliant! This is much faster than @Sisso's method for a 256 MB file! – wallyk
upvote
  flag
@techtonik : This won’t work on an array like it worked in python2. – user2284570
upvote
  flag
@user2284570 do you mean list? And why it should work on arrays? Especially arrays of floats.. – anatoly techtonik
upvote
  flag
You can also just ignore unicode errors with b'\x00\x01\xffsd'.decode('utf-8', 'ignore') in python 3. – Antonis Kalou
2 upvote
  flag
@anatolytechtonik There is the possibility to leave the escape sequence in the string and move on: b'\x80abc'.decode("utf-8", "backslashreplace") will result in '\\x80abc'. This information was taken from the unicode documentation page which seems to have been updated since the writing of this answer. – Nearoo
upvote
  flag
@Nearoo updated the answer. Unfortunately it doesn't work with Python 2 - see //allinonescript.com/questions/25442954/… – anatoly techtonik
upvote
  flag
underrated answere here – Wlliam

While @Aaron Maenpaa's answer just works, a user recently asked

Is there any more simply way? 'fhand.read().decode("ASCII")' [...] It's so long!

You can use

command_stdout.decode()

decode() has a standard argument

codecs.decode(obj, encoding='utf-8', errors='strict')

I made a function to clean a list

def cleanLists(self, lista):
    lista = [x.strip() for x in lista]
    lista = [x.replace('\n', '') for x in lista]
    lista = [x.replace('\b', '') for x in lista]
    lista = [x.encode('utf8') for x in lista]
    lista = [x.decode('utf8') for x in lista]

    return lista
3 upvote
  flag
You can actually chain all of the .strip, .replace, .encode, etc calls in one list comprehension and only iterate over the list once instead of iterating over it five times. – Taylor Edmiston
1 upvote
  flag
@TaylorEdmiston Maybe it saves on allocation but the number of operations would remain the same. – JulienD

In Python 3 you can use directly:

b'hello'.decode()

which is equivalent to

b'hello'.decode(encoding="utf-8")

here the default encoding is "utf-8", or you can check it by:

>> import sys
>> sys.getdefaultencoding()

To interpret a byte sequence as a text, you have to know the corresponding character encoding:

unicode_text = bytestring.decode(character_encoding)

Example:

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls command may produce output that can't be interpreted as text. File names on Unix may be any sequence of bytes except slash b'/' and zero b'\0':

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

Trying to decode such byte soup using utf-8 encoding raises UnicodeDecodeError.

It can be worse. The decoding may fail silently and produce mojibake if you use a wrong incompatible encoding:

>>> '—'.encode('utf-8').decode('cp1252')
'—'

The data is corrupted but your program remains unaware that a failure has occurred.

In general, what character encoding to use is not embedded in the byte sequence itself. You have to communicate this info out-of-band. Some outcomes are more likely than others and therefore chardet module exists that can guess the character encoding. A single Python script may use multiple character encodings in different places.


ls output can be converted to a Python string using os.fsdecode() function that succeeds even for undecodable filenames (it uses sys.getfilesystemencoding() and surrogateescape error handler on Unix):

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

To get the original bytes, you could use os.fsencode().

If you pass universal_newlines=True parameter then subprocess uses locale.getpreferredencoding(False) to decode bytes e.g., it can be cp1252 on Windows.

To decode the byte stream on-the-fly, io.TextIOWrapper() could be used: example.

Different commands may use different character encodings for their output e.g., dir internal command (cmd) may use cp437. To decode its output, you could pass the encoding explicitly (Python 3.6+):

output = subprocess.check_output('dir', shell=True, encoding='cp437')

The filenames may differ from os.listdir() (which uses Windows Unicode API) e.g., '\xb6' can be substituted with '\x14'—Python's cp437 codec maps b'\x14' to control character U+0014 instead of U+00B6 (¶). To support filenames with arbitrary Unicode characters, see Decode poweshell output possibly containing non-ascii unicode characters into a python string

For Python 3,this is a much safer and Pythonic approach to convert from byte to string:

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): #check if its in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')

Output:

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

If you should get the following by trying decode():

AttributeError: 'str' object has no attribute 'decode'

You can also specify the encoding type straight in a cast:

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'

Not the answer you're looking for? Browse other questions tagged or ask your own question.