Lines from `raw_input ()` in memory

Question

Lines from `raw_input ()` in memory

I've known for a while that Python likes to reuse strings in memory instead of duplicates:

>>> a = "test"
>>> id(a)
36910184L
>>> b = "test"
>>> id(b)
36910184L

However, I recently discovered that the string returned from raw_input()

does not follow this typical optimization pattern:

>>> a = "test"
>>> id(a)
36910184L
>>> c = raw_input()
test
>>> id(c)
45582816L

I'm wondering why this is so? Is there a technical reason?

+3

python python-2.7

Nordom Whistleklik 16 Feb 13 at 1:33

source to share

3 answers

The compiler cannot contain strings intern

, unless they are present in actual source code (i.e., string literals). Additionally, it raw_input

also removes newlines.

+2

Jon Clements 16 Feb At 1:35 am

source to share

[update] to answer the question, you need to know why, how, and when Python reuses strings.

Let's start with how: Python uses "interned" strings - from wikipedia :

In computer science, string interning is a method of storing only one copy of each individual string value, which must be immutable. Inner strings make some string processing tasks more time- or space-efficient by taking longer to create or intern the string. The miscellaneous values are stored in a pooled row pool.

Why? It looks like memory retention is not the main goal here, just a nice side effect.

String interpretation speeds up string comparisons, which is sometimes a performance bottleneck in applications (such as compilers and dynamic programming languages) that rely heavily on hash tables with string keys. Without interning, checking that two different strings are equal involves examining each character of both strings. This is slow for several reasons: at its core, it's O (n) in line length; usually requires reading from several areas of memory, which takes time; and the read reads the cpu cache, which means there is less cache for other needs. With interned strings after the initial static operation, a simple object identification test is sufficient; this is usually implemented as a pointer equality test, usually only one machine instruction with no memory reference.

Interpreting strings also reduces memory usage if there are many instances of the same string value; for example, it is read from the network or from storage. Such strings can include magic numbers or network protocol information. For example, XML parsers can supply tag and attribute names to conserve memory.

Now "when": cpython "puts" a string in the following situations:

when you are using a intern()

non-essential built-in function (which has moved to sys.intern

in Python 3).
small strings (0 or 1 byte) - this very informative article by Laurent Luce explains the implementation
names used in Python programs are auto-completed
dictionaries used to store attributes of a module, class, or instance have interned keys

In other situations, each implementation seems to make a big difference when strings are automatically interned.

I couldn't put it better than Alex Martinelli did in this answer (no wonder the guy has a 245k reputation):

Each Python implementation is free to make its own tradeoffs when allocating immutable objects (like strings) - either create a new one, find an existing equal, or use another reference to it, just fine from a language standpoint. In practice, of course, real world implementation requires a reasonable compromise: another reference to a suitable existing object when looking for such an object is cheap and easy, just create a new object if the task of finding a suitable existing one (which may or may not exist) looks like this may require a long search.

So, for example, multiple occurrences of the same string literal within the same function will (in all implementations I know of) use the "new reference to the same object" strategy, because when constructing this function, constant-pooling is pretty quick and easy avoid duplicates; but doing it through separate functions can potentially be a very time-consuming task, so real-world realities either don't do it at all, or only do it in some heuristically defined subsets of cases, where one can hope for a reasonable compromise between compile time (slowing down by looking for identical existing constants ) and memory consumption (increases if new copies of constants are saved).

I am not aware of any Python implementation (or, in fact, other constant string languages like Java) that take on the task of identifying possible duplicates (to reuse the same object across multiple links) when reading data from a file - it just doesn't seem like a promising compromise (and here you will be paying for runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application-level considerations) that such immutable objects are large and quite prone to a lot of duplication, you can easily implement your own constant-pool strategy (a trainee can help you do this for strings, but it's not hard to collapse, for example , tuples with immutable items, huge long integers, etc.).

[initial answer]

This is more of a comment than an answer, but the comment system is not suitable for posting code:

def main():
    while True:
        s = raw_input('feed me:')
        print '"{}".id = {}'.format(s, id(s))

if __name__ == '__main__':
    main()

Doing this gives me:

"test".id = 41898688
"test".id = 41900848
"test".id = 41898928
"test".id = 41898688
"test".id = 41900848
"test".id = 41898928
"test".id = 41898688

In my experience, at least 2.7 there is some optimization even for raw_input()

.

If the implementation uses hash tables, I assume there is more than one. Go diving now.

[first update]

It looks like my experiment was flawed:

def main():
    storage = []
    while True:
        s = raw_input('feed me:')
        print '"{}".id = {}'.format(s, id(s))
        storage.append(s)

if __name__ == '__main__':
    main()

Result:

"test".id = 43274944
"test".id = 43277104
"test".id = 43487408
"test".id = 43487504
"test".id = 43487552
"test".id = 43487600
"test".id = 43487648
"test".id = 43487744
"test".id = 43487864
"test".id = 43487936
"test".id = 43487984
"test".id = 43488032

In his answer to another question, user tzot warns about the life of an object:

Note: It is very important to know the lifetime of objects in Python. Note the following session:

Python 2.6.4 (r264:75706, Dec 26 2009, 01:03:10) 
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a="a"
>>> b="b"
>>> print id(a+b), id(b+a)
134898720 134898720
>>> print (a+b) is (b+a)
False

Your opinion that by printing the identifiers of two separate expressions and noting "they are equal to ergo, the two expressions should be equal / equivalent / the same" is wrong . A single line of output does not necessarily imply that all of its contents were created and / or coexist at the same point in time.

If you want to know if two objects are the same object, ask Python directly (using an operator is

).

+2

Paulo scardine 16 Feb 13 at 1:41

source to share

mgilson · Accepted Answer · 2013-02-16T01:35:43+0000

It looks to me like python is putting in string literals, but strings created through some other process don't go through interning:

>>> s = 'ab'
>>> id(s)
952080
>>> g = 'a' if True else 'c'
>>> g += 'b'
>>> g
'ab'
>>> id(g)
951336

Of course raw_input

creates new lines without using string literals, so it is entirely possible to assume that it will not have the same id

. There are (at least) two reasons why C-python puts strings - memory (you can save a heap if you don't keep a whole bunch of copies of the same one) and permission for hash collisions. If the 2 hash strings have the same value (in a dictionary lookup for an example) then python should check that both strings are equivalent. It can do string comparisons if they are not interned, but if they are interned, it only needs to do a pointer comparison, which is more efficient.

Lines from `raw_input ()` in memory

More articles: