This post is part 2 of a series on Postgres internals β for part 1, see How Postgres stores data on disk β this oneβs a page turner .
Where we left off last time, we had a table stored as heap segments, with each segment containing a bunch of pages, each of size 8 KiB. This was working fantastically for our table of countries, but you might be wondering β if each page is 8 KiB in size and a single row canβt span multiple pages, what happens if weβve got a single value which is bigger than 8 KiB?
Letβs find out.
# Recap
(If youβve just come from part 1 and still have the database setup, feel free to skip over this bit .)
Last time we spin up a fresh Postgres instance and inserted a load of countries in from CSV. You can re-create this by running:
curl 'https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv' \
--output ./pg-data/countries.csv
docker run -d --rm -v ./pg-data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=password postgres:16
pg_container_id = $( docker ps --filter expose= 5432 --format "{{.ID}}" )
docker exec $pg_container_id psql -U postgres -c 'create database blogdb; create extension pageinspect;'
Now the database is set up, you can start a psql session with:
docker exec -it $pg_container_id psql -U postgres blogdb
Next create the schema and load the data in with:
id integer primary key generated always as identity ,
name text not null unique ,
alpha_2 char ( 2 ) not null ,
alpha_3 char ( 3 ) not null ,
numeric_3 char ( 3 ) not null ,
iso_3166_2 text not null ,
intermediate_region text ,
intermediate_region_code char ( 3 )
from '/var/lib/postgresql/data/countries.csv'
From this we explained how Postgres stores tables in separate 1 GiB files called segments, with each segment consisting of a series of pages of 8 KiB each. Each page has a number of tuples within it which represents a snapshot view of a row at a particular time. Updating or deleting a row does not remove the tuple but instead creates a new tuple at the end of the last page. The old or βdeadβ tuple is only cleared out once Postgres runs a vacuum.
The filename of the first segment is stored in base/{database oid}/{filenode}
where OID = Object IDentifier and filenode is an integer that starts off equal to table OID but diverges over time as the database runs vacuums and other operations. You can find your databaseβs OID and tableβs filenode by running:
select oid from pg_database where datname = 'blogdb' ;
select relfilenode from pg_class where relname = 'countries' ;
Within each row there is a secret column not returned from select *
called ctid
that refers to the tupleβs physical location on disk. It looks like (page index, tuple index within page)
, e.g. (3, 7)
refers to tuple #7 within page #3. You can return this secret column by simply doing select ctid, *
instead of select *
.
You can use the built-in pageinspect
extension to examine page headers and their raw data using page_header()
and heap_page_items()
functions alongside get_raw_page()
.
# Letβs create another table
Last time we made a table of countries from the ISO 3611-1 spec. Letβs say now that we want to add some of the greatest creative works that these countries have to offer (that also happen to have their copyright expired).
To prepare for this, weβll create a new table for these culturally important creative works.
If you didnβt go through the recap steps above, make sure your database is running:
docker run -d --rm -v ./pg-data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=password postgres:16
pg_container_id = $( docker ps --filter expose= 5432 --format "{{.ID}}" )
Reminder: To open an interactive psql session, run:
docker exec -it $pg_container_id psql -U postgres blogdb
Hereβs our table schema:
create table creative_works (
id integer primary key generated always as identity ,
-- In real life this would probably be a many-to-many relationship instead of
-- one-to-many, just take a look at:
-- https://en.wikipedia.org/wiki/Wikipedia:Lamest_edit_wars/Ethnic_feuds#People
country_id integer references countries (id),
Iβve prepared three poems of varying length and formatted it as CSV so that you can quickly load it into your database:
curl 'https://drew.silcock.dev/data/poems.csv' --output ./pg-data/poems.csv
Next letβs copy the CSV into our new table β weβll use a temporary table to resolve the country code to country ID.
create temporary table creative_works_temp (
copy creative_works_temp (title, authors, country_code, content)
from '/var/lib/postgresql/data/poems.csv'
insert into creative_works
select cw . title , cw . authors , c . id , cw . content
from creative_works_temp cw
left join countries c on c . alpha_2 = cw . country_code ;
Just like before, weβre going to use the pageinspect
functions to explore what the raw data looks like1 :
blogdb = # select id, title, authors, country_id, length (content) from creative_works;
id | title | authors | country_id | length
----+----------------------+-------------------------------------------------------------------------------------+------------+--------
12 | Ozymandias | [{"name": "Percy Bysshe Shelley", "birth_year": 1792, "death_year": 1822}] | 235 | 631
13 | Ode on a Grecian Urn | [{"name": "Keats, John", "birth_year": 1795, "death_year": 1821}] | 235 | 2442
14 | The Waste Land | [{"name": "Eliot, T. S. (Thomas Stearns)", "birth_year": 1888, "death_year": 1965}] | 236 | 19950
blogdb = # select lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, t_infomask2, t_infomask, t_hoff, t_bits, t_oid, length (t_data)
blogdb - # from heap_page_items(get_raw_page( 'creative_works' , 0 ));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | length
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+--------
1 | 7408 | 1 | 781 | 1074 | 0 | 8 | ( 0 , 1 ) | 5 | 2818 | 24 | | | 757
2 | 5536 | 1 | 1868 | 1074 | 0 | 8 | ( 0 , 2 ) | 5 | 2818 | 24 | | | 1844
3 | 5360 | 1 | 174 | 1074 | 0 | 8 | ( 0 , 3 ) | 5 | 2822 | 24 | | | 150
We can see that we have added three classic poems to our table of increasing length:
Ozymandias by Percy Bysshe Shelley β a short but intense exploration of the futility of hubris and how time washes away even the greatest of empires. Clocking in at 631 characters, this is the shortest poem of the lot. The size of the whole row is 757 bytes which makes sense β 631 for the actual poem and 126 for the title, authors and country ID2 .
Ode on a Grecian Urn by John Keats β a slightly longer 2,442 character ode praising an ancient Greek urn and the scenes depicted on it. This poem is 2,442 characters long, and yet the whole row is only 1,844 bytes π€
The Waste Land by T. S. Eliot β the 434 lines of this 1922 poem are split between 5 sections and flows between different styles, times, places, narrators and themes. This is by far the longest at 19,950 characters yet the tuple in the heap table is only 150 bytes!
# Show me the data
We can figure out whatβs going on by looking at the raw data for each row. Letβs write a little helper Bash function for this:
function run-and-decode {
# You can replace `hexyl` for `xxd` if you don't have hexyl installed.
docker exec $pg_container_id psql -U postgres blogdb --tuples-only -c " $1 " | cut -c4- | xxd -r -p | hexyl
from creative_works where title = ' $1 '
(select (ctid::text::point)[0]::bigint from cw)
where t_ctid = (select ctid from cw)
order by lp desc limit 1"
Letβs start with Ozymandias:
$ print-cw-data 'Ozymandias'
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β00000000β 01 00 00 00 17 4f 7a 79 β 6d 61 6e 64 69 61 73 c3 ββ’ββββ’OzyβmandiasΓβ
β00000010β 01 00 00 40 58 00 00 d0 β 03 00 00 20 04 00 00 80 ββ’ββ@XββΓββ’ββ β’ββΓβ
β00000020β 0a 00 00 00 0a 00 00 00 β 14 00 00 00 08 00 00 10 β_βββ_βββββ’ββββ’βββ’β
β00000030β 08 00 00 10 6e 61 6d 65 β 62 69 72 74 68 5f 79 65 ββ’βββ’nameβbirth_yeβ
β00000040β 61 72 64 65 61 74 68 5f β 79 65 61 72 50 65 72 63 βardeath_βyearPercβ
β00000050β 79 20 42 79 73 73 68 65 β 20 53 68 65 6c 6c 65 79 βy Byssheβ Shelleyβ
β00000060β 20 00 00 00 00 80 00 07 β 20 00 00 00 00 80 1e 07 β ββββΓββ’β ββββΓβ’β’β
β00000070β eb 00 00 00 04 0a 00 00 β 49 20 6d 65 74 20 61 20 βΓββββ’_βββI met a β
β00000080β 74 72 61 76 65 6c 6c 65 β 72 20 66 72 6f 6d 20 61 βtravelleβr from aβ
β00000090β 6e 20 61 6e 74 69 71 75 β 65 20 6c 61 6e 64 2c 0a βn antiquβe land,_β
β000000a0β 57 68 6f 20 73 61 69 64 β e2 80 94 e2 80 9c 54 77 βWho saidβΓΓΓΓΓΓTwβ
β000000b0β 6f 20 76 61 73 74 20 61 β 6e 64 20 74 72 75 6e 6b βo vast aβnd trunkβ
β000000c0β 6c 65 73 73 20 6c 65 67 β 73 20 6f 66 20 73 74 6f βless legβs of stoβ
β000000d0β 6e 65 0a 53 74 61 6e 64 β 20 69 6e 20 74 68 65 20 βne_Standβ in the β
β000000e0β 64 65 73 65 72 74 2e 20 β 2e 20 2e 20 2e 20 4e 65 βdesert. β. . . Neβ
β000000f0β 61 72 20 74 68 65 6d 2c β 20 6f 6e 20 74 68 65 20 βar them,β on the β
β00000100β 73 61 6e 64 2c 0a 48 61 β 6c 66 20 73 75 6e 6b 20 βsand,_Haβlf sunk β
β00000110β 61 20 73 68 61 74 74 65 β 72 65 64 20 76 69 73 61 βa shatteβred visaβ
β00000120β 67 65 20 6c 69 65 73 2c β 20 77 68 6f 73 65 20 66 βge lies,β whose fβ
β00000130β 72 6f 77 6e 2c 0a 41 6e β 64 20 77 72 69 6e 6b 6c βrown,_Anβd wrinklβ
β00000140β 65 64 20 6c 69 70 2c 20 β 61 6e 64 20 73 6e 65 65 βed lip, βand sneeβ
β00000150β 72 20 6f 66 20 63 6f 6c β 64 20 63 6f 6d 6d 61 6e βr of colβd commanβ
β00000160β 64 2c 0a 54 65 6c 6c 20 β 74 68 61 74 20 69 74 73 βd,_Tell βthat itsβ
β00000170β 20 73 63 75 6c 70 74 6f β 72 20 77 65 6c 6c 20 74 β sculptoβr well tβ
β00000180β 68 6f 73 65 20 70 61 73 β 73 69 6f 6e 73 20 72 65 βhose pasβsions reβ
β00000190β 61 64 0a 57 68 69 63 68 β 20 79 65 74 20 73 75 72 βad_Whichβ yet surβ
β000001a0β 76 69 76 65 2c 20 73 74 β 61 6d 70 65 64 20 6f 6e βvive, stβamped onβ
β000001b0β 20 74 68 65 73 65 20 6c β 69 66 65 6c 65 73 73 20 β these lβifeless β
β000001c0β 74 68 69 6e 67 73 2c 0a β 54 68 65 20 68 61 6e 64 βthings,_βThe handβ
β000001d0β 20 74 68 61 74 20 6d 6f β 63 6b 65 64 20 74 68 65 β that moβcked theβ
β000001e0β 6d 2c 20 61 6e 64 20 74 β 68 65 20 68 65 61 72 74 βm, and tβhe heartβ
β000001f0β 20 74 68 61 74 20 66 65 β 64 3b 0a 41 6e 64 20 6f β that feβd ; _And oβ
β00000200β 6e 20 74 68 65 20 70 65 β 64 65 73 74 61 6c 2c 20 βn the peβdestal, β
β00000210β 74 68 65 73 65 20 77 6f β 72 64 73 20 61 70 70 65 βthese woβrds appeβ
β00000220β 61 72 3a 0a 4d 79 20 6e β 61 6d 65 20 69 73 20 4f βar:_My nβame is Oβ
β00000230β 7a 79 6d 61 6e 64 69 61 β 73 2c 20 4b 69 6e 67 20 βzymandiaβs, King β
β00000240β 6f 66 20 4b 69 6e 67 73 β 3b 0a 4c 6f 6f 6b 20 6f βof Kingsβ ; _Look oβ
β00000250β 6e 20 6d 79 20 57 6f 72 β 6b 73 2c 20 79 65 20 4d βn my Worβks, ye Mβ
β00000260β 69 67 68 74 79 2c 20 61 β 6e 64 20 64 65 73 70 61 βighty, aβnd despaβ
β00000270β 69 72 21 0a 4e 6f 74 68 β 69 6e 67 20 62 65 73 69 βir!_Nothβing besiβ
β00000280β 64 65 20 72 65 6d 61 69 β 6e 73 2e 20 52 6f 75 6e βde remaiβns. Rounβ
β00000290β 64 20 74 68 65 20 64 65 β 63 61 79 0a 4f 66 20 74 βd the deβcay_Of tβ
β000002a0β 68 61 74 20 63 6f 6c 6f β 73 73 61 6c 20 57 72 65 βhat coloβssal Wreβ
β000002b0β 63 6b 2c 20 62 6f 75 6e β 64 6c 65 73 73 20 61 6e βck, bounβdless anβ
β000002c0β 64 20 62 61 72 65 0a 54 β 68 65 20 6c 6f 6e 65 20 βd bare_Tβhe lone β
β000002d0β 61 6e 64 20 6c 65 76 65 β 6c 20 73 61 6e 64 73 20 βand leveβl sands β
β000002e0β 73 74 72 65 74 63 68 20 β 66 61 72 20 61 77 61 79 βstretch βfar awayβ
β000002f0β 2e e2 80 9d 0a β β.ΓΓΓ_ β β
ββββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ
We can see here the first 4 bytes are the ID, then the title, then a bunch of bytes for the jsonb blob containing the authors (Postgres does not store jsonb values as plain strings but thatβs a story for another post), then we can see eb 00 00 00
β this is the country ID (in my database, the UK has ID 235 = 0xeb) β then we have the 4 bytes 04 0a 00 00
and finally the poem itself, in full. We mentioned those pesky 4 bytes in the last blog post that holds the varlena metadata β weβre going to talk about them again a little bit later on in this post.
Letβs take a look at Keats:
$ print-cw-data 'Ode on a Grecian Urn' | head -n 30
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β00000000β 02 00 00 00 2b 4f 64 65 β 20 6f 6e 20 61 20 47 72 ββ’βββ+Odeβ on a Grβ
β00000010β 65 63 69 61 6e 20 55 72 β 6e b3 01 00 00 40 50 00 βecian UrβnΓβ’ββ@Pββ
β00000020β 00 d0 03 00 00 20 04 00 β 00 80 0a 00 00 00 0a 00 ββΓβ’ββ β’βββΓ_βββ_ββ
β00000030β 00 00 0b 00 00 00 09 00 β 00 10 08 00 00 10 6e 61 ββββ’βββ_ββββ’β’βββ’naβ
β00000040β 6d 65 62 69 72 74 68 5f β 79 65 61 72 64 65 61 74 βmebirth_βyeardeatβ
β00000050β 68 5f 79 65 61 72 4b 65 β 61 74 73 2c 20 4a 6f 68 βh_yearKeβats, Johβ
β00000060β 6e 00 20 00 00 00 00 80 β 03 07 20 00 00 00 00 80 βnβ ββββΓββ’β’ ββββΓβ
β00000070β 1d 07 00 00 eb 00 00 00 β f2 1a 00 00 8c 09 00 00 ββ’β’ββΓββββΓβ’ββΓ_βββ
β00000080β 00 54 68 6f 75 20 73 74 β 69 00 6c 6c 20 75 6e 72 ββThou stβiβll unrβ
β00000090β 61 76 00 69 73 68 27 64 β 20 62 72 00 69 64 65 20 βavβish'dβ brβide β
β000000a0β 6f 66 20 71 00 75 69 65 β 74 6e 65 73 73 18 2c 0a βof qβuieβtnessβ’,_β
β000000b0β 20 03 01 02 31 66 6f 73 β 00 74 65 72 2d 63 68 69 β β’β’β’1fosββter-chiβ
β000000c0β 6c 02 64 01 27 73 69 6c β 65 6e 63 00 65 20 61 6e βlβ’dβ’'silβencβe anβ
β000000d0β 64 20 73 6c 00 6f 77 20 β 74 69 6d 65 2c 00 0a 53 βd slβow βtime,β_Sβ
β000000e0β 79 6c 76 61 6e 20 00 68 β 69 73 74 6f 72 69 61 00 βylvan βhβistoriaββ
β000000f0β 6e 2c 20 77 68 6f 20 63 β 00 61 6e 73 74 20 74 68 βn, who cββanst thβ
β00000100β 75 00 73 20 65 78 70 72 β 65 73 02 73 05 5c 41 20 βuβs exprβesβ’sβ’ \A β
β00000110β 66 6c 6f 77 00 65 72 79 β 20 74 61 6c 65 00 20 6d βflowβeryβ taleβ mβ
β00000120β 6f 72 65 20 73 77 00 65 β 65 74 6c 79 20 74 68 00 βore swβeβetly thββ
β00000130β 61 6e 20 6f 75 72 20 72 β 00 68 79 6d 65 3a 0a 57 βan our rββhyme:_Wβ
β00000140β 68 00 61 74 20 6c 65 61 β 66 2d 00 66 72 69 6e 67 βhβat leaβf-βfringβ
β00000150β 27 64 20 00 6c 65 67 65 β 6e 64 20 68 00 61 75 6e β'd βlegeβnd hβaunβ
β00000160β 74 73 20 61 62 04 6f 75 β 01 66 79 20 73 68 61 04 βts abβ’ouββ’fy shaβ’β
β00000170β 70 65 05 63 4f 66 20 64 β 65 00 69 74 69 65 73 20 βpeβ’cOf dβeβities β
β00000180β 6f 72 41 01 62 74 61 6c β 73 2c 01 0c 6f c0 66 20 βorAβ’btalβs,β’_oΓf β
β00000190β 62 6f 74 68 06 e9 05 01 β 00 49 6e 20 54 65 6d 70 βbothβ’Γβ’β’ββIn Tempβ
β000001a0β 65 01 01 24 74 68 65 20 β 64 61 6c 01 01 3d 66 20 βeβ’β’$the βdalβ’β’=f β
β000001b0β 41 72 63 61 64 8c 79 3f β 05 30 02 91 6d 65 6e 01 βArcadΓy?ββ’0β’Γmenβ’β
β000001c0β 28 00 67 6f 64 73 20 61 β 72 65 11 01 31 73 65 3f β(βgods aβreβ’β’1se?β
Iβve cut it off at 30 lines because we donβt really need to see the whole thing. Again, we see 02 00 00 00
for the row ID, the title as a plain string followed by the jsonb blob then the country ID β again eb 00 00 00
β next we have the 4-byte varlena metadata 2f 1a 00 00
before we get into the data.
Youβll notice that the data looks a little bit different this time β it starts off looking normal but over time more and more of the text turns into gibberish! This is because Postgres has decided that the string is sufficiently long that it needs compressing to fit it into the page. What youβre looking at here is the compressed version of the poem. Compression is a fascinating topic which I could do a whole blog series on in itself. Without going into too much detail, Postgres uses the pglz
compression algorithm which is an implementation of the LZ compression algorithm . This uses a simple history table to refer back to previously seen values instead of repeating them. For instance, the poem starts Thou still...
on the first line and Thou foster-child...
on the second line, but you can see that the second Thou
has been replaced with 03 01 02 31
which encodes the previously seen instance of Thou
, thereby saving 1 byte. (Okay, not that impressive in this case, but it obviously reduces space over the course of the whole poem a lot!)
This explains why the row was only 1,844 bytes long when the poem itself is 2,442 characters.
Finally, letβs look at The Waste Land:
$ print-cw-data 'The Waste Land'
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β00000000β 03 00 00 00 1f 54 68 65 β 20 57 61 73 74 65 20 4c ββ’ββββ’Theβ Waste Lβ
β00000010β 61 6e 64 db 01 00 00 40 β 64 00 00 d0 03 00 00 20 βandΓβ’ββ@βdββΓβ’ββ β
β00000020β 04 00 00 80 0a 00 00 00 β 0a 00 00 00 1d 00 00 00 ββ’ββΓ_ββββ_ββββ’ββββ
β00000030β 0b 00 00 10 08 00 00 10 β 6e 61 6d 65 62 69 72 74 ββ’βββ’β’βββ’βnamebirtβ
β00000040β 68 5f 79 65 61 72 64 65 β 61 74 68 5f 79 65 61 72 βh_yeardeβath_yearβ
β00000050β 45 6c 69 6f 74 2c 20 54 β 2e 20 53 2e 20 28 54 68 βEliot, Tβ. S. (Thβ
β00000060β 6f 6d 61 73 20 53 74 65 β 61 72 6e 73 29 00 00 00 βomas Steβarns )ββββ
β00000070β 20 00 00 00 00 80 60 07 β 20 00 00 00 00 80 ad 07 β ββββΓ` β’β ββββΓΓβ’β
β00000080β ec 00 00 00 01 12 ed 4e β 00 00 24 2d 00 00 ff 66 βΓββββ’β’ΓNβββ $- ββΓfβ
β00000090β 00 00 d5 61 00 00 β βββΓaββ β β
ββββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ
This row is puny by comparison. Itβs got the usual ID, title, authors and country ID (this time ec 00 00 00
= 236 = USA), then 01 12 ed 4e
for the varlena metadata3 , but then instead of the poem itβs just got 14 bytes of gibbberish β whereβs it goneβ½
# Letβs talk about TOAST
Iβve managed to go all the main thus far without actually explaining what anything has got to do with TOAST or even what TOAST is, but now I canβt avoid it.
TOAST = T he O versized-A ttribute S torage T echnique
This is the name that Postgres gives its technique for handling these pesky values that just wonβt fit into their page like a good value. Weβve seen it in action already, we just didnβt know that thatβs what it was called!
These three poems have been carefully selected[^carefully-selected] to showcase the two TOAST tools that Postgres uses: compression and out-of-lining. These tools can be used on many variable length field, not just strings like our poems. For instance, bytea and jsonb values are commonly βTOAST-edβ as well.
What is this βout-of-liningβ, I hear you say? Itβs precisely what it sounds like: Postgres takes the value out of the line that the rest of the row sits in, and puts it a separate relation where it can really think about its actions until its small enough to come back inside the main heap file.
Itβs still being stored somewhere, so how do we track down where itβs snuck off to? This is where we turn to our old friend pg_class
:
blogdb = # select oid , relname, relfilenode, reltoastrelid
blogdb - # from pg_class where relname = 'creative_works' ;
oid | relname | relfilenode | reltoastrelid | relnamespace
-------+----------------+-------------+---------------+--------------
25042 | creative_works | 26735 | 25045 | 2200
Ahah, thereβs a column called reltoastrelid
that points from the main relation to the toast relation β letβs follow it.
blogdb = # select oid , relname, relfilenode, reltoastrelid, relnamespace
blogdb - # from pg_class where oid = (
blogdb - # select reltoastrelid from pg_class where relname = 'creative_works'
oid | relname | relfilenode | reltoastrelid | relnamespace
-------+----------------+-------------+---------------+--------------
25045 | pg_toast_25042 | 26738 | 0 | 99
You can see that the name of the toast table is just pg_toast_{main table oid}
.
You might be thinking βIβve never seen this pg_toast_25042
table before in my schema, where is it hiding? Notice that the relnamespace
for the main table and the toast table is different β this means that the TOAST table is in a different schema. We find can out the name of the schema by looking at pg_catalog.pg_namespace
:
blogdb = # select * from pg_catalog . pg_namespace ;
oid | nspname | nspowner | nspacl
-------+--------------------+----------+---------------------------------------------------------------
11 | pg_catalog | 10 | {postgres = UC / postgres, = U / postgres}
2200 | public | 6171 | {pg_database_owner = UC / pg_database_owner, = U / pg_database_owner}
13212 | information_schema | 10 | {postgres = UC / postgres, = U / postgres}
26360 | pg_toast_temp_3 | 10 |
So the main table is in the default public
schema (2200) while the pg_toast_25042
table is in the pg_toast
schema (99). Now that we know that, we can go back to our friend pageinspect
to dig into the toast table:
blogdb = # select lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, t_infomask2, t_infomask, t_hoff, t_bits, t_oid
blogdb - # from heap_page_items(get_raw_page( 'pg_toast.pg_toast_25042' , 0 ));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
1 | 6160 | 1 | 2032 | 1136 | 0 | 5 | ( 0 , 1 ) | 3 | 2050 | 24 | |
2 | 4128 | 1 | 2032 | 1136 | 0 | 5 | ( 0 , 2 ) | 3 | 2050 | 24 | |
3 | 2096 | 1 | 2032 | 1136 | 0 | 5 | ( 0 , 3 ) | 3 | 2050 | 24 | |
4 | 64 | 1 | 2032 | 1136 | 0 | 5 | ( 0 , 4 ) | 3 | 2050 | 24 | |
blogdb = # select lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, t_infomask2, t_infomask, t_hoff, t_bits, t_oid
blogdb - # from heap_page_items(get_raw_page( 'pg_toast.pg_toast_25042' , 1 ));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
1 | 6160 | 1 | 2032 | 1136 | 0 | 5 | ( 1 , 1 ) | 3 | 2050 | 24 | |
2 | 4544 | 1 | 1612 | 1136 | 0 | 5 | ( 1 , 2 ) | 3 | 2050 | 24 | |
Not only has our poem been split into multiple pages, itβs been split into multiple tuples within each page!
In our case, weβve only got one out-of-lined value, so linking the attribute to the toast tuples is fairly simple, but in general this isnβt the case. So how do we actually link the data in the main table tuple and the out-of-lined data in the toast table?
# Use the source, Luke
When you get to this level of detail, itβs easier to just read the source code rather than trying to cobble together bits of information from other sources, so letβs take a quick peek at the Postgres source code for this stuff.
PostgreSQL source code as reproduced here is copyrighted by PostgreSQL Global Development Group under the PostgreSQL License .
The function that βde-TOASTsβ data is detoast_attr()
in the src/backend/access/common/detoast.c
file. Itβs not that long, but weβre only thinking about the first if block at the moment:
/*-------------------------------------------------------------------------
* Retrieve compressed or external variable size attributes.
* Copyright (c) 2000-2024, PostgreSQL Global Development Group
* src/backend/access/common/detoast.c
*-------------------------------------------------------------------------
#include "access/detoast.h"
#include "access/table.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "common/pg_lzcompress.h"
#include "utils/expandeddatum.h"
static struct varlena * toast_fetch_datum ( struct varlena * attr );
static struct varlena * toast_fetch_datum_slice ( struct varlena * attr ,
static struct varlena * toast_decompress_datum ( struct varlena * attr );
static struct varlena * toast_decompress_datum_slice ( struct varlena * attr , int32 slicelength );
* detoast_external_attr -
* Public entry point to get back a toasted value from
* external source (possibly still in compressed format).
* This will return a datum that contains all the data internally, ie, not
* relying on external storage or memory, but it can still be compressed or
* have a short header. Note some callers assume that if the input is an
* EXTERNAL datum, the result will be a pfree'able chunk.
detoast_external_attr ( struct varlena * attr )
if ( VARATT_IS_EXTERNAL_ONDISK (attr))
* This is an external stored plain value
result = toast_fetch_datum (attr);
else if ( VARATT_IS_EXTERNAL_INDIRECT (attr))
* This is an indirect pointer --- dereference it
struct varatt_indirect redirect;
VARATT_EXTERNAL_GET_POINTER (redirect, attr);
attr = ( struct varlena * ) redirect.pointer;
/* nested indirect Datums aren't allowed */
Assert ( ! VARATT_IS_EXTERNAL_INDIRECT (attr));
/* recurse if value is still external in some other way */
if ( VARATT_IS_EXTERNAL (attr))
return detoast_external_attr (attr);
* Copy into the caller's memory context, in case caller tries to
result = ( struct varlena * ) palloc ( VARSIZE_ANY (attr));
memcpy (result, attr, VARSIZE_ANY (attr));
else if ( VARATT_IS_EXTERNAL_EXPANDED (attr))
* This is an expanded-object pointer --- get flat format
ExpandedObjectHeader * eoh;
eoh = DatumGetEOHP ( PointerGetDatum (attr));
resultsize = EOH_get_flat_size (eoh);
result = ( struct varlena * ) palloc (resultsize);
EOH_flatten_into (eoh, ( void * ) result, resultsize);
* This is a plain value inside of the main tuple - why am I called?
* Public entry point to get back a toasted value from compression
* or external storage. The result is always non-extended varlena form.
* Note some callers assume that if the input is an EXTERNAL or COMPRESSED
* datum, the result will be a pfree'able chunk.
detoast_attr ( struct varlena * attr )
if ( VARATT_IS_EXTERNAL_ONDISK (attr))
* This is an externally stored datum --- fetch it back from there
attr = toast_fetch_datum (attr);
/* If it's compressed, decompress it */
if ( VARATT_IS_COMPRESSED (attr))
struct varlena * tmp = attr;
attr = toast_decompress_datum (tmp);
else if ( VARATT_IS_EXTERNAL_INDIRECT (attr))
* This is an indirect pointer --- dereference it
struct varatt_indirect redirect;
VARATT_EXTERNAL_GET_POINTER (redirect, attr);
attr = ( struct varlena * ) redirect.pointer;
/* nested indirect Datums aren't allowed */
Assert ( ! VARATT_IS_EXTERNAL_INDIRECT (attr));
/* recurse in case value is still extended in some other way */
attr = detoast_attr (attr);
/* if it isn't, we'd better copy it */
if (attr == ( struct varlena * ) redirect.pointer)
result = ( struct varlena * ) palloc ( VARSIZE_ANY (attr));
memcpy (result, attr, VARSIZE_ANY (attr));
else if ( VARATT_IS_EXTERNAL_EXPANDED (attr))
* This is an expanded-object pointer --- get flat format
attr = detoast_external_attr (attr);
/* flatteners are not allowed to produce compressed/short output */
Assert ( ! VARATT_IS_EXTENDED (attr));
else if ( VARATT_IS_COMPRESSED (attr))
* This is a compressed value inside of the main tuple
attr = toast_decompress_datum (attr);
else if ( VARATT_IS_SHORT (attr))
* This is a short-header varlena --- convert to 4-byte header format
Size data_size = VARSIZE_SHORT (attr) - VARHDRSZ_SHORT;
Size new_size = data_size + VARHDRSZ;
struct varlena * new_attr;
new_attr = ( struct varlena * ) palloc (new_size);
SET_VARSIZE (new_attr, new_size);
memcpy ( VARDATA (new_attr), VARDATA_SHORT (attr), data_size);
* Public entry point to get back part of a toasted value
* from compression or external storage.
* sliceoffset is where to start (zero or more)
* If slicelength < 0, return everything beyond sliceoffset
detoast_attr_slice ( struct varlena * attr ,
int32 sliceoffset , int32 slicelength )
struct varlena * preslice;
elog (ERROR, "invalid sliceoffset: %d " , sliceoffset);
* Compute slicelimit = offset + length, or -1 if we must fetch all of the
* value. In case of integer overflow, we must fetch all.
else if ( pg_add_s32_overflow (sliceoffset, slicelength, & slicelimit))
slicelength = slicelimit = - 1 ;
if ( VARATT_IS_EXTERNAL_ONDISK (attr))
struct varatt_external toast_pointer;
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
/* fast path for non-compressed external datums */
if ( ! VARATT_EXTERNAL_IS_COMPRESSED (toast_pointer))
return toast_fetch_datum_slice (attr, sliceoffset, slicelength);
* For compressed values, we need to fetch enough slices to decompress
* at least the requested part (when a prefix is requested).
* Otherwise, just fetch all slices.
int32 max_size = VARATT_EXTERNAL_GET_EXTSIZE (toast_pointer);
* Determine maximum amount of compressed data needed for a prefix
* of a given length (after decompression).
* At least for now, if it's LZ4 data, we'll have to fetch the
* whole thing, because there doesn't seem to be an API call to
* determine how much compressed data we need to be sure of being
* able to decompress the required slice.
if ( VARATT_EXTERNAL_GET_COMPRESS_METHOD (toast_pointer) ==
TOAST_PGLZ_COMPRESSION_ID)
max_size = pglz_maximum_compressed_size (slicelimit, max_size);
* Fetch enough compressed slices (compressed marker will get set
preslice = toast_fetch_datum_slice (attr, 0 , max_size);
preslice = toast_fetch_datum (attr);
else if ( VARATT_IS_EXTERNAL_INDIRECT (attr))
struct varatt_indirect redirect;
VARATT_EXTERNAL_GET_POINTER (redirect, attr);
/* nested indirect Datums aren't allowed */
Assert ( ! VARATT_IS_EXTERNAL_INDIRECT (redirect.pointer));
return detoast_attr_slice (redirect.pointer,
sliceoffset, slicelength);
else if ( VARATT_IS_EXTERNAL_EXPANDED (attr))
/* pass it off to detoast_external_attr to flatten */
preslice = detoast_external_attr (attr);
Assert ( ! VARATT_IS_EXTERNAL (preslice));
if ( VARATT_IS_COMPRESSED (preslice))
struct varlena * tmp = preslice;
/* Decompress enough to encompass the slice and the offset */
preslice = toast_decompress_datum_slice (tmp, slicelimit);
preslice = toast_decompress_datum (tmp);
if ( VARATT_IS_SHORT (preslice))
attrdata = VARDATA_SHORT (preslice);
attrsize = VARSIZE_SHORT (preslice) - VARHDRSZ_SHORT;
attrdata = VARDATA (preslice);
attrsize = VARSIZE (preslice) - VARHDRSZ;
/* slicing of datum for compressed cases and plain value */
if (sliceoffset >= attrsize)
else if (slicelength < 0 || slicelimit > attrsize)
slicelength = attrsize - sliceoffset;
result = ( struct varlena * ) palloc (slicelength + VARHDRSZ);
SET_VARSIZE (result, slicelength + VARHDRSZ);
memcpy ( VARDATA (result), attrdata + sliceoffset, slicelength);
* Reconstruct an in memory Datum from the chunks saved
toast_fetch_datum ( struct varlena * attr )
struct varatt_external toast_pointer;
if ( ! VARATT_IS_EXTERNAL_ONDISK (attr))
elog (ERROR, "toast_fetch_datum shouldn't be called for non-ondisk datums" );
/* Must copy to access aligned fields */
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
attrsize = VARATT_EXTERNAL_GET_EXTSIZE (toast_pointer);
result = ( struct varlena * ) palloc (attrsize + VARHDRSZ);
if ( VARATT_EXTERNAL_IS_COMPRESSED (toast_pointer))
SET_VARSIZE_COMPRESSED (result, attrsize + VARHDRSZ);
SET_VARSIZE (result, attrsize + VARHDRSZ);
return result; /* Probably shouldn't happen, but just in
* Open the toast relation and its indexes
toastrel = table_open (toast_pointer.va_toastrelid, AccessShareLock);
table_relation_fetch_toast_slice (toastrel, toast_pointer.va_valueid,
attrsize, 0 , attrsize, result);
table_close (toastrel, AccessShareLock);
* toast_fetch_datum_slice -
* Reconstruct a segment of a Datum from the chunks saved
* Note that this function supports non-compressed external datums
* and compressed external datums (in which case the requested slice
* has to be a prefix, i.e. sliceoffset has to be 0).
toast_fetch_datum_slice ( struct varlena * attr , int32 sliceoffset ,
struct varatt_external toast_pointer;
if ( ! VARATT_IS_EXTERNAL_ONDISK (attr))
elog (ERROR, "toast_fetch_datum_slice shouldn't be called for non-ondisk datums" );
/* Must copy to access aligned fields */
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
* It's nonsense to fetch slices of a compressed datum unless when it's a
* prefix -- this isn't lo_* we can't return a compressed datum which is
* meaningful to toast later.
Assert ( ! VARATT_EXTERNAL_IS_COMPRESSED (toast_pointer) || 0 == sliceoffset);
attrsize = VARATT_EXTERNAL_GET_EXTSIZE (toast_pointer);
if (sliceoffset >= attrsize)
* When fetching a prefix of a compressed external datum, account for the
* space required by va_tcinfo, which is stored at the beginning as an
if ( VARATT_EXTERNAL_IS_COMPRESSED (toast_pointer) && slicelength > 0 )
slicelength = slicelength + sizeof (int32);
* Adjust length request if needed. (Note: our sole caller,
* detoast_attr_slice, protects us against sliceoffset + slicelength
if (((sliceoffset + slicelength) > attrsize) || slicelength < 0 )
slicelength = attrsize - sliceoffset;
result = ( struct varlena * ) palloc (slicelength + VARHDRSZ);
if ( VARATT_EXTERNAL_IS_COMPRESSED (toast_pointer))
SET_VARSIZE_COMPRESSED (result, slicelength + VARHDRSZ);
SET_VARSIZE (result, slicelength + VARHDRSZ);
return result; /* Can save a lot of work at this point! */
/* Open the toast relation */
toastrel = table_open (toast_pointer.va_toastrelid, AccessShareLock);
table_relation_fetch_toast_slice (toastrel, toast_pointer.va_valueid,
attrsize, sliceoffset, slicelength,
table_close (toastrel, AccessShareLock);
* toast_decompress_datum -
* Decompress a compressed version of a varlena datum
toast_decompress_datum ( struct varlena * attr )
Assert ( VARATT_IS_COMPRESSED (attr));
* Fetch the compression method id stored in the compression header and
* decompress the data using the appropriate decompression routine.
cmid = TOAST_COMPRESS_METHOD (attr);
case TOAST_PGLZ_COMPRESSION_ID:
return pglz_decompress_datum (attr);
case TOAST_LZ4_COMPRESSION_ID:
return lz4_decompress_datum (attr);
elog (ERROR, "invalid compression method id %d " , cmid);
return NULL ; /* keep compiler quiet */
* toast_decompress_datum_slice -
* Decompress the front of a compressed version of a varlena datum.
* offset handling happens in detoast_attr_slice.
* Here we just decompress a slice from the front.
toast_decompress_datum_slice ( struct varlena * attr , int32 slicelength )
Assert ( VARATT_IS_COMPRESSED (attr));
* Some callers may pass a slicelength that's more than the actual
* decompressed size. If so, just decompress normally. This avoids
* possibly allocating a larger-than-necessary result object, and may be
* faster and/or more robust as well. Notably, some versions of liblz4
* have been seen to give wrong results if passed an output size that is
* more than the data's true decompressed size.
if ((uint32) slicelength >= TOAST_COMPRESS_EXTSIZE (attr))
return toast_decompress_datum (attr);
* Fetch the compression method id stored in the compression header and
* decompress the data slice using the appropriate decompression routine.
cmid = TOAST_COMPRESS_METHOD (attr);
case TOAST_PGLZ_COMPRESSION_ID:
return pglz_decompress_datum_slice (attr, slicelength);
case TOAST_LZ4_COMPRESSION_ID:
return lz4_decompress_datum_slice (attr, slicelength);
elog (ERROR, "invalid compression method id %d " , cmid);
return NULL ; /* keep compiler quiet */
* Return the raw (detoasted) size of a varlena datum
* (including the VARHDRSZ header)
toast_raw_datum_size (Datum value )
struct varlena * attr = ( struct varlena * ) DatumGetPointer (value);
if ( VARATT_IS_EXTERNAL_ONDISK (attr))
/* va_rawsize is the size of the original datum -- including header */
struct varatt_external toast_pointer;
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
result = toast_pointer.va_rawsize;
else if ( VARATT_IS_EXTERNAL_INDIRECT (attr))
struct varatt_indirect toast_pointer;
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
/* nested indirect Datums aren't allowed */
Assert ( ! VARATT_IS_EXTERNAL_INDIRECT (toast_pointer.pointer));
return toast_raw_datum_size ( PointerGetDatum (toast_pointer.pointer));
else if ( VARATT_IS_EXTERNAL_EXPANDED (attr))
result = EOH_get_flat_size ( DatumGetEOHP (value));
else if ( VARATT_IS_COMPRESSED (attr))
/* here, va_rawsize is just the payload size */
result = VARDATA_COMPRESSED_GET_EXTSIZE (attr) + VARHDRSZ;
else if ( VARATT_IS_SHORT (attr))
* we have to normalize the header length to VARHDRSZ or else the
* callers of this function will be confused.
result = VARSIZE_SHORT (attr) - VARHDRSZ_SHORT + VARHDRSZ;
/* plain untoasted datum */
* Return the physical storage size (possibly compressed) of a varlena datum
toast_datum_size (Datum value )
struct varlena * attr = ( struct varlena * ) DatumGetPointer (value);
if ( VARATT_IS_EXTERNAL_ONDISK (attr))
* Attribute is stored externally - return the extsize whether
* compressed or not. We do not count the size of the toast pointer
struct varatt_external toast_pointer;
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
result = VARATT_EXTERNAL_GET_EXTSIZE (toast_pointer);
else if ( VARATT_IS_EXTERNAL_INDIRECT (attr))
struct varatt_indirect toast_pointer;
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
/* nested indirect Datums aren't allowed */
Assert ( ! VARATT_IS_EXTERNAL_INDIRECT (attr));
return toast_datum_size ( PointerGetDatum (toast_pointer.pointer));
else if ( VARATT_IS_EXTERNAL_EXPANDED (attr))
result = EOH_get_flat_size ( DatumGetEOHP (value));
else if ( VARATT_IS_SHORT (attr))
result = VARSIZE_SHORT (attr);
* Attribute is stored inline either compressed or not, just calculate
* the size of the datum in either case.
The rest of the function goes on to check the other TOAST cases which arenβt particularly relevant to us right now. The important thing is that VARATT_IS_EXTERNAL_ONDISK()
is a macro that tells you whether a varlena struct refers to an out-of-line (a.k.a. external) value stored on disk (i.e. in a TOAST table). You can also see that the out-of-line attributes can also be compressed within the external TOAST table.
Remember that varlena
is the struct that holds the main table data for a variable length attribute β in our case thatβs the 18 bytes of gibberish we saw earlier. To understand these weird 18 bytes and to understand varlena more broadly, we need to look at the src/include/varatt.h
file in the Postgres codebase β Iβve highlighted the most important bits and collapsed the bits we donβt really care about:
/*-------------------------------------------------------------------------
* variable-length datatypes (TOAST support)
* Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
* Portions Copyright (c) 1995, Regents of the University of California
*-------------------------------------------------------------------------
* struct varatt_external is a traditional "TOAST pointer", that is, the
* information needed to fetch a Datum stored out-of-line in a TOAST table.
* The data is compressed if and only if the external size stored in
* va_extinfo is less than va_rawsize - VARHDRSZ.
* This struct must not contain any padding, because we sometimes compare
* these pointers using memcmp.
* Note that this information is stored unaligned within actual tuples, so
* you need to memcpy from the tuple into a local struct variable before
* you can look at these fields! (The reason we use memcmp is to avoid
* having to do that just to detect equality of two TOAST pointers...)
typedef struct varatt_external
int32 va_rawsize; /* Original data size (includes header) */
uint32 va_extinfo; /* External saved size (without header) and
Oid va_valueid; /* Unique ID of value within TOAST table */
Oid va_toastrelid; /* RelID of TOAST table containing it */
* These macros define the "saved size" portion of va_extinfo. Its remaining
* two high-order bits identify the compression method.
#define VARLENA_EXTSIZE_BITS 30
#define VARLENA_EXTSIZE_MASK (( 1 U << VARLENA_EXTSIZE_BITS) - 1 )
* struct varatt_indirect is a "TOAST pointer" representing an out-of-line
* Datum that's stored in memory, not in an external toast relation.
* The creator of such a Datum is entirely responsible that the referenced
* storage survives for as long as referencing pointer Datums can exist.
* Note that just as for struct varatt_external, this struct is stored
* unaligned within any containing tuple.
typedef struct varatt_indirect
struct varlena * pointer; /* Pointer to in-memory varlena */
* struct varatt_expanded is a "TOAST pointer" representing an out-of-line
* Datum that is stored in memory, in some type-specific, not necessarily
* physically contiguous format that is convenient for computation not
* storage. APIs for this, in particular the definition of struct
* ExpandedObjectHeader, are in src/include/utils/expandeddatum.h.
* Note that just as for struct varatt_external, this struct is stored
* unaligned within any containing tuple.
typedef struct ExpandedObjectHeader ExpandedObjectHeader;
typedef struct varatt_expanded
ExpandedObjectHeader * eohptr;
* Type tag for the various sorts of "TOAST pointer" datums. The peculiar
* value for VARTAG_ONDISK comes from a requirement for on-disk compatibility
* with a previous notion that the tag field was the pointer datum's length.
typedef enum vartag_external
/* this test relies on the specific tag values above */
#define VARTAG_IS_EXPANDED ( tag ) \
(((tag) & ~ 1 ) == VARTAG_EXPANDED_RO)
#define VARTAG_SIZE ( tag ) \
((tag) == VARTAG_INDIRECT ? sizeof (varatt_indirect) : \
VARTAG_IS_EXPANDED (tag) ? sizeof (varatt_expanded) : \
(tag) == VARTAG_ONDISK ? sizeof (varatt_external) : \
* These structs describe the header of a varlena object that may have been
* TOASTed. Generally, don't reference these structs directly, but use the
* We use separate structs for the aligned and unaligned cases because the
* compiler might otherwise think it could generate code that assumes
* alignment while touching fields of a 1-byte-header varlena.
struct /* Normal varlena (4-byte length) */
char va_data [FLEXIBLE_ARRAY_MEMBER];
struct /* Compressed-in-line format */
uint32 va_tcinfo; /* Original data size (excludes header) and
* compression method; see va_extinfo */
char va_data [FLEXIBLE_ARRAY_MEMBER]; /* Compressed data */
char va_data [FLEXIBLE_ARRAY_MEMBER]; /* Data begins here */
/* TOAST pointers are a subset of varattrib_1b with an identifying tag byte */
uint8 va_header; /* Always 0x80 or 0x01 */
uint8 va_tag; /* Type of datum */
char va_data [FLEXIBLE_ARRAY_MEMBER]; /* Type-specific data */
* Bit layouts for varlena headers on big-endian machines:
* 00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
* 01xxxxxx 4-byte length word, aligned, *compressed* data (up to 1G)
* 10000000 1-byte length word, unaligned, TOAST pointer
* 1xxxxxxx 1-byte length word, unaligned, uncompressed data (up to 126b)
* Bit layouts for varlena headers on little-endian machines:
* xxxxxx00 4-byte length word, aligned, uncompressed data (up to 1G)
* xxxxxx10 4-byte length word, aligned, *compressed* data (up to 1G)
* 00000001 1-byte length word, unaligned, TOAST pointer
* xxxxxxx1 1-byte length word, unaligned, uncompressed data (up to 126b)
* The "xxx" bits are the length field (which includes itself in all cases).
* In the big-endian case we mask to extract the length, in the little-endian
* case we shift. Note that in both cases the flag bits are in the physically
* first byte. Also, it is not possible for a 1-byte length word to be zero;
* this lets us disambiguate alignment padding bytes from the start of an
* unaligned datum. (We now *require* pad bytes to be filled with zero!)
* In TOAST pointers the va_tag field (see varattrib_1b_e) is used to discern
* the specific type and length of the pointer datum.
* Endian-dependent macros. These are considered internal --- use the
* external macros below instead of using these directly.
* Note: IS_1B is true for external toast records but VARSIZE_1B will return 0
* for such records. Hence you should usually check for IS_EXTERNAL before
#define VARATT_IS_4B ( PTR ) \
((((varattrib_1b * ) (PTR))->va_header & 0x 80 ) == 0x 00 )
#define VARATT_IS_4B_U ( PTR ) \
((((varattrib_1b * ) (PTR))->va_header & 0x C0 ) == 0x 00 )
#define VARATT_IS_4B_C ( PTR ) \
((((varattrib_1b * ) (PTR))->va_header & 0x C0 ) == 0x 40 )
#define VARATT_IS_1B ( PTR ) \
((((varattrib_1b * ) (PTR))->va_header & 0x 80 ) == 0x 80 )
#define VARATT_IS_1B_E ( PTR ) \
((((varattrib_1b * ) (PTR))->va_header) == 0x 80 )
#define VARATT_NOT_PAD_BYTE ( PTR ) \
( * ((uint8 * ) (PTR)) != 0 )
/* VARSIZE_4B() should only be used on known-aligned data */
#define VARSIZE_4B ( PTR ) \
(((varattrib_4b * ) (PTR))->va_4byte.va_header & 0x 3FFFFFFF )
#define VARSIZE_1B ( PTR ) \
(((varattrib_1b * ) (PTR))->va_header & 0x 7F )
#define VARTAG_1B_E ( PTR ) \
(((varattrib_1b_e * ) (PTR))->va_tag)
#define SET_VARSIZE_4B ( PTR , len ) \
(((varattrib_4b * ) (PTR))->va_4byte.va_header = (len) & 0x 3FFFFFFF )
#define SET_VARSIZE_4B_C ( PTR , len ) \
(((varattrib_4b * ) (PTR))->va_4byte.va_header = ((len) & 0x 3FFFFFFF ) | 0x 40000000 )
#define SET_VARSIZE_1B ( PTR , len ) \
(((varattrib_1b * ) (PTR))->va_header = (len) | 0x 80 )
#define SET_VARTAG_1B_E ( PTR , tag ) \
(((varattrib_1b_e * ) (PTR))->va_header = 0x 80 , \
((varattrib_1b_e * ) (PTR))->va_tag = (tag))
#else /* !WORDS_BIGENDIAN */
#define VARATT_IS_4B ( PTR ) \
((((varattrib_1b * ) (PTR))->va_header & 0x 01 ) == 0x 00 )
#define VARATT_IS_4B_U ( PTR ) \
((((varattrib_1b * ) (PTR))->va_header & 0x 03 ) == 0x 00 )
#define VARATT_IS_4B_C ( PTR ) \
((((varattrib_1b * ) (PTR))->va_header & 0x 03 ) == 0x 02 )
#define VARATT_IS_1B ( PTR ) \
((((varattrib_1b * ) (PTR))->va_header & 0x 01 ) == 0x 01 )
#define VARATT_IS_1B_E ( PTR ) \
((((varattrib_1b * ) (PTR))->va_header) == 0x 01 )
#define VARATT_NOT_PAD_BYTE ( PTR ) \
( * ((uint8 * ) (PTR)) != 0 )
/* VARSIZE_4B() should only be used on known-aligned data */
#define VARSIZE_4B ( PTR ) \
((((varattrib_4b * ) (PTR))->va_4byte.va_header >> 2 ) & 0x 3FFFFFFF )
#define VARSIZE_1B ( PTR ) \
((((varattrib_1b * ) (PTR))->va_header >> 1 ) & 0x 7F )
#define VARTAG_1B_E ( PTR ) \
(((varattrib_1b_e * ) (PTR))->va_tag)
#define SET_VARSIZE_4B ( PTR , len ) \
(((varattrib_4b * ) (PTR))->va_4byte.va_header = (((uint32) (len)) << 2 ))
#define SET_VARSIZE_4B_C ( PTR , len ) \
(((varattrib_4b * ) (PTR))->va_4byte.va_header = (((uint32) (len)) << 2 ) | 0x 02 )
#define SET_VARSIZE_1B ( PTR , len ) \
(((varattrib_1b * ) (PTR))->va_header = (((uint8) (len)) << 1 ) | 0x 01 )
#define SET_VARTAG_1B_E ( PTR , tag ) \
(((varattrib_1b_e * ) (PTR))->va_header = 0x 01 , \
((varattrib_1b_e * ) (PTR))->va_tag = (tag))
#endif /* WORDS_BIGENDIAN */
#define VARDATA_4B ( PTR ) (((varattrib_4b * ) (PTR))->va_4byte.va_data)
#define VARDATA_4B_C ( PTR ) (((varattrib_4b * ) (PTR))->va_compressed.va_data)
#define VARDATA_1B ( PTR ) (((varattrib_1b * ) (PTR))->va_data)
#define VARDATA_1B_E ( PTR ) (((varattrib_1b_e * ) (PTR))->va_data)
* Externally visible TOAST macros begin here.
#define VARHDRSZ_EXTERNAL offsetof (varattrib_1b_e, va_data)
#define VARHDRSZ_COMPRESSED offsetof (varattrib_4b, va_compressed.va_data)
#define VARHDRSZ_SHORT offsetof (varattrib_1b, va_data)
#define VARATT_SHORT_MAX 0x 7F
#define VARATT_CAN_MAKE_SHORT ( PTR ) \
( VARATT_IS_4B_U (PTR) && \
( VARSIZE (PTR) - VARHDRSZ + VARHDRSZ_SHORT) <= VARATT_SHORT_MAX)
#define VARATT_CONVERTED_SHORT_SIZE ( PTR ) \
( VARSIZE (PTR) - VARHDRSZ + VARHDRSZ_SHORT)
* In consumers oblivious to data alignment, call PG_DETOAST_DATUM_PACKED(),
* VARDATA_ANY(), VARSIZE_ANY() and VARSIZE_ANY_EXHDR(). Elsewhere, call
* PG_DETOAST_DATUM(), VARDATA() and VARSIZE(). Directly fetching an int16,
* int32 or wider field in the struct representing the datum layout requires
* aligned data. memcpy() is alignment-oblivious, as are most operations on
* datatypes, such as text, whose layout struct contains only char fields.
* Code assembling a new datum should call VARDATA() and SET_VARSIZE().
* (Datums begin life untoasted.)
* Other macros here should usually be used only by tuple assembly/disassembly
* code and code that specifically wants to work with still-toasted Datums.
#define VARDATA ( PTR ) VARDATA_4B (PTR)
#define VARSIZE ( PTR ) VARSIZE_4B (PTR)
#define VARSIZE_SHORT ( PTR ) VARSIZE_1B (PTR)
#define VARDATA_SHORT ( PTR ) VARDATA_1B (PTR)
#define VARTAG_EXTERNAL ( PTR ) VARTAG_1B_E (PTR)
#define VARSIZE_EXTERNAL ( PTR ) (VARHDRSZ_EXTERNAL + VARTAG_SIZE ( VARTAG_EXTERNAL (PTR)))
#define VARDATA_EXTERNAL ( PTR ) VARDATA_1B_E (PTR)
#define VARATT_IS_COMPRESSED ( PTR ) VARATT_IS_4B_C (PTR)
#define VARATT_IS_EXTERNAL ( PTR ) VARATT_IS_1B_E (PTR)
#define VARATT_IS_EXTERNAL_ONDISK ( PTR ) \
( VARATT_IS_EXTERNAL (PTR) && VARTAG_EXTERNAL (PTR) == VARTAG_ONDISK)
#define VARATT_IS_EXTERNAL_INDIRECT ( PTR ) \
( VARATT_IS_EXTERNAL (PTR) && VARTAG_EXTERNAL (PTR) == VARTAG_INDIRECT)
#define VARATT_IS_EXTERNAL_EXPANDED_RO ( PTR ) \
( VARATT_IS_EXTERNAL (PTR) && VARTAG_EXTERNAL (PTR) == VARTAG_EXPANDED_RO)
#define VARATT_IS_EXTERNAL_EXPANDED_RW ( PTR ) \
( VARATT_IS_EXTERNAL (PTR) && VARTAG_EXTERNAL (PTR) == VARTAG_EXPANDED_RW)
#define VARATT_IS_EXTERNAL_EXPANDED ( PTR ) \
( VARATT_IS_EXTERNAL (PTR) && VARTAG_IS_EXPANDED ( VARTAG_EXTERNAL (PTR)))
#define VARATT_IS_EXTERNAL_NON_EXPANDED ( PTR ) \
( VARATT_IS_EXTERNAL (PTR) && ! VARTAG_IS_EXPANDED ( VARTAG_EXTERNAL (PTR)))
#define VARATT_IS_SHORT ( PTR ) VARATT_IS_1B (PTR)
#define VARATT_IS_EXTENDED ( PTR ) ( ! VARATT_IS_4B_U (PTR))
#define SET_VARSIZE ( PTR , len ) SET_VARSIZE_4B (PTR, len)
#define SET_VARSIZE_SHORT ( PTR , len ) SET_VARSIZE_1B (PTR, len)
#define SET_VARSIZE_COMPRESSED ( PTR , len ) SET_VARSIZE_4B_C (PTR, len)
#define SET_VARTAG_EXTERNAL ( PTR , tag ) SET_VARTAG_1B_E (PTR, tag)
#define VARSIZE_ANY ( PTR ) \
( VARATT_IS_1B_E (PTR) ? VARSIZE_EXTERNAL (PTR) : \
( VARATT_IS_1B (PTR) ? VARSIZE_1B (PTR) : \
/* Size of a varlena data, excluding header */
#define VARSIZE_ANY_EXHDR ( PTR ) \
( VARATT_IS_1B_E (PTR) ? VARSIZE_EXTERNAL (PTR) - VARHDRSZ_EXTERNAL : \
( VARATT_IS_1B (PTR) ? VARSIZE_1B (PTR) - VARHDRSZ_SHORT : \
VARSIZE_4B (PTR) - VARHDRSZ))
/* caution: this will not work on an external or compressed-in-line Datum */
/* caution: this will return a possibly unaligned pointer */
#define VARDATA_ANY ( PTR ) \
( VARATT_IS_1B (PTR) ? VARDATA_1B (PTR) : VARDATA_4B (PTR))
/* Decompressed size and compression method of a compressed-in-line Datum */
#define VARDATA_COMPRESSED_GET_EXTSIZE ( PTR ) \
(((varattrib_4b * ) (PTR))->va_compressed.va_tcinfo & VARLENA_EXTSIZE_MASK)
#define VARDATA_COMPRESSED_GET_COMPRESS_METHOD ( PTR ) \
(((varattrib_4b * ) (PTR))->va_compressed.va_tcinfo >> VARLENA_EXTSIZE_BITS)
/* Same for external Datums; but note argument is a struct varatt_external */
#define VARATT_EXTERNAL_GET_EXTSIZE ( toast_pointer ) \
((toast_pointer).va_extinfo & VARLENA_EXTSIZE_MASK)
#define VARATT_EXTERNAL_GET_COMPRESS_METHOD ( toast_pointer ) \
((toast_pointer).va_extinfo >> VARLENA_EXTSIZE_BITS)
#define VARATT_EXTERNAL_SET_SIZE_AND_COMPRESS_METHOD ( toast_pointer , len , cm ) \
Assert ((cm) == TOAST_PGLZ_COMPRESSION_ID || \
(cm) == TOAST_LZ4_COMPRESSION_ID); \
((toast_pointer).va_extinfo = \
(len) | ((uint32) (cm) << VARLENA_EXTSIZE_BITS)); \
* Testing whether an externally-stored value is compressed now requires
* comparing size stored in va_extinfo (the actual length of the external data)
* to rawsize (the original uncompressed datum's size). The latter includes
* VARHDRSZ overhead, the former doesn't. We never use compression unless it
* actually saves space, so we expect either equality or less-than.
#define VARATT_EXTERNAL_IS_COMPRESSED ( toast_pointer ) \
( VARATT_EXTERNAL_GET_EXTSIZE (toast_pointer) < \
(toast_pointer).va_rawsize - VARHDRSZ)
Thereβs quite a lot going on here, especially if youβre not too familiar with C, so letβs break it down a bit.
First, letβs take a look at the different structs weβve got here. There are essentially 4 different possible structures that the varlena can have:
# 1. varattrib_4b.va_4byte
This is the normal 4-byte header thatβs used when the value is not long enough to be compressed or out-of-lined. The header itself contains the length of the data within the 4-byte header, but as weβll see in varattrib_4b.va_compressed
and varattrib_1b
, the first bit of the varlena header is used to hold a flag indicating whether the data uses the varattrib_1b
or varattrib_1b_e
formats and the second bit is used to hold a flag indicating whether the data is compressed, so the actual length is stored in the other 30 bits.
This means that the maximum length we can have is 2 30 β 1 = ( 1 βͺ 30 ) β 1 = 0 x 3 f f f f f f = 1 , 073 , 741 , 823 2^{30} - 1 = (1 \ll 30) - 1 = \mathtt{0x3ffffff} = 1,073,741,823 2 30 β 1 = ( 1 βͺ 30 ) β 1 = 0x3ffffff = 1 , 073 , 741 , 823 which is probably big enough given the default page size is 8,192 bytes.
# 2. varattrib_4b.va_compressed
This is an 8-byte header used when the value is big enough to need compressing, but not big enough to need out-of-lining. This struct is the same as the 4-byte header but adds an extra uint32 to store information about the compression.
Note: for people unfamiliar with union
in C, this is basically saying that the struct varattrib_4b
can refer to either the struct va_4byte
or va_compressed
within the same bit of memory. Itβs not immediately clear to me why these two structs are in a union but the other two arenβt. Based on the comment above these structs, I presume itβs because the varattrib_4b
union members are aligned as they consist of uint32
fields, whereas the varattrib_1b
and varattrib_1b_e
structs are not aligned because they have one or two uint8
fields inside them. Iβm not entirely certain on this, though, so if any C experts want to chime in, feel free.
Also, while weβre at it, it seems silly that va_compressed
is part of varattrib_4b
when the header is 8 bytes long, not 4 bytes as the name implies, but I guess Iβve never written a database used by millions so I canβt really criticise.
# 3. varattrib_1b
This is referred to as a βshort-headerβ varlena. Itβs essentially the same as the va_4byte
variant but it is packed into a single byte instead of 4 bytes.
The first bit is used to indicate whether it is external or not while the other 7 bits contain the length of the data. This means that the varattrib_1b
non-external struct can only be used to store values less than 0x7f = 127 bytes in length.
For an example of this, we can look at the bytes for the title of the poem βOzymandiasβ, which looks like 17 4f 7a 79 6d 61 6e 64 69 61 73
= \x17 O z y m a n d i a s
. In this case, the header is 0x17 but remember that the first bit is used for the βexternalβ flag, so we need to shift it down one bit to get the actual size value: 0x17 << 1 = 0xb
= 11. The title of the poem, βOzymandiasβ, is 10 characters long so the length here is 11 including the 1-byte header itself.
# 4. varattrib_1b_e
This extends the varattrib_1b
struct to add an additional byte, called the vartag
, to indicate what type of TOASTed data we are looking at. You can tell whether a varlena uses varattrib_1b
or varattrib_1b_e
by checking whether the first byte is exactly equal to 0x01
, meaning the βlengthβ part of the first byte is zero. This would be 0x80
on big-endian systems and 0x01
on little-endian systems.
The enum for this data type, called vartag_external
, has 4 options:
indirect,
expanded (read-only),
expanded (read-write) and
on-disk.
Weβre talking about the βon-diskβ variant while the other types of external TOAST data refer to where the TOAST pointer refers to a value in the memory of the server process. For more details on indirect and expanded, check out PostgreSQL Documentation: 73.2.2. Out-of-Line, In-Memory TOAST Storage .
To help visualise these, hereβs a handy flow diagram:
#
Now that we understand what structs the variable length attributes can correspond to, we can reinterpret the headers for our poems:
# βOzymandiasβ header β 04 0a 00 00
First things first, we can see that the first byte is 0x04
= 0b00000100
so neither of the first two bits are set. This means that the value is in-line and uncompressed which means weβre looking at the varattrib_4b.va_4byte
struct.
We can then get the length of the poem in bytes by doing (header >> 2) - header size = (0x0a04 >> 2) - 4 = 0x281 - 4 = 641 - 4 = 637
.
If youβve got a good memory and eye for numbers, you might be thinking βwhy is this giving 637 when the length earlier gave 631β? The answer is that length()
does not give you the size of the string in bytes but the length of the string in characters. In our poem we have three non-ASCII characters β U+2014 β
(UTF-8 e2 80 94
), U+201C β
(UTF-8 value e2 80 9c
) and U+201D β
(UTF-8 value e2 80 9d
) which count as 3 characters but total 9 bytes, making up the 6 byte difference.
#
Letβs look at the first byte again: 0xf2
= 0b11110010
. This has the first bit unset meaning it is not using the 1b format but the second bit is set, meaning the value is compressed. Taking (0x1af2 >> 2) - 8 = 0x6B4 = 1716
gives the size in bytes.
The second 4-byte value 0x098c
gives us the decompressed size and compression method. The first 30 bits are used for the uncompressed size while the last 2 bits are used for the compression method. To get the uncompressed size, we need to bitwise-and 0x098c
with a the bitmask (1 << 30) - 1
, i.e. a 32-bit number with 1 in the first 30 places and 0 in the last 2 places: 0x098c & ((1 << 30) - 1) = 0x098c & 0x3fffffff = 0x98c
so the uncompressed size is 2,444 bytes. The poem is 2,442 characters long so that makes sense β thereβs probably an em dash in there somewhere4 .
There is also a single null byte before the actual poem starts. Iβm not exactly sure what this is but I presume itβs an artifact of the compression method being used.
# βThe Waste Landβ header - 01 12
This one has a short header. The first byte is simply 0x01
which tells you that this is an external value. Looking at the next byte, the vartag, you see 0x12
= 18 which tells you that this is an external on-disk TOAST pointer.
Weβve made great progress in understanding how Postgres stores variable length attributes, but we still havenβt tracked down the actual TOASTed values from the main-table tuple data. So whatβs the next step?
# Following the white rabbit
Letβs look at the whole 18 bytes for the content
column value in the main table data for βThe Waste Landβ:
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β00000000β 01 12 ed 4e 00 00 24 2d β 00 00 ff 66 00 00 d5 61 ββ’β’ΓNββ $- βββΓfββΓaβ
β00000010β 00 00 β βββ β β
ββββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ
We understand the first 2 bytes now, but thereβs still 16 bytes left that we havenβt explained.
Looking back at the detoast_attr()
function, we can see that the function calls attr = toast_fetch_datum(attr)
to retrieve the data from the external store. The answer must be in there somewhere.
/*-------------------------------------------------------------------------
* Retrieve compressed or external variable size attributes.
* Copyright (c) 2000-2024, PostgreSQL Global Development Group
* src/backend/access/common/detoast.c
*-------------------------------------------------------------------------
#include "access/detoast.h"
#include "access/table.h"
#include "access/tableam.h"
#include "access/toast_internals.h"
#include "common/pg_lzcompress.h"
#include "utils/expandeddatum.h"
static struct varlena * toast_fetch_datum ( struct varlena * attr );
static struct varlena * toast_fetch_datum_slice ( struct varlena * attr ,
static struct varlena * toast_decompress_datum ( struct varlena * attr );
static struct varlena * toast_decompress_datum_slice ( struct varlena * attr , int32 slicelength );
* detoast_external_attr -
* Public entry point to get back a toasted value from
* external source (possibly still in compressed format).
* This will return a datum that contains all the data internally, ie, not
* relying on external storage or memory, but it can still be compressed or
* have a short header. Note some callers assume that if the input is an
* EXTERNAL datum, the result will be a pfree'able chunk.
detoast_external_attr ( struct varlena * attr )
if ( VARATT_IS_EXTERNAL_ONDISK (attr))
* This is an external stored plain value
result = toast_fetch_datum (attr);
else if ( VARATT_IS_EXTERNAL_INDIRECT (attr))
* This is an indirect pointer --- dereference it
struct varatt_indirect redirect;
VARATT_EXTERNAL_GET_POINTER (redirect, attr);
attr = ( struct varlena * ) redirect.pointer;
/* nested indirect Datums aren't allowed */
Assert ( ! VARATT_IS_EXTERNAL_INDIRECT (attr));
/* recurse if value is still external in some other way */
if ( VARATT_IS_EXTERNAL (attr))
return detoast_external_attr (attr);
* Copy into the caller's memory context, in case caller tries to
result = ( struct varlena * ) palloc ( VARSIZE_ANY (attr));
memcpy (result, attr, VARSIZE_ANY (attr));
else if ( VARATT_IS_EXTERNAL_EXPANDED (attr))
* This is an expanded-object pointer --- get flat format
ExpandedObjectHeader * eoh;
eoh = DatumGetEOHP ( PointerGetDatum (attr));
resultsize = EOH_get_flat_size (eoh);
result = ( struct varlena * ) palloc (resultsize);
EOH_flatten_into (eoh, ( void * ) result, resultsize);
* This is a plain value inside of the main tuple - why am I called?
* Public entry point to get back a toasted value from compression
* or external storage. The result is always non-extended varlena form.
* Note some callers assume that if the input is an EXTERNAL or COMPRESSED
* datum, the result will be a pfree'able chunk.
detoast_attr ( struct varlena * attr )
if ( VARATT_IS_EXTERNAL_ONDISK (attr))
* This is an externally stored datum --- fetch it back from there
attr = toast_fetch_datum (attr);
/* If it's compressed, decompress it */
if ( VARATT_IS_COMPRESSED (attr))
struct varlena * tmp = attr;
attr = toast_decompress_datum (tmp);
else if ( VARATT_IS_EXTERNAL_INDIRECT (attr))
* This is an indirect pointer --- dereference it
struct varatt_indirect redirect;
VARATT_EXTERNAL_GET_POINTER (redirect, attr);
attr = ( struct varlena * ) redirect.pointer;
/* nested indirect Datums aren't allowed */
Assert ( ! VARATT_IS_EXTERNAL_INDIRECT (attr));
/* recurse in case value is still extended in some other way */
attr = detoast_attr (attr);
/* if it isn't, we'd better copy it */
if (attr == ( struct varlena * ) redirect.pointer)
result = ( struct varlena * ) palloc ( VARSIZE_ANY (attr));
memcpy (result, attr, VARSIZE_ANY (attr));
else if ( VARATT_IS_EXTERNAL_EXPANDED (attr))
* This is an expanded-object pointer --- get flat format
attr = detoast_external_attr (attr);
/* flatteners are not allowed to produce compressed/short output */
Assert ( ! VARATT_IS_EXTENDED (attr));
else if ( VARATT_IS_COMPRESSED (attr))
* This is a compressed value inside of the main tuple
attr = toast_decompress_datum (attr);
else if ( VARATT_IS_SHORT (attr))
* This is a short-header varlena --- convert to 4-byte header format
Size data_size = VARSIZE_SHORT (attr) - VARHDRSZ_SHORT;
Size new_size = data_size + VARHDRSZ;
struct varlena * new_attr;
new_attr = ( struct varlena * ) palloc (new_size);
SET_VARSIZE (new_attr, new_size);
memcpy ( VARDATA (new_attr), VARDATA_SHORT (attr), data_size);
* Public entry point to get back part of a toasted value
* from compression or external storage.
* sliceoffset is where to start (zero or more)
* If slicelength < 0, return everything beyond sliceoffset
detoast_attr_slice ( struct varlena * attr ,
int32 sliceoffset , int32 slicelength )
struct varlena * preslice;
elog (ERROR, "invalid sliceoffset: %d " , sliceoffset);
* Compute slicelimit = offset + length, or -1 if we must fetch all of the
* value. In case of integer overflow, we must fetch all.
else if ( pg_add_s32_overflow (sliceoffset, slicelength, & slicelimit))
slicelength = slicelimit = - 1 ;
if ( VARATT_IS_EXTERNAL_ONDISK (attr))
struct varatt_external toast_pointer;
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
/* fast path for non-compressed external datums */
if ( ! VARATT_EXTERNAL_IS_COMPRESSED (toast_pointer))
return toast_fetch_datum_slice (attr, sliceoffset, slicelength);
* For compressed values, we need to fetch enough slices to decompress
* at least the requested part (when a prefix is requested).
* Otherwise, just fetch all slices.
int32 max_size = VARATT_EXTERNAL_GET_EXTSIZE (toast_pointer);
* Determine maximum amount of compressed data needed for a prefix
* of a given length (after decompression).
* At least for now, if it's LZ4 data, we'll have to fetch the
* whole thing, because there doesn't seem to be an API call to
* determine how much compressed data we need to be sure of being
* able to decompress the required slice.
if ( VARATT_EXTERNAL_GET_COMPRESS_METHOD (toast_pointer) ==
TOAST_PGLZ_COMPRESSION_ID)
max_size = pglz_maximum_compressed_size (slicelimit, max_size);
* Fetch enough compressed slices (compressed marker will get set
preslice = toast_fetch_datum_slice (attr, 0 , max_size);
preslice = toast_fetch_datum (attr);
else if ( VARATT_IS_EXTERNAL_INDIRECT (attr))
struct varatt_indirect redirect;
VARATT_EXTERNAL_GET_POINTER (redirect, attr);
/* nested indirect Datums aren't allowed */
Assert ( ! VARATT_IS_EXTERNAL_INDIRECT (redirect.pointer));
return detoast_attr_slice (redirect.pointer,
sliceoffset, slicelength);
else if ( VARATT_IS_EXTERNAL_EXPANDED (attr))
/* pass it off to detoast_external_attr to flatten */
preslice = detoast_external_attr (attr);
Assert ( ! VARATT_IS_EXTERNAL (preslice));
if ( VARATT_IS_COMPRESSED (preslice))
struct varlena * tmp = preslice;
/* Decompress enough to encompass the slice and the offset */
preslice = toast_decompress_datum_slice (tmp, slicelimit);
preslice = toast_decompress_datum (tmp);
if ( VARATT_IS_SHORT (preslice))
attrdata = VARDATA_SHORT (preslice);
attrsize = VARSIZE_SHORT (preslice) - VARHDRSZ_SHORT;
attrdata = VARDATA (preslice);
attrsize = VARSIZE (preslice) - VARHDRSZ;
/* slicing of datum for compressed cases and plain value */
if (sliceoffset >= attrsize)
else if (slicelength < 0 || slicelimit > attrsize)
slicelength = attrsize - sliceoffset;
result = ( struct varlena * ) palloc (slicelength + VARHDRSZ);
SET_VARSIZE (result, slicelength + VARHDRSZ);
memcpy ( VARDATA (result), attrdata + sliceoffset, slicelength);
* Reconstruct an in memory Datum from the chunks saved
toast_fetch_datum ( struct varlena * attr )
struct varatt_external toast_pointer;
if ( ! VARATT_IS_EXTERNAL_ONDISK (attr))
elog (ERROR, "toast_fetch_datum shouldn't be called for non-ondisk datums" );
/* Must copy to access aligned fields */
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
attrsize = VARATT_EXTERNAL_GET_EXTSIZE (toast_pointer);
result = ( struct varlena * ) palloc (attrsize + VARHDRSZ);
if ( VARATT_EXTERNAL_IS_COMPRESSED (toast_pointer))
SET_VARSIZE_COMPRESSED (result, attrsize + VARHDRSZ);
SET_VARSIZE (result, attrsize + VARHDRSZ);
return result; /* Probably shouldn't happen, but just in
* Open the toast relation and its indexes
toastrel = table_open (toast_pointer.va_toastrelid, AccessShareLock);
table_relation_fetch_toast_slice (toastrel, toast_pointer.va_valueid,
attrsize, 0 , attrsize, result);
table_close (toastrel, AccessShareLock);
* toast_fetch_datum_slice -
* Reconstruct a segment of a Datum from the chunks saved
* Note that this function supports non-compressed external datums
* and compressed external datums (in which case the requested slice
* has to be a prefix, i.e. sliceoffset has to be 0).
toast_fetch_datum_slice ( struct varlena * attr , int32 sliceoffset ,
struct varatt_external toast_pointer;
if ( ! VARATT_IS_EXTERNAL_ONDISK (attr))
elog (ERROR, "toast_fetch_datum_slice shouldn't be called for non-ondisk datums" );
/* Must copy to access aligned fields */
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
* It's nonsense to fetch slices of a compressed datum unless when it's a
* prefix -- this isn't lo_* we can't return a compressed datum which is
* meaningful to toast later.
Assert ( ! VARATT_EXTERNAL_IS_COMPRESSED (toast_pointer) || 0 == sliceoffset);
attrsize = VARATT_EXTERNAL_GET_EXTSIZE (toast_pointer);
if (sliceoffset >= attrsize)
* When fetching a prefix of a compressed external datum, account for the
* space required by va_tcinfo, which is stored at the beginning as an
if ( VARATT_EXTERNAL_IS_COMPRESSED (toast_pointer) && slicelength > 0 )
slicelength = slicelength + sizeof (int32);
* Adjust length request if needed. (Note: our sole caller,
* detoast_attr_slice, protects us against sliceoffset + slicelength
if (((sliceoffset + slicelength) > attrsize) || slicelength < 0 )
slicelength = attrsize - sliceoffset;
result = ( struct varlena * ) palloc (slicelength + VARHDRSZ);
if ( VARATT_EXTERNAL_IS_COMPRESSED (toast_pointer))
SET_VARSIZE_COMPRESSED (result, slicelength + VARHDRSZ);
SET_VARSIZE (result, slicelength + VARHDRSZ);
return result; /* Can save a lot of work at this point! */
/* Open the toast relation */
toastrel = table_open (toast_pointer.va_toastrelid, AccessShareLock);
table_relation_fetch_toast_slice (toastrel, toast_pointer.va_valueid,
attrsize, sliceoffset, slicelength,
table_close (toastrel, AccessShareLock);
* toast_decompress_datum -
* Decompress a compressed version of a varlena datum
toast_decompress_datum ( struct varlena * attr )
Assert ( VARATT_IS_COMPRESSED (attr));
* Fetch the compression method id stored in the compression header and
* decompress the data using the appropriate decompression routine.
cmid = TOAST_COMPRESS_METHOD (attr);
case TOAST_PGLZ_COMPRESSION_ID:
return pglz_decompress_datum (attr);
case TOAST_LZ4_COMPRESSION_ID:
return lz4_decompress_datum (attr);
elog (ERROR, "invalid compression method id %d " , cmid);
return NULL ; /* keep compiler quiet */
* toast_decompress_datum_slice -
* Decompress the front of a compressed version of a varlena datum.
* offset handling happens in detoast_attr_slice.
* Here we just decompress a slice from the front.
toast_decompress_datum_slice ( struct varlena * attr , int32 slicelength )
Assert ( VARATT_IS_COMPRESSED (attr));
* Some callers may pass a slicelength that's more than the actual
* decompressed size. If so, just decompress normally. This avoids
* possibly allocating a larger-than-necessary result object, and may be
* faster and/or more robust as well. Notably, some versions of liblz4
* have been seen to give wrong results if passed an output size that is
* more than the data's true decompressed size.
if ((uint32) slicelength >= TOAST_COMPRESS_EXTSIZE (attr))
return toast_decompress_datum (attr);
* Fetch the compression method id stored in the compression header and
* decompress the data slice using the appropriate decompression routine.
cmid = TOAST_COMPRESS_METHOD (attr);
case TOAST_PGLZ_COMPRESSION_ID:
return pglz_decompress_datum_slice (attr, slicelength);
case TOAST_LZ4_COMPRESSION_ID:
return lz4_decompress_datum_slice (attr, slicelength);
elog (ERROR, "invalid compression method id %d " , cmid);
return NULL ; /* keep compiler quiet */
* Return the raw (detoasted) size of a varlena datum
* (including the VARHDRSZ header)
toast_raw_datum_size (Datum value )
struct varlena * attr = ( struct varlena * ) DatumGetPointer (value);
if ( VARATT_IS_EXTERNAL_ONDISK (attr))
/* va_rawsize is the size of the original datum -- including header */
struct varatt_external toast_pointer;
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
result = toast_pointer.va_rawsize;
else if ( VARATT_IS_EXTERNAL_INDIRECT (attr))
struct varatt_indirect toast_pointer;
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
/* nested indirect Datums aren't allowed */
Assert ( ! VARATT_IS_EXTERNAL_INDIRECT (toast_pointer.pointer));
return toast_raw_datum_size ( PointerGetDatum (toast_pointer.pointer));
else if ( VARATT_IS_EXTERNAL_EXPANDED (attr))
result = EOH_get_flat_size ( DatumGetEOHP (value));
else if ( VARATT_IS_COMPRESSED (attr))
/* here, va_rawsize is just the payload size */
result = VARDATA_COMPRESSED_GET_EXTSIZE (attr) + VARHDRSZ;
else if ( VARATT_IS_SHORT (attr))
* we have to normalize the header length to VARHDRSZ or else the
* callers of this function will be confused.
result = VARSIZE_SHORT (attr) - VARHDRSZ_SHORT + VARHDRSZ;
/* plain untoasted datum */
* Return the physical storage size (possibly compressed) of a varlena datum
toast_datum_size (Datum value )
struct varlena * attr = ( struct varlena * ) DatumGetPointer (value);
if ( VARATT_IS_EXTERNAL_ONDISK (attr))
* Attribute is stored externally - return the extsize whether
* compressed or not. We do not count the size of the toast pointer
struct varatt_external toast_pointer;
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
result = VARATT_EXTERNAL_GET_EXTSIZE (toast_pointer);
else if ( VARATT_IS_EXTERNAL_INDIRECT (attr))
struct varatt_indirect toast_pointer;
VARATT_EXTERNAL_GET_POINTER (toast_pointer, attr);
/* nested indirect Datums aren't allowed */
Assert ( ! VARATT_IS_EXTERNAL_INDIRECT (attr));
return toast_datum_size ( PointerGetDatum (toast_pointer.pointer));
else if ( VARATT_IS_EXTERNAL_EXPANDED (attr))
result = EOH_get_flat_size ( DatumGetEOHP (value));
else if ( VARATT_IS_SHORT (attr))
result = VARSIZE_SHORT (attr);
* Attribute is stored inline either compressed or not, just calculate
* the size of the datum in either case.
Here we see how Postgres is linking the main table data back to the external TOAST table. Itβs casting the 16 bytes to the struct varatt_external()
and using the fields varatt_external.va_toastrelid
and varatt_external.va_valueid
to pull in the values from the TOAST table.
Weβve already seen this struct, in fact β itβs in the src/include/varatt.h
file we saw earlier. Itβs got 4 fields in total, each one 4 bytes long:
typedef struct varatt_external
int32 va_rawsize; /* Original data size (includes header) */
uint32 va_extinfo; /* External saved size (without header) and
Oid va_valueid; /* Unique ID of value within TOAST table */
Oid va_toastrelid; /* RelID of TOAST table containing it */
Each of these 4 struct fields is 4 bytes each, making up the 16 bytes left. The va_rawsize
and va_extinfo
arenβt particularly interesting for us at the moment but the final last fields give us what we need. We can see that the ID of the value within the toast table, va_valueid
, is ff 66 00 00
= 26367 while the toast relation ID, va_toastrelid
, is d5 61 00 00
= 25045 β this matches the OID of the pg_toast_25042
we saw earlier5 .
We can follow the va_valueid
through to the TOAST table like so6 :
blogdb = # select ctid, chunk_id, chunk_seq
blogdb - # from pg_toast . pg_toast_25042
blogdb - # where chunk_id = 26367 ;
ctid | chunk_id | chunk_seq
-------+----------+-----------
So then we can see that all the detoast_attr()
function is doing is pulling these chunks together in order according to the chunk_seq
values to get the full poem back out. Letβs write a quick helper to look at this data:
function print-toast-data {
run-and-decode "select chunk_data from pg_toast.pg_toast_25042 where chunk_id = $1 and chunk_seq = $2 "
We can run through the 6 chunks to see the bits and pieces of the poem, still compressed but recognisable:
$ print-toast-data 26367 0 | head -n 20
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β00000000β e9 4e 00 00 00 20 20 e2 β 80 9c 4e 61 6d 00 20 53 βΓNβββ ΓβΓΓNamβ Sβ
β00000010β 69 62 79 6c 6c 61 00 6d β 20 71 75 69 64 65 6d 00 βibyllaβmβ quidemββ
β00000020β 20 43 75 6d 69 73 20 65 β 00 67 6f 20 69 70 73 65 β Cumis eββgo ipseβ
β00000030β 20 00 6f 63 75 6c 69 73 β 20 6d 00 65 69 73 0a 20 β βoculisβ mβeis_ β
β00000040β 20 76 69 00 64 69 20 69 β 6e 20 61 6d 00 70 75 6c β viβdi iβn amβpulβ
β00000050β 6c 61 20 70 65 00 6e 64 β 65 72 65 2c 20 65 00 74 βla peβndβere, eβtβ
β00000060β 20 63 75 6d 20 69 6c 00 β 6c 69 20 70 75 65 72 69 β cum ilββli pueriβ
β00000070β 00 20 64 69 63 65 72 65 β 6e 00 74 3a 0a 20 20 ce ββ dicereβnβt:_ Γβ
β00000080β a3 e1 00 bd b7 ce b2 cf β 85 ce bb 00 ce bb ce b1 βΓΓβΓΓΓΓΓβΓΓΓβΓΓΓΓβ
β00000090β 20 cf 84 e1 00 bd b7 20 β ce b8 e1 bd b3 00 ce bb β ΓΓΓβΓΓ βΓΓΓΓΓβΓΓβ
β000000a0β ce b5 ce b9 cf 82 00 3b β 20 72 65 73 70 6f 6e 20 βΓΓΓΓΓΓβ ; β respon β
β000000b0β 64 65 62 61 74 01 48 61 β 3a 00 20 e1 bc 80 cf 80 βdebatβ’Haβ:β ΓΓΓΓΓβ
β000000c0β ce bf 00 ce b8 ce b1 ce β bd ce b5 20 e1 bf 96 ce βΓΓβΓΓΓΓΓβΓΓΓ ΓΓΓΓβ
β000000d0β bd 05 36 cf 89 80 2e e2 β 80 9d 0a 0a 20 05 01 00 βΓβ’6ΓΓΓ.ΓβΓΓ__ β’β’ββ
β000000e0β 5f 46 6f 72 20 45 7a 72 β 80 61 20 50 6f 75 6e 64 β_For EzrβΓa Poundβ
β000000f0β 07 19 00 69 6c 20 6d 69 β 67 6c 69 00 6f 72 20 66 ββ’β’βil miβgliβor fβ
β00000100β 61 62 62 72 18 6f 5f 0a β 01 01 03 1d 49 2e 20 00 βabbrβ’o__ββ’β’β’β’I. ββ
β00000110β 54 48 45 20 42 55 52 49 β 20 41 4c 20 4f 46 02 0e βTHE BURIβ AL OFβ’β’β
β00000120β 44 45 04 41 44 02 22 41 β 70 72 69 6c 00 20 69 73 βDEβ’ADβ’"Aβprilβ isβ
$ print-toast-data 26367 5 | head -n 20
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β00000000β 10 68 6f 72 64 31 c8 77 β 61 72 06 6d a4 71 21 47 ββ’hord1Γwβarβ’mΓq!Gβ
β00000010β 65 6e 64 6c 65 04 73 73 β 71 5a 69 6e 73 2c 20 cc βendleβ’ssβqZins, Γβ
β00000020β 73 74 a1 fb 62 df 6e 20 β 35 d2 c1 3a 85 81 f1 52 βstΓΓbΓn β5ΓΓ:ΓΓΓRβ
β00000030β b1 3f 64 20 62 79 c4 93 β 84 61 74 01 57 69 7a 6f βΓ?d byΓΓβΓatβ’Wizoβ
β00000040β 6e 13 ce 9d 0f 01 16 33 β f2 d0 07 ed 31 07 74 79 βnβ’ΓΓβ’β’β’3βΓΓβ’Γ1β’tyβ
β00000050β 23 d8 cd 4d 8e 43 01 84 β d3 70 72 65 e1 ae 03 0c β#ΓΓMΓCβ’ΓβΓpreΓΓβ’_β
β00000060β 20 62 75 72 73 74 26 df β 76 69 8c 6f 6c 61 00 12 β burst&ΓβviΓolaββ’β
β00000070β 20 46 61 6c 02 c1 01 e6 β 21 4a 65 72 75 73 61 6c β Falβ’Γβ’Γβ!Jerusalβ
β00000080β 10 65 6d 20 41 b1 96 73 β 20 41 80 6c 65 78 61 6e ββ’em AΓΓsβ AΓlexanβ
β00000090β 64 72 f2 9f 00 56 69 65 β 6e 6e 61 20 4c 08 6f 6e βdrΓΓβVieβnna Lβ’onβ
β000000a0β 64 12 3e 55 6e 72 65 0c β 61 6c 82 dc 13 c8 20 64 βdβ’>Unre_βalΓΓβ’Γ dβ
β000000b0β 72 65 12 77 31 5d 20 6c β 31 ed 62 6c 61 80 63 6b βreβ’w1] lβ1ΓblaΓckβ
β000000c0β 20 68 61 69 72 b1 9b 0c β 20 74 81 f2 34 c0 66 69 β hairΓΓ_β tΓΓ4Γfiβ
β000000d0β 64 64 03 e1 fe a4 65 20 β 6d 75 73 69 63 73 13 f2 βddβ’ΓΓΓe βmusicsβ’Γβ
β000000e0β 11 8a 73 74 41 3d 82 e1 β 01 c2 61 c4 74 73 c3 ac ββ’ΓstA=ΓΓββ’ΓaΓtsΓΓβ
β000000f0β 62 61 62 94 26 0c d0 83 β 92 4e 73 75 73 74 6c 65 βbabΓ&_ΓΓβΓNsustleβ
β00000100β 64 b3 46 c2 62 61 fb 74 β 68 65 69 51 e8 31 e4 49 βdΓFΓbaΓtβheiQΓ1ΓIβ
β00000110β 0f 01 17 33 38 42 c6 6e β 64 11 da 77 5f 01 aa 32 ββ’β’β’38BΓnβdβ’Γw_β’Γ2β
β00000120β 3e f1 7c a1 51 f3 85 61 β 03 e2 65 ca 6e 01 c9 61 β>Γ|ΓQΓΓaββ’ΓeΓnβ’Γaβ
You can see that as this goes on, the compression takes over more and more and the actual data becomes less and less intelligible. From this we can ascertain that the compression is done before the data is split into chunks, which makes sense because it means you can take advantage of the larger history table for the compression. This fits with the detoast_attr()
function we saw earlier, which retrieved the full TOASTed value before decompressing it.
# TOAST-ing strategies
Now that weβve got a good understanding of the different TOAST techniques that Postgres utilises, is there any way we can customise Postgresβ behaviour? In fact, there is!
There are four configuration options which you can select on a per-column basis:
Strategy Explanation plain Prevents compression or out-of-lining. For non-TOAST-able types this is the only possible strategy. extended This is the default strategy that allows both compression and out-of-lining, as deemed appropriate. As we saw above, Postgres tries to compress it first and if itβs still too big, it will out-of-line it. external Allows out-of-lining but not compression. This might seem like a weird choice, but if you choose external, Postgres can actually optimise some substring operations because Postgres knows that it only needs to query specific chunk(s) of the whole TOAST slice instead of the whole thing. If your out-of-line data is compressed, you canβt do that because you need to pull in the whole TOAST slice to be able to decompress it before you can do any substring operations. main Allows compression but not out-of-lining, unless out-of-lining is absolutely necessary to be able to be able to fit the data on disk.
So how do you change this? Letβs try it out.
blogdb = # \d + creative_works
Table "public.creative_works"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
------------+---------+-----------+----------+------------------------------+----------+-------------+--------------+-------------
id | integer | | not null | generated always as identity | plain | | |
title | text | | | | extended | | |
authors | jsonb | | | | extended | | |
country_id | integer | | | | plain | | |
content | text | | | | extended | | |
"creative_works_pkey" PRIMARY KEY , btree (id)
"creative_works_country_id_fkey" FOREIGN KEY (country_id) REFERENCES countries(id)
-- This doesn't actually change anything, it just updates the strategy for future
blogdb = # alter table creative_works alter column content set storage plain;
-- Trigger an update for the "Ode on a Grecian Urn" poem.
blogdb = # update creative_works
blogdb - # set content = concat (content, e '\nThe end' )
blogdb - # where title = 'Ode on a Grecian Urn' ;
Letβs print out the poem data again to see what it looks like:
$ print- cw -data 'Ode on a Grecian Urn' | head - n 30
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β 00000000 β 0d 00 00 00 2b 4f 64 65 β 20 6f 6e 20 61 20 47 72 β_βββ + Odeβ on a Grβ
β 00000010 β 65 63 69 61 6e 20 55 72 β 6e 01 12 5c 00 00 00 58 βecian Urβnβ’β’\βββXβ
β 00000020 β 00 00 00 5d 70 00 00 d5 β 61 00 00 00 eb 00 00 00 ββββ]pββΓβaβββΓββββ
β 00000030 β 6c 26 00 00 54 68 6f 75 β 20 73 74 69 6c 6c 20 75 βl&ββThouβ still uβ
β 00000040 β 6e 72 61 76 69 73 68 27 β 64 20 62 72 69 64 65 20 βnravish 'βd bride β
β00000050β 6f 66 20 71 75 69 65 74 β 6e 65 73 73 2c 0a 20 20 βof quietβness,_ β
β00000060β 20 20 20 20 20 54 68 6f β 75 20 66 6f 73 74 65 72 β Thoβu fosterβ
β00000070β 2d 63 68 69 6c 64 20 6f β 66 20 73 69 6c 65 6e 63 β-child oβf silencβ
β00000080β 65 20 61 6e 64 20 73 6c β 6f 77 20 74 69 6d 65 2c βe and slβow time,β
β00000090β 0a 53 79 6c 76 61 6e 20 β 68 69 73 74 6f 72 69 61 β_Sylvan βhistoriaβ
β000000a0β 6e 2c 20 77 68 6f 20 63 β 61 6e 73 74 20 74 68 75 βn, who cβanst thuβ
β000000b0β 73 20 65 78 70 72 65 73 β 73 0a 20 20 20 20 20 20 βs expresβs_ β
β000000c0β 20 41 20 66 6c 6f 77 65 β 72 79 20 74 61 6c 65 20 β A floweβry tale β
β000000d0β 6d 6f 72 65 20 73 77 65 β 65 74 6c 79 20 74 68 61 βmore sweβetly thaβ
β000000e0β 6e 20 6f 75 72 20 72 68 β 79 6d 65 3a 0a 57 68 61 βn our rhβyme:_Whaβ
β000000f0β 74 20 6c 65 61 66 2d 66 β 72 69 6e 67 27 64 20 6c βt leaf-fβring' d lβ
β 00000100 β 65 67 65 6e 64 20 68 61 β 75 6e 74 73 20 61 62 6f βegend haβunts aboβ
β 00000110 β 75 74 20 74 68 79 20 73 β 68 61 70 65 0a 20 20 20 βut thy sβhape_ β
β 00000120 β 20 20 20 20 4f 66 20 64 β 65 69 74 69 65 73 20 6f β Of dβeities oβ
β 00000130 β 72 20 6d 6f 72 74 61 6c β 73 2c 20 6f 72 20 6f 66 βr mortalβs, or ofβ
β 00000140 β 20 62 6f 74 68 2c 0a 20 β 20 20 20 20 20 20 20 20 β both ,_ β β
β 00000150 β 20 20 20 20 20 20 49 6e β 20 54 65 6d 70 65 20 6f β In β Tempe oβ
β 00000160 β 72 20 74 68 65 20 64 61 β 6c 65 73 20 6f 66 20 41 βr the daβles of Aβ
β 00000170 β 72 63 61 64 79 3f 0a 20 β 20 20 20 20 20 20 57 68 βrcady?_ β Whβ
β 00000180 β 61 74 20 6d 65 6e 20 6f β 72 20 67 6f 64 73 20 61 β at men oβr gods aβ
β 00000190 β 72 65 20 74 68 65 73 65 β 3f 20 57 68 61 74 20 6d βre theseβ? What mβ
β000001a0β 61 69 64 65 6e 73 20 6c β 6f 74 68 3f 0a 57 68 61 βaidens lβoth?_Whaβ
β000001b0β 74 20 6d 61 64 20 70 75 β 72 73 75 69 74 3f 20 57 βt mad puβrsuit? Wβ
β000001c0β 68 61 74 20 73 74 72 75 β 67 67 6c 65 20 74 6f 20 βhat struβggle to β
Itβs no longer compressed! But thereβs a problem β βThe Waste Landβ is too big to fit into the main table, but weβve disallowed the column from being out-of-lined, so what happens if we trigger an update for this poem?
ERROR: row is too big: size 20280 , maximum size 8160
When we trigger a recomputation by doing a full vacuum, Postgres gives us an error telling us that we canβt fit the full poem in a single tuple. Letβs try βexternalβ next:
blogdb = # alter table creative_works alter column content set storage external ;
Letβs take a look at the data:
$ print-cw-data 'Ode on a Grecian Urn'
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β00000000β 0d 00 00 00 2b 4f 64 65 β 20 6f 6e 20 61 20 47 72 β_βββ+Odeβ on a Grβ
β00000010β 65 63 69 61 6e 20 55 72 β 6e b3 01 00 00 40 50 00 βecian UrβnΓβ’ββ@Pββ
β00000020β 00 d0 03 00 00 20 04 00 β 00 80 0a 00 00 00 0a 00 ββΓβ’ββ β’βββΓ_βββ_ββ
β00000030β 00 00 0b 00 00 00 09 00 β 00 10 08 00 00 10 6e 61 ββββ’βββ_ββββ’β’βββ’naβ
β00000040β 6d 65 62 69 72 74 68 5f β 79 65 61 72 64 65 61 74 βmebirth_βyeardeatβ
β00000050β 68 5f 79 65 61 72 4b 65 β 61 74 73 2c 20 4a 6f 68 βh_yearKeβats, Johβ
β00000060β 6e 00 20 00 00 00 00 80 β 03 07 20 00 00 00 00 80 βnβ ββββΓββ’β’ ββββΓβ
β00000070β 1d 07 00 00 eb 00 00 00 β 01 12 a3 09 00 00 9f 09 ββ’β’ββΓβββββ’β’Γ_ββΓ_β
β00000080β 00 00 54 71 00 00 d5 61 β 00 00 βββTqββΓaβββ β
ββββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ
Our poem has now been out-of-lined! By looking at the data, we can see that the chunk ID is 54 71 00 00
= 29012:
$ print-toast-data 29012 0 | head -n 20
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β00000000β 54 68 6f 75 20 73 74 69 β 6c 6c 20 75 6e 72 61 76 βThou stiβll unravβ
β00000010β 69 73 68 27 64 20 62 72 β 69 64 65 20 6f 66 20 71 βish'd brβide of qβ
β00000020β 75 69 65 74 6e 65 73 73 β 2c 0a 20 20 20 20 20 20 βuietnessβ,_ β
β00000030β 20 54 68 6f 75 20 66 6f β 73 74 65 72 2d 63 68 69 β Thou foβster-chiβ
β00000040β 6c 64 20 6f 66 20 73 69 β 6c 65 6e 63 65 20 61 6e βld of siβlence anβ
β00000050β 64 20 73 6c 6f 77 20 74 β 69 6d 65 2c 0a 53 79 6c βd slow tβime,_Sylβ
β00000060β 76 61 6e 20 68 69 73 74 β 6f 72 69 61 6e 2c 20 77 βvan histβorian, wβ
β00000070β 68 6f 20 63 61 6e 73 74 β 20 74 68 75 73 20 65 78 βho canstβ thus exβ
β00000080β 70 72 65 73 73 0a 20 20 β 20 20 20 20 20 41 20 66 βpress_ β A fβ
β00000090β 6c 6f 77 65 72 79 20 74 β 61 6c 65 20 6d 6f 72 65 βlowery tβale moreβ
β000000a0β 20 73 77 65 65 74 6c 79 β 20 74 68 61 6e 20 6f 75 β sweetlyβ than ouβ
β000000b0β 72 20 72 68 79 6d 65 3a β 0a 57 68 61 74 20 6c 65 βr rhyme:β_What leβ
β000000c0β 61 66 2d 66 72 69 6e 67 β 27 64 20 6c 65 67 65 6e βaf-fringβ'd legenβ
β000000d0β 64 20 68 61 75 6e 74 73 β 20 61 62 6f 75 74 20 74 βd hauntsβ about tβ
β000000e0β 68 79 20 73 68 61 70 65 β 0a 20 20 20 20 20 20 20 βhy shapeβ_ β
β000000f0β 4f 66 20 64 65 69 74 69 β 65 73 20 6f 72 20 6d 6f βOf deitiβes or moβ
β00000100β 72 74 61 6c 73 2c 20 6f β 72 20 6f 66 20 62 6f 74 βrtals, oβr of botβ
β00000110β 68 2c 0a 20 20 20 20 20 β 20 20 20 20 20 20 20 20 βh,_ β β
β00000120β 20 20 49 6e 20 54 65 6d β 70 65 20 6f 72 20 74 68 β In Temβpe or thβ
We can see that our out-of-line data is present in its full uncompressed form, just as weβd expected. Similarly, we can try out βmainβ:
blogdb = # alter table creative_works alter column content set storage main;
Now letβs check out the data:
$ print-cw-data 'Ode on a Grecian Urn' | head -n 20
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ
β00000000β 0d 00 00 00 2b 4f 64 65 β 20 6f 6e 20 61 20 47 72 β_βββ+Odeβ on a Grβ
β00000010β 65 63 69 61 6e 20 55 72 β 6e 01 12 5c 00 00 00 58 βecian Urβnβ’β’ \β ββXβ
β00000020β 00 00 00 a4 73 00 00 d5 β 61 00 00 00 eb 00 00 00 ββββΓsββΓβaβββΓββββ
β00000030β 1e 1b 00 00 9f 09 00 00 β 00 54 68 6f 75 20 73 74 ββ’β’ββΓ_ββββThou stβ
β00000040β 69 00 6c 6c 20 75 6e 72 β 61 76 00 69 73 68 27 64 βiβll unrβavβish'dβ
β00000050β 20 62 72 00 69 64 65 20 β 6f 66 20 71 00 75 69 65 β brβide βof qβuieβ
β00000060β 74 6e 65 73 73 18 2c 0a β 20 03 01 02 31 66 6f 73 βtnessβ’,_β β’β’β’1fosβ
β00000070β 00 74 65 72 2d 63 68 69 β 6c 02 64 01 27 73 69 6c ββter-chiβlβ’dβ’'silβ
β00000080β 65 6e 63 00 65 20 61 6e β 64 20 73 6c 00 6f 77 20 βencβe anβd slβow β
β00000090β 74 69 6d 65 2c 00 0a 53 β 79 6c 76 61 6e 20 00 68 βtime,β_Sβylvan βhβ
β000000a0β 69 73 74 6f 72 69 61 00 β 6e 2c 20 77 68 6f 20 63 βistoriaββn, who cβ
β000000b0β 00 61 6e 73 74 20 74 68 β 75 00 73 20 65 78 70 72 ββanst thβuβs exprβ
β000000c0β 65 73 02 73 05 5c 41 20 β 66 6c 6f 77 00 65 72 79 βesβ’sβ’ \A βflowβeryβ
β000000d0β 20 74 61 6c 65 00 20 6d β 6f 72 65 20 73 77 00 65 β taleβ mβore swβeβ
β000000e0β 65 74 6c 79 20 74 68 00 β 61 6e 20 6f 75 72 20 72 βetly thββan our rβ
β000000f0β 00 68 79 6d 65 3a 0a 57 β 68 00 61 74 20 6c 65 61 ββhyme:_Wβ