DragonFly On-Line Manual Pages

PT-QUERY-DIGEST(1) User Contributed Perl Documentation PT-QUERY-DIGEST(1)

NAME
pt-query-digest - Analyze MySQL queries from logs, processlist, and
tcpdump.

SYNOPSIS
Usage: pt-query-digest [OPTIONS] [FILES] [DSN]

pt-query-digest analyzes MySQL queries from slow, general, and binary
log files. It can also analyze queries from "SHOW PROCESSLIST" and
MySQL protocol data from tcpdump. By default, queries are grouped by
fingerprint and reported in descending order of query time (i.e. the
slowest queries first). If no "FILES" are given, the tool reads
"STDIN". The optional "DSN" is used for certain options like "--since"
and "--until".

Report the slowest queries from "slow.log":

pt-query-digest slow.log

Report the slowest queries from the processlist on host1:

pt-query-digest --processlist h=host1

Capture MySQL protocol data with tcppdump, then report the slowest
queries:

tcpdump -s 65535 -x -nn -q -tttt -i any -c 1000 port 3306 > mysql.tcp.txt

pt-query-digest --type tcpdump mysql.tcp.txt

Save query data from "slow.log" to host2 for later review and trend
analysis:

pt-query-digest --review h=host2 --no-report slow.log

RISKS
Percona Toolkit is mature, proven in the real world, and well tested,
but all database tools can pose a risk to the system and the database
server. Before using this tool, please:

o Read the tool's documentation

o Review the tool's known "BUGS"

o Test the tool on a non-production server

o Backup your production server and verify the backups

DESCRIPTION
pt-query-digest is a sophisticated but easy to use tool for analyzing
MySQL queries. It can analyze queries from MySQL slow, general, and
binary logs. (Binary logs must first be converted to text, see
"--type"). It can also use "SHOW PROCESSLIST" and MySQL protocol data
from tcpdump. By default, the tool reports which queries are the
slowest, and therefore the most important to optimize. More complex
and custom-tailored reports can be created by using options like
"--group-by", "--filter", and "--embedded-attributes".

Query analysis is a best-practice that should be done frequently. To
make this easier, pt-query-digest has two features: query review
("--review") and query history ("--history"). When the "--review"
option is used, all unique queries are saved to a database. When the
tool is ran again with "--review", queries marked as reviewed in the
database are not printed in the report. This highlights new queries
that need to be reviewed. When the "--history" option is used, query
metrics (query time, lock time, etc.) for each unique query are saved
to database. Each time the tool is ran with "--history", the more
historical data is saved which can be used to trend and analyze query
performance over time.

ATTRIBUTES
pt-query-digest works on events, which are a collection of key-value
pairs called attributes. You'll recognize most of the attributes right
away: "Query_time", "Lock_time", and so on. You can just look at a
slow log and see them. However, there are some that don't exist in the
slow log, and slow logs may actually include different kinds of
attributes (for example, you may have a server with the Percona
patches).

See "ATTRIBUTES REFERENCE" near the end of this documentation for a
list of common and "--type" specific attributes. A familiarity with
these attributes is necessary for working with "--filter",
"--ignore-attributes", and other attribute-related options.

With creative use of "--filter", you can create new attributes derived
from existing attributes. For example, to create an attribute called
"Row_ratio" for examining the ratio of "Rows_sent" to "Rows_examined",
specify a filter like:

--filter '($event->{Row_ratio} = $event->{Rows_sent} / ($event->{Rows_examined})) && 1'

The "&& 1" trick is needed to create a valid one-line syntax that is
always true, even if the assignment happens to evaluate false. The new
attribute will automatically appears in the output:

# Row ratio 1.00 0.00 1 0.50 1 0.71 0.50

Attributes created this way can be specified for "--order-by" or any
option that requires an attribute.

OUTPUT
The default "--output" is a query analysis report. The "--[no]report"
option controls whether or not this report is printed. Sometimes you
may want to parse all the queries but suppress the report, for example
when using "--review" or "--history".

There is one paragraph for each class of query analyzed. A "class" of
queries all have the same value for the "--group-by" attribute which is
"fingerprint" by default. (See "ATTRIBUTES".) A fingerprint is an
abstracted version of the query text with literals removed, whitespace
collapsed, and so forth. The report is formatted so it's easy to paste
into emails without wrapping, and all non-query lines begin with a
comment, so you can save it to a .sql file and open it in your favorite
syntax-highlighting text editor. There is a response-time profile at
the beginning.

The output described here is controlled by "--report-format". That
option allows you to specify what to print and in what order. The
default output in the default order is described here.

The report, by default, begins with a paragraph about the entire
analysis run The information is very similar to what you'll see for
each class of queries in the log, but it doesn't have some information
that would be too expensive to keep globally for the analysis. It also
has some statistics about the code's execution itself, such as the CPU
and memory usage, the local date and time of the run, and a list of
input file read/parsed.

Following this is the response-time profile over the events. This is a
highly summarized view of the unique events in the detailed query
report that follows. It contains the following columns:

Column Meaning
============ ==========================================================
Rank The query's rank within the entire set of queries analyzed
Query ID The query's fingerprint
Response time The total response time, and percentage of overall total
Calls The number of times this query was executed
R/Call The mean response time per execution
V/M The Variance-to-mean ratio of response time
Item The distilled query

A final line whose rank is shown as MISC contains aggregate statistics
on the queries that were not included in the report, due to options
such as "--limit" and "--outliers". For details on the variance-to-
mean ratio, please see
http://en.wikipedia.org/wiki/Index_of_dispersion.

Next, the detailed query report is printed. Each query appears in a
paragraph. Here is a sample, slightly reformatted so 'perldoc' will
not wrap lines in a terminal. The following will all be one paragraph,
but we'll break it up for commentary.

# Query 2: 0.01 QPS, 0.02x conc, ID 0xFDEA8D2993C9CAF3 at byte 160665

This line identifies the sequential number of the query in the sort
order specified by "--order-by". Then there's the queries per second,
and the approximate concurrency for this query (calculated as a
function of the timespan and total Query_time). Next there's a query
ID. This ID is a hex version of the query's checksum in the database,
if you're using "--review". You can select the reviewed query's
details from the database with a query like "SELECT .... WHERE
checksum=0xFDEA8D2993C9CAF3".

If you are investigating the report and want to print out every sample
of a particular query, then the following "--filter" may be helpful:

pt-query-digest slow.log \
--no-report \
--output slowlog \
--filter '$event->{fingerprint} \
&& make_checksum($event->{fingerprint}) eq "FDEA8D2993C9CAF3"'

Notice that you must remove the "0x" prefix from the checksum.

Finally, in case you want to find a sample of the query in the log
file, there's the byte offset where you can look. (This is not always
accurate, due to some anomalies in the slow log format, but it's
usually right.) The position refers to the worst sample, which we'll
see more about below.

Next is the table of metrics about this class of queries.

# pct total min max avg 95% stddev median
# Count 0 2
# Exec time 13 1105s 552s 554s 553s 554s 2s 553s
# Lock time 0 216us 99us 117us 108us 117us 12us 108us
# Rows sent 20 6.26M 3.13M 3.13M 3.13M 3.13M 12.73 3.13M
# Rows exam 0 6.26M 3.13M 3.13M 3.13M 3.13M 12.73 3.13M

The first line is column headers for the table. The percentage is the
percent of the total for the whole analysis run, and the total is the
actual value of the specified metric. For example, in this case we can
see that the query executed 2 times, which is 13% of the total number
of queries in the file. The min, max and avg columns are self-
explanatory. The 95% column shows the 95th percentile; 95% of the
values are less than or equal to this value. The standard deviation
shows you how tightly grouped the values are. The standard deviation
and median are both calculated from the 95th percentile, discarding the
extremely large values.

The stddev, median and 95th percentile statistics are approximate.
Exact statistics require keeping every value seen, sorting, and doing
some calculations on them. This uses a lot of memory. To avoid this,
we keep 1000 buckets, each of them 5% bigger than the one before,
ranging from .000001 up to a very big number. When we see a value we
increment the bucket into which it falls. Thus we have fixed memory
per class of queries. The drawback is the imprecision, which typically
falls in the 5 percent range.

Next we have statistics on the users, databases and time range for the
query.

# Users 1 user1
# Databases 2 db1(1), db2(1)
# Time range 2008-11-26 04:55:18 to 2008-11-27 00:15:15

The users and databases are shown as a count of distinct values,
followed by the values. If there's only one, it's shown alone; if
there are many, we show each of the most frequent ones, followed by the
number of times it appears.

# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms #####
# 100ms ####################
# 1s ##########
# 10s+

The execution times show a logarithmic chart of time clustering. Each
query goes into one of the "buckets" and is counted up. The buckets
are powers of ten. The first bucket is all values in the "single
microsecond range" -- that is, less than 10us. The second is "tens of
microseconds," which is from 10us up to (but not including) 100us; and
so on. The charted attribute can be changed by specifying
"--report-histogram" but is limited to time-based attributes.

# Tables
# SHOW TABLE STATUS LIKE 'table1'\G
# SHOW CREATE TABLE `table1`\G
# EXPLAIN
SELECT * FROM table1\G

This section is a convenience: if you're trying to optimize the queries
you see in the slow log, you probably want to examine the table
structure and size. These are copy-and-paste-ready commands to do
that.

Finally, we see a sample of the queries in this class of query. This
is not a random sample. It is the query that performed the worst,
according to the sort order given by "--order-by". You will normally
see a commented "# EXPLAIN" line just before it, so you can copy-paste
the query to examine its EXPLAIN plan. But for non-SELECT queries that
isn't possible to do, so the tool tries to transform the query into a
roughly equivalent SELECT query, and adds that below.

If you want to find this sample event in the log, use the offset
mentioned above, and something like the following:

tail -c +<offset> /path/to/file | head

See also "--report-format".

QUERY REVIEW
A query "--review" is the process of storing all the query fingerprints
analyzed. This has several benefits:

o You can add metadata to classes of queries, such as marking them
for follow-up, adding notes to queries, or marking them with an
issue ID for your issue tracking system.

o You can refer to the stored values on subsequent runs so you'll
know whether you've seen a query before. This can help you cut
down on duplicated work.

o You can store historical data such as the row count, query times,
and generally anything you can see in the report.

To use this feature, you run pt-query-digest with the "--review"
option. It will store the fingerprints and other information into the
table you specify. Next time you run it with the same option, it will
do the following:

o It won't show you queries you've already reviewed. A query is
considered to be already reviewed if you've set a value for the
"reviewed_by" column. (If you want to see queries you've already
reviewed, use the "--report-all" option.)

o Queries that you've reviewed, and don't appear in the output, will
cause gaps in the query number sequence in the first line of each
paragraph. And the value you've specified for "--limit" will still
be honored. So if you've reviewed all queries in the top 10 and
you ask for the top 10, you won't see anything in the output.

o If you want to see the queries you've already reviewed, you can
specify "--report-all". Then you'll see the normal analysis
output, but you'll also see the information from the review table,
just below the execution time graph. For example,

# Review information
# comments: really bad IN() subquery, fix soon!
# first_seen: 2008-12-01 11:48:57
# jira_ticket: 1933
# last_seen: 2008-12-18 11:49:07
# priority: high
# reviewed_by: xaprb
# reviewed_on: 2008-12-18 15:03:11

This metadata is useful because, as you analyze your queries, you
get your comments integrated right into the report.

FINGERPRINTS
A query fingerprint is the abstracted form of a query, which makes it
possible to group similar queries together. Abstracting a query
removes literal values, normalizes whitespace, and so on. For example,
consider these two queries:

SELECT name, password FROM user WHERE id='12823';
select name, password from user
where id=5;

Both of those queries will fingerprint to

select name, password from user where id=?

Once the query's fingerprint is known, we can then talk about a query
as though it represents all similar queries.

What "pt-query-digest" does is analogous to a GROUP BY statement in
SQL. (But note that "multiple columns" doesn't define a multi-column
grouping; it defines multiple reports!) If your command-line looks like
this,

pt-query-digest \
--group-by fingerprint \
--order-by Query_time:sum \
--limit 10 \
slow.log

The corresponding pseudo-SQL looks like this:

SELECT WORST(query BY Query_time), SUM(Query_time), ...
FROM /path/to/slow.log
GROUP BY FINGERPRINT(query)
ORDER BY SUM(Query_time) DESC
LIMIT 10

You can also use the value "distill", which is a kind of super-
fingerprint. See "--group-by" for more.

Query fingerprinting accommodates many special cases, which have proven
necessary in the real world. For example, an "IN" list with 5 literals
is really equivalent to one with 4 literals, so lists of literals are
collapsed to a single one. If you find something that is not
fingerprinted properly, please submit a bug report with a reproducible
test case.

Here is a list of transformations during fingerprinting, which might
not be exhaustive:

o Group all SELECT queries from mysqldump together, even if they are
against different tables. The same applies to all queries from pt-
table-checksum.

o Shorten multi-value INSERT statements to a single VALUES() list.

o Strip comments.

o Abstract the databases in USE statements, so all USE statements are
grouped together.

o Replace all literals, such as quoted strings. For efficiency, the
code that replaces literal numbers is somewhat non-selective, and
might replace some things as numbers when they really are not.
Hexadecimal literals are also replaced. NULL is treated as a
literal. Numbers embedded in identifiers are also replaced, so
tables named similarly will be fingerprinted to the same values
(e.g. users_2009 and users_2010 will fingerprint identically).

o Collapse all whitespace into a single space.

o Lowercase the entire query.

o Replace all literals inside of IN() and VALUES() lists with a
single placeholder, regardless of cardinality.

o Collapse multiple identical UNION queries into a single one.

OPTIONS
This tool accepts additional command-line arguments. Refer to the
"SYNOPSIS" and usage information for details.

--ask-pass
Prompt for a password when connecting to MySQL.

--attribute-aliases
type: array; default: db|Schema

List of attribute|alias,etc.

Certain attributes have multiple names, like db and Schema. If an
event does not have the primary attribute, pt-query-digest looks
for an alias attribute. If it finds an alias, it creates the
primary attribute with the alias attribute's value and removes the
alias attribute.

If the event has the primary attribute, all alias attributes are
deleted.

This helps simplify event attributes so that, for example, there
will not be report lines for both db and Schema.

--attribute-value-limit
type: int; default: 4294967296

A sanity limit for attribute values.

This option deals with bugs in slow logging functionality that
causes large values for attributes. If the attribute's value is
bigger than this, the last-seen value for that class of query is
used instead.

--charset
short form: -A; type: string

Default character set. If the value is utf8, sets Perl's binmode
on STDOUT to utf8, passes the mysql_enable_utf8 option to
DBD::mysql, and runs SET NAMES UTF8 after connecting to MySQL. Any
other value sets binmode on STDOUT without the utf8 layer, and runs
SET NAMES after connecting to MySQL.

--config
type: Array

Read this comma-separated list of config files; if specified, this
must be the first option on the command line.

--[no]continue-on-error
default: yes

Continue parsing even if there is an error. The tool will not
continue forever: it stops once any process causes 100 errors, in
which case there is probably a bug in the tool or the input is
invalid.

--[no]create-history-table
default: yes

Create the "--history" table if it does not exist.

This option causes the table specified by "--history" to be created
with the default structure shown in the documentation for
"--history".

--[no]create-review-table
default: yes

Create the "--review" table if it does not exist.

This option causes the table specified by "--review" to be created
with the default structure shown in the documentation for
"--review".

--daemonize
Fork to the background and detach from the shell. POSIX operating
systems only.

--database
short form: -D; type: string

Connect to this database.

--defaults-file
short form: -F; type: string

Only read mysql options from the given file. You must give an
absolute pathname.

--embedded-attributes
type: array

Two Perl regex patterns to capture pseudo-attributes embedded in
queries.

Embedded attributes might be special attribute-value pairs that
you've hidden in comments. The first regex should match the entire
set of attributes (in case there are multiple). The second regex
should match and capture attribute-value pairs from the first
regex.

For example, suppose your query looks like the following:

SELECT * from users -- file: /login.php, line: 493;

You might run pt-query-digest with the following option:

pt-query-digest --embedded-attributes ' -- .*','(\w+): ([^\,]+)'

The first regular expression captures the whole comment:

" -- file: /login.php, line: 493;"

The second one splits it into attribute-value pairs and adds them
to the event:

ATTRIBUTE VALUE
========= ==========
file /login.php
line 493

NOTE: All commas in the regex patterns must be escaped with \
otherwise the pattern will break.

--expected-range
type: array; default: 5,10

Explain items when there are more or fewer than expected.

Defines the number of items expected to be seen in the report given
by "--[no]report", as controlled by "--limit" and "--outliers". If
there are more or fewer items in the report, each one will explain
why it was included.

--explain
type: DSN

Run EXPLAIN for the sample query with this DSN and print results.

This works only when "--group-by" includes fingerprint. It causes
pt-query-digest to run EXPLAIN and include the output into the
report. For safety, queries that appear to have a subquery that
EXPLAIN will execute won't be EXPLAINed. Those are typically
"derived table" queries of the form

select ... from ( select .... ) der;

The EXPLAIN results are printed as a full vertical format in the
event report, which appears at the end of each event report in
vertical style ("\G") just like MySQL prints it.

--filter
type: string

Discard events for which this Perl code doesn't return true.

This option is a string of Perl code or a file containing Perl code
that gets compiled into a subroutine with one argument: $event.
This is a hashref. If the given value is a readable file, then pt-
query-digest reads the entire file and uses its contents as the
code. The file should not contain a shebang (#!/usr/bin/perl)
line.

If the code returns true, the chain of callbacks continues;
otherwise it ends. The code is the last statement in the
subroutine other than "return $event". The subroutine template is:

sub { $event = shift; filter && return $event; }

Filters given on the command line are wrapped inside parentheses
like like "( filter )". For complex, multi-line filters, you must
put the code inside a file so it will not be wrapped inside
parentheses. Either way, the filter must produce syntactically
valid code given the template. For example, an if-else branch
given on the command line would not be valid:

--filter 'if () { } else { }' # WRONG

Since it's given on the command line, the if-else branch would be
wrapped inside parentheses which is not syntactically valid. So to
accomplish something more complex like this would require putting
the code in a file, for example filter.txt:

my $event_ok; if (...) { $event_ok=1; } else { $event_ok=0; } $event_ok

Then specify "--filter filter.txt" to read the code from
filter.txt.

If the filter code won't compile, pt-query-digest will die with an
error. If the filter code does compile, an error may still occur
at runtime if the code tries to do something wrong (like pattern
match an undefined value). pt-query-digest does not provide any
safeguards so code carefully!

An example filter that discards everything but SELECT statements:

--filter '$event->{arg} =~ m/^select/i'

This is compiled into a subroutine like the following:

sub { $event = shift; ( $event->{arg} =~ m/^select/i ) && return $event; }

It is permissible for the code to have side effects (to alter
$event).

See "ATTRIBUTES REFERENCE" for a list of common and "--type"
specific attributes.

Here are more examples of filter code:

Host/IP matches domain.com
--filter '($event->{host} || $event->{ip} || "") =~
m/domain.com/'

Sometimes MySQL logs the host where the IP is expected.
Therefore, we check both.

User matches john
--filter '($event->{user} || "") =~ m/john/'

More than 1 warning
--filter '($event->{Warning_count} || 0) > 1'

Query does full table scan or full join
--filter '(($event->{Full_scan} || "") eq "Yes") ||
(($event->{Full_join} || "") eq "Yes")'

Query was not served from query cache
--filter '($event->{QC_Hit} || "") eq "No"'

Query is 1 MB or larger
--filter '$event->{bytes} >= 1_048_576'

Since "--filter" allows you to alter $event, you can use it to do
other things, like create new attributes. See "ATTRIBUTES" for an
example.

--group-by
type: Array; default: fingerprint

Which attribute of the events to group by.

In general, you can group queries into classes based on any
attribute of the query, such as "user" or "db", which will by
default show you which users and which databases get the most
"Query_time". The default attribute, "fingerprint", groups
similar, abstracted queries into classes; see below and see also
"FINGERPRINTS".

A report is printed for each "--group-by" value (unless
"--no-report" is given). Therefore, "--group-by user,db" means
"report on queries with the same user and report on queries with
the same db"; it does not mean "report on queries with the same
user and db." See also "OUTPUT".

Every value must have a corresponding value in the same position in
"--order-by". However, adding values to "--group-by" will
automatically add values to "--order-by", for your convenience.

There are several magical values that cause some extra data mining
to happen before the grouping takes place:

fingerprint
This causes events to be fingerprinted to abstract queries into
a canonical form, which is then used to group events together
into a class. See "FINGERPRINTS" for more about
fingerprinting.

tables
This causes events to be inspected for what appear to be
tables, and then aggregated by that. Note that a query that
contains two or more tables will be counted as many times as
there are tables; so a join against two tables will count the
Query_time against both tables.

distill
This is a sort of super-fingerprint that collapses queries down
into a suggestion of what they do, such as "INSERT SELECT
table1 table2".

--help
Show help and exit.

--history
type: DSN

Save metrics for each query class in the given table. pt-query-
digest saves query metrics (query time, lock time, etc.) to this
table so you can see how query classes change over time.

The default table is "percona_schema.query_history". Specify
database (D) and table (t) DSN options to override the default.
The database and table are automatically created unless
"--no-create-history-table" is specified (see
"--[no]create-history-table").

pt-query-digest inspects the columns in the table. The table must
have at least the following columns:

CREATE TABLE query_review_history (
checksum BIGINT UNSIGNED NOT NULL,
sample TEXT NOT NULL
);

Any columns not mentioned above are inspected to see if they follow
a certain naming convention. The column is special if the name
ends with an underscore followed by any of these values:

pct|avg|cnt|sum|min|max|pct_95|stddev|median|rank

If the column ends with one of those values, then the prefix is
interpreted as the event attribute to store in that column, and the
suffix is interpreted as the metric to be stored. For example, a
column named "Query_time_min" will be used to store the minimum
"Query_time" for the class of events.

The table should also have a primary key, but that is up to you,
depending on how you want to store the historical data. We suggest
adding ts_min and ts_max columns and making them part of the
primary key along with the checksum. But you could also just add a
ts_min column and make it a DATE type, so you'd get one row per
class of queries per day.

The following table definition is used for
"--[no]create-history-table":

CREATE TABLE IF NOT EXISTS query_history (
checksum BIGINT UNSIGNED NOT NULL,
sample TEXT NOT NULL,
ts_min DATETIME,
ts_max DATETIME,
ts_cnt FLOAT,
Query_time_sum FLOAT,
Query_time_min FLOAT,
Query_time_max FLOAT,
Query_time_pct_95 FLOAT,
Query_time_stddev FLOAT,
Query_time_median FLOAT,
Lock_time_sum FLOAT,
Lock_time_min FLOAT,
Lock_time_max FLOAT,
Lock_time_pct_95 FLOAT,
Lock_time_stddev FLOAT,
Lock_time_median FLOAT,
Rows_sent_sum FLOAT,
Rows_sent_min FLOAT,
Rows_sent_max FLOAT,
Rows_sent_pct_95 FLOAT,
Rows_sent_stddev FLOAT,
Rows_sent_median FLOAT,
Rows_examined_sum FLOAT,
Rows_examined_min FLOAT,
Rows_examined_max FLOAT,
Rows_examined_pct_95 FLOAT,
Rows_examined_stddev FLOAT,
Rows_examined_median FLOAT,
-- Percona extended slowlog attributes
-- http://www.percona.com/docs/wiki/patches:slow_extended
Rows_affected_sum FLOAT,
Rows_affected_min FLOAT,
Rows_affected_max FLOAT,
Rows_affected_pct_95 FLOAT,
Rows_affected_stddev FLOAT,
Rows_affected_median FLOAT,
Rows_read_sum FLOAT,
Rows_read_min FLOAT,
Rows_read_max FLOAT,
Rows_read_pct_95 FLOAT,
Rows_read_stddev FLOAT,
Rows_read_median FLOAT,
Merge_passes_sum FLOAT,
Merge_passes_min FLOAT,
Merge_passes_max FLOAT,
Merge_passes_pct_95 FLOAT,
Merge_passes_stddev FLOAT,
Merge_passes_median FLOAT,
InnoDB_IO_r_ops_min FLOAT,
InnoDB_IO_r_ops_max FLOAT,
InnoDB_IO_r_ops_pct_95 FLOAT,
InnoDB_IO_r_ops_stddev FLOAT,
InnoDB_IO_r_ops_median FLOAT,
InnoDB_IO_r_bytes_min FLOAT,
InnoDB_IO_r_bytes_max FLOAT,
InnoDB_IO_r_bytes_pct_95 FLOAT,
InnoDB_IO_r_bytes_stddev FLOAT,
InnoDB_IO_r_bytes_median FLOAT,
InnoDB_IO_r_wait_min FLOAT,
InnoDB_IO_r_wait_max FLOAT,
InnoDB_IO_r_wait_pct_95 FLOAT,
InnoDB_IO_r_wait_stddev FLOAT,
InnoDB_IO_r_wait_median FLOAT,
InnoDB_rec_lock_wait_min FLOAT,
InnoDB_rec_lock_wait_max FLOAT,
InnoDB_rec_lock_wait_pct_95 FLOAT,
InnoDB_rec_lock_wait_stddev FLOAT,
InnoDB_rec_lock_wait_median FLOAT,
InnoDB_queue_wait_min FLOAT,
InnoDB_queue_wait_max FLOAT,
InnoDB_queue_wait_pct_95 FLOAT,
InnoDB_queue_wait_stddev FLOAT,
InnoDB_queue_wait_median FLOAT,
InnoDB_pages_distinct_min FLOAT,
InnoDB_pages_distinct_max FLOAT,
InnoDB_pages_distinct_pct_95 FLOAT,
InnoDB_pages_distinct_stddev FLOAT,
InnoDB_pages_distinct_median FLOAT,
-- Boolean (Yes/No) attributes. Only the cnt and sum are needed
-- for these. cnt is how many times is attribute was recorded,
-- and sum is how many of those times the value was Yes. So
-- sum/cnt * 100 equals the percentage of recorded times that
-- the value was Yes.
QC_Hit_cnt FLOAT,
QC_Hit_sum FLOAT,
Full_scan_cnt FLOAT,
Full_scan_sum FLOAT,
Full_join_cnt FLOAT,
Full_join_sum FLOAT,
Tmp_table_cnt FLOAT,
Tmp_table_sum FLOAT,
Tmp_table_on_disk_cnt FLOAT,
Tmp_table_on_disk_sum FLOAT,
Filesort_cnt FLOAT,
Filesort_sum FLOAT,
Filesort_on_disk_cnt FLOAT,
Filesort_on_disk_sum FLOAT,
PRIMARY KEY(checksum, ts_min, ts_max)
);

Note that we store the count (cnt) for the ts attribute only; it
will be redundant to store this for other attributes.

--host
short form: -h; type: string

Connect to host.

--ignore-attributes
type: array; default: arg, cmd, insert_id, ip, port, Thread_id,
timestamp, exptime, flags, key, res, val, server_id, offset,
end_log_pos, Xid

Do not aggregate these attributes. Some attributes are not query
metrics but metadata which doesn't need to be (or can't be)
aggregated.

--inherit-attributes
type: array; default: db,ts

If missing, inherit these attributes from the last event that had
them.

This option sets which attributes are inherited or carried forward
to events which do not have them. For example, if one event has
the db attribute equal to "foo", but the next event doesn't have
the db attribute, then it inherits "foo" for its db attribute.

--interval
type: float; default: .1

How frequently to poll the processlist, in seconds.

--iterations
type: int; default: 1

How many times to iterate through the collect-and-report cycle. If
0, iterate to infinity. Each iteration runs for "--run-time"
amount of time. An iteration is usually determined by an amount of
time and a report is printed when that amount of time elapses.
With "--run-time-mode" "interval", an interval is instead
determined by the interval time you specify with "--run-time". See
"--run-time" and "--run-time-mode" for more information.

--limit
type: Array; default: 95%:20

Limit output to the given percentage or count.

If the argument is an integer, report only the top N worst queries.
If the argument is an integer followed by the "%" sign, report that
percentage of the worst queries. If the percentage is followed by
a colon and another integer, report the top percentage or the
number specified by that integer, whichever comes first.

The value is actually a comma-separated array of values, one for
each item in "--group-by". If you don't specify a value for any of
those items, the default is the top 95%.

See also "--outliers".

--log
type: string

Print all output to this file when daemonized.

--order-by
type: Array; default: Query_time:sum

Sort events by this attribute and aggregate function.

This is a comma-separated list of order-by expressions, one for
each "--group-by" attribute. The default "Query_time:sum" is used
for "--group-by" attributes without explicitly given "--order-by"
attributes (that is, if you specify more "--group-by" attributes
than corresponding "--order-by" attributes). The syntax is
"attribute:aggregate". See "ATTRIBUTES" for valid attributes.
Valid aggregates are:

Aggregate Meaning
========= ============================
sum Sum/total attribute value
min Minimum attribute value
max Maximum attribute value
cnt Frequency/count of the query

For example, the default "Query_time:sum" means that queries in the
query analysis report will be ordered (sorted) by their total query
execution time ("Exec time"). "Query_time:max" orders the queries
by their maximum query execution time, so the query with the single
largest "Query_time" will be list first. "cnt" refers more to the
frequency of the query as a whole, how often it appears; "Count" is
its corresponding line in the query analysis report. So any
attribute and "cnt" should yield the same report wherein queries
are sorted by the number of times they appear.

When parsing general logs ("--type" "genlog"), the default
"--order-by" becomes "Query_time:cnt". General logs do not report
query times so only the "cnt" aggregate makes sense because all
query times are zero.

If you specify an attribute that doesn't exist in the events, then
pt-query-digest falls back to the default "Query_time:sum" and
prints a notice at the beginning of the report for each query
class. You can create attributes with "--filter" and order by
them; see "ATTRIBUTES" for an example.

--outliers
type: array; default: Query_time:1:10

Report outliers by attribute:percentile:count.

The syntax of this option is a comma-separated list of colon-
delimited strings. The first field is the attribute by which an
outlier is defined. The second is a number that is compared to the
attribute's 95th percentile. The third is optional, and is
compared to the attribute's cnt aggregate. Queries that pass this
specification are added to the report, regardless of any limits you
specified in "--limit".

For example, to report queries whose 95th percentile Query_time is
at least 60 seconds and which are seen at least 5 times, use the
following argument:

--outliers Query_time:60:5

You can specify an --outliers option for each value in
"--group-by".

--output
type: string; default: report

How to format and print the query analysis results. Accepted
values are:

VALUE FORMAT
======= ==============================
report Standard query analysis report
slowlog MySQL slow log
json JSON, on array per query class
json-anon JSON without example queries

The entire "report" output can be disabled by specifying
"--no-report" (see "--[no]report"), and its sections can be
disabled or rearranged by specifying "--report-format".

"json" output was introduced in 2.2.1 and is still in development,
so the data structure may change in future versions.

--password
short form: -p; type: string

Password to use when connecting.

--pid
type: string

Create the given PID file. The tool won't start if the PID file
already exists and the PID it contains is different than the
current PID. However, if the PID file exists and the PID it
contains is no longer running, the tool will overwrite the PID file
with the current PID. The PID file is removed automatically when
the tool exits.

--port
short form: -P; type: int

Port number to use for connection.

--processlist
type: DSN

Poll this DSN's processlist for queries, with "--interval" sleep
between.

If the connection fails, pt-query-digest tries to reopen it once
per second.

--progress
type: array; default: time,30

Print progress reports to STDERR. The value is a comma-separated
list with two parts. The first part can be percentage, time, or
iterations; the second part specifies how often an update should be
printed, in percentage, seconds, or number of iterations.

--read-timeout
type: time; default: 0

Wait this long for an event from the input; 0 to wait forever.

This option sets the maximum time to wait for an event from the
input. It applies to all types of input except "--processlist".
If an event is not received after the specified time, the script
stops reading the input and prints its reports. If "--iterations"
is 0 or greater than 1, the next iteration will begin, else the
script will exit.

This option requires the Perl POSIX module.

--[no]report
default: yes

Print query analysis reports for each "--group-by" attribute. This
is the standard slow log analysis functionality. See "OUTPUT" for
the description of what this does and what the results look like.

If you don't need a report (for example, when using "--review" or
"--history"), it is best to specify "--no-report" because this
allows the tool to skip some expensive operations.

--report-all
Report all queries, even ones that have been reviewed. This only
affects the "report" "--output" when using "--review". Otherwise,
all queries are always printed.

--report-format
type: Array; default:
rusage,date,hostname,files,header,profile,query_report,prepared

Print these sections of the query analysis report.

SECTION PRINTS
============ ======================================================
rusage CPU times and memory usage reported by ps
date Current local date and time
hostname Hostname of machine on which pt-query-digest was run
files Input files read/parse
header Summary of the entire analysis run
profile Compact table of queries for an overview of the report
query_report Detailed information about each unique query
prepared Prepared statements

The sections are printed in the order specified. The rusage, date,
files and header sections are grouped together if specified
together; other sections are separated by blank lines.

See "OUTPUT" for more information on the various parts of the query
report.

--report-histogram
type: string; default: Query_time

Chart the distribution of this attribute's values.

The distribution chart is limited to time-based attributes, so
charting "Rows_examined", for example, will produce a useless
chart. Charts look like:

# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms ###########################
# 100ms ########################################################
# 1s ########
# 10s+

See "OUTPUT" for more information.

--resume
type: string

If specified, the tool writes the last file offset, if there is
one, to the given filename. When ran again with the same value for
this option, the tool reads the last file offset from the file,
seeks to that position in the log, and resumes parsing events from
that point onward.

--review
type: DSN

Save query classes for later review, and don't report already
reviewed classes.

The default table is "percona_schema.query_review". Specify
database (D) and table (t) DSN options to override the default.
The database and table are automatically created unless
"--no-create-review-table" is specified (see
"--[no]create-review-table").

If the table was created manually, it must have at least the
following columns. You can add more columns for your own special
purposes, but they won't be used by pt-query-digest.

CREATE TABLE IF NOT EXISTS query_review (
checksum BIGINT UNSIGNED NOT NULL PRIMARY KEY,
fingerprint TEXT NOT NULL,
sample TEXT NOT NULL,
first_seen DATETIME,
last_seen DATETIME,
reviewed_by VARCHAR(20),
reviewed_on DATETIME,
comments TEXT
)

The columns are:

COLUMN MEANING
=========== ====================================================
checksum A 64-bit checksum of the query fingerprint
fingerprint The abstracted version of the query; its primary key
sample The query text of a sample of the class of queries
first_seen The smallest timestamp of this class of queries
last_seen The largest timestamp of this class of queries
reviewed_by Initially NULL; if set, query is skipped thereafter
reviewed_on Initially NULL; not assigned any special meaning
comments Initially NULL; not assigned any special meaning

Note that the "fingerprint" column is the true primary key for a
class of queries. The "checksum" is just a cryptographic hash of
this value, which provides a shorter value that is very likely to
also be unique.

After parsing and aggregating events, your table should contain a
row for each fingerprint. This option depends on "--group-by
fingerprint" (which is the default). It will not work otherwise.

--run-time
type: time

How long to run for each "--iterations". The default is to run
forever (you can interrupt with CTRL-C). Because "--iterations"
defaults to 1, if you only specify "--run-time", pt-query-digest
runs for that amount of time and then exits. The two options are
specified together to do collect-and-report cycles. For example,
specifying "--iterations" 4 "--run-time" "15m" with a continuous
input (like STDIN or "--processlist") will cause pt-query-digest to
run for 1 hour (15 minutes x 4), reporting four times, once at each
15 minute interval.

--run-time-mode
type: string; default: clock

Set what the value of "--run-time" operates on. Following are the
possible values for this option:

clock
"--run-time" specifies an amount of real clock time during
which the tool should run for each "--iterations".

event
"--run-time" specifies an amount of log time. Log time is
determined by timestamps in the log. The first timestamp seen
is remembered, and each timestamp after that is compared to the
first to determine how much log time has passed. For example,
if the first timestamp seen is "12:00:00" and the next is
"12:01:30", that is 1 minute and 30 seconds of log time. The
tool will read events until the log time is greater than or
equal to the specified "--run-time" value.

Since timestamps in logs are not always printed, or not always
printed frequently, this mode varies in accuracy.

interval
"--run-time" specifies interval boundaries of log time into
which events are divided and reports are generated. This mode
is different from the others because it doesn't specify how
long to run. The value of "--run-time" must be an interval
that divides evenly into minutes, hours or days. For example,
"5m" divides evenly into hours (60/5=12, so 12 5 minutes
intervals per hour) but "7m" does not (60/7=8.6).

Specifying "--run-time-mode interval --run-time 30m
--iterations 0" is similar to specifying "--run-time-mode clock
--run-time 30m --iterations 0". In the latter case, pt-query-
digest will run forever, producing reports every 30 minutes,
but this only works effectively with continuous inputs like
STDIN and the processlist. For fixed inputs, like log files,
the former example produces multiple reports by dividing the
log into 30 minutes intervals based on timestamps.

Intervals are calculated from the zeroth second/minute/hour in
which a timestamp occurs, not from whatever time it specifies.
For example, with 30 minute intervals and a timestamp of
"12:10:30", the interval is not "12:10:30" to "12:40:30", it is
"12:00:00" to "12:29:59". Or, with 1 hour intervals, it is
"12:00:00" to "12:59:59". When a new timestamp exceeds the
interval, a report is printed, and the next interval is
recalculated based on the new timestamp.

Since "--iterations" is 1 by default, you probably want to
specify a new value else pt-query-digest will only get and
report on the first interval from the log since 1 interval = 1
iteration. If you want to get and report every interval in a
log, specify "--iterations" 0.

--sample
type: int

Filter out all but the first N occurrences of each query. The
queries are filtered on the first value in "--group-by", so by
default, this will filter by query fingerprint. For example,
"--sample 2" will permit two sample queries for each fingerprint.
Useful in conjunction with "--output slowlog" to print the queries.
You probably want to set "--no-report" to avoid the overhead of
aggregating and reporting if you're just using this to print out
samples of queries. A complete example:

pt-query-digest --sample 2 --no-report --output slowlog slow.log

--set-vars
type: Array

Set the MySQL variables in this comma-separated list of
"variable=value" pairs.

By default, the tool sets:

wait_timeout=10000

Variables specified on the command line override these defaults.
For example, specifying "--set-vars wait_timeout=500" overrides the
defaultvalue of 10000.

The tool prints a warning and continues if a variable cannot be
set.

--show-all
type: Hash

Show all values for these attributes.

By default pt-query-digest only shows as many of an attribute's
value that fit on a single line. This option allows you to specify
attributes for which all values will be shown (line width is
ignored). This only works for attributes with string values like
user, host, db, etc. Multiple attributes can be specified, comma-
separated.

--since
type: string

Parse only queries newer than this value (parse queries since this
date).

This option allows you to ignore queries older than a certain value
and parse only those queries which are more recent than the value.
The value can be several types:

* Simple time value N with optional suffix: N[shmd], where
s=seconds, h=hours, m=minutes, d=days (default s if no suffix
given); this is like saying "since N[shmd] ago"
* Full date with optional hours:minutes:seconds:
YYYY-MM-DD [HH:MM::SS]
* Short, MySQL-style date:
YYMMDD [HH:MM:SS]
* Any time expression evaluated by MySQL:
CURRENT_DATE - INTERVAL 7 DAY

If you give a MySQL time expression, and you have not also
specified a DSN for "--explain", "--processlist", or "--review",
then you must specify a DSN on the command line so that pt-query-
digest can connect to MySQL to evaluate the expression.

The MySQL time expression is wrapped inside a query like "SELECT
UNIX_TIMESTAMP(<expression>)", so be sure that the expression is
valid inside this query. For example, do not use UNIX_TIMESTAMP()
because UNIX_TIMESTAMP(UNIX_TIMESTAMP()) returns 0.

Events are assumed to be in chronological: older events at the
beginning of the log and newer events at the end of the log.
"--since" is strict: it ignores all queries until one is found that
is new enough. Therefore, if the query events are not consistently
timestamped, some may be ignored which are actually new enough.

DragonFly On-Line Manual Pages

NAME

SYNOPSIS

RISKS

DESCRIPTION

ATTRIBUTES

OUTPUT

QUERY REVIEW

FINGERPRINTS

OPTIONS

DSN OPTIONS

ENVIRONMENT

SYSTEM REQUIREMENTS

BUGS

DOWNLOADING

ATTRIBUTES REFERENCE

AUTHORS

ABOUT PERCONA TOOLKIT

COPYRIGHT, LICENSE, AND WARRANTY

VERSION