In DevOps or MlOps discovering what a process is doing now can save your system from a catastrophe. But sometimes we are already in a failure. When those failures happen, the following questions appear:
-
The process it’s hanging and I don’t know why!
-
What is the cause of the problem?
-
Is it a network issue?
-
…and so on and so forth.
Being able to answer these questions faster and with a precise answer can save you or give you a promotion. I’ll show you some simple examples of how those questions can be answered.
Table of Contents
Introduction
Linux is a very transparent operating system. Transparency means that is easy to dig in the system’s behavior to understand how it works. Also, Linux is easy to control even in the low-level behaviors. But how can this help to understand a process and consequently a failure?
The first step to understanding a process is to analyze the output. But sometimes the output doesn’t give us enough information to use. Here I’ll talk about how we can extract useful information using the strace
and lsof
commands. To use these commands it is good to know two concepts: the “Everything is a file ” mantra and the system call mechanism.
Concepts
“Everything is a file.” - The UNIX mantra.
What comes to your mind when someone talks about files? Maybe a jpeg or a CSV if you work as a data scientist. But in UNIX approach to do stuff everything can be a file, even network connections. When a thing is not a file it has at least a file descriptor associated with it. Maybe you’re thinking that I’m wandering from the post. “How can this stuff help to improve our comprehension about a process or failure?” The answer is straightforward: if everything is a file, we can use the same set of tools to list, read and interact (API) with files to analyze a generic process. Here is where lsof
appears.
LSOF
lsof is an acronym for list of open files. In simple terms, Lsof is a command-line tool that can list open file descriptors in your machine. Besides that, Lsof allows using a set of different filters to give you a filtered list of opened files. Thus, we can list the open file descriptors of a user or a process.
To put all the open files in your machine use the following command:
username:/$ lsof > lsof_everything.txt
The file lsof_everything.txt
is huge and will look similar to this
COMMAND PID TID TASKCMD USER FD TYPE DEVICE SIZE/OFF NODE NAME
systemd 1 root cwd unknown /proc/1/cwd (readlink: Permission denied)
systemd 1 root rtd unknown /proc/1/root (readlink: Permission denied)
systemd 1 root txt unknown /proc/1/exe (readlink: Permission denied)
Take some time to analyze the output. The output relates to the first lines to the root
user and you don’t have permission to gain information about these files, which is good. Let’s remove this wasteful information filtering the SYS CALLs related to just your user.
username:/$ lsof -u username > lsof_my.txt
The txt
file is still big. Try to look into this file to see if you can find anything interesting, like a webpage address.
We have a lot of different columns in the file. But don’t be worried, I’ll show you the columns that I believe are the most important ones.
- COMMAND
- The command name used to initiate the process
- PID
- This is an integer number that identifies a process, Process IDentification number.
- USER
- The user to whom the process belongs.
- TYPE
- This may be one of the most important columns. It has more than 60 possible values. Such column says the type of the node associated with the file. If the file is related with connections and sockets you will see things like that: IPV4, IPV6, unix, INET, etc. If it’s an regular file (csv, jpeg, txt, etc.) you will see the REG value.
- NODE
- This helps us to identify the node associated with the file descriptor. It can be a number, a string, etc. In the case of internet protocols this column will have values like TCP, UDP.
- NAME
- The values here will change a lot. For example, sometimes it can be a web server address or just a cryptic string. Regardless of the difficulty of interpreting the values in this column you should look carrefully here.
If you want a more deep understanding of the columns just uses man lsof
.
System Calls and strace
The SYSTEM_CALL is a mechanism that allows a program to ask the kernel for some resources, like the access of data stored in the disk. Therefore, if we have a tool to intercept those calls, we can have a deep comprehension of what a program is doing or what it want’s to do in your system. A celebrated tool to intercept the system calls is the strace
.
If the strace command is not available in your system, install it. In apt-based distros just calling the following command should be enough.
$ apt install strace
You can use strace
in two different ways. The first one is to call strace followed by the strace arguments and the command to be intercepted:
$ strace ARGS command
In the second way we will replace the command by the -p
argument followed by the process identification number (PID) of the process to be intercepted:
$ strace ARGS -p PID
To discover the PID of a process, you can use the command ps aux | grep -i ‘[p]rocess_name’
.
Let’s see an example. We will ask strace
to intercept any system call performed by the ls
command and we want to record the time that the system call was performed using the -t
argument.
$ strace -t ls
The output will be something similar to this:
18:02:23 execve("/usr/bin/ls", ["ls"], 0x7fffa727a418 /* 54 vars */) = 0
18:02:23 brk(NULL) = 0x55ebef60c000
18:02:23 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
18:02:23 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
...
18:02:23 execve("/usr/bin/ls", ["ls"], 0x7fffa727a418 /* 54 vars */) = 0
18:02:23 brk(NULL) = 0x55ebef60c000
18:02:23 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
18:02:23 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
...
Each line of the strace
output represents a single SYSTEM CALL. It’s easy to see the following pattern:
Name of the SYS CALL(Arguments to be used in the system call) = Result of the SYS CALL
Yes, it’s hard to understand the output or impossible without using a manual. For each SYS CALL you can use the man
command to get more information about what each line represents. For example, to see what openat
does and the meaning of each argument, use the following command:
$ man 2 openat
openat
is a sys call responsible for requesting a file to be opened. The result of the last line in the output of ls commands, 3
, means a successfully SYS CALL.
A deep dive into failures
We will see here failures and issues related to files, connections, etc. But don’t be afraid to explore other problems, we are just scratching the surface of strace
and lsof
.
Network issues
First, install the following packages:
$ python -m pip install requests flask
Create a script server_mlops.py
which we will use to simulate a server with network issues
# server_mlops.py
import time
import flask
app = flask.Flask(__name__)
@app.route('/')
def hello_world():
sleep_time = flask.request.args.get('sleep', default=10, type=int)
print('sleep_time:', sleep_time)
time.sleep(sleep_time)
return 'Hello World!'
if __name__ == '__main__':
app.run()
Starts the server:
$ python server_mlops.py
GET The PID of the process:
$ ps aux | grep -i '[s]erver_mlops.py'
You will see a output similar to this:
devmess+ 19321 18.0 0.3 29716 24792 pts/5 S+ 14:27 0:00 python server_mlops.py
The number in front of the username (19321
) is the PID
of the process.
Is my server alive?
We have to use a unique set of filters to answer this question using the lsof
. Therefore, we need to use the argument -a
, which represents an AND
operator. The -i
argument asks to show just the files associated with connections and finally the argument -p INT
makes the losf
list just the files opened by the process with the PID INT
.
$ lsof -a -i -p 19321
You will have a output similar to this
COMMAND | PID | USER | FD | TYPE | DEVICE | SIZE/OFF | NODE | NAME |
---|---|---|---|---|---|---|---|---|
python | 19321 | devmessias | 4u | IPv4 | 16108218 | 0t0 | TCP | localhost:5000 (LISTEN) |
The output shows that at least our server is listening in the 5000
port. Try to remove the -a
argument and see what happens.
Is my process stuck waiting for someone? What is causing the process hanging?
This can happen in a myriad of cases. For example, in the dependency management systems like pip/conda. Thus, it is superb to know if you can answer fast if you have a problem on your side or not. Let’s create a simple simulation of this issue. To do so, create the client_mlops.py
using the following code:
#!/usr/bin/env python
#client_mlops.py
import requests
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(
'--sleep', type=int, help='time to sleep', default=0)
args = parser.parse_args()
print('Ask for localhost:5000 to sleep for {} seconds'.format(args.sleep))
r = requests.get('http://localhost:5000', params={'sleep': int(args.sleep)})
print(r.text)
In the above code, we have the sleep argument. This argument will ask server_mlops
to wait for a couple of seconds before sending the answer. Let’s see how these situations appear to us at the system call level.
Start the client_mlops.py
with the strace:
$ strace -e poll,select,connect,recvfrom,sendto python client_mlops.py --sleep=20
The -e
argument followed by poll,select,connect,recvfrom,sendto
asks to filter just the sys calls related with connections issues. The output of this fake failure will be something like this
connect(4, {sa_family=AF_INET, sin_port=htons(5000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
connect(4, {sa_family=AF_INET6, sin6_port=htons(5000), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
connect(4, {sa_family=AF_INET6, sin6_port=htons(5000), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 ECONNREFUSED (Connection refused)
connect(4, {sa_family=AF_INET, sin_port=htons(5000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
sendto(4, "GET /?sleep=10 HTTP/1.1\r\nHost: l"..., 154, 0, NULL, 0) = 154
recvfrom(4,
In the last line we see an unfinished recvfrom
SYS_CALL . Do you want to know more about recvfrom
? Execute man 2 recvfrom
in a terminal session. But what matters here? The point is that: strace
is telling us there is no problem with our client program, something is problematic on the server side.
You can also use the lsof
to investigate this problem. Let’s simulate this scenario.
Starts the client again:
$ python client_mlops.py --sleep=100
Now, get the PID using ps aux | grep -i '[c]lient_mlops.py'
and execute the lsof
with the obtained PID
lsof -a -i -p 19321
the output will be something like this
COMMAND | PID | USER | FD | TYPE | DEVICE | SIZE/OFF | NODE | NAME |
---|---|---|---|---|---|---|---|---|
python | 31551 | devmessias | 4u | IPv4 | 16622065 | 0t0 | TCP | localhost:57314->localhost:5000 (ESTABLISHED) |
What lsof
is telling us is that: ‘’You client seems fine. At least it is connected with the server“”. What does this mean? This means the answer was not of great help.
Problems with regular files
Sometimes you can have a problem with regular files and you don’t know. Nothing was printed in the output. For example, a wrong cache being used, a program that tries to access a file that doesn’t have permission, a malicious or a bad writing process accessing/creating files in your system, etc.
Here we will simulate simple examples related to regular files. To do so, first, copy any file to the tmp
folder or just call the following command:
$ man strace > /tmp/dummy_file.txt
Which processes is this file attached to?
Being able to answer this question can be quite useful. For example, suppose that there is a huge file in your disk being created, almost filling up your system and you want to discover which process is doing this.
First, create the following script.
#!/usr/bin/env python
# file_open.py
import time
f = open('/tmp/dummy_file.txt', 'r')
input('Press Enter to continue...')
Now open two different terminal sessions and perform the following command python file_open.py
. in each one.
To find all the processes which are attached to dummy_file.txt
you just need to call lsof
like this:
$ lsof /tmp/dummy_file.txt
The output should be something similar to this:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 15411 devmessias 3r REG 8,2 0 2911031 /tmp/dummy_file.txt
python 20777 devmessias 3r REG 8,2 0 2911031 /tmp/dummy_file.txt
We have two distinct processes, two different PIDs, using the same file.
I made a mistake! Deleted an important file! How can I recover it?
Here the simulation will be more than one more process attached to a file and you accidentally perform some action that deletes the file.
Create a file accident.txt
. Open a terminal session and perform the following command. Don’t close the session!
$ python -c 'f=open("accident.txt", "r");input("...")'
In another session, perform the following commands:
$ rm accident.txt
$ ls accident.txt
And it’s gone :(
ls: cannot access 'acidente.txt': No such file or directory
Don’t worry! Linux has a lot of cool aspects and one of them will help us to recover our file.
In Linux any process has a directory associated with it, this directory will be inside of the /proc
folder. And what do these directories store? A lot of things! I assure you that you can be surprised. For example, these folders also store the file descriptors associated with any process using accident.txt
. Let’s see if there are any processes using this file.
$ lsof -u your_username | grep 'accident.txt'
In my case, I’ve obtained the following output:
python 22465 devmessias 3r REG 8,2 37599 14288174 path/accident.txt (deleted)
This is great news! We have a process PID 22465
that still has a file descriptor associated with accident.txt
. Now we can use a simple cp
command to recover the data. To do so, we need to use the file descriptor number. In my case is 3
which is the number in front of the r
character in the output above.
$ cp /proc/22465/fd/3 recovered.txt
Now just call nano recovered.txt
and testify the result . It’s not magic! It’s just how the process pseudo-filesystem works!
Silent errors associated with files and permissions
Let’s create a simple example of an undue access to a file using the following script
#!/usr/bin/env python
# file_404.py
import time
try:
f = open('/tmp/file_that_dosent_exist.csv', 'r')
except FileNotFoundError:
pass
try:
# create a file with sudo and then change the permission using chmod 700
f = open('/tmp/file_wrong_permission.csv', 'r')
except PermissionError:
pass
input('Press Enter to continue...')
Call python file_404.py
, nothing will appears.
To trace any SYS CALL related to regular files made by python file_404.py
you just need to use the -e trace=file
arg, like this:
$ strace -f -e trace=file python file_404.py
The argument -f
says to strace
to monitor any children’s process. You probably will use -f
every time when you’re dealing with strace in real case scenarios.
The output of the previous command should be like this
lstat("something/file_404.py", {st_mode=S_IFREG|0644, st_size=242, ...}) = 0
openat(AT_FDCWD, "file_404.py", O_RDONLY) = 3
openat(AT_FDCWD, "/tmp/file_that_dosent_exist.csv", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/file_wrong_permission.csv", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
Cool! With the strace
we are able to identify errors even when the programmer used a dangerous practice in the code.
Ok, but let’s improve our output. We can filter the strace output redirecting into it the awk
(or grep
) and performing a conditional check that each line should start with the open
string and have the pattern = -1
in the line. The -1
means that the openat
SYS CALL had returned an error.
$ strace -f -e trace=file python file_404.py 2>&1 | awk '/^open/ && /= -1/ {print}'
The output now will be simpler and easier to analyze:
openat(AT_FDCWD, "/home/devmessias/anaconda3/pyvenv.cfg", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/arquivo_404.csv", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/arquivo_permission.csv", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
If you are using the last versions of strace
(5.2>=) you can use a more simple command like this:
$ strace -f -e trace=openat -e status=failed python file_404.py
Is this process using a cache? Where can I find this cache? Which configs files does this process use?
It is a tedious task to search for cache or configs files used by a process and sometimes we need to delete these cache files. Another task that appears very often is to discover which files a process is using or if it is doing anything strange in your system. Maybe you want to discover if your python script is using the correct libs and files. For all these situations, you can use the following command:
$ strace -f -e trace=file command
If you want to get just the SYS CALLs that was perfomed succesfully do the next:
$ strace -f -e trace=file comando 2>&1 | awk '/^open/ && !/= -1/ {print}'
In the above command the !
is as a negation parameter for the awk
search.
Again, the new versions of strace allows the status flag:
$ strace -f -e trace=openat -e status=successful python file_404.py
Dissecting your database system
The strace
and lsof
are powerful tools to discover and solve bugs. You don’t need to believe me. Just check the number of bugs in MySQL that strace
was used to tackle
“using strace for mysql troubleshooting. Thus, it’s not a surprise that we can use strace
in our daily life dealing with databases.
Is this process well behaved? How many connections does it have?
In MlOps or DevOps we always need to deal with database connections. Sometimes we must check if a process is closing these connections or is creating more than necessary. If these connections are made using the TCP protocol you can list all established connections using the following command:
$ lsof -iTCP -sTCP:ESTABLISHED
As expected, we get a lot of unwanted information like website addresses and other application communications. If we want to list just the TCP connection in a process, we must pass the PID like this:
$ lsof -iTCP -sTCP:ESTABLISHED -p 22157
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 22157 devmessias 4u IPv4 9474517 0t0 TCP localhost:35932->localhost:mysql (ESTABLISHED)
python 22157 devmessias 5u IPv4 9474518 0t0 TCP localhost:35934->localhost:mysql (ESTABLISHED)
python 22157 devmessias 6u IPv4 9475529 0t0 TCP localhost:37048->localhost:5000 (ESTABLISHED)
As can you see we also have some connections in our server that are not related to the mysql
. If we want to investigate just the mysql
connections between all the processes in our system just do:
$ lsof -iTCP:mysql -sTCP:ESTABLISHED
python 22157 devmessias 4u IPv4 9474517 0t0 TCP localhost:35932->localhost:mysql (ESTABLISHED)
python 22157 devmessias 5u IPv4 9474518 0t0 TCP localhost:35934->localhost:mysql (ESTABLISHED)
The process identified by 22157 PID has two connections with our mysql server.
Notice we have used a pattern in the -i
argument. This pattern follows this structure:
lsof -i[protocol][@hostname|hostaddr][:service|port]
Everything is working proprely?
Let’s give you a taste of what we can extract from the mysql service using strace
in order to get a comprehension about the processes.
Get the PID of mysqld
$ps aux | grep -i '[m]ysqld'
mysql 14001 1 0 13:41 ? 00:00:18 /usr/sbin/mysqld
We will ask the strace
to increase the size of the strings up to 50 characters (-s 50
) and we will omit any SYS CALL of the type io_getevents
, nanosleep
and futex
.
$ sudo strace -s 50 -f -e trace=!io_getevents,nanosleep,futex -p 10767
Chose one of your databases and tables to do the following examples. Here, I’ve perfomed this SQL query:
SELECT * FROM product WHERE product_id = 1;
My output prompted some stuff like this
[pid 14334] recvfrom(52, ",\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 14334] recvfrom(52, "\3SELECT * FROM product WHERE product_id = 1", 44, MSG_DONTWAIT, NULL, NULL) = 44
[pid 14334] sendto(52, "\1\0\0\1\5F\0\0\2\3def\16farmers_market\7product\7product\nprodu"..., 477, MSG_DONTWAIT, NULL, 0 <unfinished ...>
[pid 14207] sched_yield( <unfinished ...>
[pid 14334] <... sendto resumed>) = 477
[pid 14207] <... sched_yield resumed>) = 0
...
We can see the SQL query above. This also shows how strace
can help us to gain a deep understanding about our system. We can see how the sql queries are comunicating using de recvfrom
and sendfrom
calls. The man 2 recvfrom
says the first number, 52, represents the file descriptor associated with a unix socket.
We can use this approach to investigate $IO$ problems ( using-strace-in-linux-to-troubleshoot-database-performance-issues) as well many others. But let’s simulate a lock condition and see what happens.
Start a session and initiate any transaction. Don’t finish with a COMMIT; command!
# first session
MariaDB [you_db]> BEGIN;
Query OK, 0 rows affected (0.001 sec)
MariaDB [your_db]> UPDATE customer SET customer_first_name = 'something' WHERE customer_id=1;
Query OK, 0 rows affected (0.001 sec)
Rows matched: 1 Changed: 0 Warnings: 0
MariaDB [you_db]>
Looks in the strace
output, something like this should appear
[pid 14334] recvfrom(52, "I\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 14334] recvfrom(52, "\3UPDATE customer SET customer_first_name = 'Brun"..., 73, MSG_DONTWAIT, NULL, NULL) = 73
[pid 14334] sendto(52, "0\0\0\1\0\0\0\3\0\0\0(Rows matched: 1 Changed: 0 Warnings:"..., 52, MSG_DONTWAIT, NULL, 0) = 52
[pid 14334] recvfrom(52, 0x7fd354007348, 4, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN
[pid 14334] poll([{fd=52, events=POLLIN|POLLPRI}], 1, 28800000
The poll
is a system call that will wait for any change in the file descriptor 52. The process is waiting for the COMMIT;
clause in our first mysql session. Note the absence of the enclosing )
in the last line.
Open another mysql session, try to execute the same SQL query (whitouth the BEGIN)
# second session
MariaDB [your_db]> UPDATE customer SET customer_first_name = 'something' WHERE customer_id=1;
The db row it’s in a lock state. If we look in the strace
output just some new lines will be printed, like this
<unfinished ...>
[pid 29884] recvfrom(83, "I\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 29884] recvfrom(83, "\3UPDATE customer SET customer_first_name = 'someth"..., 73, MSG_DONTWAIT, NULL, NULL) = 73
Our second transction is waiting for the first to be commited in our db. If you perform a commit in the first session this will hapen in the strace
output:
[pid 14334] <... poll resumed>) = 1 ([{fd=52, revents=POLLIN}])
[pid 14334] recvfrom(52, "\7\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 14334] recvfrom(52, "\3COMMIT", 7, MSG_DONTWAIT, NULL, NULL) = 7
...
Your first transaction was performed and the lock was released allowing the second transaction to be executed.
Extras related to files (/proc/
) and strace
I gave you some examples about connection issues, regular files accidents and database management. Here I’ll give more examples that I believe are not so useful although they are very interesting.
Creating a SYS CALL summary: what does my program do?
An overview of what your program does can help you to perform some optimizations or discover something strange. We can use the strace -c
to get an overview of the system calls. For example, the following command gave me a summary of system calls of a make sync-env
that I’m using in one of my projects:
$ strace -c -e trace=!\wait4 make sync-env
The exclamation mark, -e trace=!\wait4
, in the above command tells strace
to ignore any wait4
system call.
What I’ve obtained was this
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
14,54 0,000209 6 33 13 openat
13,01 0,000187 17 11 vfork
12,32 0,000177 7 25 mmap
8,49 0,000122 3 31 close
8,42 0,000121 5 21 rt_sigprocmask
8,14 0,000117 6 17 read
6,89 0,000099 5 19 11 stat
5,85 0,000084 3 23 fstat
2,85 0,000041 8 5 mprotect
2,64 0,000038 9 4 write
2,51 0,000036 2 16 fcntl
2,02 0,000029 3 9 rt_sigaction
1,95 0,000028 14 2 readlink
1,95 0,000028 14 2 getdents64
1,25 0,000018 4 4 brk
1,25 0,000018 18 1 1 access
1,25 0,000018 3 5 pipe
1,11 0,000016 4 4 ioctl
0,84 0,000012 6 2 getcwd
0,70 0,000010 10 1 munmap
0,49 0,000007 7 1 lstat
0,49 0,000007 7 1 execve
0,49 0,000007 3 2 prlimit64
0,35 0,000005 5 1 chdir
0,21 0,000003 3 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 0,001437 241 25 total
What can we extract from the above output? A lot of things. For example make sync-env
spent 14% of the time doing system calls of the type opennat
and 20 of these 33 openat
calls had some problem.
Did this process start with the correct environment variables?
We have several reasons to use environment variables. These variables are easy to configure, improve security, and prevent errors. So they are used everywhere to store a secret, point to a lib and much more. However, sometimes we are not so sure whether a process is using the correct environment variables or not.
Let’s try something simple
$ ANSWER=42 python script.py
As I said previously the /proc
is responsible for storing the state of any process running in your machine. So, it’s not a surprise that we can extract the environment variables from that.
To print the environment variables of the process with a PID 4301
just call this cat /proc/4031/environ
and get an ugly output. To improve the output we can use the tr
and replace the null characters \0
by break lines \n
. Like this:
$ tr '\0' '\n' < /proc/4031/environ
You will have a output similar to this
ANSWER=42
SHELL=/bin/bash
LANGUAGE=en_US
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/bin/java
...more stuff
If you want to look just at the environment variables with a given string pattern you can use awk
, grep
, or anything that you feel more comfortable. For example, doing this
$ tr '\0' '\n' < /proc/4031/environ 2>&1 | awk '/^CONDA/ {print}'
I’ve obtained this
CONDA_EXE=/home/devmessias/anaconda3/bin/conda
CONDA_PREFIX=/home/devmessias/anaconda3
CONDA_PROMPT_MODIFIER=(base)
CONDA_SHLVL=1
CONDA_PYTHON_EXE=/home/devmessias/anaconda3/bin/python
CONDA_DEFAULT_ENV=base
I forgot to redirect the outputs! What can I do now?
Suppose you started a process without redirecting the output to a file. Maybe you forgot or you are too optimistic about the problems and now you want to persist the errors. If restart the process it’s out of question you can use the strace
to solve your headache. Let’s see how we can solve that.
The system calls responsible to request the kernel to write in the stdin, stdout and stderr is the write
. If you want to know more you should read the manual
$ man 2 write
But the most important part of the write
manual is on the top and is this:
NAME
write - write to a file descriptor
SYNOPSIS
#include <unistd.h>
ssize_t write(int fd, const void *buf, size_t count);
As you can see the first argument is an integer that represents the file descriptor. If fd=1 this means a writing in the stdout and if fd=2 the writing will be in the stderr. So, it’s an easy piece here. We just need to monitor any sys call write
with the fd
equals to $1$ or $2$ and save the values in a file.
When I need to do this (two or three times in my life time) I use the following pattern
$ strace -f -t -etrace=write -s 1000 -p 4320 2>&1 | grep --line-buffered -e 'write(2, ' -e 'write(1, ' > out.txt
I’m asking strace
to monitor any SYS CALL write
from the process with the PID 4320
or children created by them (-f
) . And saving the output in the out.txt
file.
The following code changes the server_mlops.py
to help you to explore more this scenario.
# server_mlops.py
import time
import flask
import sys
app = flask.Flask(__name__)
@app.route('/')
def hello_world():
sleep_time = flask.request.args.get('sleep', default=10, type=int)
print('sleep_time:', sleep_time)
for i in range(sleep_time):
print(f'INFO: {i} of sleep_time \n asdf \t ')
print(f'ERROR: Example msg {i}', file=sys.stderr)
time.sleep(1)
return 'Hello World!'
if __name__ == '__main__':
app.run()
How this program has been called? What is the working dir of the process?
Ok, you can answer these questions using htop
. However we can get the same information without installing anything, just looking in a file inside of the /proc/
folder, like this
$ tr '\0' '\t' < /proc/A_PID_NUMBER/cmdline
In my case I’ve obtined this
python client.py --sleep 1000
To discover the working directory do this
$ readlink /proc/A_PID_NUMBER/cwd
Conclusion & Suggestions
I hope that after reading this post you can be more prepared to face problems in your daily tasks. But if you have any suggestion you can send me an email devmessias@gmail.com. Thanks!
If you want to know more about strace
and linux in general I strongly recommend spent some hours navigating and reading Julia Evans’ blog
https://jvns.ca/.