In DevOps or MlOps discovering what a process is doing now can save your system from a catastrophe. But sometimes we are already in a failure. When those failures happen, the following questions appear:

The process it’s hanging and I don’t know why!
What is the cause of the problem?
Is it a network issue?
…and so on and so forth.

Being able to answer these questions faster and with a precise answer can save you or give you a promotion. I’ll show you some simple examples of how those questions can be answered.

The strace logo is an ostrich.

Table of Contents

Introduction

Linux is a very transparent operating system. Transparency means that is easy to dig in the system’s behavior to understand how it works. Also, Linux is easy to control even in the low-level behaviors. But how can this help to understand a process and consequently a failure?

The first step to understanding a process is to analyze the output. But sometimes the output doesn’t give us enough information to use. Here I’ll talk about how we can extract useful information using the strace and lsof commands. To use these commands it is good to know two concepts: the “Everything is a file ” mantra and the system call mechanism.

Concepts

“Everything is a file.” - The UNIX mantra.

everything_is_a_file

What comes to your mind when someone talks about files? Maybe a jpeg or a CSV if you work as a data scientist. But in UNIX approach to do stuff everything can be a file, even network connections. When a thing is not a file it has at least a file descriptor associated with it. Maybe you’re thinking that I’m wandering from the post. “How can this stuff help to improve our comprehension about a process or failure?” The answer is straightforward: if everything is a file, we can use the same set of tools to list, read and interact (API) with files to analyze a generic process. Here is where lsof appears.

LSOF

lsof is an acronym for list of open files. In simple terms, Lsof is a command-line tool that can list open file descriptors in your machine. Besides that, Lsof allows using a set of different filters to give you a filtered list of opened files. Thus, we can list the open file descriptors of a user or a process.

To put all the open files in your machine use the following command:

username:/$ lsof > lsof_everything.txt

The file lsof_everything.txt is huge and will look similar to this

COMMAND     PID   TID TASKCMD               USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
systemd       1                             root  cwd   unknown                                         /proc/1/cwd (readlink: Permission denied)
systemd       1                             root  rtd   unknown                                         /proc/1/root (readlink: Permission denied)
systemd       1                             root  txt   unknown                                         /proc/1/exe (readlink: Permission denied)

Take some time to analyze the output. The output relates to the first lines to the root user and you don’t have permission to gain information about these files, which is good. Let’s remove this wasteful information filtering the SYS CALLs related to just your user.

username:/$ lsof -u username > lsof_my.txt

The txt file is still big. Try to look into this file to see if you can find anything interesting, like a webpage address.

We have a lot of different columns in the file. But don’t be worried, I’ll show you the columns that I believe are the most important ones.

COMMAND
- The command name used to initiate the process
PID
- This is an integer number that identifies a process, Process IDentification number.
USER
- The user to whom the process belongs.
TYPE
- This may be one of the most important columns. It has more than 60 possible values. Such column says the type of the node associated with the file. If the file is related with connections and sockets you will see things like that: IPV4, IPV6, unix, INET, etc. If it’s an regular file (csv, jpeg, txt, etc.) you will see the REG value.
NODE
- This helps us to identify the node associated with the file descriptor. It can be a number, a string, etc. In the case of internet protocols this column will have values like TCP, UDP.
NAME
- The values here will change a lot. For example, sometimes it can be a web server address or just a cryptic string. Regardless of the difficulty of interpreting the values in this column you should look carrefully here.

If you want a more deep understanding of the columns just uses man lsof .

System Calls and strace

The SYSTEM_CALL is a mechanism that allows a program to ask the kernel for some resources, like the access of data stored in the disk. Therefore, if we have a tool to intercept those calls, we can have a deep comprehension of what a program is doing or what it want’s to do in your system. A celebrated tool to intercept the system calls is the strace.

If the strace command is not available in your system, install it. In apt-based distros just calling the following command should be enough.

$ apt install strace

In older Debian distros (<=10) you can get the last deb package from here: strace_5.10-1_amd64.deb. The new version of strace has some cool features that can facilitate our job.

You can use strace in two different ways. The first one is to call strace followed by the strace arguments and the command to be intercepted:

$ strace ARGS command

In the second way we will replace the command by the -p argument followed by the process identification number (PID) of the process to be intercepted:

$ strace ARGS -p PID

To discover the PID of a process, you can use the command ps aux | grep -i ‘[p]rocess_name’.

Let’s see an example. We will ask strace to intercept any system call performed by the ls command and we want to record the time that the system call was performed using the -t argument.

$ strace -t ls

The output will be something similar to this:

18:02:23 execve("/usr/bin/ls", ["ls"], 0x7fffa727a418 /* 54 vars */) = 0
18:02:23 brk(NULL)                      = 0x55ebef60c000
18:02:23 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
18:02:23 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
...

18:02:23 execve("/usr/bin/ls", ["ls"], 0x7fffa727a418 /* 54 vars */) = 0
18:02:23 brk(NULL)                      = 0x55ebef60c000
18:02:23 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
18:02:23 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
...

Each line of the strace output represents a single SYSTEM CALL. It’s easy to see the following pattern:

Name of the SYS CALL(Arguments to be used in the system call) = Result of the SYS CALL

Yes, it’s hard to understand the output or impossible without using a manual. For each SYS CALL you can use the man command to get more information about what each line represents. For example, to see what openat does and the meaning of each argument, use the following command:

$ man 2 openat

openat is a sys call responsible for requesting a file to be opened. The result of the last line in the output of ls commands, 3, means a successfully SYS CALL.

A deep dive into failures

We will see here failures and issues related to files, connections, etc. But don’t be afraid to explore other problems, we are just scratching the surface of strace and lsof.

Network issues

First, install the following packages:

$ python -m pip install requests flask

Create a script server_mlops.py which we will use to simulate a server with network issues

# server_mlops.py
import time
import flask

app = flask.Flask(__name__)

@app.route('/')
def hello_world():
    sleep_time = flask.request.args.get('sleep', default=10, type=int)
    print('sleep_time:', sleep_time)
    time.sleep(sleep_time)
    return 'Hello World!'

if __name__ == '__main__':
    app.run()

Starts the server:

$ python server_mlops.py

GET The PID of the process:

$ ps aux | grep -i '[s]erver_mlops.py'

You will see a output similar to this:

devmess+ 19321 18.0  0.3  29716 24792 pts/5    S+   14:27   0:00 python server_mlops.py

The number in front of the username (19321) is the PID of the process.

Is my server alive?

We have to use a unique set of filters to answer this question using the lsof. Therefore, we need to use the argument -a , which represents an AND operator. The -i argument asks to show just the files associated with connections and finally the argument -p INT makes the losf list just the files opened by the process with the PID INT.

$ lsof -a -i -p 19321

You will have a output similar to this

COMMAND	PID	USER	FD	TYPE	DEVICE	SIZE/OFF	NODE	NAME
python	19321	devmessias	4u	IPv4	16108218	0t0	TCP	localhost:5000 (LISTEN)

The output shows that at least our server is listening in the 5000 port. Try to remove the -a argument and see what happens.

Is my process stuck waiting for someone? What is causing the process hanging?

This can happen in a myriad of cases. For example, in the dependency management systems like pip/conda. Thus, it is superb to know if you can answer fast if you have a problem on your side or not. Let’s create a simple simulation of this issue. To do so, create the client_mlops.py using the following code:

#!/usr/bin/env python
#client_mlops.py

import requests
import argparse

parser = argparse.ArgumentParser()
parser.add_argument(
    '--sleep', type=int, help='time to sleep', default=0)
args = parser.parse_args()

print('Ask for localhost:5000 to sleep for {} seconds'.format(args.sleep))
r = requests.get('http://localhost:5000', params={'sleep': int(args.sleep)})
print(r.text)

In the above code, we have the sleep argument. This argument will ask server_mlops to wait for a couple of seconds before sending the answer. Let’s see how these situations appear to us at the system call level.

Start the client_mlops.py with the strace:

$ strace -e poll,select,connect,recvfrom,sendto python client_mlops.py --sleep=20

The -e argument followed by poll,select,connect,recvfrom,sendto asks to filter just the sys calls related with connections issues. The output of this fake failure will be something like this

connect(4, {sa_family=AF_INET, sin_port=htons(5000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
connect(4, {sa_family=AF_INET6, sin6_port=htons(5000), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
connect(4, {sa_family=AF_INET6, sin6_port=htons(5000), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 ECONNREFUSED (Connection refused)
connect(4, {sa_family=AF_INET, sin_port=htons(5000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
sendto(4, "GET /?sleep=10 HTTP/1.1\r\nHost: l"..., 154, 0, NULL, 0) = 154
recvfrom(4,

In the last line we see an unfinished recvfrom SYS_CALL . Do you want to know more about recvfrom? Execute man 2 recvfrom in a terminal session. But what matters here? The point is that: strace is telling us there is no problem with our client program, something is problematic on the server side.

You can also use the lsof to investigate this problem. Let’s simulate this scenario.

Starts the client again:

$ python client_mlops.py --sleep=100

Now, get the PID using ps aux | grep -i '[c]lient_mlops.py' and execute the lsof with the obtained PID

lsof -a -i -p 19321

the output will be something like this

COMMAND	PID	USER	FD	TYPE	DEVICE	SIZE/OFF	NODE	NAME
python	31551	devmessias	4u	IPv4	16622065	0t0	TCP	localhost:57314->localhost:5000 (ESTABLISHED)

What lsof is telling us is that: ‘’You client seems fine. At least it is connected with the server“”. What does this mean? This means the answer was not of great help.

Problems with regular files

Sometimes you can have a problem with regular files and you don’t know. Nothing was printed in the output. For example, a wrong cache being used, a program that tries to access a file that doesn’t have permission, a malicious or a bad writing process accessing/creating files in your system, etc.

Here we will simulate simple examples related to regular files. To do so, first, copy any file to the tmp folder or just call the following command:

$ man strace > /tmp/dummy_file.txt

Which processes is this file attached to?

Being able to answer this question can be quite useful. For example, suppose that there is a huge file in your disk being created, almost filling up your system and you want to discover which process is doing this.

First, create the following script.

#!/usr/bin/env python
# file_open.py
import time

f = open('/tmp/dummy_file.txt', 'r')
input('Press Enter to continue...')

Now open two different terminal sessions and perform the following command python file_open.py . in each one.

To find all the processes which are attached to dummy_file.txt you just need to call lsof like this:

$ lsof /tmp/dummy_file.txt

The output should be something similar to this:

COMMAND   PID       USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
python  15411 devmessias    3r   REG    8,2        0 2911031 /tmp/dummy_file.txt
python  20777 devmessias    3r   REG    8,2        0 2911031 /tmp/dummy_file.txt

We have two distinct processes, two different PIDs, using the same file.

I made a mistake! Deleted an important file! How can I recover it?

Here the simulation will be more than one more process attached to a file and you accidentally perform some action that deletes the file.

Create a file accident.txt. Open a terminal session and perform the following command. Don’t close the session!

$ python -c 'f=open("accident.txt", "r");input("...")'

In another session, perform the following commands:

$ rm accident.txt
$ ls accident.txt

And it’s gone :(

ls: cannot access 'acidente.txt': No such file or directory

Don’t worry! Linux has a lot of cool aspects and one of them will help us to recover our file.

In Linux any process has a directory associated with it, this directory will be inside of the /proc folder. And what do these directories store? A lot of things! I assure you that you can be surprised. For example, these folders also store the file descriptors associated with any process using accident.txt. Let’s see if there are any processes using this file.

$ lsof -u your_username | grep 'accident.txt'

In my case, I’ve obtained the following output:

python    22465 devmessias    3r      REG                8,2     37599   14288174 path/accident.txt (deleted)

This is great news! We have a process PID 22465 that still has a file descriptor associated with accident.txt. Now we can use a simple cp command to recover the data. To do so, we need to use the file descriptor number. In my case is 3 which is the number in front of the r character in the output above.

$ cp /proc/22465/fd/3 recovered.txt

Now just call nano recovered.txt and testify the result . It’s not magic! It’s just how the process pseudo-filesystem works!

Silent errors associated with files and permissions

Let’s create a simple example of an undue access to a file using the following script

#!/usr/bin/env python
# file_404.py
import time

try:
    f = open('/tmp/file_that_dosent_exist.csv', 'r')
except FileNotFoundError:
    pass

try:
    # create a file with sudo and then change the permission using chmod 700
    f = open('/tmp/file_wrong_permission.csv', 'r')
except PermissionError:
    pass

input('Press Enter to continue...')

Call python file_404.py, nothing will appears.

To trace any SYS CALL related to regular files made by python file_404.py you just need to use the -e trace=file arg, like this:

$ strace -f -e trace=file python file_404.py

The argument -f says to strace to monitor any children’s process. You probably will use -f every time when you’re dealing with strace in real case scenarios.

The output of the previous command should be like this

lstat("something/file_404.py", {st_mode=S_IFREG|0644, st_size=242, ...}) = 0
openat(AT_FDCWD, "file_404.py", O_RDONLY) = 3
openat(AT_FDCWD, "/tmp/file_that_dosent_exist.csv", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/file_wrong_permission.csv", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)

Cool! With the strace we are able to identify errors even when the programmer used a dangerous practice in the code.

Ok, but let’s improve our output. We can filter the strace output redirecting into it the awk (or grep) and performing a conditional check that each line should start with the open string and have the pattern = -1 in the line. The -1 means that the openat SYS CALL had returned an error.

$ strace -f -e trace=file python file_404.py 2>&1 | awk '/^open/ && /= -1/ {print}'

The output now will be simpler and easier to analyze:

openat(AT_FDCWD, "/home/devmessias/anaconda3/pyvenv.cfg", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/arquivo_404.csv", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/arquivo_permission.csv", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)

If you are using the last versions of strace (5.2>=) you can use a more simple command like this:

$ strace -f -e trace=openat -e status=failed python file_404.py

Is this process using a cache? Where can I find this cache? Which configs files does this process use?

It is a tedious task to search for cache or configs files used by a process and sometimes we need to delete these cache files. Another task that appears very often is to discover which files a process is using or if it is doing anything strange in your system. Maybe you want to discover if your python script is using the correct libs and files. For all these situations, you can use the following command:

$ strace -f -e trace=file command

If you want to get just the SYS CALLs that was perfomed succesfully do the next:

$ strace -f -e trace=file comando 2>&1 | awk '/^open/ && !/= -1/ {print}'

In the above command the ! is as a negation parameter for the awk search.

Again, the new versions of strace allows the status flag:

$ strace -f -e trace=openat -e status=successful python file_404.py

Dissecting your database system

The strace and lsof are powerful tools to discover and solve bugs. You don’t need to believe me. Just check the number of bugs in MySQL that strace was used to tackle “using strace for mysql troubleshooting. Thus, it’s not a surprise that we can use strace in our daily life dealing with databases.

Is this process well behaved? How many connections does it have?

In MlOps or DevOps we always need to deal with database connections. Sometimes we must check if a process is closing these connections or is creating more than necessary. If these connections are made using the TCP protocol you can list all established connections using the following command:

$ lsof -iTCP -sTCP:ESTABLISHED

As expected, we get a lot of unwanted information like website addresses and other application communications. If we want to list just the TCP connection in a process, we must pass the PID like this:

$ lsof -iTCP -sTCP:ESTABLISHED -p 22157

COMMAND   PID       USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
python  22157 devmessias    4u  IPv4 9474517      0t0  TCP localhost:35932->localhost:mysql (ESTABLISHED)
python  22157 devmessias    5u  IPv4 9474518      0t0  TCP localhost:35934->localhost:mysql (ESTABLISHED)
python  22157 devmessias    6u  IPv4 9475529      0t0  TCP localhost:37048->localhost:5000 (ESTABLISHED)

As can you see we also have some connections in our server that are not related to the mysql. If we want to investigate just the mysql connections between all the processes in our system just do:

$ lsof -iTCP:mysql -sTCP:ESTABLISHED 
python  22157 devmessias    4u  IPv4 9474517      0t0  TCP localhost:35932->localhost:mysql (ESTABLISHED)
python  22157 devmessias    5u  IPv4 9474518      0t0  TCP localhost:35934->localhost:mysql (ESTABLISHED)

The process identified by 22157 PID has two connections with our mysql server.

Notice we have used a pattern in the -i argument. This pattern follows this structure:

lsof -i[protocol][@hostname|hostaddr][:service|port]

Everything is working proprely?

Let’s give you a taste of what we can extract from the mysql service using strace in order to get a comprehension about the processes.

Get the PID of mysqld

$ps aux | grep -i '[m]ysqld'
mysql    14001     1  0 13:41 ?        00:00:18 /usr/sbin/mysqld

We will ask the strace to increase the size of the strings up to 50 characters (-s 50) and we will omit any SYS CALL of the type io_getevents , nanosleep and futex.

$ sudo strace -s 50 -f -e trace=!io_getevents,nanosleep,futex -p 10767

Chose one of your databases and tables to do the following examples. Here, I’ve perfomed this SQL query:

SELECT * FROM  product WHERE product_id = 1;

My output prompted some stuff like this

[pid 14334] recvfrom(52, ",\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 14334] recvfrom(52, "\3SELECT * FROM  product WHERE product_id = 1", 44, MSG_DONTWAIT, NULL, NULL) = 44
[pid 14334] sendto(52, "\1\0\0\1\5F\0\0\2\3def\16farmers_market\7product\7product\nprodu"..., 477, MSG_DONTWAIT, NULL, 0 <unfinished ...>
[pid 14207] sched_yield( <unfinished ...>
[pid 14334] <... sendto resumed>)       = 477
[pid 14207] <... sched_yield resumed>)  = 0
...

We can see the SQL query above. This also shows how strace can help us to gain a deep understanding about our system. We can see how the sql queries are comunicating using de recvfrom and sendfrom calls. The man 2 recvfrom says the first number, 52, represents the file descriptor associated with a unix socket.

We can use this approach to investigate $IO$ problems ( using-strace-in-linux-to-troubleshoot-database-performance-issues) as well many others. But let’s simulate a lock condition and see what happens.

Start a session and initiate any transaction. Don’t finish with a COMMIT; command!

# first session
MariaDB [you_db]> BEGIN;
Query OK, 0 rows affected (0.001 sec)
MariaDB [your_db]> UPDATE customer SET customer_first_name = 'something' WHERE customer_id=1;
Query OK, 0 rows affected (0.001 sec)
Rows matched: 1  Changed: 0  Warnings: 0
MariaDB [you_db]>

Looks in the strace output, something like this should appear

[pid 14334] recvfrom(52, "I\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 14334] recvfrom(52, "\3UPDATE customer SET customer_first_name = 'Brun"..., 73, MSG_DONTWAIT, NULL, NULL) = 73
[pid 14334] sendto(52, "0\0\0\1\0\0\0\3\0\0\0(Rows matched: 1  Changed: 0  Warnings:"..., 52, MSG_DONTWAIT, NULL, 0) = 52
[pid 14334] recvfrom(52, 0x7fd354007348, 4, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN 
[pid 14334] poll([{fd=52, events=POLLIN|POLLPRI}], 1, 28800000

The poll is a system call that will wait for any change in the file descriptor 52. The process is waiting for the COMMIT; clause in our first mysql session. Note the absence of the enclosing ) in the last line.

Open another mysql session, try to execute the same SQL query (whitouth the BEGIN)

# second session
MariaDB [your_db]> UPDATE customer SET customer_first_name = 'something' WHERE customer_id=1;

The db row it’s in a lock state. If we look in the strace output just some new lines will be printed, like this

 <unfinished ...>
[pid 29884] recvfrom(83, "I\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 29884] recvfrom(83, "\3UPDATE customer   SET customer_first_name = 'someth"..., 73, MSG_DONTWAIT, NULL, NULL) = 73

Our second transction is waiting for the first to be commited in our db. If you perform a commit in the first session this will hapen in the strace output:

[pid 14334] <... poll resumed>)         = 1 ([{fd=52, revents=POLLIN}])
[pid 14334] recvfrom(52, "\7\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 14334] recvfrom(52, "\3COMMIT", 7, MSG_DONTWAIT, NULL, NULL) = 7
...

Your first transaction was performed and the lock was released allowing the second transaction to be executed.

Extras related to files (`/proc/`) and `strace`

I gave you some examples about connection issues, regular files accidents and database management. Here I’ll give more examples that I believe are not so useful although they are very interesting.

Creating a SYS CALL summary: what does my program do?

An overview of what your program does can help you to perform some optimizations or discover something strange. We can use the strace -c to get an overview of the system calls. For example, the following command gave me a summary of system calls of a make sync-env that I’m using in one of my projects:

$ strace -c -e trace=!\wait4 make sync-env

The exclamation mark, -e trace=!\wait4, in the above command tells strace to ignore any wait4 system call.

What I’ve obtained was this

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 14,54    0,000209           6        33        13 openat
 13,01    0,000187          17        11           vfork
 12,32    0,000177           7        25           mmap
  8,49    0,000122           3        31           close
  8,42    0,000121           5        21           rt_sigprocmask
  8,14    0,000117           6        17           read
  6,89    0,000099           5        19        11 stat
  5,85    0,000084           3        23           fstat
  2,85    0,000041           8         5           mprotect
  2,64    0,000038           9         4           write
  2,51    0,000036           2        16           fcntl
  2,02    0,000029           3         9           rt_sigaction
  1,95    0,000028          14         2           readlink
  1,95    0,000028          14         2           getdents64
  1,25    0,000018           4         4           brk
  1,25    0,000018          18         1         1 access
  1,25    0,000018           3         5           pipe
  1,11    0,000016           4         4           ioctl
  0,84    0,000012           6         2           getcwd
  0,70    0,000010          10         1           munmap
  0,49    0,000007           7         1           lstat
  0,49    0,000007           7         1           execve
  0,49    0,000007           3         2           prlimit64
  0,35    0,000005           5         1           chdir
  0,21    0,000003           3         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0,001437                   241        25 total

What can we extract from the above output? A lot of things. For example make sync-env spent 14% of the time doing system calls of the type opennat and 20 of these 33 openat calls had some problem.

Did this process start with the correct environment variables?

We have several reasons to use environment variables. These variables are easy to configure, improve security, and prevent errors. So they are used everywhere to store a secret, point to a lib and much more. However, sometimes we are not so sure whether a process is using the correct environment variables or not.

Let’s try something simple

$ ANSWER=42 python script.py

As I said previously the /proc is responsible for storing the state of any process running in your machine. So, it’s not a surprise that we can extract the environment variables from that.

To print the environment variables of the process with a PID 4301 just call this cat /proc/4031/environ and get an ugly output. To improve the output we can use the tr and replace the null characters \0 by break lines \n. Like this:

$ tr '\0' '\n' < /proc/4031/environ

You will have a output similar to this

ANSWER=42
SHELL=/bin/bash
LANGUAGE=en_US
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/bin/java
...more stuff

If you want to look just at the environment variables with a given string pattern you can use awk , grep, or anything that you feel more comfortable. For example, doing this

$ tr '\0' '\n' < /proc/4031/environ 2>&1 | awk '/^CONDA/ {print}'

I’ve obtained this

CONDA_EXE=/home/devmessias/anaconda3/bin/conda
CONDA_PREFIX=/home/devmessias/anaconda3
CONDA_PROMPT_MODIFIER=(base) 
CONDA_SHLVL=1
CONDA_PYTHON_EXE=/home/devmessias/anaconda3/bin/python
CONDA_DEFAULT_ENV=base

I forgot to redirect the outputs! What can I do now?

Suppose you started a process without redirecting the output to a file. Maybe you forgot or you are too optimistic about the problems and now you want to persist the errors. If restart the process it’s out of question you can use the strace to solve your headache. Let’s see how we can solve that.

The system calls responsible to request the kernel to write in the stdin, stdout and stderr is the write . If you want to know more you should read the manual

$ man 2 write

But the most important part of the write manual is on the top and is this:

NAME
       write - write to a file descriptor
SYNOPSIS
       #include <unistd.h>
       ssize_t write(int fd, const void *buf, size_t count);

As you can see the first argument is an integer that represents the file descriptor. If fd=1 this means a writing in the stdout and if fd=2 the writing will be in the stderr. So, it’s an easy piece here. We just need to monitor any sys call write with the fd equals to $1$ or $2$ and save the values in a file.

When I need to do this (two or three times in my life time) I use the following pattern

$ strace -f -t -etrace=write -s 1000 -p 4320 2>&1 | grep --line-buffered -e 'write(2, ' -e 'write(1, ' > out.txt

I’m asking strace to monitor any SYS CALL write from the process with the PID 4320 or children created by them (-f) . And saving the output in the out.txt file.

The following code changes the server_mlops.py to help you to explore more this scenario.

# server_mlops.py
import time
import flask
import sys

app = flask.Flask(__name__)

@app.route('/')
def hello_world():
    sleep_time = flask.request.args.get('sleep', default=10, type=int)
    print('sleep_time:', sleep_time)
    for i in range(sleep_time):
        print(f'INFO: {i} of sleep_time \n asdf \t ')
        print(f'ERROR: Example msg {i}', file=sys.stderr) 
        time.sleep(1)
    return 'Hello World!'

if __name__ == '__main__':
    app.run()

How this program has been called? What is the working dir of the process?

Ok, you can answer these questions using htop. However we can get the same information without installing anything, just looking in a file inside of the /proc/ folder, like this

$ tr '\0' '\t' < /proc/A_PID_NUMBER/cmdline

In my case I’ve obtined this

python client.py --sleep 1000

To discover the working directory do this

$ readlink /proc/A_PID_NUMBER/cwd

Conclusion & Suggestions

I hope that after reading this post you can be more prepared to face problems in your daily tasks. But if you have any suggestion you can send me an email devmessias@gmail.com. Thanks!

If you want to know more about strace and linux in general I strongly recommend spent some hours navigating and reading Julia Evans’ blog https://jvns.ca/.

linux strace lsof system calls python debugging devops mlops databases

Dissecting processes and failures in Linux with lsof and strace: cases for MlOps and DevOps

Concepts

“Everything is a file.” - The UNIX mantra.

LSOF

System Calls and strace

A deep dive into failures

Network issues

Is my server alive?

Is my process stuck waiting for someone? What is causing the process hanging?

Problems with regular files

Which processes is this file attached to?

I made a mistake! Deleted an important file! How can I recover it?

Silent errors associated with files and permissions

Is this process using a cache? Where can I find this cache? Which configs files does this process use?

Dissecting your database system

Is this process well behaved? How many connections does it have?

Everything is working proprely?

Creating a SYS CALL summary: what does my program do?

Did this process start with the correct environment variables?

I forgot to redirect the outputs! What can I do now?

How this program has been called? What is the working dir of the process?

Conclusion & Suggestions

Bruno Messias

Ph.D Candidate/Software Developer

Related

Dissecting processes and failures in Linux with lsof and strace: cases for MlOps and DevOps

Concepts

“Everything is a file.” - The UNIX mantra.

LSOF

System Calls and strace

A deep dive into failures

Network issues

Is my server alive?

Is my process stuck waiting for someone? What is causing the process hanging?

Problems with regular files

Which processes is this file attached to?

I made a mistake! Deleted an important file! How can I recover it?

Silent errors associated with files and permissions

Is this process using a cache? Where can I find this cache? Which configs files does this process use?

Dissecting your database system

Is this process well behaved? How many connections does it have?

Everything is working proprely?

Extras related to files (/proc/) and strace

Creating a SYS CALL summary: what does my program do?

Did this process start with the correct environment variables?

I forgot to redirect the outputs! What can I do now?

How this program has been called? What is the working dir of the process?

Conclusion & Suggestions

Bruno Messias

Ph.D Candidate/Software Developer

Related

Extras related to files (`/proc/`) and `strace`