I don’t know how to continue so I will summarize my findings about mce daemon and libdsme issue in nemomobile and will hope for power of internet.
Symptoms:
[root@manjaro-arm ~]# mce
systemctl status mce
*** UNRECOVERABLE FAILURE ***
Segmentation fault (core dumped)
When I started mce -vvv -T
(super verbose and to stderr) it crashed inside of libdsme initialization.
mce: T+12.080 X: mce-dsme.c: mce_dsme_datapipe_dsme_service_state_cb(): DSME dbus service: undefined -> running
mce: T+12.080 D: mce-dsme.c: mce_dsme_socket_connect(): Opening DSME socket
Thread 1 "mce" received signal SIGSEGV, Segmentation fault.
I looked into the libdsme. Namely to that part of code. I have seen the DSME_SOCKFILE environment as first suspect of trouble. I tried to set it to the same value as is the default because socket with same name was there. It started to work. I was looking for some variable which overridden default, but I didn’t found it hence I started with debugging.
const char* dsmesock_default_location = "/run/dsme.socket";
dsmesock_connection_t* dsmesock_connect(void)
{
dsmesock_connection_t* ret = 0;
int fd;
struct sockaddr_un c_addr;
const char* dsmesock_filename = NULL;
dsmesock_filename = getenv("DSME_SOCKFILE");
if (dsmesock_filename == 0 || *dsmesock_filename == '\0') {
dsmesock_filename = dsmesock_default_location;
}
if ((fd = socket(PF_UNIX, SOCK_STREAM, 0)) != -1) {
....
I have added following code:
fprintf(stderr, "env = %s\ndefault= %d\n", dsmesock_filename, dsmesock_default_location);
Surprisingly, it returned:
env = (null)
default = (null)
I started to review my C language knowledge, because I had no idea how this can happen. After discussion with our C/C++ guru I have realized, that this part of code could be correct and have just some memory corruption around.
We tried to add const
to its definition:
const char* const dsmesock_default_location = "/run/dsme.socket";
This fixed the behavior of dsmesock_default_location
variable, but we were afraid of moving memory corruption somewhere else. I was trying to add gdb watchpoint:
$ gdb mce
# watch (const char*) dsmesock_default_location
# r
But it looks very very slow or and I didn’t had enough patience to keep it run to the end.
The backtrace of the crash looks like this (hey, I should started with it). Only libdsme have debug symbols now.
0x0000fffff789b98c in strcpy () from /usr/lib/libc.so.6
(gdb) bt
#0 0x0000fffff789b98c in strcpy () at /usr/lib/libc.so.6
#1 0x0000fffff7b02720 in dsmesock_connect () at protocol.c:65
#2 0x0000aaaaaaade054 in ()
#3 0x0000aaaaaaabd294 in datapipe_exec_full_real ()
#4 0x0000aaaaaaad977c in ()
#5 0x0000aaaaaaadad10 in ()
#6 0x0000fffff7b29adc in () at /usr/lib/libdbus-1.so.3
#7 0x0000fffff7b2e6bc in dbus_connection_dispatch ()
at /usr/lib/libdbus-1.so.3
#8 0x0000aaaaaaaffa00 in ()
#9 0x0000fffff7be5560 in g_main_context_dispatch ()
at /usr/lib/libglib-2.0.so.0
#10 0x0000fffff7be58ac in () at /usr/lib/libglib-2.0.so.0
#11 0x0000fffff7be5cb8 in g_main_loop_run () at /usr/lib/libglib-2.0.so.0
#12 0x0000aaaaaaabaf64 in main ()
When I have compiled it with debug symbols it suddenly started work again . It could be caused by moving of data into different part of memory, so the evil pointer is breaking again something else. It may also indicate removal of data during strip
!debug
removal of makepkg
process.
The same behavior could be observed right after entering main function. The shared library is not loaded by dlopen()
unlinke libhybris so the symbol should be reachable at this point.
(gdb) b main
Breakpoint 1 at 0x1a700
(gdb) r
Starting program: /usr/bin/mce
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Breakpoint 1, 0x0000aaaaaaaba700 in main ()
(gdb) p dsmesock_default_location
$1 = 0x0
(gdb)
(gdb) p &dsmesock_default_location
$2 = (const char **) 0xaaaaaab3d850 <dsmesock_default_location>
(gdb) x/30 0xaaaaaab3d820
0xaaaaaab3d820: 0x00000000 0x00000000 0x00000000 0x00000000
0xaaaaaab3d830: 0x00000000 0x00000000 0x00000000 0x00000000
0xaaaaaab3d840: 0x00000000 0x00000000 0x00000000 0x00000000
0xaaaaaab3d850 <dsmesock_default_location>: 0x00000000 0x00000000 0xaab66600 0x0000aaaa
0xaaaaaab3d860: 0xaab99820 0x0000aaaa 0x00000000 0x00000000
0xaaaaaab3d870: 0xaab83eb0 0x0000aaaa 0x00000000 0x00000074
0xaaaaaab3d880: 0xaab8c290 0x0000aaaa 0x00000001 0x00000000
0xaaaaaab3d890: 0x00000000 0x00000000
I was trying to use address sanitizer to find something, but there is dlopen needed for use of libhybris with RTLD_DEEPBIND flag which cannot be used with address sanitizer. After disabling of libhybris part, i have seen some strange comilation error (DSO@ something) which I cannot solve.
By the way the string is present in the shared object:
$ strings libdsme.so | grep -c /run/dsme.socket
1
Now is the time for magic power of internet
- Why the original code is not working?
- Is it memory corruption or result of invalid compilation or result of stripping of binary?
Update: what is the solution
I focused on libdsme as the back trace shown crash inside of it. It turned out, that any recompilation of mce package leads to binary which doesn’t crash anymore.
I am not able to debug it when I am not able to reproduce it.
Update: collision of names
If I understand correctly, the issue was caused by collision of names. Thomas Perl found it and created merge request. Thanks!