Fri Mar 29 10:03:43 2024
EVENTS
 FREE
SOFTWARE
INSTITUTE

POLITICS
JOBS
MEMBERS'
CORNER

MAILING
LIST

NYLXS Mailing Lists and Archives
NYLXS Members have a lot to say and share but we don't keep many secrets. Join the Hangout Mailing List and say your peice.

DATE 2017-01-01

LEARN

2024-03-29 | 2024-02-29 | 2024-01-29 | 2023-12-29 | 2023-11-29 | 2023-10-29 | 2023-09-29 | 2023-08-29 | 2023-07-29 | 2023-06-29 | 2023-05-29 | 2023-04-29 | 2023-03-29 | 2023-02-28 | 2023-01-28 | 2022-12-28 | 2022-11-28 | 2022-10-28 | 2022-09-28 | 2022-08-28 | 2022-07-28 | 2022-06-28 | 2022-05-28 | 2022-04-28 | 2022-03-28 | 2022-02-28 | 2022-01-28 | 2021-12-28 | 2021-11-28 | 2021-10-28 | 2021-09-28 | 2021-08-28 | 2021-07-28 | 2021-06-28 | 2021-05-28 | 2021-04-28 | 2021-03-28 | 2021-02-28 | 2021-01-28 | 2020-12-28 | 2020-11-28 | 2020-10-28 | 2020-09-28 | 2020-08-28 | 2020-07-28 | 2020-06-28 | 2020-05-28 | 2020-04-28 | 2020-03-28 | 2020-02-28 | 2020-01-28 | 2019-12-28 | 2019-11-28 | 2019-10-28 | 2019-09-28 | 2019-08-28 | 2019-07-28 | 2019-06-28 | 2019-05-28 | 2019-04-28 | 2019-03-28 | 2019-02-28 | 2019-01-28 | 2018-12-28 | 2018-11-28 | 2018-10-28 | 2018-09-28 | 2018-08-28 | 2018-07-28 | 2018-06-28 | 2018-05-28 | 2018-04-28 | 2018-03-28 | 2018-02-28 | 2018-01-28 | 2017-12-28 | 2017-11-28 | 2017-10-28 | 2017-09-28 | 2017-08-28 | 2017-07-28 | 2017-06-28 | 2017-05-28 | 2017-04-28 | 2017-03-28 | 2017-02-28 | 2017-01-28 | 2016-12-28 | 2016-11-28 | 2016-10-28 | 2016-09-28 | 2016-08-28 | 2016-07-28 | 2016-06-28 | 2016-05-28 | 2016-04-28 | 2016-03-28 | 2016-02-28 | 2016-01-28 | 2015-12-28 | 2015-11-28 | 2015-10-28 | 2015-09-28 | 2015-08-28 | 2015-07-28 | 2015-06-28 | 2015-05-28 | 2015-04-28 | 2015-03-28 | 2015-02-28 | 2015-01-28 | 2014-12-28 | 2014-11-28 | 2014-10-28

Key: Value:

Key: Value:

MESSAGE
DATE 2017-01-20
FROM ruben safir
SUBJECT Subject: [Learn] Fwd: Re: threads and exit() woes
From learn-bounces-at-nylxs.com Fri Jan 20 15:03:12 2017
Return-Path:
X-Original-To: archive-at-mrbrklyn.com
Delivered-To: archive-at-mrbrklyn.com
Received: from www.mrbrklyn.com (www.mrbrklyn.com [96.57.23.82])
by mrbrklyn.com (Postfix) with ESMTP id EA96B161312;
Fri, 20 Jan 2017 15:03:11 -0500 (EST)
X-Original-To: learn-at-nylxs.com
Delivered-To: learn-at-nylxs.com
Received: from [10.0.0.62] (flatbush.mrbrklyn.com [10.0.0.62])
by mrbrklyn.com (Postfix) with ESMTP id 45C08160E77
for ; Fri, 20 Jan 2017 15:03:08 -0500 (EST)
References:
<877f73ke3n.fsf-at-doppelsaurus.mobileactivedefense.com>




<20161213233015.ccfd8a0248833f37069ca9c6-at-speakeasy.net>



<87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com>

<686dnZXvcrwrAsXFnZ2dnUU7-X3NnZ2d-at-posted.internetamerica>


To: "learn-at-nylxs.com"
From: ruben safir
X-Forwarded-Message-Id:
<877f73ke3n.fsf-at-doppelsaurus.mobileactivedefense.com>




<20161213233015.ccfd8a0248833f37069ca9c6-at-speakeasy.net>



<87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com>

<686dnZXvcrwrAsXFnZ2dnUU7-X3NnZ2d-at-posted.internetamerica>


Message-ID: <058c1891-9b5e-70cf-11cb-bca30e16e32b-at-mrbrklyn.com>
Date: Fri, 20 Jan 2017 15:03:08 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.5.1
MIME-Version: 1.0
In-Reply-To:
Content-Type: multipart/mixed; boundary="------------03A2DB7651F09BFE8C70F091"
Subject: [Learn] Fwd: Re: threads and exit() woes
X-BeenThere: learn-at-nylxs.com
X-Mailman-Version: 2.1.17
Precedence: list
List-Id:
List-Unsubscribe: ,

List-Archive:
List-Post:
List-Help:
List-Subscribe: ,

Errors-To: learn-bounces-at-nylxs.com
Sender: "Learn"

This is a multi-part message in MIME format.
--------------03A2DB7651F09BFE8C70F091
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit

ditto

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!goblin1!goblin.stu.neva.ru!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: Jorgen Grahn
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: 13 Dec 2016 06:44:58 GMT
Message-ID:
References:
X-Trace: individual.net fr/oyQQ/5RQ5DNqAWUHJhgDKJeY8gXh+VnMMso8PNOLWd+vzbp
Cancel-Lock: sha1:+FwHL5SqESz0ScoXHd828SQxUq4=
User-Agent: slrn/pre1.0.0-18 (Linux)
Xref: panix comp.unix.programmer:236705

On Mon, 2016-12-12, Jens Thoms Toerring wrote:
> Hi,
>
> I've to deal with a multi-threaded program that has, as
> one of its threads a "watchdog thread" that, when it doesn't
> notice some variable getting set within a certain time,
> is supposed to stop the whole program (at any cost, no
> worries about data lost). It does attempt to shut down the
> program by calling exit(). Now, all the references I have
> consulted (TLPI, APUE 3rd ed. etc.) all claim that when one
> of the threads calls exit() the program will be ended. A
> look at SUSv4 just mentions in addition that the end of
> the program might be delayed if there are outstanding
> asynchronuous I/O operations that can't be cancelled
> (nothing I guess I'm having).
>
> This did work with a 3.4 Linux kernel. But after switching
> to a 4.4 kernel it suddenly doesn't work reliably anymore.
> If it fails one thread seems to run amok, using about 50%
> of the CPU time, the other 50% being used by ksoftirqd. The
> whole thing can't be stopped in any way (not even with 'kill
> -SIGKILL'). I've also tried to replace the exit() call with
> a kill(getpid(), SIGKILL) but also with no luck. Attaching
> with gdb fails as well (hangs indefinitely). Looks like a
> real zombie: dead and very active at the same time:-(
>
> Does that ring a bell with anyone of you? One of the threads
> is rather likely to do a lot of epoll() calls.
>
> Please keep in mind that I can't simply change the whole
> architecture - this is an embedded system already out in
> the field, and my role in this is to get a new kernel ver-
> sion to work, not upset a more or less working application
> (unless I can come up with very convincing arguments;-)

Apart from what the others wrote:

- Can you use strace or pstack or something to find out what that
remaining thread is doing? Even looking in /proc can be useful.

- Keep in mind that exit() does things before exiting, e.g. run exit
handlers.

Also shots in the dark ...

/Jorgen

--
// Jorgen Grahn \X/ snipabacken.se> O o .

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!goblin1!goblin.stu.neva.ru!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: Rainer Weikusat
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: Tue, 13 Dec 2016 16:06:52 +0000
Message-ID: <877f73ke3n.fsf-at-doppelsaurus.mobileactivedefense.com>
References:
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: individual.net mkuINJo5GrcQrgzSegXW2QZrlNY62iGJ9KJV8gx6VDJXA5MKY=
Cancel-Lock: sha1:n4T5frj4oLH11ajhLfBYX0G3538= sha1:NKqF1FZDtckQfWzDBuu6k4+q/I8=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
Xref: panix comp.unix.programmer:236707

jt-at-toerring.de (Jens Thoms Toerring) writes:

[terminate program via exit run by watchdog thread]

> This did work with a 3.4 Linux kernel. But after switching
> to a 4.4 kernel it suddenly doesn't work reliably anymore.
> If it fails one thread seems to run amok, using about 50%
> of the CPU time, the other 50% being used by ksoftirqd. The
> whole thing can't be stopped in any way (not even with 'kill
> -SIGKILL').

This suggests that the thread is in a D state (uninterruptible sleep)
which persists for some reason. Trying to determine what it's doing in
the kernel (eg, strace, /proc//wchan) might be useful.

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!goblin1!goblin3!goblin.stu.neva.ru!news.mb-net.net!open-news-network.org!.POSTED!not-for-mail
From: Marcel Mueller
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: Tue, 13 Dec 2016 18:34:23 +0100
Organization: MB-NET.NET for Open-News-Network e.V.
Message-ID:
References:
NNTP-Posting-Host: aftr-95-222-29-234.unity-media.net
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: gwaiyur.mb-net.net 1481650463 32695 95.222.29.234 (13 Dec 2016 17:34:23 GMT)
X-Complaints-To: abuse-at-open-news-network.org
NNTP-Posting-Date: Tue, 13 Dec 2016 17:34:23 +0000 (UTC)
User-Agent: Mozilla/5.0 (OS/2; Warp 4.5; rv:24.0) Gecko/20100101 Thunderbird/24.8.1
In-Reply-To:
Xref: panix comp.unix.programmer:236708

On 13.12.16 00.03, Jens Thoms Toerring wrote:
> This did work with a 3.4 Linux kernel. But after switching
> to a 4.4 kernel it suddenly doesn't work reliably anymore.
> If it fails one thread seems to run amok, using about 50%
> of the CPU time, the other 50% being used by ksoftirqd. The
> whole thing can't be stopped in any way (not even with 'kill
> -SIGKILL'). I've also tried to replace the exit() call with
> a kill(getpid(), SIGKILL) but also with no luck. Attaching
> with gdb fails as well (hangs indefinitely). Looks like a
> real zombie: dead and very active at the same time:-(

Probably an exit handler does unexpected things. This could be part of
the C runtime as well as part of a used library or even your code.

Maybe shutting down your program this way runs into badly tested code
paths with some race conditions.

Try abort() which does not invoke that much exit handlers.

> Does that ring a bell with anyone of you? One of the threads
> is rather likely to do a lot of epoll() calls.

Definitely I/O. It should check for the exit condition before invoking
another I/O. The Linux kernel behaves quite bad when killing processes
with outstanding I/O. Request like that are simply ignored.


Marcel

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!feeder.erje.net!2.us.feeder.erje.net!news.glorb.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!post01.iad.highwinds-media.com!fx36.iad.POSTED!not-for-mail
X-Newsreader: xrn 9.03-beta-14-64bit
Sender: scott-at-dragon.sl.home (Scott Lurndal)
From: scott-at-slp53.sl.home (Scott Lurndal)
Reply-To: slp53-at-pacbell.net
Subject: Re: threads and exit() woes
Newsgroups: comp.unix.programmer
References:
Message-ID:
X-Complaints-To: abuse-at-usenetserver.com
NNTP-Posting-Date: Tue, 13 Dec 2016 18:13:40 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Tue, 13 Dec 2016 18:13:40 GMT
X-Received-Bytes: 2015
X-Received-Body-CRC: 2604350729
Xref: panix comp.unix.programmer:236709

Marcel Mueller writes:
>On 13.12.16 00.03, Jens Thoms Toerring wrote:
>> This did work with a 3.4 Linux kernel. But after switching
>> to a 4.4 kernel it suddenly doesn't work reliably anymore.
>> If it fails one thread seems to run amok, using about 50%
>> of the CPU time, the other 50% being used by ksoftirqd. The
>> whole thing can't be stopped in any way (not even with 'kill
>> -SIGKILL'). I've also tried to replace the exit() call with
>> a kill(getpid(), SIGKILL) but also with no luck. Attaching
>> with gdb fails as well (hangs indefinitely). Looks like a
>> real zombie: dead and very active at the same time:-(
>
>Probably an exit handler does unexpected things. This could be part of
>the C runtime as well as part of a used library or even your code.
>
>Maybe shutting down your program this way runs into badly tested code
>paths with some race conditions.
>
>Try abort() which does not invoke that much exit handlers.
>
>> Does that ring a bell with anyone of you? One of the threads
>> is rather likely to do a lot of epoll() calls.
>
>Definitely I/O. It should check for the exit condition before invoking
>another I/O. The Linux kernel behaves quite bad when killing processes
>with outstanding I/O. Request like that are simply ignored.
>

If SIGKILL doesn't kill the process, you've a kernel bug.

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!2.us.feeder.erje.net!feeder.erje.net!2.eu.feeder.erje.net!news.swapon.de!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Lew Pitcher
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: Tue, 13 Dec 2016 13:27:48 -0500
Organization: The Pitcher Digital Freehold
Message-ID:
References:
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7Bit
Injection-Info: mx02.eternal-september.org; posting-host="3010cfc25bc10d40bae4e65aed6697c7";
logging-data="31197"; mail-complaints-to="abuse-at-eternal-september.org"; posting-account="U2FsdGVkX18096ZIB0NkFx0gkx7L0aWo+EKfDPDtt4E="
Cancel-Lock: sha1:AlAmiMqQBBT0sbC81Q36rXJOV9Q=
Xref: panix comp.unix.programmer:236711

On Tuesday December 13 2016 13:13, in comp.unix.programmer, "Scott Lurndal"
wrote:

> Marcel Mueller writes:
>>On 13.12.16 00.03, Jens Thoms Toerring wrote:
>>> This did work with a 3.4 Linux kernel. But after switching
>>> to a 4.4 kernel it suddenly doesn't work reliably anymore.
>>> If it fails one thread seems to run amok, using about 50%
>>> of the CPU time, the other 50% being used by ksoftirqd. The
>>> whole thing can't be stopped in any way (not even with 'kill
>>> -SIGKILL'). I've also tried to replace the exit() call with
>>> a kill(getpid(), SIGKILL) but also with no luck. Attaching
>>> with gdb fails as well (hangs indefinitely). Looks like a
>>> real zombie: dead and very active at the same time:-(
>>
>>Probably an exit handler does unexpected things. This could be part of
>>the C runtime as well as part of a used library or even your code.
>>
>>Maybe shutting down your program this way runs into badly tested code
>>paths with some race conditions.
>>
>>Try abort() which does not invoke that much exit handlers.
>>
>>> Does that ring a bell with anyone of you? One of the threads
>>> is rather likely to do a lot of epoll() calls.
>>
>>Definitely I/O. It should check for the exit condition before invoking
>>another I/O. The Linux kernel behaves quite bad when killing processes
>>with outstanding I/O. Request like that are simply ignored.
>>
>
> If SIGKILL doesn't kill the process, you've a kernel bug.

Even with a non-buggy kernel, SIGKILL won't terminate a zombie process, nor a
process stuck in "uninterruptable sleep" state.

It would be helpfull to see the state of the hung thread, as reported by ps or
some other tool.

--
Lew Pitcher
"In Skills, We Trust"
PGP public key available upon request


--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!usenet.stanford.edu!news.glorb.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!post01.iad.highwinds-media.com!fx16.iad.POSTED!not-for-mail
X-Newsreader: xrn 9.03-beta-14-64bit
Sender: scott-at-dragon.sl.home (Scott Lurndal)
From: scott-at-slp53.sl.home (Scott Lurndal)
Reply-To: slp53-at-pacbell.net
Subject: Re: threads and exit() woes
Newsgroups: comp.unix.programmer
References:
Message-ID:
X-Complaints-To: abuse-at-usenetserver.com
NNTP-Posting-Date: Tue, 13 Dec 2016 18:44:59 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Tue, 13 Dec 2016 18:44:59 GMT
X-Received-Bytes: 2983
X-Received-Body-CRC: 2032316764
Xref: panix comp.unix.programmer:236712

Lew Pitcher writes:
>On Tuesday December 13 2016 13:13, in comp.unix.programmer, "Scott Lurndal"
> wrote:
>
>> Marcel Mueller writes:
>>>On 13.12.16 00.03, Jens Thoms Toerring wrote:
>>>> This did work with a 3.4 Linux kernel. But after switching
>>>> to a 4.4 kernel it suddenly doesn't work reliably anymore.
>>>> If it fails one thread seems to run amok, using about 50%
>>>> of the CPU time, the other 50% being used by ksoftirqd. The
>>>> whole thing can't be stopped in any way (not even with 'kill
>>>> -SIGKILL'). I've also tried to replace the exit() call with
>>>> a kill(getpid(), SIGKILL) but also with no luck. Attaching
>>>> with gdb fails as well (hangs indefinitely). Looks like a
>>>> real zombie: dead and very active at the same time:-(
>>>
>>>Probably an exit handler does unexpected things. This could be part of
>>>the C runtime as well as part of a used library or even your code.
>>>
>>>Maybe shutting down your program this way runs into badly tested code
>>>paths with some race conditions.
>>>
>>>Try abort() which does not invoke that much exit handlers.
>>>
>>>> Does that ring a bell with anyone of you? One of the threads
>>>> is rather likely to do a lot of epoll() calls.
>>>
>>>Definitely I/O. It should check for the exit condition before invoking
>>>another I/O. The Linux kernel behaves quite bad when killing processes
>>>with outstanding I/O. Request like that are simply ignored.
>>>
>>
>> If SIGKILL doesn't kill the process, you've a kernel bug.
>
>Even with a non-buggy kernel, SIGKILL won't terminate a zombie process, nor a
>process stuck in "uninterruptable sleep" state.

A zombie no longer holds resources, with the exception of the exit status
(say 32-bits) and the pid.

It's the parent responsibility to reap the status.

An operating system that allows an application to enter an
"uninterruptable sleep" state is broken.

It used to be in SVR3, that one could end up in an uninterruptable
sleep state during close(2) when the file descriptor referenced a
character special device for a parallel port (e.g. printer) and the
printer was off-line. Bugs like that were mainly fixed a quarter
century ago.

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!goblin2!goblin1!goblin.stu.neva.ru!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Lew Pitcher
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: Tue, 13 Dec 2016 14:18:22 -0500
Organization: The Pitcher Digital Freehold
Message-ID:
References:
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7Bit
Injection-Info: mx02.eternal-september.org; posting-host="3010cfc25bc10d40bae4e65aed6697c7";
logging-data="11428"; mail-complaints-to="abuse-at-eternal-september.org"; posting-account="U2FsdGVkX1+hbyxoCh68E1Z6eEdVlVzkJFeeQOiUxCc="
Cancel-Lock: sha1:Gog6ChEHhvzP8J7O1khU2ydBnRA=
Xref: panix comp.unix.programmer:236713

On Tuesday December 13 2016 13:44, in comp.unix.programmer, "Scott Lurndal"
wrote:

> Lew Pitcher writes:
>>On Tuesday December 13 2016 13:13, in comp.unix.programmer, "Scott Lurndal"
>> wrote:
>>
>>> Marcel Mueller writes:
>>>>On 13.12.16 00.03, Jens Thoms Toerring wrote:
>>>>> This did work with a 3.4 Linux kernel. But after switching
>>>>> to a 4.4 kernel it suddenly doesn't work reliably anymore.
>>>>> If it fails one thread seems to run amok, using about 50%
>>>>> of the CPU time, the other 50% being used by ksoftirqd. The
>>>>> whole thing can't be stopped in any way (not even with 'kill
>>>>> -SIGKILL'). I've also tried to replace the exit() call with
>>>>> a kill(getpid(), SIGKILL) but also with no luck. Attaching
>>>>> with gdb fails as well (hangs indefinitely). Looks like a
>>>>> real zombie: dead and very active at the same time:-(
>>>>
>>>>Probably an exit handler does unexpected things. This could be part of
>>>>the C runtime as well as part of a used library or even your code.
>>>>
>>>>Maybe shutting down your program this way runs into badly tested code
>>>>paths with some race conditions.
>>>>
>>>>Try abort() which does not invoke that much exit handlers.
>>>>
>>>>> Does that ring a bell with anyone of you? One of the threads
>>>>> is rather likely to do a lot of epoll() calls.
>>>>
>>>>Definitely I/O. It should check for the exit condition before invoking
>>>>another I/O. The Linux kernel behaves quite bad when killing processes
>>>>with outstanding I/O. Request like that are simply ignored.
>>>>
>>>
>>> If SIGKILL doesn't kill the process, you've a kernel bug.
>>
>>Even with a non-buggy kernel, SIGKILL won't terminate a zombie process, nor
>>a process stuck in "uninterruptable sleep" state.
>
> A zombie no longer holds resources, with the exception of the exit status
> (say 32-bits) and the pid.
>
> It's the parent responsibility to reap the status.

True. It remains in the process table (and visible through ps(1)) until the
parent reaps the status, or permits init(8) to reap the status. Since the
process is already dead, it CANNOT be "killed" (terminated and removed from
the process table) by SIGKILL.

> An operating system that allows an application to enter an
> "uninterruptable sleep" state is broken.

OK. Thanks for the opinion.

Howver, whether or not the OS is, in your opinion, "broken", "uninterruptable
sleep" is still a permitted state. And, because the process cannot be
scheduled, it cannot receive /any/ signal, let alone SIGKILL.

> It used to be in SVR3, that one could end up in an uninterruptable
> sleep state during close(2) when the file descriptor referenced a
> character special device for a parallel port (e.g. printer) and the
> printer was off-line. Bugs like that were mainly fixed a quarter
> century ago.


--
Lew Pitcher
"In Skills, We Trust"
PGP public key available upon request


--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!goblin3!goblin.stu.neva.ru!news.mb-net.net!open-news-network.org!.POSTED!not-for-mail
From: Marcel Mueller
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: Tue, 13 Dec 2016 21:01:54 +0100
Organization: MB-NET.NET for Open-News-Network e.V.
Message-ID:
References:
NNTP-Posting-Host: aftr-95-222-29-234.unity-media.net
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: gwaiyur.mb-net.net 1481659314 22785 95.222.29.234 (13 Dec 2016 20:01:54 GMT)
X-Complaints-To: abuse-at-open-news-network.org
NNTP-Posting-Date: Tue, 13 Dec 2016 20:01:54 +0000 (UTC)
User-Agent: Mozilla/5.0 (OS/2; Warp 4.5; rv:24.0) Gecko/20100101 Thunderbird/24.8.1
In-Reply-To:
Xref: panix comp.unix.programmer:236714

On 13.12.16 19.13, Scott Lurndal wrote:
>> Definitely I/O. It should check for the exit condition before invoking
>> another I/O. The Linux kernel behaves quite bad when killing processes
>> with outstanding I/O. Request like that are simply ignored.
>
> If SIGKILL doesn't kill the process, you've a kernel bug.

Well, welcome to real word.
A process hanging in state D is one of the most often causes of system
reboots. This did not change significantly over the last 15 years from
Debian Woody to recent Raspbian with kernel 4.4. Of course, it is not
that often that I have serious trouble. Once or twice per year or
something like that.
AFAIK there is absolutely no recovery from a process blocked in state D.
This seems to be a Linux specific "feature".


Marcel

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!feeder.erje.net!2.us.feeder.erje.net!news.glorb.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!buffer1.nntp.dca1.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Tue, 13 Dec 2016 18:30:02 -0600
Message-ID:
From:
Subject: Re: threads and exit() woes
Newsgroups: comp.unix.programmer
References:
User-Agent: tin/2.2.1-20140504 ("Tober an Righ") (UNIX) (OpenBSD/5.9 (amd64))
Date: Tue, 13 Dec 2016 16:23:08 -0800
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-WeYPoU4YB163az2eTaLRJKMASQKmTMcWZWQgiXior0JywE5Za6CK4GPE7Q2Nxso/BhRjun0G0uciwXE!6KNyh69LTZ6OeupiBMZjusXwl2dkyf8Y+FFVl2HW8o1wgGh8z0MFpmj6xI8ld8QYMQ==
X-Complaints-To: abuse-at-giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 4490
Xref: panix comp.unix.programmer:236717

Marcel Mueller wrote:
> On 13.12.16 19.13, Scott Lurndal wrote:
>>> Definitely I/O. It should check for the exit condition before invoking
>>> another I/O. The Linux kernel behaves quite bad when killing processes
>>> with outstanding I/O. Request like that are simply ignored.
>>
>> If SIGKILL doesn't kill the process, you've a kernel bug.
>
> Well, welcome to real word.
> A process hanging in state D is one of the most often causes of system
> reboots. This did not change significantly over the last 15 years from
> Debian Woody to recent Raspbian with kernel 4.4. Of course, it is not
> that often that I have serious trouble. Once or twice per year or
> something like that.
> AFAIK there is absolutely no recovery from a process blocked in state D.
> This seems to be a Linux specific "feature".

The classic stumbling block is that the block device subsystems in Linux as
well the *BSDs are fundamentally synchronous. This is related historically
to why polling I/O on regular (block device) files is defined by POSIX to
alway immediately return ready. Given the expectations engendered by the
history, it was apparently too convenient for implementations to bake
synchronous interfaces into their block device and driver models.

NFS implementations on Linux (and I assume other Unix systems) were
especially notorious in this regard, because the kernel implementations
adopted the same synchronous interface model, but for obvious reasons were
much more prone to putting processes into a prolonged, uninterruptible
state.

AFAIU, making block device I/O asynchronous (and thus interruptible)
requires extensive refactoring of the driver model as well as the individual
drivers for those operating systems.

POSIX AIO on those systems simply use kernel threads to do the synchronous
calls, which only hides the issue. The kernel thread could still block,
consuming system resources indefinitely even after the requesting process
has long exited. You get a slightly cleaner user process tree, yes, but
requests still linger behind the scenes, and resource accounting can no
longer be kept deterministic without some ugly compromises.

Given the pedigree of Solaris, AIX, and HP-UX, I'm curious what those
systems did. Did they refactor their driver model? Officially commit to the
kernel thread hack? Or find some sort of compromise, e.g. a
quasi-synchronous interface where updated drivers could bubble up through
the call stack an interrupt or timeout?

There have been several attempts over the years to systematize the kernel
thread hack in Linux. See, e.g., these 2007 articles

"Fibrils and asynchronous system calls", https://lwn.net/Articles/219954/
"LCA: A new approach to asynchronous I/O" https://lwn.net/Articles/316806/

and most recently from 2016

"Fixing asynchronous I/O, again" https://lwn.net/Articles/671649/

I like to think they always fail because at the end of the day using slave
threads can be easily done in userspace. And interfaces like splice(2),
sendfile(2), eventfd(2), etc that can allow the userspace solution to match
or even exceed the kernel-space solution are useful in their own right. That
reality makes it difficult to accept the maintenance burden of an in-kernel
overlay solution that doesn't address the underlying issues. But maybe
that's just wishful thinking.


--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!news.linkpendium.com!news.linkpendium.com!news.glorb.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!post02.iad.highwinds-media.com!fx23.iad.POSTED!not-for-mail
From: "James K. Lowden"
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Message-ID: <20161213233015.ccfd8a0248833f37069ca9c6-at-speakeasy.net>
References:




X-Newsreader: Sylpheed 3.4.3 (GTK+ 2.24.28; x86_64--netbsd)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse-at-newsdemon.com
NNTP-Posting-Date: Wed, 14 Dec 2016 04:30:15 UTC
Organization: http://www.NewsDemon.com
Date: Tue, 13 Dec 2016 23:30:15 -0500
X-Received-Bytes: 1776
X-Received-Body-CRC: 134682496
Xref: panix comp.unix.programmer:236719

On Tue, 13 Dec 2016 16:23:08 -0800
wrote:

> The classic stumbling block is that the block device subsystems in
> Linux as well the *BSDs are fundamentally synchronous.

It's not clear to me why they should be anything other than
synchronous. The devices themselves might in some cases support a
queued command interface (e.g. SCSI) but that view of the device is
very different from a linear-store-of-bytes abstraction.

The kernel provides applications with a perfectly good asynchronous
interface: the timeslice. if the application has something better to
do while it's blocked against I/O, it can put that processing on
another pid. In the typical case, the application blocks against
needed input, and the kernel can schedule CPU time for something else.

> I like to think they always fail because at the end of the day using
> slave threads can be easily done in userspace.

Exactly.

--jkl

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!news.linkpendium.com!news.linkpendium.com!news.glorb.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!post01.iad.highwinds-media.com!fx33.iad.POSTED!not-for-mail
X-Newsreader: xrn 9.03-beta-14-64bit
Sender: scott-at-dragon.sl.home (Scott Lurndal)
From: scott-at-slp53.sl.home (Scott Lurndal)
Reply-To: slp53-at-pacbell.net
Subject: Re: threads and exit() woes
Newsgroups: comp.unix.programmer
References:
Message-ID:
X-Complaints-To: abuse-at-usenetserver.com
NNTP-Posting-Date: Wed, 14 Dec 2016 13:39:18 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 14 Dec 2016 13:39:18 GMT
X-Received-Bytes: 1575
X-Received-Body-CRC: 3603917634
Xref: panix comp.unix.programmer:236728

writes:

>Given the pedigree of Solaris, AIX, and HP-UX, I'm curious what those
>systems did. Did they refactor their driver model? Officially commit to the
>kernel thread hack? Or find some sort of compromise, e.g. a
>quasi-synchronous interface where updated drivers could bubble up through
>the call stack an interrupt or timeout?

SVR4.2 ES/MP completely redesigned the I/O system to handle
asynchronicity natively (along with eliminating the BFKL[*]).
The POSIX asynchronous I/O apis were implemented naturally
throughout the I/O stack.

Our Chorus microkernel-based port of SVR4.2 ES/MP (called SVR4/MK,
or project Amadeus in Europe) also supported the asynchronous interfaces
internally, and they were heavily used by Oracle for performance.


[*] Big F'ing Kernel Lock

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!newsswitch.lcs.mit.edu!ottix-news.ottix.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!buffer1.nntp.dca1.giganews.com!nntp.posted.internetamerica!news.posted.internetamerica.POSTED!not-for-mail
NNTP-Posting-Date: Thu, 15 Dec 2016 00:35:59 -0600
Sender: Gordon Burditt
From: gordonb.ci1jn-at-burditt.org (Gordon Burditt)
Subject: Re: threads and exit() woes
Newsgroups: comp.unix.programmer
References:
User-Agent: tin/2.3.4-20160628 ("Newton") (UNIX) (FreeBSD/10.0-RELEASE (i386))
Message-ID:
Date: Thu, 15 Dec 2016 00:35:59 -0600
X-Usenet-Provider: http://www.giganews.com
NNTP-Posting-Host: 108.65.82.77
X-Trace: sv3-C8SOH6kikRcZtGxbKrqgv4cxAAQ7mKQRg4eGRVI3YZ6S4TXi15VhPpDjsLxw9gnv2dHwlEqn1nbdTr7!BIvEhENUIXve4TJFe07KOvlfMnc42HSTT4wvb2wFCQZpdsICwbvg8zNgbQuYrTkn61U2PimtyPrP!sKX+qbfb3RWdeOHA/exDym6ZSaHx
X-Complaints-To: abuse-at-airmail.net
X-DMCA-Complaints-To: abuse-at-airmail.net
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 2107
Xref: panix comp.unix.programmer:236736

> Well, welcome to real word.
> A process hanging in state D is one of the most often causes of system
> reboots. This did not change significantly over the last 15 years from
> Debian Woody to recent Raspbian with kernel 4.4. Of course, it is not
> that often that I have serious trouble. Once or twice per year or
> something like that.
> AFAIK there is absolutely no recovery from a process blocked in state D.
> This seems to be a Linux specific "feature".

I'm not sure I agree with that. Hanging device drivers (in state
"D"), specifically due to USB devices being disconnected at
inconvenient times, seems to be a bigger problem than just Linux.
I've observed it occasionally on the *BSDs. Usually it's quite
obvious that the device shouldn't have been intentionally disconnected,
but that the cable/connector was a little loose and someone wiggled
it.

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!2.us.feeder.erje.net!feeder.erje.net!1.eu.feeder.erje.net!weretis.net!feeder4.news.weretis.net!news.mb-net.net!open-news-network.org!.POSTED!not-for-mail
From: Marcel Mueller
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: Thu, 15 Dec 2016 19:32:10 +0100
Organization: MB-NET.NET for Open-News-Network e.V.
Message-ID:
References:
NNTP-Posting-Host: aftr-95-222-29-234.unity-media.net
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: gwaiyur.mb-net.net 1481826730 25536 95.222.29.234 (15 Dec 2016 18:32:10 GMT)
X-Complaints-To: abuse-at-open-news-network.org
NNTP-Posting-Date: Thu, 15 Dec 2016 18:32:10 +0000 (UTC)
User-Agent: Mozilla/5.0 (OS/2; Warp 4.5; rv:24.0) Gecko/20100101 Thunderbird/24.8.1
In-Reply-To:
Xref: panix comp.unix.programmer:236738

On 15.12.16 07.35, Gordon Burditt wrote:
>> AFAIK there is absolutely no recovery from a process blocked in state D.
>> This seems to be a Linux specific "feature".
>
> I'm not sure I agree with that. Hanging device drivers (in state
> "D"), specifically due to USB devices being disconnected at
> inconvenient times, seems to be a bigger problem than just Linux.
> I've observed it occasionally on the *BSDs. Usually it's quite
> obvious that the device shouldn't have been intentionally disconnected,
> but that the cable/connector was a little loose and someone wiggled
> it.

Bugs and I/O errors can occur everywhere. Not that nice, but that's life.
The only problem is the kernel is unable to recover from this errors
without reboot. This is not contemporary.


Marcel

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!goblin2!goblin1!goblin.stu.neva.ru!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: Rainer Weikusat
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: Fri, 16 Dec 2016 17:38:09 +0000
Message-ID: <87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com>
References:




Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: individual.net Qx6DB7EzpoHiBHQL+DxPrgcRb3601ITjULtqHZJ1048dcFktA=
Cancel-Lock: sha1:FdcxYpsgHq4gzzOxnPfR3zvQv1A= sha1:jOKKe3MrmzUAcpZMMDLSjPwLbPo=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
Xref: panix comp.unix.programmer:236742

Marcel Mueller writes:
> On 15.12.16 07.35, Gordon Burditt wrote:
>>> AFAIK there is absolutely no recovery from a process blocked in state D.
>>> This seems to be a Linux specific "feature".
>>
>> I'm not sure I agree with that. Hanging device drivers (in state
>> "D"), specifically due to USB devices being disconnected at
>> inconvenient times, seems to be a bigger problem than just Linux.
>> I've observed it occasionally on the *BSDs. Usually it's quite
>> obvious that the device shouldn't have been intentionally disconnected,
>> but that the cable/connector was a little loose and someone wiggled
>> it.
>
> Bugs and I/O errors can occur everywhere. Not that nice, but that's life.
> The only problem is the kernel is unable to recover from this errors
> without reboot. This is not contemporary.

It is contemporary because it's happening now.

'Uninterruptible sleep' state usually means 'the operation being waited
for is always expected to complete' as it's entirely within the domain
of the local system. Insofar the state persists when talking to a
device, that's usually a hardware failure. Another possible cause would
be a kernel mutex deadlock.

Interruptible sleeping needs correct support code for every instance of
a sleep. That's a whole load of opportunities for additional bugs as
this will usually need 'resource allocation unwinding' back up the
complete callstack. It also needs to be handled correctly in all
applications. IMHO, is very questionable if this is really a good idea
"just in case there's a kernel bug".

It's entirely unclear how "recovery in case of hardware errors" should
look like. If a mass storage device fails, the result is going to be
"unpleasant" regardless of requiring a reboot to paper over the issue
for some time.

The idea to use 'D' state for network filesystems is obviously moronic
and there should be some kind of 'emergency abort' for removable storage
devices, too.


--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!goblin2!goblin.stu.neva.ru!aioe.org!.POSTED!not-for-mail
From: spud-at-potato.field
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: Mon, 19 Dec 2016 09:38:04 +0000 (UTC)
Organization: Aioe.org NNTP Server
Message-ID:
References:




<87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com>
NNTP-Posting-Host: IkJkuvU+mf0C8Ve1AyJG/g.user.gioia.aioe.org
X-Complaints-To: abuse-at-aioe.org
X-Newsreader: :redaersweN-X
X-Notice: Filtered by postfilter v. 0.8.2
Xref: panix comp.unix.programmer:236744

On Fri, 16 Dec 2016 17:38:09 +0000
Rainer Weikusat wrote:
>It's entirely unclear how "recovery in case of hardware errors" should
>look like. If a mass storage device fails, the result is going to be
>"unpleasant" regardless of requiring a reboot to paper over the issue
>for some time.

Unless the device is the drive the OS system files are hosted on or some other
critical main board component, then any hardware failure should be dealt with
gracefully. Period. Hardware failures should be expected and the OS should help
the admins diagnose the problem, not just give up and die.

>The idea to use 'D' state for network filesystems is obviously moronic
>and there should be some kind of 'emergency abort' for removable storage
>devices, too.

FreeBSD had a nice bug back in the day (maybe still does) whereby if you
mounted a floppy disk as a filesystem then removed the disk the kernel would
crash. Despite numerous people including myself pointing this out they still
hadn't fixed it by 6.0, at which point I switched to linux for other reasons.

--
Spud


--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!newsswitch.lcs.mit.edu!ottix-news.ottix.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!buffer1.nntp.dca1.giganews.com!buffer2.nntp.dca1.giganews.com!nntp.posted.internetamerica!news.posted.internetamerica.POSTED!not-for-mail
NNTP-Posting-Date: Mon, 19 Dec 2016 21:04:22 -0600
Sender: Gordon Burditt
From: gordonb.b6e0s-at-burditt.org (Gordon Burditt)
Subject: Re: threads and exit() woes
Newsgroups: comp.unix.programmer
References: <87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com>
User-Agent: tin/2.3.4-20160628 ("Newton") (UNIX) (FreeBSD/10.0-RELEASE (i386))
Message-ID: <686dnZXvcrwrAsXFnZ2dnUU7-X3NnZ2d-at-posted.internetamerica>
Date: Mon, 19 Dec 2016 21:04:22 -0600
X-Usenet-Provider: http://www.giganews.com
NNTP-Posting-Host: 108.65.82.77
X-Trace: sv3-eRfoIVM4LYkHTQNUg5bmtY+uAQHhj/2tqYECM+4OZEBQF6etbDK0BOqfpb8vPx1BgASeh9cDSUnm7RW!/xFB4Z4qtAXvSEIDkOx7ueBJsd3MEttDvqZJXjO1k03olvCJ2EAZ1v06VdnLDMmb3EH6Fl+r1DRv!VIb/W0sVliknDlKE+H2Tzv3ZNvZN
X-Complaints-To: abuse-at-airmail.net
X-DMCA-Complaints-To: abuse-at-airmail.net
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 3809
Xref: panix comp.unix.programmer:236746

>>The idea to use 'D' state for network filesystems is obviously moronic
>>and there should be some kind of 'emergency abort' for removable storage
>>devices, too.
>
> FreeBSD had a nice bug back in the day (maybe still does) whereby if you
> mounted a floppy disk as a filesystem then removed the disk the kernel would
> crash. Despite numerous people including myself pointing this out they still
> hadn't fixed it by 6.0, at which point I switched to linux for other reasons.

I expect that you would have the same problem for *ANY* removable
device with a UFS filesystem with soft updates enabled (on FreeBSD
10.1, and I think on 11.0). I've managed to trigger some kind of
panic related to soft updates by accidental removal of a mounted
filesystem (as in "accidentally yanked the cable out"). Floppies
using a FAT-16 filesystem probably won't have this issue. Neither,
it seems, will a UFS filesystems with soft updates turned off. The
data is inconsistent, but the system doesn't panic. Sometimes, the
panic was triggered after the program that wrote the data had already
terminated (but not all data flushed to disk). Soft updates does
seem to work well for actually non-removable drives. The problem
of panics doesn't exist when non-removable drives are removed from
the system by a power failure.

I'm not sure about journaling on UFS, but journaling is usually
unsuitable for my application for removable media: large copy to
the drive, followed by the data being read-only for a long time
(maybe months), or else read a few times (usually by different
systems) and then deleted. Journaling increases the number of
writes (possibly wearing out flash drives earlier), and I don't
really care about the integrity of the data *between* the time the
copy starts and everything gets written. I do care about data
integrity after it's unmounted and re-mounted.

No, this wasn't any essential filesystem like /, swap, /usr, or
/var. Most of the time it was /mnt or /mnt2, filesystems used for
data transfer or archive using USB memory sticks, or a USB hard
drive. I suppose it would also happen with a USB or normal floppy
drive. Nothing is permanently mounted on /mnt. In case of accidental
disconnection, I'd expect the data in process of being transferred
to be toast, and I really don't care much about that. I can't trust
the copy anyway.


--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!goblin3!goblin.stu.neva.ru!news.mb-net.net!open-news-network.org!.POSTED!not-for-mail
From: Marcel Mueller
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: Mon, 19 Dec 2016 20:29:59 +0100
Organization: MB-NET.NET for Open-News-Network e.V.
Message-ID:
References: <87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com>
NNTP-Posting-Host: aftr-95-222-29-121.unity-media.net
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: gwaiyur.mb-net.net 1482175799 32441 95.222.29.121 (19 Dec 2016 19:29:59 GMT)
X-Complaints-To: abuse-at-open-news-network.org
NNTP-Posting-Date: Mon, 19 Dec 2016 19:29:59 +0000 (UTC)
User-Agent: Mozilla/5.0 (OS/2; Warp 4.5; rv:24.0) Gecko/20100101 Thunderbird/24.8.1
In-Reply-To: <87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com>
Xref: panix comp.unix.programmer:236745

On 16.12.16 18.38, Rainer Weikusat wrote:
>> Bugs and I/O errors can occur everywhere. Not that nice, but that's life.
>> The only problem is the kernel is unable to recover from this errors
>> without reboot. This is not contemporary.
>
> It is contemporary because it's happening now.
>
> 'Uninterruptible sleep' state usually means 'the operation being waited
> for is always expected to complete' as it's entirely within the domain
> of the local system. Insofar the state persists when talking to a
> device, that's usually a hardware failure. Another possible cause would
> be a kernel mutex deadlock.

Even if DMA is involved it should be possible to cancel this operation.
And well, if a hardware DMA does not complete within a few minutes it
will likely never complete. So unloading the driver is just fine in
99,9% of the cases.

> Interruptible sleeping needs correct support code for every instance of
> a sleep. That's a whole load of opportunities for additional bugs as
> this will usually need 'resource allocation unwinding' back up the
> complete callstack.

Agree.
But I do not talk about graceful exit. Just cancel all related threads.
Of course, this might leave the driver in an inconsistent state. Not too
surprising since there is the bug. So the next action is to forcibly
unload the driver. Since most drivers reset their device when loaded
(again) it is likely that the hardware could recover from this error.

> It also needs to be handled correctly in all
> applications. IMHO, is very questionable if this is really a good idea
> "just in case there's a kernel bug".

I do not see any action other than "kill" that could be executed in this
state. So I see no need for any action in userspace.


> It's entirely unclear how "recovery in case of hardware errors" should
> look like. If a mass storage device fails, the result is going to be
> "unpleasant" regardless of requiring a reboot to paper over the issue
> for some time.

If it is the root filesystem or swap, yes. There is no reasonable recovery.
But most of the time state D is not related to the system disk. More
likely it is a WLAN device (amazingly unreliable this kind of hardware)
or an USB stick or some other less important device.

> The idea to use 'D' state for network filesystems is obviously moronic
> and there should be some kind of 'emergency abort' for removable storage
> devices, too.

Indeed. NFS is really annoying if the network is not 100% solid.


Marcel

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!2.us.feeder.erje.net!feeder.erje.net!1.eu.feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: jt-at-toerring.de (Jens Thoms Toerring)
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: 13 Dec 2016 22:32:32 GMT
Organization: Freie Universitaet Berlin
Message-ID:
References:
X-Trace: news.uni-berlin.de xD10nltki5lG8On22aX5TgJ6Vu+omUj+ScqRv68mkSTi2x
X-Orig-Path: not-for-mail
User-Agent: tin/2.1.1-20120623 ("Mulindry") (UNIX) (Linux/3.2.0-4-amd64 (x86_64))
Xref: panix comp.unix.programmer:236715

Hi,

thank you all - I'm quite overwhelmed by the number and
quality of responses! So please don't be annoyed if I don't
respond to each post in detail.

As usual I guess I've looked too much at "red herrings".
It doesn't seem to have been something really related to
threads. After a lot more of looking at the rather longish
output of strace I started to notice a pattern, i.e. that
one of the threads got interrupted in a call of close().
This often happend a long (relatively speaking) time be-
fore the software watchdog tried to stop the program - and
that thread never got re-scheduled.

So I switched my attention to the serial driver (that close()
call was for a device file for one of the serial ports of the
processor) and found a different version of it. And, lo and
behold, with that updated driver I haven't seen any of that
strange behaviour anymore for about 400 test runs. While
that is, of course, no proof that everything is well, it at
least encouraging;)

Unfortunately, the somewhat restricted tools I have at my
disposal don't tell me too much what state a process is in.
'ps' is rather terse in what it tells you (no D/S/R etc., i.
e. no STAT field at all) one is used from a PC. But the pro-
cess/thread was definitely not sleeping nor a zombie - it was
so active that it used up about 50% of the CPU time, and ob-
viously somehow kept [ksoftirqd] busy as well;-)

So from what I can say at the moment it was a slightly buggy
driver that, in what manner I can't tell yet, didn't close
the device file as requested and thus kept the program from
exiting. At least my believe in TLPI/APUE has been restored
in that it most likely was a situation where an exit() would
have killed all threads if not a buggy driver had intervened;-)

Thank you all and best regards, Jens
--
\ Jens Thoms Toerring ___ jt-at-toerring.de
\__________________________ http://toerring.de

--------------03A2DB7651F09BFE8C70F091
Content-Type: message/rfc822;
name="Re: threads and exit() woes.eml"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="Re: threads and exit() woes.eml"

Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!feeder.erje.net!2.us.feeder.erje.net!newsfeed.fsmpi.rwth-aachen.de!newsfeed.straub-nv.de!news-1.dfn.de!news.dfn.de!news.informatik.hu-berlin.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: Jorgen Grahn
Newsgroups: comp.unix.programmer
Subject: Re: threads and exit() woes
Date: 14 Dec 2016 00:10:17 GMT
Message-ID:
References:

X-Trace: individual.net KXStxWAi7DhXSdgmn1cKVAhGNQTmm7nj6VeFQboa2GiXvI7hJP
Cancel-Lock: sha1:9Zpq4aMyThIc9MN998m/x380B/w=
User-Agent: slrn/pre1.0.0-18 (Linux)
Xref: panix comp.unix.programmer:236716

On Tue, 2016-12-13, Jens Thoms Toerring wrote:
> Hi,
>
> thank you all - I'm quite overwhelmed by the number and
> quality of responses! So please don't be annoyed if I don't
> respond to each post in detail.
>
> As usual I guess I've looked too much at "red herrings".
> It doesn't seem to have been something really related to
> threads. After a lot more of looking at the rather longish
> output of strace I started to notice a pattern, i.e. that
> one of the threads got interrupted in a call of close().
> This often happend a long (relatively speaking) time be-
> fore the software watchdog tried to stop the program - and
> that thread never got re-scheduled.
>
> So I switched my attention to the serial driver (that close()
> call was for a device file for one of the serial ports of the
> processor)

Seems that was the turning point. Nice!

> and found a different version of it. And, lo and
> behold, with that updated driver I haven't seen any of that
> strange behaviour anymore for about 400 test runs. While
> that is, of course, no proof that everything is well, it at
> least encouraging;)
>
> Unfortunately, the somewhat restricted tools I have at my
> disposal don't tell me too much what state a process is in.
> 'ps' is rather terse in what it tells you (no D/S/R etc., i.
> e. no STAT field at all) one is used from a PC.

One useful trick is to look in the Linux /proc file system. I think
that's where ps gets its information anyway, and there's more useful
information in there too. The proc(5) man page et cetera may be
needed to interpret it.

> But the pro-
> cess/thread was definitely not sleeping nor a zombie - it was
> so active that it used up about 50% of the CPU time, and ob-
> viously somehow kept [ksoftirqd] busy as well;-)

> So from what I can say at the moment it was a slightly buggy
> driver that, in what manner I can't tell yet, didn't close
> the device file as requested and thus kept the program from
> exiting.

A guess: the buggy serial driver sometimes couldn't deal with the
resource cleanup caused by the file descriptor closing. close() never
returned but initiated some work: partly attributed to the process,
and partly to the kernel itself. Maybe the work was actual I/O.

Probably you'd have triggered the same thing with a 'kill -9' or an
abort() as with exit(). In both cases there's a freeing of kernel
resources associated with that file descriptor.

> At least my believe in TLPI/APUE has been restored
> in that it most likely was a situation where an exit() would
> have killed all threads if not a buggy driver had intervened;-)
>
> Thank you all and best regards, Jens

/Jorgen

--
// Jorgen Grahn \X/ snipabacken.se> O o .

--------------03A2DB7651F09BFE8C70F091
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Learn mailing list
Learn-at-nylxs.com
http://lists.mrbrklyn.com/mailman/listinfo/learn

--------------03A2DB7651F09BFE8C70F091--

  1. 2017-01-09 James E Keenan <jkeen-at-verizon.net> Subject: [Learn] Perl Conference 2017: June 18-23: Call for Proposals
  2. 2017-01-09 From: "David H. Adler" <dha-at-panix.com> Subject: [Learn] [MEETING] New year, new meetings.
  3. 2017-01-10 IEEE Engineering in Medicine and Biology Society <noreply-at-embs.org> Subject: [Learn] BHI 2017 -Important Reminders
  4. 2017-01-12 mrbrklyn <mrbrklyn-at-panix.com> Subject: [Learn] Fwd: [Accu-contacts] C/C++ Engineer Roles - YouView set-top
  5. 2017-01-16 mrbrklyn <mrbrklyn-at-panix.com> Subject: [Learn] openscience this year
  6. 2017-01-19 Ruben Safir <ruben-at-mrbrklyn.com> Re: [Learn] (fwd) Re: Keith Hernandez should be coaching,
  7. 2017-01-19 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Keith Hernandez should be coaching,
  8. 2017-01-19 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Keith Hernandez should be coaching,
  9. 2017-01-19 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Keith Hernandez should be coaching,
  10. 2017-01-19 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Keith Hernandez should be coaching,
  11. 2017-01-19 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Keith Hernandez should be coaching,
  12. 2017-01-19 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Keith Hernandez should be coaching,
  13. 2017-01-19 Rick Moen <rick-at-linuxmafia.com> Subject: [Learn] [Hangout-NYLXS] RAM and RAM-testing
  14. 2017-01-19 Rick Moen <rick-at-linuxmafia.com> Subject: [Learn] [Hangout-NYLXS] RAM and RAM-testing
  15. 2017-01-20 Ruben Safir <ruben-at-mrbrklyn.com> Re: [Learn] Follow up conversation
  16. 2017-01-20 Ruben Safir <ruben-at-mrbrklyn.com> Re: [Learn] Fwd: cs691 notes and task
  17. 2017-01-20 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] Alumni Publications
  18. 2017-01-20 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] Follow up conversation
  19. 2017-01-20 ruben safir <ruben-at-mrbrklyn.com> Subject: [Learn] Fwd: Re: threads and exit() woes
  20. 2017-01-20 ruben safir <ruben-at-mrbrklyn.com> Subject: [Learn] Fwd: threads and exit() woes
  21. 2017-01-21 Ruben Safir <ruben.safir-at-my.liu.edu> Subject: [Learn] Fwd: Re: Nueral Networks
  22. 2017-01-21 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] Nice project to learn from
  23. 2017-01-23 IEEE Engineering in Medicine and Biology Society <noreply-at-embs.org> Subject: [Learn] 8th International IEEE EMBS Conference on Neural
  24. 2017-01-23 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] anyone understand this - ME
  25. 2017-01-23 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] compiler job
  26. 2017-01-23 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] Fwd: Re: Nueral Networks
  27. 2017-01-23 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] Parse Tree theory
  28. 2017-01-24 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] Computational evolution
  29. 2017-01-25 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Felsenstein Phylogenies
  30. 2017-01-25 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] R Programming Workshop
  31. 2017-01-26 ruben safir <ruben-at-mrbrklyn.com> Re: [Learn] Felsenstein Phylogenies
  32. 2017-01-26 Ruben Safir <ruben-at-mrbrklyn.com> Re: [Learn] [Hangout-NYLXS] librepalnet
  33. 2017-01-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Felsenstein Phylogenies
  34. 2017-01-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Felsenstein Phylogenies
  35. 2017-01-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Felsenstein Phylogenies
  36. 2017-01-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Felsenstein Phylogenies
  37. 2017-01-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Felsenstein Phylogenies
  38. 2017-01-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Felsenstein Phylogenies
  39. 2017-01-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Felsenstein Phylogenies
  40. 2017-01-26 Ruben Safir <mrbrklyn-at-panix.com> Subject: [Learn] (fwd) Re: Felsenstein Phylogenies
  41. 2017-01-26 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] Installfest at LIU Brooklyn
  42. 2017-01-26 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] librepalnet
  43. 2017-01-27 Christopher League <league-at-contrapunctus.net> Subject: [Learn] P vs NP
  44. 2017-01-28 Ruben Safir <ruben-at-mrbrklyn.com> Re: [Learn] P vs NP
  45. 2017-01-28 ruben safir <ruben-at-mrbrklyn.com> Subject: [Learn] Fwd: Re: Felsenstein Phylogenies
  46. 2017-01-30 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] R Programming Workshop
  47. 2017-01-30 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] R workshop
  48. 2017-01-30 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] [Hangout-NYLXS] Installfest for Lunch
  49. 2017-01-30 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] [ruben-at-mrbrklyn.com: [Hangout-NYLXS] Installfest for Lunch]
  50. 2017-01-31 ruben <ruben-at-mrbrklyn.com> Subject: [Learn] Fwd: [dinosaur] Collagen preserved in Early Jurassic
  51. 2017-01-31 Ruben Safir <ruben-at-mrbrklyn.com> Subject: [Learn] Fwd: [isoc-ny] FCC Seeks Diverse Stakeholders for Broadband

NYLXS are Do'ers and the first step of Doing is Joining! Join NYLXS and make a difference in your community today!