After we started to use kill() over raise() everything should work just
fine.
This reverts commit a86f89d333d870e6714bd28c695ba1774df3d7f5.
Fixed-in: 728c5dc1 ("Use kill() over raise() for raising the signal (fixes osx 10.14 with kqueue)")
Fixes: #747
On OSX 10.14+ the raise() uses pthread_kill() (verified with dtruss) and
by some reason signals that has been raised with pthread_kill() do not
received by kqueue EVFILT_SIGNAL.
While on OSX 10.11 the raise()/pthread_kill() uses plain kill() and
everything work just fine (linux also does the same, but instead of
kill() it uses tgkill())
Here is a simple reproducer that installs alarm to show that the signal
does not received by the kqueue backend:
https://gist.github.com/azat/73638b8e3b0fa563a20dadcca9e652a1
Refs: #747Fixes: #765
There can be tricky cases (that can be reproduced by reducing
SO_RCVBUF/SO_SNDBUF to 6144, on linux, and be aware, since linux doubles
this const), when there is still write event pending, although we read
enough.
This should be fixed in a more sophisticated way, but to backport the
patch, let's simply break the loop manually.
The ssl/bufferevent_wm originally failed on solaris.
windows has intptr_t instead of regular int.
Also tt_fd_op() had been introduced, since we cannot use tt_int_op() for
comparing fd, since it is not always int.
Next code will not work correctly under win x64:
evutil_socket_t very_long_pair_name[2];
int *pair = very_long_pair_name; // <-- accessing the second word of the first element
Because sizeof(evutil_socket_t) == sizeof(intptr_t) == 8
P.S. in the 5334762f another test had been fixed instead of the one that
really fails.
Fixes: 5334762f ("test/et/et: fix it by using appropriate type for the SOCKET (evutil_socket_t)")
Refs: #750
This patch mark testcases that only fail under travis-ci/appveyor with
TT_RETRIABLE, since otherwise there is too much noise, other issues
(like failures under vagrant boxes) would be investigated separatelly.
linux (from travis-ci only):
- http/cancel_by_host_no_ns
- http/cancel_by_host_inactive_server
- http/cancel_by_host_ns_timeout
- http/cancel_by_host_ns_timeout_inactive_server
- thread/conditions_simple
- util/monotonic_prc_precise
- util/usleep
- main/del_wait
vagrant/ubuntu box (this is the only exception):
- thread/no_events
win32 (from appveyor only):
- main/active_later
- main/persistent_active_timeout
And we should use TT_RETRIABLE over TT_OFF_BY_DEFAULT/TT_SKIP when it
make sense.
But there is still "test-ratelim__group_lim" left.
We have some tests that has false-positive due to real/CPU time bound,
but they are pretty generic and we do not want to skip them by default.
TT_RETRIABLE is the flag that will indicate tinytest to retry the test
in case of failure, use it to avoid next possible false-positives:
- real time-related
- CPU time-related
Since I guess it is better to see/grepping RETRYING messages over
ignoring completely failed builds.
No configuration switch for number of retries was done on purpose (only
3 retries and no more).
And this is how it looks BTW:
$ gcc ../test/tinytest_demo.c ../test/tinytest.c
$ ./a.out --verbose --no-fork
demo/timeout_retry
demo/timeout_retry:
FAIL ../test/tinytest_demo.c:201: assert(i != 1): 1 vs 1
[timeout_retry FAILED]
[RETRYING timeout_retry (3)]
demo/timeout_retry:
OK ../test/tinytest_demo.c:201: assert(i != 1): 2 vs 1
OK ../test/tinytest_demo.c:213: assert(t2-t1 >= 4): 5 vs 4
OK ../test/tinytest_demo.c:215: assert(t2-t1 <= 6): 5 vs 6
1 tests ok. (0 skipped)
Even after referenced patch there is still possible recursive callbacks
from evbuffer_drain(bev_input), i.e.:
wm_transfer() -> evbuffer_drain() -> wm_transfer()
inc(ctx->get)
But if we will increment ctx->get before drain that we will not add more
data to buffer.
Refs: 54c6fe3c ("regress_ssl: make ssl/bufferevent_wm_filter more fault-tolerance")
CI: https://ci.appveyor.com/project/nmathewson/libevent/build/job/f0rv299i71wnuxdq#L2546
Due to inplace callbacks (i.e. no BEV_OPT_DEFER_CALLBACKS) we cannot be
sure that wm_transfer() will not be called recursively and indeed it
still happens sometimes, and the referenced patch increase amount of
this times, especially for linux/poll.
Fixes: 66304a23cf748714159c988e78f35401c5352827 ("Fix
ssl/bufferevent_wm_filter when bev does not reach watermark on break")
EVHTTP_CON_READ_ON_WRITE_ERROR works only if an error already read from
the socket, but if we already got EPIPE on write we cannot read from the
socket anymore, and win32 does not guarantee that read will happens
before (although it happens from time to time).
In the referenced patch I just replaced callback with not expecting 417,
but like I already wrote, this is not always true (i.e. it is flacky).
Fixes: 3b581693ac1967f7f8d98491cb772a1b415eb4cd ("test/http:
read_on_write_error: fix it for win32")
For the ssl/bufferevent_wm* we have next configuration:
- payload_len = 1024
- wm_high = 5120
- limit = 40960
- to_read = 512
In this test we expect that with high watermark installed to "wm_high"
we will read "limit" bytes by reading "to_read" at a time, but adding
"payload_len" at a time (this "to_read"/"payload_len" limits is
installed to finally overflow watermark).
Once we read "limit" bytes we break, by disable EV_READ and reset
callbacks. Although this will not work if when we want to break we do
not reach watermark, this is because watermarks installs evbuffer
callback for the input buffer and if the watermark does not reached it
will enable EV_READ while be_openssl_enable() will read from the
underlying buffer (in case the openssl bufferevent created via
bufferevent_openssl_filter_new()) and call callback again (until it will
reach watermark or read al from the underlying buffer -- this is why it
stops in our caes).
And this is exactly what happened in win32, you can see this in the
following logs:
- win32 before:
OK C:\vagrant\test\regress_ssl.c:829: wm_transfer-client(00DC2750): in: 4608, out: 0, got: 40960
OK C:\vagrant\test\regress_ssl.c:834: wm_transfer-client(00DC2750): break
OK C:\vagrant\test\regress_ssl.c:829: wm_transfer-client(00DC2750): in: 4608, out: 0, got: 41472
OK C:\vagrant\test\regress_ssl.c:834: wm_transfer-client(00DC2750): break
OK C:\vagrant\test\regress_ssl.c:829: wm_transfer-client(00DC2750): in: 4608, out: 0, got: 41984
OK C:\vagrant\test\regress_ssl.c:834: wm_transfer-client(00DC2750): break
OK C:\vagrant\test\regress_ssl.c:829: wm_transfer-client(00DC2750): in: 4608, out: 0, got: 42496
OK C:\vagrant\test\regress_ssl.c:834: wm_transfer-client(00DC2750): break
- win32 after:
OK C:\vagrant\test\regress_ssl.c:821: wm_transfer-client(00FC26F0): break
OK C:\vagrant\test\regress_ssl.c:836: wm_transfer-client(00FC26F0): in: 4800, out: 0, got: 40960
- linux before:
OK ../test/regress_ssl.c:829: wm_transfer-client(0x55555566f5e0): in: 5120, out: 0, got: 40960
OK ../test/regress_ssl.c:834: wm_transfer-client(0x55555566f5e0): break
- linux after:
OK ../test/regress_ssl.c:821: wm_transfer-client(0x55555566f5e0): break
OK ../test/regress_ssl.c:836: wm_transfer-client(0x55555566f5e0): in: 5120, out: 0, got: 40960
(As you can see in linux case we already reach watermark hence it passed
before).
So fix the issue by breaking before draining.
But during fixing this I was thinking is this right? I.e. reading from
the be_openssl_enable(), maybe we should force deferred callbacks at
least?
Next tests uses fds without O_NONBLOCK flag
- main/free_active_base
- main/many_events
- et/et (has some other bits cleaned up by using TT_* flags and test
setup/cleanup callbacks)
And hence they will fail in debug mode (EVENT_DEBUG_MODE=):
Assertion flags & O_NONBLOCK failed in event_debug_assert_socket_nonblocking_
And use it in places where event_debug() should be called (since it
requires access to "event_debug_logging_mask_" and in win32 it is
tricky).
One of this places that is covered by this patch is the test for
event_debug().
One tricky bit is reply to the BIO_C_GET_FD command, since otherwise it
will try to close(0) and accepted bev in ssl/bufferevent_connect_sleep
will leak. Other seems more or less trivial.
This was done to make sure that for at least generic cases does not
leak (tricky cases was listed here nmathewson/Libevent#83).
And this will allow run ssl/.. with --no-fork
- http/basic_trailing_space -- covers cases when there is trailing space
after the request line (nginx handles this)
- http/simple_nonconformant -- covers non RFC3986 conformant URIs
Between start (setting "started_at") and comparing the time when
timeouts triggered with the start (test_timeval_diff_eq), there is too
much various things that can introduce extra delays and eventually could
fail the test on machine with shortage of CPU.
And this is exactly what happend on:
- travis-ci
- #262
Here is a simple reproducer that I came up with for this issue:
docker run --cpus=0.01 -e LD_LIBRARY_PATH=$PWD/lib -e PATH=/usr/bin:/bin:$PWD/bin -v $PWD:$PWD --rm -it debian:testing regress --no-fork --verbose bufferevent/bufferevent_timeout
Under limited CPU (see reproducer) the test almost always has problems
with that "write_timeout_at" exceed default timeval diff tolerance
(test_timeval_diff_eq() has 50 tolerance), i.e.:
FAIL ../test/regress_bufferevent.c:1040: assert(labs(timeval_msec_diff(((&started_at)), ((&res1.write_timeout_at))) - (100)) <= 50): 101 vs 50
But under some setup write timeout can even not triggered, and the
reason for this is that we write to the bufferevent 1024*1024 bytes, and
hence if evbuffer_write_iovec() will has some delay after writev() and
not send more then one vector at a time [1], it is pretty simple to
trigger, i.e.:
FAIL ../test/regress_bufferevent.c:1040: assert(labs(timeval_msec_diff(((&started_at)), ((&res1.write_timeout_at))) - (100)) <= 50): 1540155888478 vs 50
[1]: https://gist.github.com/azat/b72773dfe7549fed865d439e03de05c1
So this patch just send static small payload for all cases (plus a few
more asserts added).
The outcome of this patch is that all regression tests passed on
travis-ci for linux box [2]. While before it fails almost always [3].
Also reproducer with CPU limiting via docker also survive some
iterations (and strictly speaking it should has less CPU then travis-ci
workers I guess).
[2]: https://travis-ci.org/azat/libevent/builds/444391481
[3]: https://travis-ci.org/libevent/libevent/builds/444336505
There is only one test that uses service name getaddrinfo_async, which
manually check whether it works or not, other should not assume that it
is available and works.
There was already an attempt to overcome some possible limitations, like
lack of "http" in /etc/services in
d6bafbbeb27ff3943d6f3b6783bcded76384c31e ("test/dns: replace servname
since solaris does not have "http"")
When there is no /etc/services file evdns_getaddrinfo() will fail (with
service="ssh") and hence it will go to then "end" label with locked
rp.lock which in case of debug locking checks will bail with:
[err] ../evthread.c:220: Assertion lock->count == 0 failed in debug_lock_free
So add rp.locked flag, and unlock the lock before freeing it if it is in
locked state.
And here is how you can reproduce the issue:
$ docker run -e LD_LIBRARY_PATH=$PWD/lib -e PATH=/usr/bin:/bin:$PWD/bin -v $PWD:$PWD --rm -it debian:testing regress dns/getaddrinfo_race_gotresolve
(since debian:testing does not have /etc/services)
avoid warnings with any modern C99 compiler due to implicit function
declaration for pthread_create, as shown by the following :
test/regress_dns.c:2226:2: warning: implicit declaration of function
'pthread_create' is invalid in C99 [-Wimplicit-function-declaration]
THREAD_START(thread[0], race_base_run, &rp);
^
test/regress_thread.h:35:2: note: expanded from macro 'THREAD_START'
pthread_create(&(threadvar), NULL, fn, arg)
^
test/regress_dns.c:2226:2: warning: this function declaration is not a prototype
[-Wstrict-prototypes]
test/regress_thread.h:35:2: note: expanded from macro 'THREAD_START'
pthread_create(&(threadvar), NULL, fn, arg)
^
$ clang --version
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Closes: #686 (cherry-picked)