summaryrefslogtreecommitdiff
path: root/migration
AgeCommit message (Collapse)Author
2019-01-23migration: introduce pages-per-secondXiao Guangrong
It introduces a new statistic, pages-per-second, as bandwidth or mbps is not enough to measure the performance of posting pages out as we have compression, xbzrle, which can significantly reduce the amount of the data size, instead, pages-per-second is the one we want Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com> Message-Id: <20190111063732.10484-2-xiaoguangrong@tencent.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> With typo's Eric spotted fixed
2019-01-23vmstate: constify SaveVMHandlersMarc-André Lureau
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Message-Id: <20181114133139.27346-1-marcandre.lureau@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2019-01-23migration/rdma: unregister fd handlerDr. David Alan Gilbert
Unregister the fd handler before we destroy the channel, otherwise we've got a race where we might land in the fd handler just as we're closing the device. (The race is quite data dependent, you just have to have the right set of devices for it to trigger). Corresponds to RH bz: https://bugzilla.redhat.com/show_bug.cgi?id=1666601 Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-Id: <20190122173111.29821-1-dgilbert@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2019-01-23migration: unify error handling for process_incoming_migration_coFei Li
In the current code, if process_incoming_migration_co() fails we do the same error handing: set the error state, close the source file, do the cleanup for multifd, and then exit(EXIT_FAILURE). To make the code clearer, add a "goto fail" to unify the error handling. Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Fei Li <fli@suse.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-Id: <20190113140849.38339-6-lifei1214@126.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2019-01-23migration: add more error handling for postcopy_ram_enable_notifyFei Li
Call postcopy_ram_incoming_cleanup() to do the cleanup when postcopy_ram_enable_notify fails. Besides, report the error message when qemu_ram_foreach_migratable_block() fails. Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Fei Li <fli@suse.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-Id: <20190113140849.38339-5-lifei1214@126.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2019-01-23migration: multifd_save_cleanup() can't fail, simplifyFei Li
multifd_save_cleanup() takes an Error ** argument and returns an error code even though it can't actually fail. Its callers dutifully check for failure. Remove the useless argument and return value, and simplify the callers. Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Markus Armbruster <armbru@redhat.com> Signed-off-by: Fei Li <fli@suse.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Message-Id: <20190113140849.38339-4-lifei1214@126.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2019-01-23migration: fix the multifd code when receiving less channelsFei Li
In our current code, when multifd is used during migration, if there is an error before the destination receives all new channels, the source keeps running, however the destination does not exit but keeps waiting until the source is killed deliberately. Fix this by dumping the specific error and let users decide whether to quit from the destination side when failing to receive packet via some channel. And update the comment for multifd_recv_new_channel(). Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Markus Armbruster <armbru@redhat.com> Signed-off-by: Fei Li <fli@suse.com> Reviewed-by: Peter Xu <peterx@redhat.com> Message-Id: <20190113140849.38339-3-lifei1214@126.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2019-01-21migration: Add post_save function to VMStateDescriptionAaron Lindsay
In some cases it may be helpful to modify state before saving it for migration, and then modify the state back after it has been saved. The existing pre_save function provides half of this functionality. This patch adds a post_save function to provide the second half. Signed-off-by: Aaron Lindsay <aclindsa@gmail.com> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-id: 20181211151945.29137-2-aaron@os.amperecomputing.com Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-01-17migration: Use strnlen() for fixed-size stringPhilippe Mathieu-Daudé
GCC 8 introduced the -Wstringop-overflow, which detect buffer overflow by string-modifying functions declared in <string.h>, such strncpy(), used in global_state_store_running(). GCC indeed found an incorrect use of strlen(), because this array is loaded by VMSTATE_BUFFER(runstate, GlobalState) then parsed using qapi_enum_parse which does not get the buffer length. Use strnlen() which returns sizeof(s->runstate) if the array is not NUL-terminated, assert the size is within range, and enforce the array to be NUL-terminated to avoid an overflow in qapi_enum_parse(). This fixes: CC migration/global_state.o qemu/migration/global_state.c: In function 'global_state_pre_save': qemu/migration/global_state.c:109:15: error: 'strlen' argument 1 declared attribute 'nonstring' [-Werror=stringop-overflow=] s->size = strlen((char *)s->runstate) + 1; ^~~~~~~~~~~~~~~~~~~~~~~~~~~ qemu/migration/global_state.c:24:13: note: argument 'runstate' declared here uint8_t runstate[100] QEMU_NONSTRING; ^~~~~~~~ cc1: all warnings being treated as errors make: *** [qemu/rules.mak:69: migration/global_state.o] Error 1 Suggested-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-01-17migration: Fix stringop-truncation warningMarc-André Lureau
GCC 8 added a -Wstringop-truncation warning: The -Wstringop-truncation warning added in GCC 8.0 via r254630 for bug 81117 is specifically intended to highlight likely unintended uses of the strncpy function that truncate the terminating NUL character from the source string. This new warning leads to compilation failures: CC migration/global_state.o qemu/migration/global_state.c: In function 'global_state_store_running': qemu/migration/global_state.c:45:5: error: 'strncpy' specified bound 100 equals destination size [-Werror=stringop-truncation] strncpy((char *)global_state.runstate, state, sizeof(global_state.runstate)); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ make: *** [qemu/rules.mak:69: migration/global_state.o] Error 1 Adding an assert is enough to silence GCC. (alternatively, we could hard-code "running") Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> [PMD: More verbose commit message] Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-01-11qemu/queue.h: leave head structs anonymous unless necessaryPaolo Bonzini
Most list head structs need not be given a name. In most cases the name is given just in case one is going to use QTAILQ_LAST, QTAILQ_PREV or reverse iteration, but this does not apply to lists of other kinds, and even for QTAILQ in practice this is only rarely needed. In addition, we will soon reimplement those macros completely so that they do not need a name for the head struct. So clean up everything, not giving a name except in the rare case where it is necessary. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-18qmp hmp: Make system_wakeup check wake-up support and run stateDaniel Henrique Barboza
The qmp/hmp command 'system_wakeup' is simply a direct call to 'qemu_system_wakeup_request' from vl.c. This function verifies if runstate is SUSPENDED and if the wake up reason is valid before proceeding. However, no error or warning is thrown if any of those pre-requirements isn't met. There is no way for the caller to differentiate between a successful wakeup or an error state caused when trying to wake up a guest that wasn't suspended. This means that system_wakeup is silently failing, which can be considered a bug. Adding error handling isn't an API break in this case - applications that didn't check the result will remain broken, the ones that check it will have a chance to deal with it. Adding to that, the commit before previous created a new QMP API called query-current-machine, with a new flag called wakeup-suspend-support, that indicates if the guest has the capability of waking up from suspended state. Although such guest will never reach SUSPENDED state and erroring it out in this scenario would suffice, it is more informative for the user to differentiate between a failure because the guest isn't suspended versus a failure because the guest does not have support for wake up at all. All this considered, this patch changes qmp_system_wakeup to check if the guest is capable of waking up from suspend, and if it is suspended. After this patch, this is the output of system_wakeup in a guest that does not have wake-up from suspend support (ppc64): (qemu) system_wakeup wake-up from suspend is not supported by this guest (qemu) And this is the output of system_wakeup in a x86 guest that has the support but isn't suspended: (qemu) system_wakeup Unable to wake up: guest is not in suspended state (qemu) Reported-by: Balamuruhan S <bala24@linux.vnet.ibm.com> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20181205194701.17836-4-danielhb413@gmail.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Acked-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>
2018-12-14qapi: add conditions to REPLICATION type/commands on the schemaMarc-André Lureau
Add #if defined(CONFIG_REPLICATION) in generated code, and adjust the code accordingly. Made conditional: * xen-set-replication, query-xen-replication-status, xen-colo-do-checkpoint Before the patch, we first register the commands unconditionally in generated code (requires a stub), then conditionally unregister in qmp_unregister_commands_hack(). Afterwards, we register only when CONFIG_REPLICATION. The command fails exactly the same, with CommandNotFound. Improvement, because now query-qmp-schema is accurate, and we're one step closer to killing qmp_unregister_commands_hack(). * enum BlockdevDriver value "replication" in command blockdev-add * BlockdevOptions variant @replication and related structures. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20181213123724.4866-23-marcandre.lureau@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>
2018-11-27vmstate: constify VMStateFieldMarc-André Lureau
Because they are supposed to remain const. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Message-Id: <20181114132931.22624-1-marcandre.lureau@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27migration: savevm: consult migration blockersPaolo Bonzini
There is really no difference between live migration and savevm, except that savevm does not require bdrv_invalidate_cache to be implemented by all disks. However, it is unlikely that savevm is used with anything except qcow2 disks, so the penalty is small and worth the improvement in catching bad usage of savevm. Only one place was taking care of savevm when adding a migration blocker, and it can be removed. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-21migration/migration.c: Add COLO dependency checksZhang Chen
Current COLO mode(independent disk mode) need replication module work together. Suggested by Dr. David Alan Gilbert <dgilbert@redhat.com>. Signed-off-by: Zhang Chen <chen.zhang@intel.com> Message-Id: <20181114190912.7242-1-chen.zhang@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-11-21migration/colo.c: Fix compilation issue when disable replicationZhang Chen
This compilation issue will occur when user use --disable-replication to config Qemu. Reported-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Message-Id: <20181101021226.6353-1-zhangckid@gmail.com> Tested-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-10-31migration: avoid segmentfault when take a snapshot of a VM which being migratedJia Lina
During an active background migration, snapshot will trigger a segmentfault. As snapshot clears the "current_migration" struct and updates "to_dst_file" before it finds out that there is a migration task, Migration accesses the null pointer in "current_migration" struct and qemu crashes eventually. Signed-off-by: Jia Lina <jialina01@baidu.com> Signed-off-by: Chai Wen <chaiwen@baidu.com> Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Message-Id: <20181026083620.10172-1-jialina01@baidu.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-10-29dirty-bitmaps: clean-up bitmaps loading and migration logicVladimir Sementsov-Ogievskiy
This patch aims to bring the following behavior: 1. We don't load bitmaps, when started in inactive mode. It's the case of incoming migration. In this case we wait for bitmaps migration through migration channel (if 'dirty-bitmaps' capability is enabled) or for invalidation (to load bitmaps from the image). 2. We don't remove persistent bitmaps on inactivation. Instead, we only remove bitmaps after storing. This is the only way to restore bitmaps, if we decided to resume source after [failed] migration with 'dirty-bitmaps' capability enabled (which means, that bitmaps were not stored). 3. We load bitmaps on open and any invalidation, it's ok for all cases: - normal open - migration target invalidation with dirty-bitmaps capability (bitmaps are migrating through migration channel, the are not stored, so they should have IN_USE flag set and will be skipped when loading. However, it would fail if bitmaps are read-only[1]) - migration target invalidation without dirty-bitmaps capability (normal load of the bitmaps, if migrated with shared storage) - source invalidation with dirty-bitmaps capability (skip because IN_USE) - source invalidation without dirty-bitmaps capability (bitmaps were dropped, reload them) [1]: to accurately handle this, migration of read-only bitmaps is explicitly forbidden in this patch. New mechanism for not storing bitmaps when migrate with dirty-bitmaps capability is introduced: migration filed in BdrvDirtyBitmap. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: John Snow <jsnow@redhat.com>
2018-10-29block/dirty-bitmaps: add user_locked status checkerJohn Snow
Instead of both frozen and qmp_locked checks, wrap it into one check. frozen implies the bitmap is split in two (for backup), and shouldn't be modified. qmp_locked implies it's being used by another operation, like being exported over NBD. In both cases it means we shouldn't allow the user to modify it in any meaningful way. Replace any usages where we check both frozen and qmp_locked with the new check. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 20181002230218.13949-2-jsnow@redhat.com [w/edits Suggested-By: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>] Signed-off-by: John Snow <jsnow@redhat.com>
2018-10-23Merge remote-tracking branch 'remotes/armbru/tags/pull-error-2018-10-22' ↵Peter Maydell
into staging Error reporting patches for 2018-10-22 # gpg: Signature made Mon 22 Oct 2018 13:20:23 BST # gpg: using RSA key 3870B400EB918653 # gpg: Good signature from "Markus Armbruster <armbru@redhat.com>" # gpg: aka "Markus Armbruster <armbru@pond.sub.org>" # Primary key fingerprint: 354B C8B3 D7EB 2A6B 6867 4E5F 3870 B400 EB91 8653 * remotes/armbru/tags/pull-error-2018-10-22: (40 commits) error: Drop bogus "use error_setg() instead" admonitions vpc: Fail open on bad header checksum block: Clean up bdrv_img_create()'s error reporting vl: Simplify call of parse_name() vl: Fix exit status for -drive format=help blockdev: Convert drive_new() to Error vl: Assert drive_new() does not fail in default_drive() fsdev: Clean up error reporting in qemu_fsdev_add() spice: Clean up error reporting in add_channel() tpm: Clean up error reporting in tpm_init_tpmdev() numa: Clean up error reporting in parse_numa() vnc: Clean up error reporting in vnc_init_func() ui: Convert vnc_display_init(), init_keyboard_layout() to Error ui/keymaps: Fix handling of erroneous include files vl: Clean up error reporting in device_init_func() vl: Clean up error reporting in parse_fw_cfg() vl: Clean up error reporting in mon_init_func() vl: Clean up error reporting in machine_set_property() vl: Clean up error reporting in chardev_init_func() qom: Clean up error reporting in user_creatable_add_opts_foreach() ... Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-10-19migration: Fix !replay_can_snapshot() error handlingMarkus Armbruster
Calling error_report() in a function that takes an Error ** argument is suspicious. save_snapshot() and load_snapshot() do that, and then fail without setting an error. Wrong. The HMP commands survive this unscathed, since hmp_handle_error() does nothing when no error has been set. Callers main() (on behalf of -loadvm) and replay_vmstate_init() crash, but I'm not sure the error is possible there. Screwed up when commit 377b21ccea1 (v2.12.0) added incorrect error handling right next to correct examples. Fix by calling error_setg() instead of error_report(). Fixes: 377b21ccea1755a8b0dae822c29567c58dda6939 Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Message-Id: <20181017082702.5581-13-armbru@redhat.com>
2018-10-19error: Fix use of error_prepend() with &error_fatal, &error_abortMarkus Armbruster
From include/qapi/error.h: * Pass an existing error to the caller with the message modified: * error_propagate(errp, err); * error_prepend(errp, "Could not frobnicate '%s': ", name); Fei Li pointed out that doing error_propagate() first doesn't work well when @errp is &error_fatal or &error_abort: the error_prepend() is never reached. Since I doubt fixing the documentation will stop people from getting it wrong, introduce error_propagate_prepend(), in the hope that it lures people away from using its constituents in the wrong order. Update the instructions in error.h accordingly. Convert existing error_prepend() next to error_propagate to error_propagate_prepend(). If any of these get reached with &error_fatal or &error_abort, the error messages improve. I didn't check whether that's the case anywhere. Cc: Fei Li <fli@suse.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20181017082702.5581-2-armbru@redhat.com>
2018-10-19COLO: quick failover process by kick COLO threadzhanghailiang
COLO thread may sleep at qemu_sem_wait(&s->colo_checkpoint_sem), while failover works begin, It's better to wakeup it to quick the process. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19COLO: notify net filters about checkpoint/failover eventzhanghailiang
Notify all net filters about the checkpoint and failover event. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19COLO: flush host dirty ram from cachezhanghailiang
Don't need to flush all VM's ram from cache, only flush the dirty pages since last checkpoint Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19savevm: split the process of different stages for loadvm/savevmZhang Chen
There are several stages during loadvm/savevm process. In different stage, migration incoming processes different types of sections. We want to control these stages more accuracy, it will benefit COLO performance, we don't have to save type of QEMU_VM_SECTION_START sections everytime while do checkpoint, besides, we want to separate the process of saving/loading memory and devices state. So we add three new helper functions: qemu_load_device_state() and qemu_savevm_live_state() to achieve different process during migration. Besides, we make qemu_loadvm_state_main() and qemu_save_device_state() public, and simplify the codes of qemu_save_device_state() by calling the wrapper qemu_savevm_state_header(). Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19qapi: Add new command to query colo statusZhang Chen
Libvirt or other high level software can use this command query colo status. You can test this command like that: {'execute':'query-colo-status'} Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19qapi/migration.json: Rename COLO unknown mode to none mode.Zhang Chen
Suggested by Markus Armbruster rename COLO unknown mode to none mode. Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19qmp event: Add COLO_EXIT event to notify users while exited COLOzhanghailiang
If some errors happen during VM's COLO FT stage, it's important to notify the users of this event. Together with 'x-colo-lost-heartbeat', Users can intervene in COLO's failover work immediately. If users don't want to get involved in COLO's failover verdict, it is still necessary to notify users that we exited COLO mode. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19COLO: Flush memory data from ram cacheZhang Chen
During the time of VM's running, PVM may dirty some pages, we will transfer PVM's dirty pages to SVM and store them into SVM's RAM cache at next checkpoint time. So, the content of SVM's RAM cache will always be same with PVM's memory after checkpoint. Instead of flushing all content of PVM's RAM cache into SVM's MEMORY, we do this in a more efficient way: Only flush any page that dirtied by PVM since last checkpoint. In this way, we can ensure SVM's memory same with PVM's. Besides, we must ensure flush RAM cache before load device state. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19ram/COLO: Record the dirty pages that SVM receivedZhang Chen
We record the address of the dirty pages that received, it will help flushing pages that cached into SVM. Here, it is a trick, we record dirty pages by re-using migration dirty bitmap. In the later patch, we will start the dirty log for SVM, just like migration, in this way, we can record both the dirty pages caused by PVM and SVM, we only flush those dirty pages from RAM cache while do checkpoint. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19COLO: Load dirty pages into SVM's RAM cache firstlyZhang Chen
We should not load PVM's state directly into SVM, because there maybe some errors happen when SVM is receving data, which will break SVM. We need to ensure receving all data before load the state into SVM. We use an extra memory to cache these data (PVM's ram). The ram cache in secondary side is initially the same as SVM/PVM's memory. And in the process of checkpoint, we cache the dirty pages of PVM into this ram cache firstly, so this ram cache always the same as PVM's memory at every checkpoint, then we flush this cached ram to SVM after we receive all PVM's state. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19COLO: Remove colo_state migration structZhang Chen
We need to know if migration is going into COLO state for incoming side before start normal migration. Instead by using the VMStateDescription to send colo_state from source side to destination side, we use MIG_CMD_ENABLE_COLO to indicate whether COLO is enabled or not. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19COLO: Add block replication into colo processZhang Chen
Make sure master start block replication after slave's block replication started. Besides, we need to activate VM's blocks before goes into COLO state. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19COLO: integrate colo compare with colo frameZhang Chen
For COLO FT, both the PVM and SVM run at the same time, only sync the state while it needs. So here, let SVM runs while not doing checkpoint, change DEFAULT_MIGRATE_X_CHECKPOINT_DELAY to 200*100. Besides, we forgot to release colo_checkpoint_semd and colo_delay_timer, fix them here. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-11migration: Stop postcopy fault thread before notifyingIlya Maximets
POSTCOPY_NOTIFY_INBOUND_END handlers will remove userfault fds from the postcopy_remote_fds array which could be still in use by the fault thread. Let's stop the thread before notification to avoid possible accessing wrong memory. Fixes: 46343570c06e ("vhost+postcopy: Wire up POSTCOPY_END notify") Cc: qemu-stable@nongnu.org Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Message-Id: <20181008160536.6332-2-i.maximets@samsung.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration/ram.c: Avoid taking address of fields in packed MultiFDInit_t structPeter Maydell
Taking the address of a field in a packed struct is a bad idea, because it might not be actually aligned enough for that pointer type (and thus cause a crash on dereference on some host architectures). Newer versions of clang warn about this: migration/ram.c:651:19: warning: taking address of packed member 'magic' of class or structure 'MultiFDInit_t' may result in an unaligned pointer value [-Waddress-of-packed-member] migration/ram.c:652:19: warning: taking address of packed member 'version' of class or structure 'MultiFDInit_t' may result in an unaligned pointer value [-Waddress-of-packed-member] migration/ram.c:737:19: warning: taking address of packed member 'magic' of class or structure 'MultiFDPacket_t' may result in an unaligned pointer value [-Waddress-of-packed-member] migration/ram.c:745:19: warning: taking address of packed member 'version' of class or structure 'MultiFDPacket_t' may result in an unaligned pointer value [-Waddress-of-packed-member] migration/ram.c:755:19: warning: taking address of packed member 'size' of class or structure 'MultiFDPacket_t' may result in an unaligned pointer value [-Waddress-of-packed-member] Avoid the bug by not using the "modify in place" byteswapping functions. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Message-Id: <20180925161924.7832-1-peter.maydell@linaro.org> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration: fix the compression codeFei Li
Add judgement in compress_threads_save_cleanup() to check whether the static CompressParam *comp_param has been allocated. If not, just return; or else segmentation fault will occur when using the NULL comp_param's parameters. One test case can reproduce this is: set the compression on and migrate to a wrong nonexistent host IP address. Our current code does not judge before handling comp_param[idx]'s quit and cond that whether they have been initialized. If not initialized, "qemu_mutex_lock_impl: Assertion `mutex->initialized' failed." will occur. Fix this by squashing the terminate_compression_threads() into compress_threads_save_cleanup() and employing the existing judgement condition. One test case can reproduce this error is: set the compression on and fail to fully setup the default eight compression thread in compress_threads_save_setup(). Signed-off-by: Fei Li <fli@suse.com> Message-Id: <20180925091440.18910-1-fli@suse.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration: fix QEMUFile leakMarc-André Lureau
Spotted by ASAN while running: $ tests/migration-test -p /x86_64/migration/postcopy/recovery ================================================================= ==18034==ERROR: LeakSanitizer: detected memory leaks Direct leak of 33864 byte(s) in 1 object(s) allocated from: #0 0x7f3da7f31e50 in calloc (/lib64/libasan.so.5+0xeee50) #1 0x7f3da644441d in g_malloc0 (/lib64/libglib-2.0.so.0+0x5241d) #2 0x55af9db15440 in qemu_fopen_channel_input /home/elmarco/src/qemu/migration/qemu-file-channel.c:183 #3 0x55af9db15413 in channel_get_output_return_path /home/elmarco/src/qemu/migration/qemu-file-channel.c:159 #4 0x55af9db0d4ac in qemu_file_get_return_path /home/elmarco/src/qemu/migration/qemu-file.c:78 #5 0x55af9dad5e4f in open_return_path_on_source /home/elmarco/src/qemu/migration/migration.c:2295 #6 0x55af9dadb3bf in migrate_fd_connect /home/elmarco/src/qemu/migration/migration.c:3111 #7 0x55af9dae1bf3 in migration_channel_connect /home/elmarco/src/qemu/migration/channel.c:91 #8 0x55af9daddeca in socket_outgoing_migration /home/elmarco/src/qemu/migration/socket.c:108 #9 0x55af9e13d3db in qio_task_complete /home/elmarco/src/qemu/io/task.c:158 #10 0x55af9e13ca03 in qio_task_thread_result /home/elmarco/src/qemu/io/task.c:89 #11 0x7f3da643b1ca in g_idle_dispatch gmain.c:5535 Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Message-Id: <20180925092245.29565-1-marcandre.lureau@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration: cleanup in error paths in loadvmDr. David Alan Gilbert
There's a couple of error paths in qemu_loadvm_state which happen early on but after we've initialised the load state; that needs to be cleaned up otherwise we can hit asserts if the state gets reinitialised later. Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-Id: <20180914170430.54271-3-dgilbert@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration/postcopy: Clear have_listen_threadDr. David Alan Gilbert
Clear have_listen_thread when we exit the thread. The fallout from this was that various things thought there was an ongoing postcopy after the postcopy had finished. The case that failed was postcopy->savevm->loadvm. This corresponds to RH bug https://bugzilla.redhat.com/show_bug.cgi?id=1608765 Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-Id: <20180914170430.54271-2-dgilbert@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration: use save_page_use_compression in flush_compressed_dataXiao Guangrong
It avoids to touch compression locks if xbzrle and compression are both enabled Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Message-Id: <20180906070101.27280-4-xiaoguangrong@tencent.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration: show the statistics of compressionXiao Guangrong
Currently, it includes: pages: amount of pages compressed and transferred to the target VM busy: amount of count that no free thread to compress data busy-rate: rate of thread busy compressed-size: amount of bytes after compression compression-rate: rate of compressed size Reviewed-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com> Message-Id: <20180906070101.27280-3-xiaoguangrong@tencent.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration: do not flush_compressed_data at the end of iterationXiao Guangrong
flush_compressed_data() needs to wait all compression threads to finish their work, after that all threads are free until the migration feeds new request to them, reducing its call can improve the throughput and use CPU resource more effectively We do not need to flush all threads at the end of iteration, the data can be kept locally until the memory block is changed or memory migration starts over in that case we will meet a dirtied page which may still exists in compression threads's ring Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Message-Id: <20180906070101.27280-2-xiaoguangrong@tencent.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26Add a hint message to loadvm and exits on failureJose Ricardo Ziviani
This patch adds a small hint for the failure case of the load snapshot process. It may be useful for users to remember that the VM configuration has changed between the save and load processes. (qemu) loadvm vm-20180903083641 Unknown savevm section or instance 'cpu_common' 4. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices Error -22 while loading VM state (qemu) device_add host-spapr-cpu-core,core-id=4 (qemu) loadvm vm-20180903083641 (qemu) c (qemu) info status VM status: running It also exits Qemu if the snapshot cannot be loaded before reaching the main loop (-loadvm in the command line). $ qemu-system-ppc64 ... -loadvm vm-20180903083641 qemu-system-ppc64: Unknown savevm section or instance 'cpu_common' 4. Make sure that your current VM setup matches your saved VM setup, including any hotplugged devices qemu-system-ppc64: Error -22 while loading VM state $ Signed-off-by: Jose Ricardo Ziviani <joserz@linux.ibm.com> Message-Id: <20180903162613.15877-1-joserz@linux.ibm.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration: handle the error condition properlyXiao Guangrong
ram_find_and_save_block() can return negative if any error hanppens, however, it is completely ignored in current code Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Message-Id: <20180903092644.25812-5-xiaoguangrong@tencent.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration: fix calculating xbzrle_counters.cache_miss_rateXiao Guangrong
As Peter pointed out: | - xbzrle_counters.cache_miss is done in save_xbzrle_page(), so it's | per-guest-page granularity | | - RAMState.iterations is done for each ram_find_and_save_block(), so | it's per-host-page granularity | | An example is that when we migrate a 2M huge page in the guest, we | will only increase the RAMState.iterations by 1 (since | ram_find_and_save_block() will be called once), but we might increase | xbzrle_counters.cache_miss for 2M/4K=512 times (we'll call | save_xbzrle_page() that many times) if all the pages got cache miss. | Then IMHO the cache miss rate will be 512/1=51200% (while it should | actually be just 100% cache miss). And he also suggested as xbzrle_counters.cache_miss_rate is the only user of rs->iterations we can adapt it to count target guest page numbers After that, rename 'iterations' to 'target_page_count' to better reflect its meaning Suggested-by: Peter Xu <peterx@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com> Message-Id: <20180903092644.25812-3-xiaoguangrong@tencent.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration/rdma: Fix uninitialised rdma_return_pathDr. David Alan Gilbert
Clang correctly errors out moaning that rdma_return_path is used uninitialised in the earlier error paths. Make it NULL so that the error path ignores it. Fixes: 55cc1b5937a8e709e4c102e74b206281073aab82 Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reported-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20180830173657.22939-1-dgilbert@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-08-28qapi: Drop qapi_event_send_FOO()'s Error ** argumentPeter Xu
The generated qapi_event_send_FOO() take an Error ** argument. They can't actually fail, because all they do with the argument is passing it to functions that can't fail: the QObject output visitor, and the @qmp_emit callback, which is either monitor_qapi_event_queue() or event_test_emit(). Drop the argument, and pass &error_abort to the QObject output visitor and @qmp_emit instead. Suggested-by: Eric Blake <eblake@redhat.com> Suggested-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180815133747.25032-4-peterx@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> [Commit message rewritten, update to qapi-code-gen.txt corrected] Signed-off-by: Markus Armbruster <armbru@redhat.com>