How to load many (100K +) XML documents using mlcp without encountering too long argument list error?
When I try to load 160,000 XML documents into MarkLogic 8.0-2 using mlcp on MacOS 10.10.4 an error is thrown mlcp-Hadoop2-1.3-1/bin/mlcp.sh: line 16: /usr/bin/java: Argument list too long
.
The command I issue is:
mlcp import -database FO -username sss4r -password ******* -host localhost -port 8003 -mode local -input_file_pattern '*\.xml' -output_uri_replace "/Users/sss4r/Documents/FOPOC,''" -input_file_path .
I realize this is probably a Unix shell problem, mlcp uses the filesystem facilities to return a list of names. There is a system limit on the number of filenames that can be processed in a command.
What does MarkLogician best recommend to fix this problem? Trying to bulk in small chunks? Try changing your system limit?
Thank.
source to share
MLCP does not depend on the shell extension to be able to upload files. I'm afraid the shell expansion is happening inside mlcp.sh, but only unintentionally. If you reset the input file template parameter, you will likely see that it downloads all files. A quick fix might be to put the files in a sub-dir, not use the file template, and just point the sub-dir as the input_file_path.
Rob S. gives another solution that prevents this. Put your options in a file, each option on a separate line, and point to the option -options_file
on the command line. It also saves you the trouble with quotes and other special characters unintentionally interpreted by the shell environment.
More details here: https://docs.marklogic.com/guide/ingestion/content-pump#id_36150
NTN!
PS: I filed a bug to improve MLCP (# 33670)
source to share
First, you save a lot of grief if you use the options file when there are command line argument values that the shell can interpolate. Otherwise, you end up fighting uphill against shell quoting. Geert has already provided a link to this syntax, so I won't repeat it.
Second, -input_file_pattern
Java regex is required. *\.xml
probably not what you want. You probably mean .*\.xml
. For template language references used by mlcp see
https://docs.marklogic.com/guide/ingestion/content-pump#id_10243
source to share