C # parse and modify strings in yaml
I'm looking for a way to parse a yaml file and change each line, then save the file without changing the structure of the original file. In my opinion I shouldn't be using Regex for this, but some kind of yaml parser. An example of yaml input is below:
receipt: Oz-Ware Purchase Invoice
date: 2007-08-06
customer:
given: Dorothy
items:
- part_no: A4786
descrip: Water Bucket (Filled)
- part_no: E1628
descrip: High Heeled "Ruby" Slippers
size: 8
bill-to: &id001
street: |
123 Tornado Alley
Suite 16
city: East Centerville
state: KS
ship-to: *id001
specialDelivery: >
Follow the Yellow Brick
Road to the Emerald City.
...
Desired output:
receipt: ###Oz-Ware Purchase Invoice###
date: ###2007-08-06###
customer:
given: ###Dorothy###
items:
- part_no: ###A4786###
descrip: ###Water Bucket (Filled)###
- part_no: ###E1628###
descrip: ###High Heeled "Ruby" Slippers###
size: ###8###
bill-to: ###&id001###
street: |
###123 Tornado Alley
Suite 16###
city: ###East Centerville###
state: ###KS###
ship-to: ###*id001###
specialDelivery: >
###Follow the Yellow Brick
Road to the Emerald City.###
...
Is there a good yaml parser that could handle complex yaml files, modify strings, and save that data without affecting the document structure? Perhaps you have another idea how to fix this problem. Basically, I would like to iterate over each line from the top of the document and make some changes to the line. Any hints were appreciated.
source to share
Most YAML parsers are built to read YAML, either written by other programs or edited by humans, and to write YAML to be read by other programs. What is known to be missing is the ability of parsers to write YAML, which is still human readable:
- display order of keys is undefined
- comments are discarded
- scalar block literal style, if present, is discarded
- distance around scalars is discarded
- scalar folding information, if any, is discarded
Loading a dump of an uploaded YAML file with manual processing will create the same internal data structures as the main load, but the intermediate dump usually does not look like the original (manual) YAML.
If you have a Python program:
import ruamel.yaml as yaml
yaml_str = """\
receipt: Oz-Ware Purchase Invoice
date: 2007-08-06
customer:
given: Dorothy
items:
- part_no: A4786
descrip: Water Bucket (Filled)
- part_no: E1628
descrip: High Heeled "Ruby" Slippers
size: 8
bill-to: &id001
street: |
123 Tornado Alley
Suite 16
city: East Centerville
state: KS
ship-to: *id001
specialDelivery: >
Follow the Yellow Brick
Road to the Emerald City.
"""
data1 = yaml.load(yaml_str, Loader=yaml.Loader)
dump_str = yaml.dump(data1, Dumper=yaml.Dumper)
data2 = yaml.load(dump_str, Loader=yaml.Loader)
Then the following statements hold:
assert data1 == data2
assert dump_str != yaml_str
The intermediate dump_str
looks like this:
bill-to: &id001 {city: East Centerville, state: KS, street: '123 Tornado Alley Suite 16 '} customer: {given: Dorothy} date: 2007-08-06 items: - {descrip: Water Bucket (Filled), part_no: A4786} - {descrip: High Heeled "Ruby" Slippers, part_no: E1628, size: 8} receipt: Oz-Ware Purchase Invoice ship-to: *id001 specialDelivery: 'Follow the Yellow Brick Road to the Emerald City. '
The above is the default behavior for ruamel.yaml , PyYAML, and for many other YAML parsers in other languages and online YAML conversion services. For some parsers, this is the only behavior.
The reason for running ruamel.yaml as an enhancement to PyYAML was to make the transition from manual YAML to internal data, to YAML, lead to something that is better human readable (what I call a round trip) and saves additional information (especially comments).
data = yaml.load(yaml_str, Loader=yaml.RoundTripLoader) print yaml.dump(data, Dumper=yaml.RoundTripDumper)
gives you:
receipt: Oz-Ware Purchase Invoice date: 2007-08-06 customer: given: Dorothy items: - part_no: A4786 descrip: Water Bucket (Filled) - part_no: E1628 descrip: High Heeled "Ruby" Slippers size: 8 bill-to: &id001 street: | 123 Tornado Alley Suite 16 city: East Centerville state: KS ship-to: *id001 specialDelivery: 'Follow the Yellow Brick Road to the Emerald City. '
My focus was on comments, key, orderly and literal block style. The distance between scalars and stacked scalars is not (yet) special.
From now on (you can also do this in PyYAML, but you won't have the built-in enhancements to ruamel.yaml key management), you can either provide custom emitters or hook into the system at a lower level by overriding some of the methods in emitter.py
(and making sure you can call originals for cases you don't need to handle:
def rewrite_write_plain(self, text, split=True):
if self.state == self.expect_block_mapping_simple_value:
text = '###' + text + '###'
while self.column < 20:
text = ' ' + text
self.column += 1
self._org_write_plain(text, split)
def rewrite_write_literal(self, text):
if self.state == self.expect_block_mapping_simple_value:
last_nl = False
if text and text[-1] == '\n':
last_nl = True
text = text[:-1]
text = '###' + text + '###'
if False:
extra_indent = ''
while self.column < 15:
text = ' ' + text
extra_indent += ' '
self.column += 1
text = text.replace('\n', '\n' + extra_indent)
if last_nl:
text += '\n'
self._org_write_literal(text)
def rewrite_write_single_quoted(self, text, split=True):
if self.state == self.expect_block_mapping_simple_value:
last_nl = False
if text and text[-1] == u'\n':
last_nl = True
text = text[:-1]
text = u'###' + text + u'###'
if last_nl:
text += u'\n'
self.write_folded(text)
def rewrite_write_indicator(self, indicator, need_whitespace,
whitespace=False, indention=False):
if indicator and indicator[0] in u"*&":
indicator = u'###' + indicator + u'###'
while self.column < 20:
indicator = ' ' + indicator
self.column += 1
self._org_write_indicator(indicator, need_whitespace, whitespace,
indention)
dumper._org_write_plain = dumper.write_plain
dumper.write_plain = rewrite_write_plain
dumper._org_write_literal = dumper.write_literal
dumper.write_literal = rewrite_write_literal
dumper._org_write_single_quoted = dumper.write_single_quoted
dumper.write_single_quoted = rewrite_write_single_quoted
dumper._org_write_indicator = dumper.write_indicator
dumper.write_indicator = rewrite_write_indicator
print yaml.dump(data, Dumper=dumper, indent=4)
gives you:
receipt: ###Oz-Ware Purchase Invoice### date: ###2007-08-06### customer: given: ###Dorothy### items: - part_no: ###A4786### descrip: ###Water Bucket (Filled)### - part_no: ###E1628### descrip: ###High Heeled "Ruby" Slippers### size: ###8### bill-to: ###&id001### street: | ###123 Tornado Alley Suite 16### city: ###East Centerville### state: ###KS### ship-to: ###*id001### specialDelivery: > ###Follow the Yellow Brick Road to the Emerald City.###
which is hopefully acceptable for further processing in C #
source to share
In the YAML spec this is to say :
In the view model, the mapping keys have no order. To serialize a mapping, you need to impose an order on its keys. This order is a serialization detail and should not be used when constructing the presentation graph (and therefore to store application data). In every case where the order of the node is significant, a sequence must be used. For example, an ordered mapping can be thought of as a sequence of mappings, where each mapping is a pair with one key: value. YAML provides a convenient compact notation for this case.
Thus, you really shouldn't expect YAML to maintain any order when loading and saving documents.
Having said that, I completely understand where you are from. Since YAML documents are meant for humans, maintaining a certain order is definitely useful. Unfortunately, due to the specification, most implementations will use unordered data structures to represent key / value mappings. In C # and Python, this would be a dictionary; and dictionaries are developed without ordering.
But both C # and Python have ordered dictionary types, OrderedDictionary
and OrderedDict
and, at least for Python, there has been some effort in the past to maintain key order using ordered dictionaries:
-
!!omap
type is a special ordered display. Supported in PyYAML. - PyYAML ticket 29 speaks of possible inclusion
OrderedLoader
. There is also a short workaround using YAML constructors between and the possible loader implementation at the end . - PyYAML ticket 161 contains a recipe that provides this functionality as well.
- Finally, there is this other question that covers loading YAML into
OrderedDict
s.
This is the Python side; I'm sure there are similar efforts to implement C # too.
source to share